I have recently upgraded my lab’s reference machine to Oracle Linux 6 and have experimented today with its network failover capabilities. I seemed to remember that network bonding on xen didn’t work, so was curious to test it on new hardware. As always, I am running this on my openSuSE 11.2 lab server, which features these components:
xen-3.4.1_19718_04-2.1.x86_64
Kernel 2.6.31.12-0.2-xen
libvirt-0.7.2-1.1.4.x86_64
Now for the fun part-I cloned my OL6REF domU, and in about 10 minutes had a new system to experiment with. The necessary new NIC was added quickly before registering the domU with XenStore. All you need to do in this case is to add another interface, as in this example (00:16:1e:1b:1d:1f already existed):
...
...
After registering the domU using a call to “virsh define bondingTest.xml” the system starts as usual, except that it has a second NIC, which at this stage is unconfigured. Remember that the Oracle Linux 5 and 6 network configuration is in /etc/sysconfig/network and /etc/sysconfig/network-scripts/.
The first step is to rename the server-change /etc/sysconfig/network to match your new server name.That’s easy :)
Now to the bonding driver. RHEL6 and OL 6 have deprecated /etc/modprobe.conf in favour of /etc/modprobe.d and its configuration files. It’s still necessary to tell the kernel that it should use the bonding driver for my new device, bond0 so I created a new file /etc/modprobe.d/bonding.conf with just one line in it:
alias bond0 bonding
That’s it, don’t put any further information about module parameters in the file, this is deprecated. The documentation clearly states “Important: put all bonding module parameters in ifcfg-bondN files”.
Now I had to create the configuration files for eth0, eth1 and bond0. They are created as follows:
Now for the bonding paramters-there are a few of interest. First, I wanted to set the mode to active-passive, which is Oracle recommended (with the rationale: it is simple). Additionally, you have to set either the arp_interval/arp_target parameters or a value to miimon to allow for speedy link failure detection. My BONDING_OPTS for bond0 is therefore as follows:
The test is going to be simple: first I’ll bring up the interface bond0 by issuing a “system network restart” command on the xen console, followed by a “xm network-detach” command.The output of the network restart command is here:
[root@rhel6ref network-scripts]# service network restart
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface bond0: [ OK ]
[root@rhel6ref network-scripts]# ifconfig
bond0 Link encap:Ethernet HWaddr 00:16:1E:1B:1D:1F
inet addr:192.168.99.126 Bcast:192.168.99.255 Mask:255.255.255.0
inet6 addr: fe80::216:1eff:fe1b:1d1f/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:297 errors:0 dropped:0 overruns:0 frame:0
TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:9002 (8.7 KiB) TX bytes:1824 (1.7 KiB)
eth0 Link encap:Ethernet HWaddr 00:16:1E:1B:1D:1F
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:214 errors:0 dropped:0 overruns:0 frame:0
TX packets:22 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6335 (6.1 KiB) TX bytes:1272 (1.2 KiB)
Interrupt:18
eth1 Link encap:Ethernet HWaddr 00:16:1E:1B:1D:1F
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:83 errors:0 dropped:0 overruns:0 frame:0
TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2667 (2.6 KiB) TX bytes:552 (552.0 b)
Interrupt:17
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
The kernel traces these operations in /var/log/messages:
May 1 07:55:49 rhel6ref kernel: bonding: bond0: Setting MII monitoring interval to 1000.
May 1 07:55:49 rhel6ref kernel: bonding: bond0: setting mode to active-backup (1).
May 1 07:55:49 rhel6ref kernel: ADDRCONF(NETDEV_UP): bond0: link is not ready
May 1 07:55:49 rhel6ref kernel: bonding: bond0: Adding slave eth0.
May 1 07:55:49 rhel6ref kernel: bonding: bond0: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full.
May 1 07:55:49 rhel6ref kernel: bonding: bond0: making interface eth0 the new active one.
May 1 07:55:49 rhel6ref kernel: bonding: bond0: first active interface up!
May 1 07:55:49 rhel6ref kernel: bonding: bond0: enslaving eth0 as an active interface with an up link.
May 1 07:55:49 rhel6ref kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
May 1 07:55:49 rhel6ref kernel: bonding: bond0: Adding slave eth1.
May 1 07:55:49 rhel6ref kernel: bonding: bond0: Warning: failed to get speed and duplex from eth1, assumed to be 100Mb/sec and Full.
May 1 07:55:49 rhel6ref kernel: bonding: bond0: enslaving eth1 as a backup interface with an up link.
This shows an active device of eth0, with eth1 as the passive device. Note that the MAC addresses of all devices are identical (which is expected behaviour). Now let’s see what happens to the channel failover when I take a NIC offline. First of all I have to check xenstore which NICs are present:
I would like to take the active link away, which is at index 0. Let’s try:
# xm network-detach bondingTest 0
The domU shows the link failover:
May 1 08:00:46 rhel6ref kernel: bonding: bond0: Warning: the permanent HWaddr of eth0 - 00:16:1e:1b:1d:1f - is still in use by bond0.
Set the HWaddr of eth0 to a different address to avoid conflicts.
May 1 08:00:46 rhel6ref kernel: bonding: bond0: releasing active interface eth0
May 1 08:00:46 rhel6ref kernel: bonding: bond0: making interface eth1 the new active one.
May 1 08:00:46 rhel6ref kernel: net eth0: xennet_release_rx_bufs: fix me for copying receiver.
Oops, there seems to be a problem with the xennet driver, but never mind. The important information is in the lines above: the active eth0 device has been released, and eth1 jumped in. Next I think I will have to run a workload against the interface to see if that makes a difference.
And the reverse …
I couldn’t possibly leave the system in the “broken” state, so I decided to add the NIC back. That’s yet another online operation I can do:
I can also see that the kernel added the new interface back in.
May 2 05:05:31 rhel6ref kernel: bonding: bond0: Adding slave eth0.
May 2 05:05:31 rhel6ref kernel: bonding: bond0: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full.
May 2 05:05:31 rhel6ref kernel: bonding: bond0: enslaving eth0 as a backup interface with an up link.
Despite all recent progress in other virtualisation technologies I am staying faithful to Xen. The reason is simple: it works for me. Paravirtualised domUs are the fastest way to run virtual machines on hardware that doesn’t support virtualisation in the processor.
I read about cgroups yesterday, a feature that’s appeared in kernel 2.6.38 and apparently was back-ported into RedHat 6. Unfortunately I can’t get hold of a copy, so I decided to use Oracle’s clone instead. I wanted to install the new domU on my existing lab environment which is a 24G RAM core i7 920 system with 1.5TB of storage. The only limitation I can see is the low number of cores, I wish I could rent an Opteron 6100 series system instead (for the same price).Creating the domU
The first setback was the failure of virt-manager. Virt Manager is OpenSuSE’s preferred too to create xen virtual machines. I wrote about virt-manager and OpenSuSE 11.2 some time ago and went back to this post for instructions. However, the logic coded into the application doesn’t seem to handle OL6, it repeatedly failed to start the PV kernel. I assume the directory structure on the ISO has changed or some other configuration issue here, maybe even a PEBKC.
That was a bit of a problem, because it meant I had to do the legwork all on my own. Thinking about it virt-manager is not to blame really, OL6 wasn’t yet released when the tool came out. So be it, at least I’ll learn something new today. To start with, I needed to create a new “disk” to contain my root volume group. The way the dom0 is set up doesn’t allow me to use LVM logical volumes – all the space is already allocated. My domUs are all stored in /var/lib/xen/images/domUName. I started of by creating the top level directory for my new OL6 reference domU:
# mkdir /var/lib/xen/images/ol6
Inside the directory I created the sparse file for my first “disk”:
# cd /var/lib/xen/images/ol6
# dd if=/dev/zero of=disk0 bs=1024k seek=8192 count=0
This will create a sparse file (much like a temp file in Oracle) for use as my first virtual disk. The next step was to extract the kernel and initial ramdisk from the ISO image and store it somewhere convenient. My default location for xen kernels is /m/xenkernels/domUName. The new kernel is a PVOPS kernel (but still not dom0 capable!) so copying it from a loopback mounted ISO image’s isolinux/vmlinuz location was enough. There is also only one initrd to copy. You should get them into the xen kernel location as shown here:
Next, I copied the contents of the ISO image to /var/srw/www/htdocs/ol6.
We now need a configuration file to start the domU initially. The below config file worked for me-I have deliberately not chosen a libvirt compatible XML file to keep the example simple. We’ll convert to xenstore later ….
As I am forwarding the VNC port I needed a fixed one. In my putty session I forwarded local port 5911 to my dom0′s port 5911. Now start the domU using the xm create /tmp/rhel6ref command. Within a few seconds you should be able to point your vncviewer to localhost:11 and connect to the domU. That’s all! From now on the installation is really just a matter of “next”, “next”, “next”. Some things have changed though, have a look at these selected screenshots.
Walking through the installation
First of all you need to configure the network for the domU to talk to your staging server. I always use a manual configuration, and configure IPv4 only. The URL setup is interesting-I used http://dom0/ol6 as the repository and got further. When in doubt, check the access and error logs in /var/log/apache2
After the welcome screen I was greeted with a message stating that my xen-vbd-xxx device had to be reinitialised. Huh? But ok, so I did that and progressed. I then entered the hostname and got to the partitioning screen. I chose to “use all space” and ticked the box next to “review and modify partitioning layout”. Remember that ext4 is now the default for all file systems, but OpenSuSE’s pygrub can’t read it. The important step is to ensure that you have a separate /boot partition outside any LVM devices, and that it’s formatted with ext2. The ext3 file system might also work, but I decided to stick with ext2 which I knew pygrub could deal with. I also tend to rename my volume group to rootvg, instead vg_hostname as the installer suggests.
The VNC interface now became a little bit difficult to use when being asked to select a timezone, I deferred that to later. I ran into a bit of a problem when it came to the package selection screen. Suddenly the installer, which happily read all data from my apache setup, claimed it couldn’t read the repodata.xml file. I thouoght that was strange but then manually pointed it to the Server/repodata/repomd.xml file and clicked on the install button. Unfortunately the installer now couldn’t read the first package. The reason was quickly identified in the access log
192.168.99.124 – - [01/Apr/2011:14:01:42 +0200] “GET /ol6/Server/Packages/alsa-utils-1.0.21-3.el6.x86_64.rpm HTTP/1.1″ 403 1036 “-” “Oracle Linux Server (anaconda)/6.0″
192.168.99.124 – - [01/Apr/2011:14:03:10 +0200] “GET /ol6/Server/Packages/alsa-utils-1.0.21-3.el6.x86_64.rpm HTTP/1.1″ 403 1036 “-” “Oracle Linux Server (anaconda)/6.0″
HTTP 403 errors (i.e. FORBIDDEN). They were responsible for the problem with the repomd.xml file as well:
192.168.99.124 – - [01/Apr/2011:14:00:53 +0200] “GET /ol6/repodata/repomd.xml HTTP/1.1″ 403 1036 “-” “Oracle Linux Server (anaconda)/6.0″
The reason for these could be found in the error log:
[Fri Apr 01 14:00:53 2011] [error] [client 192.168.99.124] Symbolic link not allowed or link target not accessible: /srv/www/htdocs/ol6/repodata
Oha.Where do these come from? Fair enough, the directory structure has changed:
Now then, because this is my lab and I’m solely responsible for its security, I change the Options in my apache’s server root to FollowSymLinks.Do not do this in real life! Create a separate directory, or alias, and don’t compromise your server root. Enough said …
A reload of the apache2 daemon fixed that problem, but I had to start from scratch. This time however it ran through without problems.
Cleaning Up
When the installer prompts you for a reboot, don’t click on the “OK” button just yet. The configuration file needs to be changed to use the bootloader pygrub. Change the configuration file to something similar to this:
The only change is the replacement of the kernel and ramdisk lines with bootloader. You may have to xm destroy the VM for the change to take effect, a reboot doesn’t seem to trigger a reparse of the configuration file.
With that done, restart the VM and enjoy the final stages of the configuration. If your domU doesn’t start now, you probably forgot to format /boot with ext2 and it is ext4. In that case you have to do some research on google whether or not you can save your installation.
The result is a new Oracle Linux reference machine!
# ssh root@192.168.99.124
The authenticity of host '192.168.99.124 (192.168.99.124)' can't be established.
RSA key fingerprint is 3d:90:d5:ef:33:e1:15:f8:eb:4a:38:15:cd:b9:f1:7e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.99.124' (RSA) to the list of known hosts.
root@192.168.99.124's password:
[root@rhel6ref ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 (Santiago)
[root@rhel6ref ~]# uname -a
Linux rhel6ref.localdomain 2.6.32-100.28.5.el6.x86_64 #1 SMP Wed Feb 2 18:40:23 EST
2011 x86_64 x86_64 x86_64 GNU/Linux
If you want, you can register the domU in xenstore-while the domU is up and running, dump the XML file using virsh dumpxml > /tmp/ol6.xml. My configuration file looked like this:
This post is about the installation of Grid Infrastructure, and where it’s really getting exciting: the 3rd NFS voting disk is going to be presented and I am going to show you how simple it is to add it into the disk group chosen for OCR and voting disks.
Let’s start with the installation of Grid Infrastructure. This is really simple, and I won’t go into too much detail. Start by downloading the required file from MOS, a simple search for patch 10098816 should bring you to the download patch for 11.2.0.2 for Linux-just make sure you select the 64bit version. The file we need just now is called p10098816_112020_Linux-x86-64_3of7.zip. The file names don’t necessarily relate to their contents, the readme helps finding out which piece of the puzzle is used for what functionality.
I alluded to my software distribution method in one of the earlier posts, and here’s all the detail to come. My dom0 exports the /m directory to the 192.168.99.0/24 network, the one accessible to all my domUs. This really simplifies software deployments.
This creates the subdirectory “grid”. Switch back to edcnode1 and log in as oracle. As I already explained I won’t use different accounts for Grid Infrastructure and the RDBMS in this example.
If not already done so, mount the /m directory on the domU (which requires root privileges). Move to the newly unzipped “grid” directory under your mount point and begin to set up the user equivalence. On edcnode1 and edcnode2, create RSA and DSA keys for SSH:
[oracle@edcnode1 ~]$ ssh-keygen -t rsa
Any questions can be answered with the return key, it’s important to leave the passphrase empty. Repeat the call to ssh-keygen with argument “-t dsa”. Navigate to ~/.ssh and create the authorized_keys file as follows:
Change the permissions on the authorized_keys file to 0400 on both hosts, otherwise it won’t be considered when trying to log in. With all of this done, you can add all the unknown hosts to each node’s known_hosts file. The easiest way is a for loop:
[oracle@edcnode1 ~]$ for i in edcnode1 edcnode2 edcnode1-priv edcnode2-priv; do ssh $i hostname; don
Run this twice on each node, acknowledging the question if the new address should be added. Important: Ensure that there is no banner (/etc/motd, .profile, .bash_profile etc) writing to stdout or stderr or you are going to see strange error messages about user equivalence not being set up correctly.
I hear you say: but 11.2 can create user equivalence in OUI now-this is of course correct, but I wanted to run cluvfy now which requires a working setup.
Cluster Verification
It is good practice to run a check to see if the prerequisites for the Grid Infrastructure installation are met, and keep the output. Change to the NFS mount where the grid directory is exported, and execute runcluvfy.sh as in this example:
The nice thing is that you can run the fixup script now to fix kernel parameter settings:
[root@edcnode2 ~]# /tmp/CVU_11.2.0.2.0_oracle/runfixup.sh
/usr/bin/id
Response file being used is :/tmp/CVU_11.2.0.2.0_oracle/fixup.response
Enable file being used is :/tmp/CVU_11.2.0.2.0_oracle/fixup.enable
Log file location: /tmp/CVU_11.2.0.2.0_oracle/orarun.log
Setting Kernel Parameters...
fs.file-max = 327679
fs.file-max = 6815744
net.ipv4.ip_local_port_range = 9000 65500
net.core.wmem_max = 262144
net.core.wmem_max = 1048576
Repeat this on the second node, edcnode2. Obviously you should fix any other problem cluvfy reports before proceeding.
In the previous post I created the /u01 mount point-double check that /u01 is actually mounted-otherwise you’d end up writing on your root_vg’s root_lv, not an ideal situation.
You are now ready to start the installer: type in ./runInstaller to start the installation.
Grid Installation
This is rather mundane, and instaed of providing print screens, I opted for a description of the steps needed to execute in the OUI session.
Screen 01: Skip software updates (I don’t have an Internet connection on my lab)
Screen 02: Install and configure Grid Infrastructure for a cluster
Screen 03: Advanced Installation
Screen 04: Keep defaults or add additional languages
Screen 05: Cluster Name: edc, SCAN name edc-scan, SCAN port: 1521, do not configure GNS
Screen 06: Ensure that both hosts are listed in this screen. Add/edit as appropriate. Hostnames are edcode{1,2}.localdomain, VIPs are to be edcnode{1,2}-vip.localdomain. Enter the oracle user’s password and click on next
Screen 07: Assign eth0 to public, eth1 to private and eth2 to “do not use”.
Screen 08: Select ASM
Screen 09: disk group name: OCRVOTE with NORMAL redundancy. Tick the boxes for “ORCL:OCR01FILER01″, “ORCL:OCR01FILER02″ and “ORCL:OCR02FILER01″
Screen 10: Choose suitable passwords for SYS and ASMSNMP
Screen 11: Don’t use IPMI
Screen 12: Assign DBA to OSDBA, OSOPER and OSASM. Again, in the real world you should think about role separation and assign different groups
Screen 15: Ignore all-there should only be references to swap, cvuqdisk, ASM device checks and NTP. If you have additional warnings, fix them first!
Screen 16: Click on install!
The usual installation will now take place. At the end, run the root.sh script on edcnode1 and after it completes, on edcnode2. The output is included here for completeness:
[root@edcnode1 u01]# /u01/app/11.2.0/grid/root.sh 2>&1 | tee /tmp/root.sh.out
Running Oracle 11g root script...
The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME= /u01/app/11.2.0/grid
Enter the full pathname of the local bin directory: [/usr/local/bin]:
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
root wallet
root wallet cert
root cert export
peer wallet
profile reader wallet
pa wallet
peer wallet keys
pa wallet keys
peer cert request
pa cert request
peer cert
pa cert
peer root cert TP
profile reader root cert TP
pa root cert TP
peer pa cert TP
pa peer cert TP
profile reader pa cert TP
profile reader peer cert TP
peer user cert
pa user cert
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-2672: Attempting to start 'ora.mdnsd' on 'edcnode1'
CRS-2676: Start of 'ora.mdnsd' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'edcnode1'
CRS-2676: Start of 'ora.gpnpd' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'edcnode1'
CRS-2672: Attempting to start 'ora.gipcd' on 'edcnode1'
CRS-2676: Start of 'ora.gipcd' on 'edcnode1' succeeded
CRS-2676: Start of 'ora.cssdmonitor' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'edcnode1'
CRS-2672: Attempting to start 'ora.diskmon' on 'edcnode1'
CRS-2676: Start of 'ora.diskmon' on 'edcnode1' succeeded
CRS-2676: Start of 'ora.cssd' on 'edcnode1' succeeded
ASM created and started successfully.
Disk Group OCRVOTE created successfully.
clscfg: -install mode specified
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4256: Updating the profile
Successful addition of voting disk 38f2caf7530c4f67bfe23bb170ed2bfe.
Successful addition of voting disk 9aee80ad14044f22bf6211b81fe6363e.
Successful addition of voting disk 29fde7c3919b4fd6bf626caf4777edaa.
Successfully replaced voting disk group with +OCRVOTE.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
3. ONLINE 29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE]
Located 3 voting disk(s).
CRS-2672: Attempting to start 'ora.asm' on 'edcnode1'
CRS-2676: Start of 'ora.asm' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.OCRVOTE.dg' on 'edcnode1'
CRS-2676: Start of 'ora.OCRVOTE.dg' on 'edcnode1' succeeded
ACFS-9200: Supported
ACFS-9200: Supported
CRS-2672: Attempting to start 'ora.registry.acfs' on 'edcnode1'
CRS-2676: Start of 'ora.registry.acfs' on 'edcnode1' succeeded
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded
[root@edcnode2 ~]# /u01/app/11.2.0/grid/root.sh 2>&1 | tee /tmp/rootsh.out
Running Oracle 11g root script...
The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME= /u01/app/11.2.0/grid
Enter the full pathname of the local bin directory: [/usr/local/bin]:
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node edcnode1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded
[root@edcnode2 ~]#
Congratulations! You have a working setup! Check if everything is ok:
It’s about time to deal with this subject. If not done so already, start the domU “filer03″. Log in as openfiler and ensure that the NFS server is started. On the services tab click on enable next to the NFS server if needed. Next navigate to the shares tab, where you should find the volume group and logical volume created earlier. The volume group I created is called “ocrvotenfs_vg”, and it has 1 logical volume, “nfsvol_lv”. Click on the name of the LV to create a new share. I named the new share “ocrvote” – enter this in the popup window and click on “create sub folder”.
The new share should appear underneath the nfsvol_lv now. Proceed by clicking on “ocrvote” to set the share’s properties. Before you get to enter these, click on “make share”. Scroll down to the host access configuration section in the following screen. In this section you could set all sorts of technologies-SMB, NFS, WebDAV, FTP and RSYNC. For this example, everything but NFS should be set to “NO”.
For NFS, the story is different: ensure you set the radio button to “RW” for both hosts. Then click on Edit for each machine. This is important! The anonymous UID and GID must match the Grid Owner’s uid and gid. In my scenario I entered “500″ for both-you can check your settings using the id command as oracle: it will print the UID and GID plus other information.
The UID/GID mapping then has to be set to all_squash, IO mode to sync, and write delay to wdelay. Leave the default for “requesting origin port”, which was set to “secure < 1024″ in my configuration.
I decided to create /ocrvote on both nodes to mount the NFS export:
[root@edcnode2 ~]# mkdir /ocrvote
Edit the /etc/fstab file to make the mount persistent across reboots. I added this line to the file on both nodes:
The “addr” command instructs Linux to use the storage network to mount the share. Now you are ready to mount the device on all nodes, using the “mount /ocrvote” command.
I changed the export on the filer to the uid/gid combination of the oracle account (or, on an installation with separate grid software owner, to its uid/gid combination):
You only need to do this on one node. Recall that the current state is:
[oracle@edcnode1 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
3. ONLINE 29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE]
Located 3 voting disk(s).
ASM sees it the same way:
SQL> select mount_status,header_status, name,failgroup,library
2 from v$asm_disk
3 /
MOUNT_S HEADER_STATU NAME FAILGROUP LIBRARY
------- ------------ ------------------------------ --------------- ------------------------------------------------------------
CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CACHED MEMBER OCR01FILER01 OCR01FILER01 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CACHED MEMBER OCR01FILER02 OCR01FILER02 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CACHED MEMBER OCR02FILER01 OCR02FILER01 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
7 rows selected.
Now here’s the idea: you add the NFS location to the ASM diskstring in addition with “ORCL:*” and all is well. But that didn’t work:
SQL> show parameter disk
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
asm_diskgroups string
asm_diskstring string ORCL:*
SQL>
SQL> alter system set asm_diskstring = 'ORCL:*, /ocrvote/nfsvotedisk01' scope=memory sid='*';
alter system set asm_diskstring = 'ORCL:*, /ocrvote/nfsvotedisk01' scope=memory sid='*'
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-15014: path 'ORCL:OCR01FILER01' is not in the discovery set
Regardless of what I tried, the system complained. Grudgingly I used the GUI – asmca.
After starting asmca, click on Disk Groups. Then select diskgroup “OCRVOTE”, and right click to “add disks”. The trick is to click on “change discovery path”. Enter “ORCL:*, /ocrvote/nfsvotedisk01″ (without quotes) to the dialog field and close it. Strangely, now the NFS disk now appears. Make two ticks: before disk path, and in the quorum box. A click on the OK button starts the magic, and you should be presented with a success message. The ASM instance reports a little more:
ALTER SYSTEM SET asm_diskstring='ORCL:*','/ocrvote/nfsvotedisk01' SCOPE=BOTH SID='*';
2010-09-29 10:54:52.557000 +01:00
SQL> ALTER DISKGROUP OCRVOTE ADD QUORUM DISK '/ocrvote/nfsvotedisk01' SIZE 500M /* ASMCA */
NOTE: Assigning number (1,3) to disk (/ocrvote/nfsvotedisk01)
NOTE: requesting all-instance membership refresh for group=1
2010-09-29 10:54:54.445000 +01:00
NOTE: initializing header on grp 1 disk OCRVOTE_0003
NOTE: requesting all-instance disk validation for group=1
NOTE: skipping rediscovery for group 1/0xd032bc02 (OCRVOTE) on local instance.
2010-09-29 10:54:57.154000 +01:00
NOTE: requesting all-instance disk validation for group=1
NOTE: skipping rediscovery for group 1/0xd032bc02 (OCRVOTE) on local instance.
2010-09-29 10:55:00.718000 +01:00
GMON updating for reconfiguration, group 1 at 5 for pid 27, osid 15253
NOTE: group 1 PST updated.
NOTE: initiating PST update: grp = 1
GMON updating group 1 at 6 for pid 27, osid 15253
2010-09-29 10:55:02.896000 +01:00
NOTE: PST update grp = 1 completed successfully
NOTE: membership refresh pending for group 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:05.285000 +01:00
GMON querying group 1 at 7 for pid 18, osid 4247
NOTE: cache opening disk 3 of grp 1: OCRVOTE_0003 path:/ocrvote/nfsvotedisk01
GMON querying group 1 at 8 for pid 18, osid 4247
SUCCESS: refreshed membership for 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:06.528000 +01:00
SUCCESS: ALTER DISKGROUP OCRVOTE ADD QUORUM DISK '/ocrvote/nfsvotedisk01' SIZE 500M /* ASMCA */
2010-09-29 10:55:08.656000 +01:00
NOTE: Attempting voting file refresh on diskgroup OCRVOTE
NOTE: Voting file relocation is required in diskgroup OCRVOTE
NOTE: Attempting voting file relocation on diskgroup OCRVOTE
NOTE: voting file allocation on grp 1 disk OCRVOTE_0003
2010-09-29 10:55:10.047000 +01:00
NOTE: voting file deletion on grp 1 disk OCR02FILER01
NOTE: starting rebalance of group 1/0xd032bc02 (OCRVOTE) at power 1
Starting background process ARB0
ARB0 started with pid=29, OS id=15446
NOTE: assigning ARB0 to group 1/0xd032bc02 (OCRVOTE) with 1 parallel I/O
2010-09-29 10:55:13.178000 +01:00
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
2010-09-29 10:55:15.533000 +01:00
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0xd032bc02 (OCRVOTE)
GMON updating for reconfiguration, group 1 at 9 for pid 31, osid 15451
NOTE: group 1 PST updated.
2010-09-29 10:55:17.907000 +01:00
NOTE: membership refresh pending for group 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:20.481000 +01:00
GMON querying group 1 at 10 for pid 18, osid 4247
SUCCESS: refreshed membership for 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:23.490000 +01:00
NOTE: Attempting voting file refresh on diskgroup OCRVOTE
NOTE: Voting file relocation is required in diskgroup OCRVOTE
NOTE: Attempting voting file relocation on diskgroup OCRVOTE
Superb! But did it kick out the correct disk? Yes it did-you now see OCR01FILER01 and ORC01FILER02 plus the NFS disk:
[oracle@edcnode1 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
3. ONLINE 6107050ad9ba4fd1bfebdf3a029c48be (/ocrvote/nfsvotedisk01) [OCRVOTE]
Located 3 voting disk(s).
Preferred Mirror Read
One of the cool new 11.1 features allowed administrators to instruct administrators of stretch RAC system to read mirrored extents rather than primary extents. This can speed up data access in cases where data would otherwise have been sent from the remote array. Setting this parameter is crucial to many implementations. In preparation of the RDBMS installation (to be detailed in the next post), I created a disk group consisting of 4 ASM disks, two from each filer. The syntax for the disk group creation is as follows:
SQL> create diskgroup data normal redundancy
2 failgroup sitea disk 'ORCL:ASM01FILER01','ORCL:ASM01FILER02'
3* failgroup siteb disk 'ORCL:ASM02FILER01','ORCL:ASM02FILER02'
SQL> /
Diskgroup created.
As you can see all disks from sitea are from filer01 and form one failure group. The other disks, originating from filer02 form the second failure group.
You can see the result in v$asm_disk, as this example shows:
Now all that remains to be done is to instruct the ASM instances to read from the local storage if possible. This is performed by setting an instance-specific init.ora parameter. I used the following syntax:
SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEB' scope=both sid='+ASM2';
System altered.
SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEA' scope=both sid='+ASM1';
System altered.
So I’m all set for the next step, the installation of the RDBMS software. But that’s for another post…
Finally I have some more time to work on the next article in this series, dealing with the setup of my two cluster nodes. This is actually going to be quite short compared to the other articles so far. This is mainly due to the fact that I have streamlined the deployment of new Oracle-capable machines to a degree where I can comfortably set up a cluster in 2 hours. It’s a bit more work initially, but it paid off. The setup of my reference VM is documented on this blog as well, search for virtualisation and opensuse to get to the article.
When I first started working in my lab environment I created a virtual machine called “rhel55ref”. In reality it’s OEL, because of Red Hat’s windooze like policy to require an activation code. I would have considered CentOS as well, but when I created the reference VM the community hadn’t provided the “update 5″. I like the brand new shiny things most :)
Seems like I’m lucky now as well with the introduction of Oracle’s own Linux kernel I am ready for the future. Hopefully Red Hat will get their act together soon and release version 6 of their distribution. As much as I like Oracle I don’t want them to dominate the OS market too much. With Solaris now in their hands as well…
Anyway, to get started with my first node I cloned my template. Moving to /var/lib/xen/images all I had to do was to “cp -a rhel55ref edcnode1″. One repetition to edcnode2 gave me my second node. Xen (or libvirt for that matter) stores the VM configuration in xenstore, a backend database which can be interrogated easily. So I dumped the XML configuration file for my rhel55ref VM and stored it in edcnode{1,2}.xml. The command to dump the information is “virsh dumpxml domainName” > edcnode{1,2}.xml
The domU folder contains the virtual disk for the root file system of my VM, called disk0. I then created a new “hard disk”, called disk1 to contain the Oracle binaries. Experience told me not to have that too small, 20G should be enough for my /u01 mountpoint for Grid Infrastructure and the RDBMS binaries.
[root@dom0]# /var/lib/xen/images/edcnode1 # dd if=/dev/zero of=disk01 bs=1 count=0 seek=20G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 1.3869e-05 s, 0.0 kB/
I like to speed the file creation up by using the sparse file trick: the file disk1 will be reported to be 20G in size, but it will only use that if the virtual machine needs them. It’s a bit like Oracle creating a temporary tablespace.
With that information it’s time to modify the dumped XML file. Again it’s important to define MAC addresses for the network interfaces, otherwise the system will try and use dhcp for your NICs, destroying the carefully crafted /etc/sysconfig/network-scripts/ifcfg-eth{0,1,2} files. Oh, and remember that the first 3 tupel are reserved for XEN, so don’t change “00:16:3e”! Your UUID also has to be unique. In the end my first VM’s XML description looked like this:
You can see that the interfaces refer to br1, br2, and br3. These are the ones that were previously defined in the first article. The tag “” in the tag doesn’t matter as that will be dynamically assigned anyway.
When done, you can define the new VM and start it:
You are directly connected to the VM’s console (80×24-just like in the old times!) and have to wait a looooong time for the DHCP requests for eth0, eth1 and eth2 to time out. This is the first thing to address. As root, log in to the system and navigate straight to /etc/sysconfig/network-scripts to change ifcfg-eth{0,1,2}. Alternatively, use system-config-network-tui to change the network settings.
The following settings should be used for edcnode1:
eth0: 192.168.99.56/24
eth1: 192.168.100.56/24
eth2: 192.168.101.56/24
These are the settings for edcnode2:
eth0: 192.168.99.58/24
eth1: 192.168.100.58/24
eth2: 192.168.101.58/24
The nameserver for both is my dom0 – in this case 192.168.99.10. Enter the appropriate hostname as well as the nameserver. Note that 192.168.99.57 and 59 are reserved for the node VIPs, hence the “gap”. Then edit /etc/hosts to enter the information about the private interconnect, which for obvious reasons is not included in DNS. If you like, persist your public and VIP information in /etc/hosts as well. Don’t do this with the SCAN, it’s not suggested to have the SCAN resolve through /etc/hosts although it works.
Now’s the big moment-restart the network services and get out of the uncomfortable 80×24 character limitation:
[root@edcnode1]# service network restart
The complete configuration is printed here for the sake of completeness for edcnode1:
Next on the agenda is the iscsi-initiator. This isn’t part of my standard build and had to be added. All my software is exported from the dom0 via NFS and mounted to /mnt/
It's important to edit the initiator name, i.e. the name the initiator reports back to OpenFiler. I changed it to include edcnode1 and edcnode2 on their respective hosts. The file to edit is /etc/iscsi/initiatorname.iscsi
Time to get serious now:
[root@edcnode1 ~]# /etc/init.d/iscsi start
iscsid is stopped
Starting iSCSI daemon: [ OK ]
[ OK ]
Setting up iSCSI targets: iscsiadm: No records found!
[ OK ]
We are ready to roll. First, we need to discover the targets from the OpenFiler appliance-start with the first one filer01:
A restart of the iscsi service will automatically log in and persist the settings (this is very wide output-works best in 1280xsomething resolution)
[root@edcnode1 ~]# service iscsi restart
Stopping iSCSI daemon:
iscsid dead but pid file exists [ OK ]
Starting iSCSI daemon: [ OK ]
[ OK ]
Setting up iSCSI targets: Logging in to [iface: default, target: iqn.2006-01.com.openfiler:asm01Filer01, portal: 192.168.101.50,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:ocrvoteFiler01, portal: 192.168.101.50,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:asm02Filer01, portal: 192.168.101.50,3260]
Login to [iface: default, target: iqn.2006-01.com.openfiler:asm01Filer01, portal: 192.168.101.50,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:ocrvoteFiler01, portal: 192.168.101.50,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:asm02Filer01, portal: 192.168.101.50,3260]: successful
[ OK ]
Fine! Now over to fdisk the new devices. I know that my “local” storage is named /dev/xvd*, so anything new (“/dev/sd*”) will be iSCSI provided storage. If you are unsure you can always check the /var/log/messages file to see which device have just been discovered. You should see something similar to this output:
The output will continue with /dev/sdb and other devices exported by the filer.
Prepare the local Oracle Installation
Using fdisk, modify /dev/xvdb, create a partition spanning the whole disk and set its type to “8e” – Linux LVM. It’s always a good idea to use LVM to install Oracle binaries into, it makes later extension of a filesystem easier. I’ll add the fdisk output here for this device but won’t for later partitioning excercises.
root@edcnode1 ~]# fdisk /dev/xvdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.
The number of cylinders for this disk is set to 1305.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-1305, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-1305, default 1305):
Using default value 1305
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
Once /dev/xvdb1 is ready, we need to start its transformation into a logical volume. First, a physical volume is to be created:
[root@edcnode1 ~]# pvcreate /dev/xvdb1
Physical volume "/dev/xvdb1" successfully created
The physical volume (“PV”) is then used to form a volume group (“VG”). In real life, you’d probably have more than 1 PV to form a VG… I named my volume group “oracle_vg”. The existing volume group is called “root_vg” by the way.
[root@edcnode1 ~]# vgcreate oracle_vg /dev/xvdb1
Volume group "oracle_vg" successfully create
Wonderful! I never quite remember how many extents this VG has so I need to query it. When using –size 10g it will through an error – some internal overhead will reduce the available capacity to something just shy of 10G:
[root@edcnode1 ~]# vgdisplay oracle_vg
--- Volume group ---
VG Name oracle_vg
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 10.00 GB
PE Size 4.00 MB
Total PE 2559
Alloc PE / Size 0 / 0
Free PE / Size 2559 / 10.00 GB
VG UUID QgHgnY-Kqsl-noAR-VLgP-UXcm-WADN-VdiwO7
Right, so now let’s create a logical volume (“LV”) with 2559 extents:
You are done! Create the mountpoint for your oracle installation, /u01/ in my case, and grant oracle:oinstall ownership to it. In this lab excercise I didn’t create a separate owner for the Grid Infrastructure to avoid potentially undiscovered problems in 11.2.0.2 and stretched RAC. Finally add this to /etc/fstab to make it persistent:
Now continue to partition the iSCSI volumes, but don’t create file systems on top of them. You should not assign a partition type other than the default “Linux” to it either.
ASMLib
Yes I know…The age old argument, but I decided to use it anyway. The reason is simple: scsi_id doesn’t return a value in para-virtualised Linux, which makes it impossible to set up device name persistence with udev. And ASMLib is easier to use anyway! But if your system administrators are database agnostic and not willing to learn the basics about ASM, then probably ASMLib is not a good idea to be rolled out. It’s only a matter of time until someone executes an “rpm -Uhv kernel*” to your box and of course a) didn’t tell the DBAs and b) didn’t bother applying the ASMLib kernel module. But I digress.
Before you are able to use ASMLib you have to configure it on each cluster node. A sample session could look like this:
[root@edcnode1 ~]# /etc/init.d/oracleasm configure
Configuring the Oracle ASM library driver.
This will configure the on-boot properties of the Oracle ASM library
driver. The following questions will determine whether the driver is
loaded on boot and what permissions it will have. The current values
will be shown in brackets ('[]'). Hitting without typing an
answer will keep that current value. Ctrl-C will abort.
Default user to own the driver interface []: oracle
Default group to own the driver interface []: dba
Start Oracle ASM library driver on boot (y/n) [n]:
Scan for Oracle ASM disks on boot (y/n) [y]:
Writing Oracle ASM library driver configuration: done
Dropping Oracle ASMLib disks: [ OK ]
Shutting down the Oracle ASMLib driver: [ OK ]
[root@edcnode1 ~]#
Now with this done, it is possible to create the ASMLib maintained ASM disks. For the LUNs presented by filer01 these be
ASM01FILER01
ASM02FILER01
OCR01FILER01
OCR02FILER01
The disks are created using the /etc/init.d/oracleasm createdisk command as in these examples:
[root@edcnode1 ~]# /etc/init.d/oracleasm createdisk asm01filer01 /dev/sda1
Marking disk "asm01filer01" as an ASM disk: [ OK ]
[root@edcnode1 ~]# /etc/init.d/oracleasm createdisk asm02filer01 /dev/sdc1
Marking disk "asm02filer01" as an ASM disk: [ OK ]
[root@edcnode1 ~]# /etc/init.d/oracleasm createdisk ocr01filer01 /dev/sdb1
Marking disk "ocr01filer01" as an ASM disk: [ OK ]
[root@edcnode1 ~]# /etc/init.d/oracleasm createdisk ocr02filer01 /dev/sdd1
Marking disk "ocr02filer01" as an ASM disk: [ OK ]
Switch over to the second node now to validate the configuration and to continue the configuration of the iSCSI LUNs from filer02. Define the domU with a similar configuration file as shown above for edcnode1, and start the domU. Once the wait for DHCP timeouts is over and you are presented with a login, set up the network as shown above. Install the iscsi initiator package, change the initiator name and discover the targets from filer02 in addition to those from filer01.
Still on the second node, continue the mounting of the scsi devices
[root@edcnode2 ~]# service iscsi start
iscsid (pid 2802) is running...
Setting up iSCSI targets: Logging in to [iface: default, target: iqn.2006-01.com.openfiler:asm02Filer02, portal: 192.168.101.51,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:ocrvoteFiler02, portal: 192.168.101.51,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:asm01Filer01, portal: 192.168.101.50,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:asm01Filer02, portal: 192.168.101.51,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:ocrvoteFiler01, portal: 192.168.101.50,3260]
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:asm02Filer01, portal: 192.168.101.50,3260]
Login to [iface: default, target: iqn.2006-01.com.openfiler:asm02Filer02, portal: 192.168.101.51,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:ocrvoteFiler02, portal: 192.168.101.51,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:asm01Filer01, portal: 192.168.101.50,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:asm01Filer02, portal: 192.168.101.51,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:ocrvoteFiler01, portal: 192.168.101.50,3260]: successful
Login to [iface: default, target: iqn.2006-01.com.openfiler:asm02Filer01, portal: 192.168.101.50,3260]: successful
Partition the disks from filer02 the same way as shown in the previous example. On edcnode2, fdisk reported the following as new disks
Disk /dev/sda doesn't contain a valid partition table
Disk /dev/sdb doesn't contain a valid partition table
Disk /dev/sdf doesn't contain a valid partition table
Disk /dev/sda: 10.6 GB, 10670309376 bytes
Disk /dev/sdb: 2650 MB, 2650800128 bytes
Disk /dev/sdf: 10.7 GB, 10737418240 bytes
Note that /dev/sda and /dev/sdf are the 2 10G LUNs for ASM data, and /dev/sdb is the OCR/voting disk combination. Next, create the additional ASMLib disks:
[root@edcnode2 ~]# /etc/init.d/oracleasm scandisks
...
[root@edcnode2 ~]# /etc/init.d/oracleasm createdisk asm01filer02 /dev/sda1
Marking disk "asm01filer02" as an ASM disk: [ OK ]
[root@edcnode2 ~]# /etc/init.d/oracleasm createdisk asm02filer02 /dev/sdf1
Marking disk "asm02filer02" as an ASM disk: [ OK ]
[root@edcnode2 ~]# /etc/init.d/oracleasm createdisk ocr01filer02 /dev/sdb1
Marking disk "ocr01filer02" as an ASM disk: [ OK ]
[root@edcnode2 ~]# /etc/init.d/oracleasm listdisks
ASM01FILER01
ASM01FILER02
ASM02FILER01
ASM02FILER02
OCR01FILER01
OCR01FILER02
OCR02FILER01
Perform another scandisks command on edcnode1 to have all the disks:
[root@edcnode1 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@edcnode1 ~]# /etc/init.d/oracleasm listdisks
ASM01FILER01
ASM01FILER02
ASM02FILER01
ASM02FILER02
OCR01FILER01
OCR01FILER02
OCR02FILER01
Summary
All done!And I seriously thought initially that this was going to be a shorter post than the others, how wrong I was. Congratulations on having arrived here at the bottom of the article by the way.
In the course of this post I prepared my virtual machines to begin the installation of Grid Infrastructure. The ASM disk names will be persistent across reboots thanks to ASMLib, and no messing around with udev for that matter. You might notice that there are 2 ASM disk from filer01 but only 1 from filer02 for the voting disk/OCR diskgroup, and that’s for a reason. I’m cheeky and won’t tell you here, that’s for another post later…
On to the next part in the series. This time I am showing how I prepared the iSCSI openFiler “appliances” on my host. This is quite straight forward, if one knows how it works :)
Setting up the openFiler appliance on the dom0
OpenFiler 2.3 has a special download option suitable for paravirtualised Xen hosts. Proceed by downloading the file from your favourite mirror, the file name I am using is “openfiler-2.3-x86_64.tar.gz”, you might have to pick another one if you don’t want a 64bit system.
All my domU go to /var/lib/xen/images/vm-name, and so do the openFiler ones. I am not using LVM to present storage to the domUs, my system came without free space I could have turned into a physical volume. Here are the steps to create the openFiler, remember to repeat this 3 times, one for each storage provider.
Begin with the first openFiler appliance. Whenever you see numbers in {} then that implies that the operation has to be repeated for each of the numbers in the curly braces.
# cd /var/lib/xen/images/
# mkdir filer0{1,2,3}
# cd filer0{1,2}
Next create the virtual disks for the appliance. I use 4G for the root file system and one 5G + 2 10G disks. The 5G disk will later on be part of the OCR and voting files disk group, whereas the other two are going to be the local ASM disks. These steps are for filer01 and filer02, the iSCSI target providers.
# dd if=/dev/zero of=disk01 bs=1 count=0 seek=4G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 1.3296e-05 s, 0.0 kB/s
# dd if=/dev/zero of=disk02 bs=1 count=0 seek=5G
# dd if=/dev/zero of=disk03 bs=1 count=0 seek=10G
# dd if=/dev/zero of=disk04 bs=1 count=0 seek=10G
For the NFS filer03, you only need two 4G disks, disk1 and disk2. For all filers, a root partition has to be created. You also have to create a file system on the “root” volume:
# mkfs.ext3 disk01
mke2fs 1.41.9 (22-Aug-2009)
disk01 is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
262144 inodes, 1048576 blocks
52428 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1073741824
32 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 21 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
openSUSE-112-64-minimal:/var/lib/xen/images/filer01 #
Prepare to mount the root volume as a loop device, and also label the disk. Once mounted, copy the contents of the downloaded openfiler tarball into it as shown in this example:
# e2label disk01 root
# mkdir tmpmnt/
# mount -o loop disk01 tmpmnt/
# cd tmpmnt
# tar --gzip -xvf /m/downloads/openfiler-2.3-x86_64.tar.gz
With this done, we need to extract the kernel and the initial RAMdisk for later use in the xen config file. I have not experimented with pygrub for the openfiler appliances, someone with more knowledge may correct me here. This in any case works for this demonstration:
# mkdir /m/xenkernels/openfiler
# cp -a /var/lib/xen/images/filer01/tmpmnt/boot /m/xenkernels/openfiler
Here are the files now stored inside the kernel directory on the dom0:
# ls -l /m/xenkernels/openfiler/
total 9276
-rw-r--r-- 1 root root 770924 May 30 2008 System.map-2.6.21.7-3.20.smp.gcc3.4.x86_64.xen.domU
-rw-r--r-- 1 root root 32220 Jun 28 2008 config-2.6.21.7-3.20.smp.gcc3.4.x86_64.xen.domU
drwxr-xr-x 2 root root 4096 Jul 1 2008 grub
-rw-r--r-- 1 root root 1112062 Jul 1 2008 initrd-2.6.21.7-3.20.smp.gcc3.4.x86_64.xen.domU.img
-rw-r--r-- 1 root root 5986208 May 14 18:01 vmlinux
-rw-r--r-- 1 root root 1558259 Jun 28 2008 vmlinuz-2.6.21.7-3.20.smp.gcc3.4.x86_64.xen.domU
With this information, at hand we can construct ourselves a xen configuration file, such as the following:
In plain English, this verbose XML file describes the VM as a paravirtualised linux system with 4 hard disks and 2 network interfaces. The MAC must be static, otherwise you’ll end up with network problems each time you boot. For all currently started domUs the MAC also has to be unique! Change the UUID, name, paths to the disks (“source file”) and MAC addresses for filer02. The same applies for filer03, but this one only uses 2 disks-xvda and xvdb so please remove the disk-tags for disk03 and disk04.
Define the VM in xenstore and start it, while staying attached to the console:
Repeat this for filer02.xml and filer03.xml in separate terminal sessions.
Eventually, you are going to be presented with the welcome screen:
Welcome to Openfiler NAS/SAN Appliance, version 2.3
You do not appear to have networking. Please login to start networking.
Configuring the OpenFiler domU
Log in as root (which doesn’t have a password, you should change this now!) and correct the missing network information. We have 2 virtual NICs, eth0 for the public network, and eth1 for the storage network. As root, navigate to /etc/sysconfig/network-scripts/ and edit ifcfg-eth{0,1}. In our example, we need 2 static interfaces. For eth0 for example, the existing file has the following contents:
[root@localhost network-scripts]# vi ifcfg-eth0
# Device file installed by rBuilder
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
TYPE=Ethernet
Similarly, change ifcfg-eth1 for address 192.168.101.50 and restart the network:
[root@localhost network-scripts]# service network restart
After this, ifconfig should report the correct interfaces and you are ready to access the web console.
The network for filer02 uses 192.168.99.51 for eth0 and 192.168.101.51 for eth1. Similarly, filer03 uses 192.168.99.52 for eth0 and 192.168.101.52 for eth1.
All domUs are in the internal network, you have to set up some port forwarding rules. The easiest way to do this is in your $HOME/.ssh/config file. For my server, I set up the following options:
martin@linux-itgi:~> cat .ssh/config
Host *eq8
HostName eq8
User martin
Compression yes
# note the white space
LocalForward 4460 192.168.99.50:446
LocalForward 4470 192.168.99.51:446
LocalForward 4480 192.168.99.52:446
LocalForward 5902 192.168.99.56:5902
# other hosts
Host *
PasswordAuthentication yes
FallBackToRsh no
martin@linux-itgi:~>
I am forwarding the local ports 4460, 4470, 4480 on my PC to the openfiler appliances. This way, I can enter https://localhost:44{6,7,8}0 to access the web frontend for the openFiler appliance. This is needed, as you can’t really administer them otherwise. When using Firefox, you’ll get a warning about certificates-I have added security exceptions because I know the web server is not conducting a man in the middle attack on me. You should always be careful adding unknown certificates to your browser in other cases.
Administering OpenFiler
NOTE: The following steps are for filer01 and filer02 only!
Once logged in as user “openfiler” (the default password is “password”), you might want to secure that password. Click on Accounts -> Admin Password and make the changes you like.
Next I recommend you verify the system setup. Click on System and review the settings. You should see the network configured correctly, and can change the hostname to filer0{1,2}.localdomain. Save your changes. Networking settings should be correct, if not you can update them here.
Next we need to partition our block devices. Previously unknown to me, openFiler uses the “gpt” format to partition disks. Click on Volumes -> Block devices to see all the block devices. Since you are running a domU, you can’t see the root device /dev/xvda. For each device (xvd{b,c,d} create one partition spanning the whole of the “disk”. You can do so by clicking on the device name. Scroll down to the “Create partition in /dev/xvdx” section and fill the data. Click “create” to create the partition. Note that you can’t see the partitions in fdisk should you log in to the appliance as root.
Once the partitions are created, it’s time to create volumes to be exported as iSCSI targets. Still in “Volumes”, click on “Volume Groups”. I chose to create the following volume groups:
ASM_VG with member PVs xvdc1 and xvdd1
OCRVOTE_VG with member PV xvdb1
Once the volume groups are created, you should proceed by creating logical volumes within these. Click on “Add Volume” to access this screen. You have a drop-down menu to select your volume group. For OCRVOTE_VG I opted to create the following logical volumes (you have to set the type to iSCSI rather than XFS):
ocrvote01_lv, about 2.5G in size, type iSCSI
ocrvote02_lv, about 2.5G in size, type iSCSI
For volume group ASM_VG, I created these logical volumes:
asmdata01_lv, about 10G in size, type iSCSI
asmdata02_lv, about 10G in size, type iSCSI
We are almost there! The storage has been carved out of the pool of available storage, and what remains to be done is the definition of the iSCSI targets and ACLs. You can define very fine grained access to iSCSI targets, and even for iSCSI discovery! This example tries to keep it simple and doesn’t use any CHAP authentication for iSCSI targets and discovery-in the real world you’d very much want to implement these security features though.
Preparing the iSCSI part
We are done for now on the Volumes tab. First, we need to enable the iSCSI target server. In “Services”, ensure that the “iSCSI target server” is enabled. If not, click on the link next to it. Before we can export any LUNs, we need to define who is eligible to mount them. In openFiler, this is configured via ACLs. Go to the “System” tab and scroll down to the “Network access configuration” section. Fill in the details of our cluster nodes here as shown below. These are the settings for edcnode1:
Name: edcnode1
Network/Host: 192.168.101.56
Netmaksk: 255.255.255.255 (IMPORTANT: it has to be 255.255.255.255, NOT 255.255.255.0)
Type: share
The settings for edcnode2 are identical, except for the IP address which is 192.168.101.58-remember, we are configuring the “STORAGE” network here! Click on “Update” to make the changes permanent. You are now ready to create the iSCSI targets, of which there will be 2: one for the OCR/Voting Disk, and another one for the ASM LUNs.
Back to the Volume tab, click on “iSCSI targets”. You will be notified that no targets have been defined yet. You will have to defined the following targets for filer01:
iqn.2006-01.com.openfiler:ocrvoteFiler01
iqn.2006-01.com.openfiler:asm01Filer01
iqn.2006-01.com.openfiler:asm02Filer01
Leave the default settings, they will do for our example. You simply add the name to the “Target IQN” field and then click on “Add”. The targets currently don’t support any LUNs yet, something that needs addressing in this step.
Switch to target iqn.2006-01.com.openfiler:ocrvoteFiler01 and then use the tab “LUN mapping” to map a LUN. In the list of available LUNs add ocrvote01_lv and ocrvote02_lv to the target. Click on “network ACL” and allow access to the LUN from edcnode1 and edcnode2. For the first ASM target, map asmdata01_lv and set the permissions, then repeat for the last target with asmdata02_lv.
Create the following targets for filer02:
iqn.2006-01.com.openfiler:ocrvoteFiler02
iqn.2006-01.com.openfiler:asm01Filer02
iqn.2006-01.com.openfiler:asm02Filer02
The mappings and settings for the ASM targets are identical to filer01, but for the OCRVOTE target only export the first logical volume, i.e. ocrvote01_lv.
NFS export
The third filer, filer03 is a little bit different in way that it only exports a NFS share to the cluster. It only has one data disk, data02. In a nutshell, create the filer as described to the point where it’s accessible via its web interface. The high level steps for it are:
Partition /dev/xvdb into 1 partition spanning the whole disk
Create a volume group ocrvotenfs_vg from /dev/xvdb1
Create a logical volume nfsvol_lv, approx 1G in size with ext3 as its file system
Enable the NFS v3 server (Services tab)
From there on the procedure is slightly different. Click on “Shares” to access the network shares available from the filer. You should see your volume group with the logical volume nfsvol_lv. Click on the link “nfsvol_lv” and enter “ocrvote” as subfolder name. A new folder icon with the name ocrvote will appear. Click on this one, and in the pop-up dialog click on “Make share”. You should set the following on the now opening lengthy configuration dialog:
Public guest acces
Host access for edcnode1 and edcnode2 for NFS RW (select the radio button)
Click on edit to access special options for edcnode1 and edcnode2. Ensure that the anonymous UID and GID match the one for the grid software owner. The UID/GID mapping has to be “all_squash”, IO mode has to be “sync”. You can ignore the write delay and origin port for this example
Leave all other protocols deselected
Click update to make the changes permanent
That was it! The storage layer is now perfectly set up for the cluster nodes which I’ll discuss in a follow-on post.
openSUSE-112-64-minimal:/var/lib/xen/images/filer01 # mkfs.ext3 disk01
mke2fs 1.41.9 (22-Aug-2009)
disk01 is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
262144 inodes, 1048576 blocks
52428 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1073741824
32 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: doneThis filesystem will be automatically checked every 21 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
openSUSE-112-64-minimal:/var/lib/xen/images/filer01 #mkdir tmpmnt/e2label disk01 root
mount -o loop disk01 tmpmnt/
cd tmpmnt
tar –gzip -xvf ../openfiler-2.3-x86_64.tar.gz
# only for the first time
mkdir /m/xenkernels/openfiler
cp -a /var/lib/xen/images/filer01/tmpmnt/boot /m/xenkernels/openfiler/
I promised in the introduction to introduce my lab environment in the first part of the series. So here we go…
OpenSuSE
Similar to the Fedora project, SuSE (now Novell) have come up with a community distribution some time ago which can be freely downloaded from the Internet. All these community editions give the users a glimpse at the new and upcoming Enterprise distribution, such as RHEL or SLES.
I have chosen the OpenSuSE 12.2 distribution for the host operating system. It has been updated to xen 3.4.1, kernel 2.6.31.12 and libvirt 0.7.2. These packages provide a stable execution environment of the virtual machines we are going to build. Alternative xen-based solutions have not been considered. During initial testing I found that Oracle VM 2.1.x virtual machines could not mount iSCSI targets without kernel-panicking and crashing. Citrix’s xenserver is too commercial, and the community edition is lacking needed features, and finally Virtual Iron had already been purchased by Oracle.
All kernel 2.6.18-x based distributions such as Red Hat 5.x and clones were discarded for lack of features and their age. After all, 2.6.18 has been introduced three years ago and although features were back-ported to it, xen support is way behind what I needed. The final argument in favour of OpenSuSE was the fact that SuSE provide a xen-capable 2.6.31 kernel out of the box. Although it is perfectly possibly to build one’s own xen-kernel, this is an advanced topic and not covered here. OpenSuSE also makes configuring the networking bridges very straight forward by a good integration into yast, the distributions setup and configuration tool.
The host system uses the following components:
Single Intel Core i7 processor
24GB RAM
1.5 TB hard disk space in RAID 1
The whole configuration can be rented from hosting providers, something I have chosen to do. The host has run a four node 11.2 cluster plus 2 additional virtual machines for Enterprise Manager Grid Control 11.1 without problems. To my experience the huge amount of memory is the greatest benefit of the above configuration. Allocating four GB of RAM to each VM helped a lot.
Terminology
You should be roughly familiar with the concepts behind XEN virtualisation, the following list explains the most important terminology.
Hypervisor The enabling technology to run virtual machines. The hypervisor used in this document is the xen hypervisor.
dom0 The dom(ain) 0 is the name for the host. The dom0 has full access to all the system’s peripherals
domU In Xen parlance, the domU is a virtual machine. Xen differentiates between paravirtualised and fully virtualised machines. Paravirtualisation broadly speaking offers superior performance, but requires a modified operating system. I am going to use paravirtualised domUs
Bridge: A (virtual) network device used for IP communication between virtual machines and the
host
Prerequisites
Start off by installing the openSuSE 11.2 distribution, either choosing the GNOME or KDE desktop. Long years of exposure to Red Hat based systems made me chose the GNOME desktop. Once the installation has completed, start the yast administration tool and click on the “install hypervisor and tools” button. This will install the xen-aware kernel and add the necessary entry to GRUB boot loader. Once completed, reboot the server and boot the xen kernel. You don’t need to configure any network bridges at this stage, even though yast prompts you to do so.
Networking on the dom0
RAC requires at least 2 NICs per cluster nodes with fibre channel connectivity. In our example I am going to use iSCSI targets for storage, provided by the OpenFiler community edition. It is good practice to separate storage communication from any other communication, the same as with the cluster interconnect. Therefore, a third bridge will be used. Production setups would of course use a different setup, but as iSCSI serves the purpose quite well I decided to implement it. Also, a production cluster would feature redundancy everywhere, including NICs and HBAs. Remember that redundancy can prevent outages!
The communication between the cluster nodes will be channeled over virtual switches, so called bridges. It used to be quite difficult to set up a network bridge for XEN, but openSuSE’s yast configuration
tool makes this quite simple. My host has the following bridges configured:
br0 This is the only bridge that has a physical interface bridged, normally eth0 or bond0. It won’t be used for the cluster and is used purely to allow my ssh traffic coming in. If not yet configured,
br1 I used br1 as a host only network for the public cluster communication. It does not have a bridged physical interface
br2 This is in use for the private cluster interconnect. This bridge doesn’t have a physical NIC configured
br3 Finally this bridge will be used to allow iSCSI communication between the filers and the cluster nodes. Neither does this have a physical NIC configured
I said a number of times that configuring a bridge was quite tedious, and for some other distributions it still is. It requires quite a bit of knowledge of the bridge-utils package and the naming conventions for virtual and physical network interfaces in XEN. To configure a bridge in OpenSuSE, start yast, and click on the “Network Settings” icon to start the network configuration.
The configuration tool will load the current network configuration. Bridge br0 should be configured to bridge the public interface name, usually eth0. All other bridges should not bridge physical devices, effectively making them host-only. If you haven’t configured a network bridge when you installed the xen hypervisor and tools, it’s time to do so now. Identify your external networking device in the list of devices shown on the “Overview” page. Take note of all settings such as IP address, netmask, gateway, MTU, routes, etc. You can get this information by selecting your external NIC and clicking on the “Edit” button.
You should see a Network Bridge entry in the list of interfaces, which probably uses DHCP. Select it and click on “Edit”. Enter all the details you just copied from your actual physical NIC and ensure that under tab “Bridged Devices” that interface is listed. Click on “Next”. Confirm the warning that a device is already configured with these settings. This will effectively deconfigure the physical device and replace it with the bridge.
Adding the host-only bridges is easier. Select the “Add” option next, and on the following screen ensure to have selected “Bridge” as the device type. The configuration name will be set correctly, don’t change it unless you know what you are doing. In the following Network Card Setup screen, assign a static IP address, a subnet, and optionally ahostname. I left the hostname blank for all but the public bridge br0.
Finish the configuration assistant. Before restarting the network ensure you have an alternative means of getting to your machine, for example using a console. If the network is badly configured, you might be locked out.
The network setup
The following IP addresses are used for the example cluster:
IP Address Range
Used For
192.168.99.50-52
The IP addresses for the web interfaces of the openFiler iSCSI “SAN”
192.168.99.53-55
The Single Client Access Name for the cluster
192.168.99.56-59
Node virtual IP addresses
192.168.100.56 and 58
Private cluster interconnect
192.168.101.56 and 58
Storage subnet for the cluster nodes
192.168.101.50-52
The IP addresses for the iSCSI interfaces of the openFiler “SAN”
Some of these addresses need to go into DNS. Edit your DNS server’s zone files and include the following to the zone’s forward lookup file:
; extended distance cluster
filer01 IN A 192.168.99.50
filer02 IN A 192.168.99.51
filer03 IN A 192.168.99.52
edc-scan IN A 192.168.99.53
edc-scan IN A 192.168.99.54
edc-scan IN A 192.168.99.55
edcnode1 IN A 192.168.99.56
edcnode1-vip IN A 192.168.99.57
edcnode2 IN A 192.168.99.58
edcnode2-vip IN A 192.168.99.59
The reverse lookup looks as follows:
; extended distance cluster
50 IN PTR filer01.localdomain.
51 IN PTR filer02.localdomain.
52 IN PTR filer03.localdomain.
53 IN PTR edc-scan.localdomain.
54 IN PTR edc-scan.localdomain.
55 IN PTR edc-scan.localdomain.
56 IN PTR edcnode1.localdomain.
57 IN PTR edcnode1-vip.localdomain.
58 IN PTR edcnode2.localdomain.
59 IN PTR edcnode2-vip.localdomain.
The public network maps to bridge br1 on network 192.168.99/24, the private network is supported through br2 in the 192.168.100/24, and the storage will go through br3, using the 192.168.101/24 subnet.
Reload the DNS service now to make these changes active.
You should use the “host” utility to check if the SCAN resolves in DNS to be sure it all works.
That’s it-you successfully set up the dom0 for working with the virtual machines. Continue with the next part of the series, which is going to introduce openFiler and how to install it as a domU with minimal effort.
Finally time for a new series! With the arrival of the new 11.2.0.2 patchset I thought it was about time to try and set up a virtual 11.2.0.2 extended distance or stretched RAC. So, it’s virtual, fair enough. It doesn’t allow me to test things like the impact of latency on the inter-SAN communication, but it allowed me to test the general setup. Think of this series as a guide after all the tedious work has been done, and SANs happily talk to each other. The example requires some understanding of how XEN virtualisation works, and it’s tailored to openSuSE 11.2 as the dom0 or “host”. I have tried OracleVM in the past but back then a domU (or virtual machine) could not mount an iSCSI target without a kernel panic and reboot. Clearly not what I needed at the time. OpenSuSE has another advantage: it uses a new kernel-not the 3 year old 2.6.18 you find in Enterprise distributions. Also, xen is recent (openSuSE 11.3 even features xen 4.0!) and so is libvirt.
The Setup
The general idea follows the design you find in the field, but with less cluster nodes. I am thinking of 2 nodes for the cluster, and 2 iSCSI target providers. I wouldn’t use iSCSI in the real world, but my lab isn’t connected to an EVA or similar.A third site will provide quorum via an NFS provided voting disk.
Site A will consist of filer01 for the storage part, and edcnode1 as the RAC node. Site B will consist of filer02 and edcnode2. The iSCSI targets are going to be provided by openFiler’s domU installation, and the cluster nodes will make use of Oracle Enterprise Linux 5 update 5.To make it more realistic, site C will consist of another openfiler isntance, filer03 to provide the NFS export for the 3rd voting disk. Note that openFiler seems to support NFS v3 only at the time of this writing. All systems are 64bit.
The network connectivity will go through 3 virtual switches, all “host only” on my dom0.
Public network: 192.168.99/24
Private network: 192.168.100/24
Storage network: 192.168.101/24
As in the real world, private and storage network have to be separated to prevent iSCSI packets clashing with Cache Fusion traffic. Also, I increased the MTU for the private and storage networks to 9000 instead of the default 1500. If you like to use jumbo frames you should check if your switch supports it.
Grid Infrastructure will use ASM to store OCR and voting disks, and the inter-SAN replication will also be performed by ASM in normal redundancy. I am planning on using preferred mirror read and intelligent data placement to see if that makes a difference.
Known limitations
This setup has some limitations, such as the following ones:
You cannot test inter-site SAN connectivity problems
You cannot make use of udev for the ASM devices-a xen domU doesn’t report anything back from /sbin/scsi_id which makes the mapping to /dev/mapper impossible (maybe someone knows a workaround?)
Network interfaces are not bonded-you certainly would use bonded NICs in real life
No “real” fibre channel connectivity between the cluster nodes
So much for the introduction-I’ll post the setup step-by-step. The intended series will consist of these articles:
Introduction to XEN on openSuSE 11.2 and dom0 setup
Introduction to openFiler and their installation as a virtual machine
Setting up the cluster nodes
Installing Grid Infrastructure 11.2.0.2
Adding third voting disk on NFS
Installing RDBMS binaries
Creating a database
That’s it for today, I hope I got you interested and following the series. It’s been real fun doing it; now it’s about writing it all up.
I have already written about the renamedg command, but since then fell in love with ASMLib. The use of ASMLib introduces a few caveats you should be aware of.
USAGE NOTES
This document presents research I performed with ASM on a lab environment. It should be applicable to any environment, but you should NOT use this for production-the renamedg command still is buggy, and you should not mess with ASM disk headers in an important system such as production or staging/UAT. You set the importance here! The recommended setup for cloning disk groups is to use a data guard physical standby database on a different storage array to create a real time copy of your production database on that array. Again, do not use you production array for this!
Walking through a renamdg session
Oracle ASMLib introduces a new value to the ASM header, called the provider string as the following example shows:
[root@asmtest ~]# /etc/init.d/oracleasm querydisk /dev/xvdc1
Device "/dev/xvdc1" is marked an ASM disk with the label "VOL1"
The prefix “ORCLDISK” is automatically added by ASMLib and cannot easily be changed.
The problem with ASMLib is that the renamedg command does NOT update the provider string, which I’ll illustrate by walking through an example session. Disk group “DATA”, setup with external redundancy and two disks, DATA1 and DATA2, is to be cloned to “DATACLONE”.
The renamedg command requires the disk group to be cloned to be stopped. To prevent nasty surprises, you should stop the databases using that diskgroup manually.
So apart from files from other disk groups no files are open, especially not referring to disk group DATA.
Now comes the part where you copy the LUNs, and this entirely depends on your system. The EVA series of storage arrays I worked with in this particular project offered a “snapclone” function, which used COW to create an identical copy of the source LUN, with a new WWID (which can be an input parameter to the snapclone call). When you are using device-mapper-multipath then ensure that your sys admins add the newly created LUNs to the /etc/multipath.conf file on all cluster nodes!
I am using Xen in my lab, which makes it simpler-all I need to do is to copy the disk containers on the domO and then add the new block devices to the running domU (“virtual machine” in Xen language). This can be done easily as the following example shows:
In the example, rac11gr2drnode{1,2} are the domU, the backend device is the copied file on the file system, the front end device in the domU is xvd{g,h}, and the mode is read/write, shareable. The exclamation mark here is crucial or else the second domU can’t mount the new block device-it is already exclusively mounted to another domU.
The fdisk command in my example immediately “sees” the new LUNs, with device mapper multipathing you might have to go through iterations of restarting multipathd and discovering partitions using kpartx. It is again very important to have all disks presented to all cluster nodes!
Here’s the sample output from my system:
[root@rac11gr2drnode1 ~]# fdisk -l | grep Disk | sort
Disk /dev/xvda: 4294 MB, 4294967296 bytes
Disk /dev/xvdb: 16.1 GB, 16106127360 bytes
Disk /dev/xvdc: 5368 MB, 5368709120 bytes
Disk /dev/xvdd: 16.1 GB, 16106127360 bytes
Disk /dev/xvde: 16.1 GB, 16106127360 bytes
Disk /dev/xvdf: 10.7 GB, 10737418240 bytes
Disk /dev/xvdg: 16.1 GB, 16106127360 bytes
Disk /dev/xvdh: 16.1 GB, 16106127360 bytes
I cloned /dev/xvdd and /dev/xvde to /dev/xvdg and /dev/xvdh.
Do NOT run /etc/init.d/oracleasm scandisks yet! Otherwise the renamedg command will complain about duplicate disk names, which is entirely reasonable.
I dumped all headers for disks /dev/xvd{d,e,g,h}1 to /tmp to be able to compare.
[root@rac11gr2drnode1 ~]# kfed read /dev/xvdd1 > /tmp/xvdd1.header
# repeat with the other disks
Start with phase one of the renamedg command:
[root@rac11gr2drnode1 ~]# renamedg phase=one dgname=DATA newdgname=DATACLONE \
> confirm=true verbose=true config=/tmp/cfg
Parsing parameters..
Parameters in effect:
Old DG name : DATA
New DG name : DATACLONE
Phases :
Phase 1
Discovery str : (null)
Confirm : TRUE
Clean : TRUE
Raw only : TRUE
renamedg operation: phase=one dgname=DATA newdgname=DATACLONE confirm=true
verbose=true config=/tmp/cfg
Executing phase 1
Discovering the group
Performing discovery with string:
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with
disk number:0 and timestamp (32940276 1937075200)
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with
disk number:1 and timestamp (32940276 1937075200)
Checking for hearbeat...
Re-discovering the group
Performing discovery with string:
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with
disk number:0 and timestamp (32940276 1937075200)
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with
disk number:1 and timestamp (32940276 1937075200)
Checking if the diskgroup is mounted
Checking disk number:0
Checking disk number:1
Checking if diskgroup is used by CSS
Generating configuration file..
Completed phase 1
Terminating kgfd context 0x2b7a2fbac0a0
[root@rac11gr2drnode1 ~]#
You should always check “$?” for errors-the message “terminating kgfd context” sounds bad, but isn’t. At the end of stage 1, there is no change to the header. Only at phase two there is:
Although the original disks (/dev/xvdd1 and /dev/xvde1) had their disk group name changed, the provider string remained untouched. So if we were to issue a scandisks command now through /etc/init.d/oracleasm, there’d still be duplicate disk names. This is a bug in my opinion, and a bad thing.
Renaming the disks is straight forward, the difficult bit is to find out which have to be renamed. Again, you can use kfed to figure that out. I knew the disks to be renamed were /dev/xvdd1 and /dev/xvde1 after consulting the header information.
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvdd1 DATACLONE1
Renaming disk "/dev/xvdd1" to "DATACLONE1": [ OK ]
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvde1 DATACLONE2
Renaming disk "/dev/xvde1" to "DATACLONE2": [ OK ]
I then performed a scandisks operation on all nodes just to be sure… I had corruption of the disk group before :)
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@rac11gr2drnode1 tmp]#
[root@rac11gr2drnode2 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@rac11gr2drnode2 ~]#
The output on all cluster nodes should be identical, on my system I found the following disks:
Sure enough, the cloned disks were present. Although everything seemed ok at this point, I could not start disk group DATA and had to reboot the cluster nodes to rectify that problem. Maybe there is some not so transient information stored somewhere about ASM disks. After the reboot, CRS started my database correctly, and with all dependent resources:
[oracle@rac11gr2drnode1 ~]$ srvctl status database -d dev
Instance dev1 is running on node rac11gr2drnode1
Instance dev2 is running on node rac11gr2drnode2
A very short post about a cool new feature I noticed today. RAC 11.2 has moved a lot of commands previously having their own syntax into crsctl. One of the cool new things is the fact that crsctl status resource -t (“tabular”) reports state details. Here I could see that my lab environment had a stuck archiver. Other state details include information about the cluster time synchronisation daemon ctss, or ASM instances. Have a look at my 4 node cluster:
Recent comments
17 weeks 2 days ago
27 weeks 14 hours ago
28 weeks 5 days ago
31 weeks 6 days ago
34 weeks 1 day ago
43 weeks 5 days ago
45 weeks 2 days ago
46 weeks 2 days ago
46 weeks 3 days ago
49 weeks 1 day ago