One of the major adventures this time of the year involves installing RAC 11.2.0.2 on Solaris 10 10/09 x86-64. The system setup included EMC Power Path 5.3 as the multipathing solution to shared storage.
I initially asked for 4 BL685 G6 with 24 cores, but in the end “only” got two-still plenty of resources to experiment with. I especially like the output of this command:
$ /usr/sbin/psrinfo | wc –l 24
Nice! Actually, it’s 4 Opteron processors:
$ /usr/sbin/prtdiag | less
System Configuration: HP ProLiant BL685c G6 BIOS Configuration: HP A17 12/09/2009 BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)
==== Processor Sockets ====================================
Version Location Tag -------------------------------- -------------------------- Opteron Proc 1 Opteron Proc 2 Opteron Proc 3 Opteron Proc 4
So much for the equipment. The operating system showed 4 NICs, all called bnxen where n was 0 through 4. The first interface, bnxe0, will be used for the public network. The second NIC is to be ignored and the final 2, bnxe2 and bnxe3 will be used for the high available cluster interconnect feature. This way I can prevent the use of SFRAC which inevitably would have meant a clustered Veritas file system instead of ASM.
One interesting point to notice is that the Oracle MOS document 1210883.1 specifies that the interfaces for the private interconnect are on the same subnet. So-node1 will use 192.168.0.1 for bnxe2 and 192.168.0.2 for bnxe3. Similarly, node2 uses 192.168.0.3 for bnxe2 and 192.168.0.4 for bnxe3. The Oracle example is actually a bit more complicated than it could have been, as they use a /25 subnet mask. But ipcalc confirms that the address range they use are all well within the subnet:
Address: 10.1.0.128 00001010.00000001.00000000.1 0000000 Netmask: 255.255.255.128 = 25 11111111.11111111.11111111.1 0000000 Wildcard: 0.0.0.127 00000000.00000000.00000000.0 1111111 => Network: 10.1.0.128/25 00001010.00000001.00000000.1 0000000 (Class A) Broadcast: 10.1.0.255 00001010.00000001.00000000.1 1111111 HostMin: 10.1.0.129 00001010.00000001.00000000.1 0000001 HostMax: 10.1.0.254 00001010.00000001.00000000.1 1111110 Hosts/Net: 126 (Private Internet)
This setup will have some interesting implications which I’ll describe a little later.
Part of the test was to find out how mature the port to Solaris on Intel was. So I decided to start off by installing Grid Infrastructure on node 1 first, and extend the cluster to node2 using the addNode.sh script in $ORACLE_HOME/oui/bin.
The installation uses 2 different accounts to store the Grid Infrastructure binaries separately from the RDBMS binaries. Operating system accounts are oragrid and oracle.
Oracle: uid=501(oracle) gid=30275(oinstall) groups=309(dba),2046(asmdba),2047(asmadmin)
OraGrid: uid=502(oragrid) gid=30275(oinstall) groups=309(dba),2046(asmdba),2047(asmadmin)
I started off by downloading files 1,2 and 3 of patch 10098816 for my platform. The ratio of downloads of this patch was 243 to 751 between x64 and SPARC. So not a massive uptake of this patchset for Solaris it would seem.
As the oragrid user I created user equivalence for RSA and DSA ssh-keys, a little utility will do this now for you, but I’m old-school and create the keys and exchanged them on the hosts myself. Not too bad a task on only 2 nodes.
The next step was to find out about the shared storage. And that took me a little while I admit freely: I haven’t used the EMC Power Path multipathing software before and found it difficult to approach, mainly for the lack of information about it. Or maybe I just didn’t find it, but device-mapper-multipath for instance is easier to understand. Additionally, the fact that this was Solaris Intel made it a little more complex. First I needed to know what the device names actually mean. As on Solaris SPARC, /dev/dsk will list the block devices, /dev/rdsk/ lists the raw devices. So there’s where I’m heading. Next I checked the devices, emcpower0a to emcpower9a. In the course of the installation I found out how to deal with these. First of all, on Solaris Intel, you have to create a partition of the LUN before it can be dealt with in the SPARC way. So for each device you would like to use, fdisk the emcpowerxp0 device, i.e.
# fdisk /dev/rdsk/emcpower0p0
If there is no partition, simply say “y” to the question if you want to use all of it for Solaris and exit fdisk. Otherwise, delete the existing partition (AFTER HAVING double/triple CHECKED THAT IT’S REALLY NOT NEEDED!) and create a new one of type “Solaris2”. It didn’t seem necessary to make it active.
Here’s a sample session:
bash-3.00# fdisk /dev/rdsk/emcpower0p0
No fdisk table exists. The default partition for the disk is: a 100% "SOLARIS System" partition Type "y" to accept the default partition, otherwise type "n" to edit the partition table.
Y
Now let’s check the result:
bash-3.00# fdisk /dev/rdsk/emcpower0p0
Total disk size is 1023 cylinders
Cylinder size is 2048 (512 byte) blocks
Cylinders
Partition Status Type Start End Length % ========= ====== ============ ===== === ====== === 1 Active Solaris2 1 1022 1022 100 SELECT ONE OF THE FOLLOWING: 1. Create a partition 2. Specify the active partition 3. Delete a partition 4. Change between Solaris and Solaris2 Partition IDs 5. Exit (update disk configuration and exit) 6. Cancel (exit without updating disk configuration) Enter Selection: 6 bash-3.00#
This particular device will be used for my OCRVOTE disk group, that’s why it’s only 1G. The next step is identical on SPARC-start the format tool, select partition, change the fourth partition to use the whole disk (with an offset of 3 cylinders at the beginning of the slice) and label it. With that done, exit the format application.
This takes me back to the discussion of the emcpower-device name. The letters [a-p] refer to the slices of the device, while p stands for the partition. /dev/emcpowernc is slice 2 of the second multipathed device, in other words the whole disk. I usually create a slice 4 which translates to emcpowerne. After having completed the disk initialisation, I had to ensure that the ones I was working on were really shared. Unfortunately the emcpower devices are not consistently named across the cluster. What is emcpower0a on node1 turned out to be emcpower2a on the second node. How to check? The powermt tool to the rescue. Similar to “multipath –ll” on Linux the powermt command can show the underlying disks which are aggregated under the emcpowern pseudo device. So I wanted to know if my device /dev/rdsk/emcpower0e was shared. What I really was interested on was the native device:
# powermt display dev=emcpower0a | awk \
> '/c[0-9]t/ {print $3}'
c1t50000974C00A611Cd6s0
c2t50000974C00A6118d6s0Well, does that exist on the other node?
# powermt display dev=all | /usr/sfw/bin/ggrep -B8 c1t50000974C00A611Cd6s0 Pseudo name=emcpower3a Symmetrix ID=000294900664 Logical device ID=0468 state=alive; policy=SymmOpt; priority=0; queued-IOs=0; ============================================================================== --------------- Host --------------- - Stor - -- I/O Path -- -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ============================================================================== 3072 pci@39,0/pci1166,142@12/pci103c,1708@0/fp@0,0 c1t50000974C00A611Cd6s0 FA 8eA active alive 0 0
So yes it was there. Cool! I checked the 2 other OCR/voting disks LUNS and they were shareable as well. The final piece was to change the ownership of the devices to oragrid:asmdba and permissions to 0660.
Project settings
Another item to look at is the project settings for the grid owner and oracle. It’s important to set projects correctly, otherwise the installation will fail when ASM is starting. All newly created users inherit the settings from the default project. Unless the sys admins set the default project high enough, you will have to change them. To check the settings you can use the “prctl -i project default” call to check all the values for this project.
I usually create a project for the grid owner, oragrid, as well as for the oracle account. My settings are as follows for a maximum SGA size of around 20G:
projadd -c “Oracle Grid Infrastructure” ‘user.oracle’
projmod -s -K “process.max-sem-nsems=(privileged,256,deny)” ‘user.oracle’
projmod -s -K “project.max-shm-memory=(privileged,20GB,deny)” ‘user.oracle’
projmod -s -K “project.max-shm-ids=(privileged,256,deny)” ‘user.oracle’
Repeat this for the oragrid user, then log in as oragrid and check that the project is actually assigned:
# id -p oragrid
uid=223(oragrid) gid=30275(oinstall) projid=100(user.oragrid)
Installing Grid Infrastructure
Finally ready to start the installer! The solaris installation isn’t any different from Linux except for the aforementioned fiddling with the raw devices.
The installation went smoothly, I ran orainstroot.sh and root.sh without any problem. If anything, it was a bit slow, taking 10 minutes to complete root.sh on node1. You can tail the rootcrs_node1.log file in /data/oragrid/product/11.2.0.2/cfgtoollogs/crsconfig to see what’s going on behind the scenes. This is certainly one of the biggest improvements over 10g and 11g Release 1.
Extending the cluster
The MOS document I was alluding to earlier suggested, like I said, to have all the private NIC IP addresses in the same subnet. That isn’t necessarily to the liking of cluvfy. The communication over bnxe3 on both hosts fails, as shown in this example. Tests executed from node1:
bash-3.00# ping 192.168.0.1
192.168.0.1 is alive
bash-3.00# ping 192.168.0.2
192.168.0.2 is alive
bash-3.00# ping 192.168.0.3
192.168.0.3 is alive
bash-3.00# ping 192.168.0.4
^C
192.168.0.4 is not replying
Tests executed on node 2
bash-3.00# ping 192.168.0.1
192.168.0.1 is alive
bash-3.00# ping 192.168.0.2
^C
bash-3.00# ping 192.168.0.3
192.168.0.3 is alive
bash-3.00# ping 192.168.0.4
192.168.0.4 is alive
I decided to ignore this for now, and sure enough, the cluster extension didn’t fail. As I’m not using GNS, the command to add the node was
$ ./addNode.sh -debug -logLevel finest "CLUSTER_NEW_NODES={loninengblc208}" \
CLUSTER_NEW_VIRTUAL_HOSTNAMES={loninengblc208-vip}"This is actually a little more verbose than I needed, but it’s always good to be prepared for a SR with Oracle.
However, the OUI command will perform a pre-requisite check before the actual call to runInstaller, and that repeatedly failed, complaining about connectivity on the bnxe3 network. Checking the contents of the addNode.sh script I found an environment variable “$IGNORE_PREADDNODE_CHECKS” which can be set to “Y” to force the script to ignore the pre-requisite checks. With that set, the addNode operation succeeded.
RDBMS installation
This is actually not worthy to report, it’s pretty much the same as on Linux. However, a small caveat is specified to Solaris x86-64. Many files in the Oracle inventory didn’t have correct permissions. When launching runInstaller to install the binaries, I was bombarded with complaints about file permissions.
For example, oraInstaller.properties has the wrong permissions. Example for Solaris Intel:
# ls -l oraInstaller.properties -rw-r--r-- 1 oragrid oinstall 317 Nov 9 15:01 oraInstaller.properties
On Linux:
$ ls -l oraInstaller.properties -rw-rw---- 1 oragrid oinstall 345 Oct 21 12:44 oraInstaller.properties
There were a few more, I fixed them using these commands:
$ chmod 770 ContentsXML $ chmod 660 install.platform $ chmod 770 oui $ chmod 660 ContentsXML/* $ chmod 660 oui/*
Once the permissions were fixed the installation succeeded.
DBCA
Nothing to report here, it’s the same as for Linux.
I have already written about the renamedg command, but since then fell in love with ASMLib. The use of ASMLib introduces a few caveats you should be aware of.
This document presents research I performed with ASM on a lab environment. It should be applicable to any environment, but you should NOT use this for production-the renamedg command still is buggy, and you should not mess with ASM disk headers in an important system such as production or staging/UAT. You set the importance here! The recommended setup for cloning disk groups is to use a data guard physical standby database on a different storage array to create a real time copy of your production database on that array. Again, do not use you production array for this!
Oracle ASMLib introduces a new value to the ASM header, called the provider string as the following example shows:
[root@asmtest ~]# kfed read /dev/oracleasm/disks/VOL1 | grep prov kfdhdb.driver.provstr: ORCLDISKVOL1 ; 0x000: length=12
This can be verified with ASMLib:
[root@asmtest ~]# /etc/init.d/oracleasm querydisk /dev/xvdc1 Device "/dev/xvdc1" is marked an ASM disk with the label "VOL1"
The prefix “ORCLDISK” is automatically added by ASMLib and cannot easily be changed.
The problem with ASMLib is that the renamedg command does NOT update the provider string, which I’ll illustrate by walking through an example session. Disk group “DATA”, setup with external redundancy and two disks, DATA1 and DATA2, is to be cloned to “DATACLONE”.
The renamedg command requires the disk group to be cloned to be stopped. To prevent nasty surprises, you should stop the databases using that diskgroup manually.
[grid@rac11gr2drnode1 ~]$ srvctl stop database -d dev [grid@rac11gr2drnode1 ~]$ ps -ef | grep smon grid 3424 1 0 Aug07 ? 00:00:00 asm_smon_+ASM1 grid 17909 17619 0 15:13 pts/0 00:00:00 grep smon [grid@rac11gr2drnode1 ~]$ srvctl stop diskgroup -g data [grid@rac11gr2drnode1 ~]$
You can use the new “lsof” command of asmcmd to check for open files:
ASMCMD> lsof DB_Name Instance_Name Path +ASM +ASM1 +ocrvote.255.4294967295 asmvol +ASM1 +acfsdg/APACHEVOL.256.724157197 asmvol +ASM1 +acfsdg/DRL.257.724157197 ASMCMD>
So apart from files from other disk groups no files are open, especially not referring to disk group DATA.
Now comes the part where you copy the LUNs, and this entirely depends on your system. The EVA series of storage arrays I worked with in this particular project offered a “snapclone” function, which used COW to create an identical copy of the source LUN, with a new WWID (which can be an input parameter to the snapclone call). When you are using device-mapper-multipath then ensure that your sys admins add the newly created LUNs to the /etc/multipath.conf file on all cluster nodes!
I am using Xen in my lab, which makes it simpler-all I need to do is to copy the disk containers on the domO and then add the new block devices to the running domU (“virtual machine” in Xen language). This can be done easily as the following example shows:
Usage: xm block-attachxm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w! xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w! xm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w! xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w!
In the example, rac11gr2drnode{1,2} are the domU, the backend device is the copied file on the file system, the front end device in the domU is xvd{g,h}, and the mode is read/write, shareable. The exclamation mark here is crucial or else the second domU can’t mount the new block device-it is already exclusively mounted to another domU.
The fdisk command in my example immediately “sees” the new LUNs, with device mapper multipathing you might have to go through iterations of restarting multipathd and discovering partitions using kpartx. It is again very important to have all disks presented to all cluster nodes!
Here’s the sample output from my system:
[root@rac11gr2drnode1 ~]# fdisk -l | grep Disk | sort Disk /dev/xvda: 4294 MB, 4294967296 bytes Disk /dev/xvdb: 16.1 GB, 16106127360 bytes Disk /dev/xvdc: 5368 MB, 5368709120 bytes Disk /dev/xvdd: 16.1 GB, 16106127360 bytes Disk /dev/xvde: 16.1 GB, 16106127360 bytes Disk /dev/xvdf: 10.7 GB, 10737418240 bytes Disk /dev/xvdg: 16.1 GB, 16106127360 bytes Disk /dev/xvdh: 16.1 GB, 16106127360 bytes
I cloned /dev/xvdd and /dev/xvde to /dev/xvdg and /dev/xvdh.
Do NOT run /etc/init.d/oracleasm scandisks yet! Otherwise the renamedg command will complain about duplicate disk names, which is entirely reasonable.
I dumped all headers for disks /dev/xvd{d,e,g,h}1 to /tmp to be able to compare.
[root@rac11gr2drnode1 ~]# kfed read /dev/xvdd1 > /tmp/xvdd1.header # repeat with the other disks
Start with phase one of the renamedg command:
[root@rac11gr2drnode1 ~]# renamedg phase=one dgname=DATA newdgname=DATACLONE \ > confirm=true verbose=true config=/tmp/cfg Parsing parameters.. Parameters in effect: Old DG name : DATA New DG name : DATACLONE Phases : Phase 1 Discovery str : (null) Confirm : TRUE Clean : TRUE Raw only : TRUE renamedg operation: phase=one dgname=DATA newdgname=DATACLONE confirm=true verbose=true config=/tmp/cfg Executing phase 1 Discovering the group Performing discovery with string: Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with disk number:0 and timestamp (32940276 1937075200) Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with disk number:1 and timestamp (32940276 1937075200) Checking for hearbeat... Re-discovering the group Performing discovery with string: Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with disk number:0 and timestamp (32940276 1937075200) Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with disk number:1 and timestamp (32940276 1937075200) Checking if the diskgroup is mounted Checking disk number:0 Checking disk number:1 Checking if diskgroup is used by CSS Generating configuration file.. Completed phase 1 Terminating kgfd context 0x2b7a2fbac0a0 [root@rac11gr2drnode1 ~]#
You should always check “$?” for errors-the message “terminating kgfd context” sounds bad, but isn’t. At the end of stage 1, there is no change to the header. Only at phase two there is:
[root@rac11gr2drnode1 ~]# renamedg phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg Parsing parameters.. renamedg operation: phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg Executing phase 2 Completed phase 2
Now there are changes:
[root@rac11gr2drnode1 tmp]# grep DATA *header xvdd1.header:kfdhdb.driver.provstr: ORCLDISKDATA1 ; 0x000: length=13 xvdd1.header:kfdhdb.dskname: DATA1 ; 0x028: length=5 xvdd1.header:kfdhdb.grpname: DATACLONE ; 0x048: length=9 xvdd1.header:kfdhdb.fgname: DATA1 ; 0x068: length=5 xvde1.header:kfdhdb.driver.provstr: ORCLDISKDATA2 ; 0x000: length=13 xvde1.header:kfdhdb.dskname: DATA2 ; 0x028: length=5 xvde1.header:kfdhdb.grpname: DATACLONE ; 0x048: length=9 xvde1.header:kfdhdb.fgname: DATA2 ; 0x068: length=5 xvdg1.header:kfdhdb.driver.provstr: ORCLDISKDATA1 ; 0x000: length=13 xvdg1.header:kfdhdb.dskname: DATA1 ; 0x028: length=5 xvdg1.header:kfdhdb.grpname: DATA ; 0x048: length=4 xvdg1.header:kfdhdb.fgname: DATA1 ; 0x068: length=5 xvdh1.header:kfdhdb.driver.provstr: ORCLDISKDATA2 ; 0x000: length=13 xvdh1.header:kfdhdb.dskname: DATA2 ; 0x028: length=5 xvdh1.header:kfdhdb.grpname: DATA ; 0x048: length=4 xvdh1.header:kfdhdb.fgname: DATA2 ; 0x068: length=5
Although the original disks (/dev/xvdd1 and /dev/xvde1) had their disk group name changed, the provider string remained untouched. So if we were to issue a scandisks command now through /etc/init.d/oracleasm, there’d still be duplicate disk names. This is a bug in my opinion, and a bad thing.
Renaming the disks is straight forward, the difficult bit is to find out which have to be renamed. Again, you can use kfed to figure that out. I knew the disks to be renamed were /dev/xvdd1 and /dev/xvde1 after consulting the header information.
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvdd1 DATACLONE1 Renaming disk "/dev/xvdd1" to "DATACLONE1": [ OK ] [root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvde1 DATACLONE2 Renaming disk "/dev/xvde1" to "DATACLONE2": [ OK ]
I then performed a scandisks operation on all nodes just to be sure… I had corruption of the disk group before :)
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm scandisks Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac11gr2drnode1 tmp]# [root@rac11gr2drnode2 ~]# /etc/init.d/oracleasm scandisks Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac11gr2drnode2 ~]#
The output on all cluster nodes should be identical, on my system I found the following disks:
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm listdisks
ACFS1
ACFS2
ACFS3
ACFS4
DATA1
DATA2
DATACLONE1
DATACLONE2
VOL1
VOL2
VOL3
VOL4
VOL5
Sure enough, the cloned disks were present. Although everything seemed ok at this point, I could not start disk group DATA and had to reboot the cluster nodes to rectify that problem. Maybe there is some not so transient information stored somewhere about ASM disks. After the reboot, CRS started my database correctly, and with all dependent resources:
[oracle@rac11gr2drnode1 ~]$ srvctl status database -d dev Instance dev1 is running on node rac11gr2drnode1 Instance dev2 is running on node rac11gr2drnode2
The ps command in the ways I use it most (ps -ef and ps auxwww) doesn’t display the scheduling class for a process. Oracle have cunningly released a patchset to update Grid Infrastructure that changes the scheduling class from the VKTM and LMSn ASM processes to “Timeshare” instead of Realtime.
So far so good, but I had no idea how to display the scheduling class of a process so some man page reading and Internet research were in order. After some digging around I found out that using the BSD command line syntax combined with the “–format” option does the trick. The difficult bit was in figuring out which format identifiers to use. All the information ps can get about a process are recorded in /proc/pid/stat. Parsing this with a keen eye however proves difficult due to the sheer number of fields in the file. So back to using ps (1).
Here’s the example. Before applying the workaround to the patch, Oracle ASM’s VKTM (virtual keeper of time) and LMSn (global cache services process) run with TS priority:
[oracle@rac11gr2node2 ~]$ ps ax --format uname,pid,ppid,tty,cmd,cls,pri,rtprio \ >| egrep "(vktm|lms)" | grep asm grid 4296 1 ? asm_vktm_+ASM2 TS 24 - grid 4318 1 ? asm_lms0_+ASM2 TS 24 -
After applying the workaround the scheduling class changed:
[oracle@rac11gr2node1 ~]$ ps ax --format uname,pid,ppid,tty,cmd,cls,pri,rtprio | egrep "(vktm|lms)" | grep asm grid 2352 1 ? asm_vktm_+ASM1 RR 41 1 grid 2374 1 ? asm_lms0_+ASM1 RR 41 1
Notice how the cls field changed, and also that the rtprio is now populated. I have learned something new today.
Recent comments
16 weeks 4 days ago
26 weeks 2 days ago
28 weeks 12 hours ago
31 weeks 2 days ago
33 weeks 3 days ago
43 weeks 14 hours ago
44 weeks 4 days ago
45 weeks 4 days ago
45 weeks 5 days ago
48 weeks 3 days ago