When using Locally Managed Tablespaces (LMT) with variable, system managed extent sizes (AUTOALLOCATE) and data files residing in ASM the Allocation Unit (AU) size can make a significant difference to the algorithm that searches for free extents.The corresponding free extent search algorithm when searching for free extents >= the AU size seems to only search for free extents on AU boundaries in order to avoid I/O splitting.Furthermore the algorithm seems to use two extent sizes when searching for free extents: A "desired" (for example 8MB) and a "minimum acceptable" (for example 1MB) extent size - however when performing the search the "desired" size seems to be relevant when limiting the search to free extents on AU boundaries.This can lead to some surprising side effects, in particular when using 4MB AUs.It effectively means that although you might have plenty o
During some testing I encountered an ORA-000214 during startup of an Oracle 11.2.0.2 database instance:
ORA-00214: control file '+RECO_XXXX/test/controlfile/current.334.755391511'
version 268 inconsistent with file
'+DATA_XXXX/test/controlfile/current.299.755390399' version 265
This is a RAC instance on Exadata, but all techniques in this article will work on any Oracle 11.2.x database using ASM.
This message means the database found two controlfiles which have a different version. If this message appears when the database is open, the database will crash. If an instance is startup after this message, the same error appears, and the database remains in nomount state. Further diagnosis: the control file version in the recovery area is more recent than the version in the data diskgroup (version 268 versus version 265).
Exadata – the Sequel Exadata V2 is Still Oracle(http://www.teradata.com/t/assets/0/206/276/5bfc4694-ce82-4a07-867d-3f104...)より
Shared Everything vs. Shared Nothing
Nothingというとネガティブな印象になりがちだが、けしてそういうわけではない。
TeradataはShared Nothing方式をとることによりI/Oボトルネックを防いでいる。
ExadataはASMにより均等にCellにデータを配置する。これをSAMEアーキテクチャ(Stripe And Mirror Everywhere(あるいはEverything))と呼ぶ。Parallel Queryは均等にストライプされたCellからReadするわけだから、Parallel Query Slave(下記の図ではWorkerと記述されている)同士でもI/O待ちは発生する。そして同時セッション数とI/O待ちが比例する。
数秒で終わるQueryでも数十分かかるQueryでも「全てのParallel Query Slave」が「全てのCell」を検索する。
This post is about the installation of Grid Infrastructure, and where it’s really getting exciting: the 3rd NFS voting disk is going to be presented and I am going to show you how simple it is to add it into the disk group chosen for OCR and voting disks.
Let’s start with the installation of Grid Infrastructure. This is really simple, and I won’t go into too much detail. Start by downloading the required file from MOS, a simple search for patch 10098816 should bring you to the download patch for 11.2.0.2 for Linux-just make sure you select the 64bit version. The file we need just now is called p10098816_112020_Linux-x86-64_3of7.zip. The file names don’t necessarily relate to their contents, the readme helps finding out which piece of the puzzle is used for what functionality.
I alluded to my software distribution method in one of the earlier posts, and here’s all the detail to come. My dom0 exports the /m directory to the 192.168.99.0/24 network, the one accessible to all my domUs. This really simplifies software deployments.
So starting off, the file has been unzipped:
openSUSE-112-64-minimal:/m/download/db11.2/11.2.0.2 # unzip -q p10098816_112020_Linux-x86-64_3of7.zip
This creates the subdirectory “grid”. Switch back to edcnode1 and log in as oracle. As I already explained I won’t use different accounts for Grid Infrastructure and the RDBMS in this example.
If not already done so, mount the /m directory on the domU (which requires root privileges). Move to the newly unzipped “grid” directory under your mount point and begin to set up the user equivalence. On edcnode1 and edcnode2, create RSA and DSA keys for SSH:
[oracle@edcnode1 ~]$ ssh-keygen -t rsa
Any questions can be answered with the return key, it’s important to leave the passphrase empty. Repeat the call to ssh-keygen with argument “-t dsa”. Navigate to ~/.ssh and create the authorized_keys file as follows:
[oracle@edcnode1 .ssh]$ cat *.pub >> authorized_keys
Then copy the authorized_keys file to edcnode2 and add the public keys:
[oracle@edcnode1 .ssh]$ scp authorized_keys oracle@edcnode2:`pwd` [oracle@edcnode1 .ssh]$ ssh oracle@edcnode2
If you are prompted, add the host to the ~/.ssh/known_hosts file by typing in “yes”.
[oracle@edcnode2 .ssh]$ cat *.pub >> authorized_keys
Change the permissions on the authorized_keys file to 0400 on both hosts, otherwise it won’t be considered when trying to log in. With all of this done, you can add all the unknown hosts to each node’s known_hosts file. The easiest way is a for loop:
[oracle@edcnode1 ~]$ for i in edcnode1 edcnode2 edcnode1-priv edcnode2-priv; do ssh $i hostname; don
Run this twice on each node, acknowledging the question if the new address should be added. Important: Ensure that there is no banner (/etc/motd, .profile, .bash_profile etc) writing to stdout or stderr or you are going to see strange error messages about user equivalence not being set up correctly.
I hear you say: but 11.2 can create user equivalence in OUI now-this is of course correct, but I wanted to run cluvfy now which requires a working setup.
Cluster Verification
It is good practice to run a check to see if the prerequisites for the Grid Infrastructure installation are met, and keep the output. Change to the NFS mount where the grid directory is exported, and execute runcluvfy.sh as in this example:
[oracle@edcnode1 grid]$ ./runcluvfy.sh stage -pre crsinst -n edcnode1,edcnode2 -verbose -fixup 2>&1 | tee /tmp/preCRS.tx
The nice thing is that you can run the fixup script now to fix kernel parameter settings:
[root@edcnode2 ~]# /tmp/CVU_11.2.0.2.0_oracle/runfixup.sh /usr/bin/id Response file being used is :/tmp/CVU_11.2.0.2.0_oracle/fixup.response Enable file being used is :/tmp/CVU_11.2.0.2.0_oracle/fixup.enable Log file location: /tmp/CVU_11.2.0.2.0_oracle/orarun.log Setting Kernel Parameters... fs.file-max = 327679 fs.file-max = 6815744 net.ipv4.ip_local_port_range = 9000 65500 net.core.wmem_max = 262144 net.core.wmem_max = 1048576
Repeat this on the second node, edcnode2. Obviously you should fix any other problem cluvfy reports before proceeding.
In the previous post I created the /u01 mount point-double check that /u01 is actually mounted-otherwise you’d end up writing on your root_vg’s root_lv, not an ideal situation.
You are now ready to start the installer: type in ./runInstaller to start the installation.
Grid Installation
This is rather mundane, and instaed of providing print screens, I opted for a description of the steps needed to execute in the OUI session.
The usual installation will now take place. At the end, run the root.sh script on edcnode1 and after it completes, on edcnode2. The output is included here for completeness:
[root@edcnode1 u01]# /u01/app/11.2.0/grid/root.sh 2>&1 | tee /tmp/root.sh.out Running Oracle 11g root script... The following environment variables are set as: ORACLE_OWNER= oracle ORACLE_HOME= /u01/app/11.2.0/grid Enter the full pathname of the local bin directory: [/usr/local/bin]: Copying dbhome to /usr/local/bin ... Copying oraenv to /usr/local/bin ... Copying coraenv to /usr/local/bin ... Creating /etc/oratab file... Entries will be added to the /etc/oratab file as needed by Database Configuration Assistant when a database is created Finished running generic part of root script. Now product-specific root actions will be performed. Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params Creating trace directory LOCAL ADD MODE Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. OLR initialization - successful root wallet root wallet cert root cert export peer wallet profile reader wallet pa wallet peer wallet keys pa wallet keys peer cert request pa cert request peer cert pa cert peer root cert TP profile reader root cert TP pa root cert TP peer pa cert TP pa peer cert TP profile reader pa cert TP profile reader peer cert TP peer user cert pa user cert Adding daemon to inittab ACFS-9200: Supported ACFS-9300: ADVM/ACFS distribution files found. ACFS-9307: Installing requested ADVM/ACFS software. ACFS-9308: Loading installed ADVM/ACFS drivers. ACFS-9321: Creating udev for ADVM/ACFS. ACFS-9323: Creating module dependencies - this may take some time. ACFS-9327: Verifying ADVM/ACFS devices. ACFS-9309: ADVM/ACFS installation correctness verified. CRS-2672: Attempting to start 'ora.mdnsd' on 'edcnode1' CRS-2676: Start of 'ora.mdnsd' on 'edcnode1' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'edcnode1' CRS-2676: Start of 'ora.gpnpd' on 'edcnode1' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'edcnode1' CRS-2672: Attempting to start 'ora.gipcd' on 'edcnode1' CRS-2676: Start of 'ora.gipcd' on 'edcnode1' succeeded CRS-2676: Start of 'ora.cssdmonitor' on 'edcnode1' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'edcnode1' CRS-2672: Attempting to start 'ora.diskmon' on 'edcnode1' CRS-2676: Start of 'ora.diskmon' on 'edcnode1' succeeded CRS-2676: Start of 'ora.cssd' on 'edcnode1' succeeded ASM created and started successfully. Disk Group OCRVOTE created successfully. clscfg: -install mode specified Successfully accumulated necessary OCR keys. Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. CRS-4256: Updating the profile Successful addition of voting disk 38f2caf7530c4f67bfe23bb170ed2bfe. Successful addition of voting disk 9aee80ad14044f22bf6211b81fe6363e. Successful addition of voting disk 29fde7c3919b4fd6bf626caf4777edaa. Successfully replaced voting disk group with +OCRVOTE. CRS-4256: Updating the profile CRS-4266: Voting file(s) successfully replaced ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE] 2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE] 3. ONLINE 29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE] Located 3 voting disk(s). CRS-2672: Attempting to start 'ora.asm' on 'edcnode1' CRS-2676: Start of 'ora.asm' on 'edcnode1' succeeded CRS-2672: Attempting to start 'ora.OCRVOTE.dg' on 'edcnode1' CRS-2676: Start of 'ora.OCRVOTE.dg' on 'edcnode1' succeeded ACFS-9200: Supported ACFS-9200: Supported CRS-2672: Attempting to start 'ora.registry.acfs' on 'edcnode1' CRS-2676: Start of 'ora.registry.acfs' on 'edcnode1' succeeded Preparing packages for installation... cvuqdisk-1.0.9-1 Configure Oracle Grid Infrastructure for a Cluster ... succeeded [root@edcnode2 ~]# /u01/app/11.2.0/grid/root.sh 2>&1 | tee /tmp/rootsh.out Running Oracle 11g root script... The following environment variables are set as: ORACLE_OWNER= oracle ORACLE_HOME= /u01/app/11.2.0/grid Enter the full pathname of the local bin directory: [/usr/local/bin]: Copying dbhome to /usr/local/bin ... Copying oraenv to /usr/local/bin ... Copying coraenv to /usr/local/bin ... Creating /etc/oratab file... Entries will be added to the /etc/oratab file as needed by Database Configuration Assistant when a database is created Finished running generic part of root script. Now product-specific root actions will be performed. Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params Creating trace directory LOCAL ADD MODE Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. OLR initialization - successful Adding daemon to inittab ACFS-9200: Supported ACFS-9300: ADVM/ACFS distribution files found. ACFS-9307: Installing requested ADVM/ACFS software. ACFS-9308: Loading installed ADVM/ACFS drivers. ACFS-9321: Creating udev for ADVM/ACFS. ACFS-9323: Creating module dependencies - this may take some time. ACFS-9327: Verifying ADVM/ACFS devices. ACFS-9309: ADVM/ACFS installation correctness verified. CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node edcnode1, number 1, and is terminating An active cluster was found during exclusive startup, restarting to join the cluster Preparing packages for installation... cvuqdisk-1.0.9-1 Configure Oracle Grid Infrastructure for a Cluster ... succeeded [root@edcnode2 ~]#
Congratulations! You have a working setup! Check if everything is ok:
[root@edcnode2 ~]# crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.OCRVOTE.dg ONLINE ONLINE edcnode1 ONLINE ONLINE edcnode2 ora.asm ONLINE ONLINE edcnode1 Started ONLINE ONLINE edcnode2 ora.gsd OFFLINE OFFLINE edcnode1 OFFLINE OFFLINE edcnode2 ora.net1.network ONLINE ONLINE edcnode1 ONLINE ONLINE edcnode2 ora.ons ONLINE ONLINE edcnode1 ONLINE ONLINE edcnode2 ora.registry.acfs ONLINE ONLINE edcnode1 ONLINE ONLINE edcnode2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE edcnode2 ora.LISTENER_SCAN2.lsnr 1 ONLINE ONLINE edcnode1 ora.LISTENER_SCAN3.lsnr 1 ONLINE ONLINE edcnode1 ora.cvu 1 ONLINE ONLINE edcnode1 ora.edcnode1.vip 1 ONLINE ONLINE edcnode1 ora.edcnode2.vip 1 ONLINE ONLINE edcnode2 ora.oc4j 1 ONLINE ONLINE edcnode1 ora.scan1.vip 1 ONLINE ONLINE edcnode2 ora.scan2.vip 1 ONLINE ONLINE edcnode1 ora.scan3.vip 1 ONLINE ONLINE edcnode1 [root@edcnode2 ~]# [root@edcnode1 ~]# crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE] 2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE] 3. ONLINE 29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE] Located 3 voting disk(s).
Adding the NFS voting disk
It’s about time to deal with this subject. If not done so already, start the domU “filer03″. Log in as openfiler and ensure that the NFS server is started. On the services tab click on enable next to the NFS server if needed. Next navigate to the shares tab, where you should find the volume group and logical volume created earlier. The volume group I created is called “ocrvotenfs_vg”, and it has 1 logical volume, “nfsvol_lv”. Click on the name of the LV to create a new share. I named the new share “ocrvote” – enter this in the popup window and click on “create sub folder”.
The new share should appear underneath the nfsvol_lv now. Proceed by clicking on “ocrvote” to set the share’s properties. Before you get to enter these, click on “make share”. Scroll down to the host access configuration section in the following screen. In this section you could set all sorts of technologies-SMB, NFS, WebDAV, FTP and RSYNC. For this example, everything but NFS should be set to “NO”.
For NFS, the story is different: ensure you set the radio button to “RW” for both hosts. Then click on Edit for each machine. This is important! The anonymous UID and GID must match the Grid Owner’s uid and gid. In my scenario I entered “500″ for both-you can check your settings using the id command as oracle: it will print the UID and GID plus other information.
The UID/GID mapping then has to be set to all_squash, IO mode to sync, and write delay to wdelay. Leave the default for “requesting origin port”, which was set to “secure < 1024″ in my configuration.
I decided to create /ocrvote on both nodes to mount the NFS export:
[root@edcnode2 ~]# mkdir /ocrvote
Edit the /etc/fstab file to make the mount persistent across reboots. I added this line to the file on both nodes:
192.168.101.52:/mnt/ocrvotenfs_vg/nfsvol_lv/ocrvote /ocrvote nfs rw,bg,hard,intr,rsize=32768,wsize=32768,tcp,noac,nfsvers=3,timeo=600,addr=192.168.101.51
The “addr” command instructs Linux to use the storage network to mount the share. Now you are ready to mount the device on all nodes, using the “mount /ocrvote” command.
I changed the export on the filer to the uid/gid combination of the oracle account (or, on an installation with separate grid software owner, to its uid/gid combination):
[root@filer03 ~]# cd /mnt/ocrvotenfs_vg/nfsvol_lv/ [root@filer03 nfsvol_lv]# ls -l total 44 -rw------- 1 root root 6144 Sep 24 15:38 aquota.group -rw------- 1 root root 6144 Sep 24 15:38 aquota.user drwxrwxrwx 2 root root 4096 Sep 24 15:26 homes drwx------ 2 root root 16384 Sep 24 15:26 lost+found drwxrwsrwx 2 ofguest ofguest 4096 Sep 24 15:31 ocrvote -rw-r--r-- 1 root root 974 Sep 24 15:45 ocrvote.info.xml [root@filer03 nfsvol_lv]# chown 500:500 ocrvote [root@filer03 nfsvol_lv]# ls -l total 44 -rw------- 1 root root 7168 Sep 24 16:09 aquota.group -rw------- 1 root root 7168 Sep 24 16:09 aquota.user drwxrwxrwx 2 root root 4096 Sep 24 15:26 homes drwx------ 2 root root 16384 Sep 24 15:26 lost+found drwxrwsrwx 2 500 500 4096 Sep 24 15:31 ocrvote -rw-r--r-- 1 root root 974 Sep 24 15:45 ocrvote.info.xml [root@filer03 nfsvol_lv]#
ASM requires zero padded files asm “disks”, so create one:
[root@filer03 nfsvol_lv]# dd if=/dev/zero of=ocrvote/nfsvotedisk01 bs=1G count=2 [root@filer03 nfsvol_lv]# chown 500:500 ocrvote/nfsvotedisk01
Add the third voting disk
Almost there! Before performing any change to the cluster configuration it is always a good idea to take a backup.
[root@edcnode1 ~]# ocrconfig -manualbackup edcnode1 2010/09/24 17:11:51 /u01/app/11.2.0/grid/cdata/edc/backup_20100924_171151.ocr
You only need to do this on one node. Recall that the current state is:
[oracle@edcnode1 ~]$ crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE] 2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE] 3. ONLINE 29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE] Located 3 voting disk(s).
ASM sees it the same way:
SQL> select mount_status,header_status, name,failgroup,library 2 from v$asm_disk 3 / MOUNT_S HEADER_STATU NAME FAILGROUP LIBRARY ------- ------------ ------------------------------ --------------- ------------------------------------------------------------ CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2) CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2) CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2) CLOSED PROVISIONED ASM Library - Generic Linux, version 2.0.4 (KABI_V2) CACHED MEMBER OCR01FILER01 OCR01FILER01 ASM Library - Generic Linux, version 2.0.4 (KABI_V2) CACHED MEMBER OCR01FILER02 OCR01FILER02 ASM Library - Generic Linux, version 2.0.4 (KABI_V2) CACHED MEMBER OCR02FILER01 OCR02FILER01 ASM Library - Generic Linux, version 2.0.4 (KABI_V2) 7 rows selected.
Now here’s the idea: you add the NFS location to the ASM diskstring in addition with “ORCL:*” and all is well. But that didn’t work:
SQL> show parameter disk NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ asm_diskgroups string asm_diskstring string ORCL:* SQL> SQL> alter system set asm_diskstring = 'ORCL:*, /ocrvote/nfsvotedisk01' scope=memory sid='*'; alter system set asm_diskstring = 'ORCL:*, /ocrvote/nfsvotedisk01' scope=memory sid='*' * ERROR at line 1: ORA-02097: parameter cannot be modified because specified value is invalid ORA-15014: path 'ORCL:OCR01FILER01' is not in the discovery set
Regardless of what I tried, the system complained. Grudgingly I used the GUI – asmca.
After starting asmca, click on Disk Groups. Then select diskgroup “OCRVOTE”, and right click to “add disks”. The trick is to click on “change discovery path”. Enter “ORCL:*, /ocrvote/nfsvotedisk01″ (without quotes) to the dialog field and close it. Strangely, now the NFS disk now appears. Make two ticks: before disk path, and in the quorum box. A click on the OK button starts the magic, and you should be presented with a success message. The ASM instance reports a little more:
ALTER SYSTEM SET asm_diskstring='ORCL:*','/ocrvote/nfsvotedisk01' SCOPE=BOTH SID='*'; 2010-09-29 10:54:52.557000 +01:00 SQL> ALTER DISKGROUP OCRVOTE ADD QUORUM DISK '/ocrvote/nfsvotedisk01' SIZE 500M /* ASMCA */ NOTE: Assigning number (1,3) to disk (/ocrvote/nfsvotedisk01) NOTE: requesting all-instance membership refresh for group=1 2010-09-29 10:54:54.445000 +01:00 NOTE: initializing header on grp 1 disk OCRVOTE_0003 NOTE: requesting all-instance disk validation for group=1 NOTE: skipping rediscovery for group 1/0xd032bc02 (OCRVOTE) on local instance. 2010-09-29 10:54:57.154000 +01:00 NOTE: requesting all-instance disk validation for group=1 NOTE: skipping rediscovery for group 1/0xd032bc02 (OCRVOTE) on local instance. 2010-09-29 10:55:00.718000 +01:00 GMON updating for reconfiguration, group 1 at 5 for pid 27, osid 15253 NOTE: group 1 PST updated. NOTE: initiating PST update: grp = 1 GMON updating group 1 at 6 for pid 27, osid 15253 2010-09-29 10:55:02.896000 +01:00 NOTE: PST update grp = 1 completed successfully NOTE: membership refresh pending for group 1/0xd032bc02 (OCRVOTE) 2010-09-29 10:55:05.285000 +01:00 GMON querying group 1 at 7 for pid 18, osid 4247 NOTE: cache opening disk 3 of grp 1: OCRVOTE_0003 path:/ocrvote/nfsvotedisk01 GMON querying group 1 at 8 for pid 18, osid 4247 SUCCESS: refreshed membership for 1/0xd032bc02 (OCRVOTE) 2010-09-29 10:55:06.528000 +01:00 SUCCESS: ALTER DISKGROUP OCRVOTE ADD QUORUM DISK '/ocrvote/nfsvotedisk01' SIZE 500M /* ASMCA */ 2010-09-29 10:55:08.656000 +01:00 NOTE: Attempting voting file refresh on diskgroup OCRVOTE NOTE: Voting file relocation is required in diskgroup OCRVOTE NOTE: Attempting voting file relocation on diskgroup OCRVOTE NOTE: voting file allocation on grp 1 disk OCRVOTE_0003 2010-09-29 10:55:10.047000 +01:00 NOTE: voting file deletion on grp 1 disk OCR02FILER01 NOTE: starting rebalance of group 1/0xd032bc02 (OCRVOTE) at power 1 Starting background process ARB0 ARB0 started with pid=29, OS id=15446 NOTE: assigning ARB0 to group 1/0xd032bc02 (OCRVOTE) with 1 parallel I/O 2010-09-29 10:55:13.178000 +01:00 NOTE: GroupBlock outside rolling migration privileged region NOTE: requesting all-instance membership refresh for group=1 2010-09-29 10:55:15.533000 +01:00 NOTE: stopping process ARB0 SUCCESS: rebalance completed for group 1/0xd032bc02 (OCRVOTE) GMON updating for reconfiguration, group 1 at 9 for pid 31, osid 15451 NOTE: group 1 PST updated. 2010-09-29 10:55:17.907000 +01:00 NOTE: membership refresh pending for group 1/0xd032bc02 (OCRVOTE) 2010-09-29 10:55:20.481000 +01:00 GMON querying group 1 at 10 for pid 18, osid 4247 SUCCESS: refreshed membership for 1/0xd032bc02 (OCRVOTE) 2010-09-29 10:55:23.490000 +01:00 NOTE: Attempting voting file refresh on diskgroup OCRVOTE NOTE: Voting file relocation is required in diskgroup OCRVOTE NOTE: Attempting voting file relocation on diskgroup OCRVOTE
Superb! But did it kick out the correct disk? Yes it did-you now see OCR01FILER01 and ORC01FILER02 plus the NFS disk:
[oracle@edcnode1 ~]$ crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE] 2. ONLINE 9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE] 3. ONLINE 6107050ad9ba4fd1bfebdf3a029c48be (/ocrvote/nfsvotedisk01) [OCRVOTE] Located 3 voting disk(s).
Preferred Mirror Read
One of the cool new 11.1 features allowed administrators to instruct administrators of stretch RAC system to read mirrored extents rather than primary extents. This can speed up data access in cases where data would otherwise have been sent from the remote array. Setting this parameter is crucial to many implementations. In preparation of the RDBMS installation (to be detailed in the next post), I created a disk group consisting of 4 ASM disks, two from each filer. The syntax for the disk group creation is as follows:
SQL> create diskgroup data normal redundancy 2 failgroup sitea disk 'ORCL:ASM01FILER01','ORCL:ASM01FILER02' 3* failgroup siteb disk 'ORCL:ASM02FILER01','ORCL:ASM02FILER02' SQL> / Diskgroup created.
As you can see all disks from sitea are from filer01 and form one failure group. The other disks, originating from filer02 form the second failure group.
You can see the result in v$asm_disk, as this example shows:
SQL> select name,failgroup from v$asm_disk; NAME FAILGROUP ------------------------------ ------------------------------ ASM01FILER01 SITEA ASM01FILER02 SITEA ASM02FILER01 SITEB ASM02FILER02 SITEB OCR01FILER01 OCR01FILER01 OCR01FILER02 OCR01FILER02 OCR02FILER01 OCR02FILER01 OCRVOTE_0003 OCRVOTE_0003 8 rows selected.
Now all that remains to be done is to instruct the ASM instances to read from the local storage if possible. This is performed by setting an instance-specific init.ora parameter. I used the following syntax:
SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEB' scope=both sid='+ASM2'; System altered. SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEA' scope=both sid='+ASM1'; System altered.
So I’m all set for the next step, the installation of the RDBMS software. But that’s for another post…
Finally time for a new series! With the arrival of the new 11.2.0.2 patchset I thought it was about time to try and set up a virtual 11.2.0.2 extended distance or stretched RAC. So, it’s virtual, fair enough. It doesn’t allow me to test things like the impact of latency on the inter-SAN communication, but it allowed me to test the general setup. Think of this series as a guide after all the tedious work has been done, and SANs happily talk to each other. The example requires some understanding of how XEN virtualisation works, and it’s tailored to openSuSE 11.2 as the dom0 or “host”. I have tried OracleVM in the past but back then a domU (or virtual machine) could not mount an iSCSI target without a kernel panic and reboot. Clearly not what I needed at the time. OpenSuSE has another advantage: it uses a new kernel-not the 3 year old 2.6.18 you find in Enterprise distributions. Also, xen is recent (openSuSE 11.3 even features xen 4.0!) and so is libvirt.
The Setup
The general idea follows the design you find in the field, but with less cluster nodes. I am thinking of 2 nodes for the cluster, and 2 iSCSI target providers. I wouldn’t use iSCSI in the real world, but my lab isn’t connected to an EVA or similar.A third site will provide quorum via an NFS provided voting disk.
Site A will consist of filer01 for the storage part, and edcnode1 as the RAC node. Site B will consist of filer02 and edcnode2. The iSCSI targets are going to be provided by openFiler’s domU installation, and the cluster nodes will make use of Oracle Enterprise Linux 5 update 5.To make it more realistic, site C will consist of another openfiler isntance, filer03 to provide the NFS export for the 3rd voting disk. Note that openFiler seems to support NFS v3 only at the time of this writing. All systems are 64bit.
The network connectivity will go through 3 virtual switches, all “host only” on my dom0.
As in the real world, private and storage network have to be separated to prevent iSCSI packets clashing with Cache Fusion traffic. Also, I increased the MTU for the private and storage networks to 9000 instead of the default 1500. If you like to use jumbo frames you should check if your switch supports it.
Grid Infrastructure will use ASM to store OCR and voting disks, and the inter-SAN replication will also be performed by ASM in normal redundancy. I am planning on using preferred mirror read and intelligent data placement to see if that makes a difference.
Known limitations
This setup has some limitations, such as the following ones:
So much for the introduction-I’ll post the setup step-by-step. The intended series will consist of these articles:
That’s it for today, I hope I got you interested and following the series. It’s been real fun doing it; now it’s about writing it all up.
I have already written about the renamedg command, but since then fell in love with ASMLib. The use of ASMLib introduces a few caveats you should be aware of.
This document presents research I performed with ASM on a lab environment. It should be applicable to any environment, but you should NOT use this for production-the renamedg command still is buggy, and you should not mess with ASM disk headers in an important system such as production or staging/UAT. You set the importance here! The recommended setup for cloning disk groups is to use a data guard physical standby database on a different storage array to create a real time copy of your production database on that array. Again, do not use you production array for this!
Oracle ASMLib introduces a new value to the ASM header, called the provider string as the following example shows:
[root@asmtest ~]# kfed read /dev/oracleasm/disks/VOL1 | grep prov kfdhdb.driver.provstr: ORCLDISKVOL1 ; 0x000: length=12
This can be verified with ASMLib:
[root@asmtest ~]# /etc/init.d/oracleasm querydisk /dev/xvdc1 Device "/dev/xvdc1" is marked an ASM disk with the label "VOL1"
The prefix “ORCLDISK” is automatically added by ASMLib and cannot easily be changed.
The problem with ASMLib is that the renamedg command does NOT update the provider string, which I’ll illustrate by walking through an example session. Disk group “DATA”, setup with external redundancy and two disks, DATA1 and DATA2, is to be cloned to “DATACLONE”.
The renamedg command requires the disk group to be cloned to be stopped. To prevent nasty surprises, you should stop the databases using that diskgroup manually.
[grid@rac11gr2drnode1 ~]$ srvctl stop database -d dev [grid@rac11gr2drnode1 ~]$ ps -ef | grep smon grid 3424 1 0 Aug07 ? 00:00:00 asm_smon_+ASM1 grid 17909 17619 0 15:13 pts/0 00:00:00 grep smon [grid@rac11gr2drnode1 ~]$ srvctl stop diskgroup -g data [grid@rac11gr2drnode1 ~]$
You can use the new “lsof” command of asmcmd to check for open files:
ASMCMD> lsof DB_Name Instance_Name Path +ASM +ASM1 +ocrvote.255.4294967295 asmvol +ASM1 +acfsdg/APACHEVOL.256.724157197 asmvol +ASM1 +acfsdg/DRL.257.724157197 ASMCMD>
So apart from files from other disk groups no files are open, especially not referring to disk group DATA.
Now comes the part where you copy the LUNs, and this entirely depends on your system. The EVA series of storage arrays I worked with in this particular project offered a “snapclone” function, which used COW to create an identical copy of the source LUN, with a new WWID (which can be an input parameter to the snapclone call). When you are using device-mapper-multipath then ensure that your sys admins add the newly created LUNs to the /etc/multipath.conf file on all cluster nodes!
I am using Xen in my lab, which makes it simpler-all I need to do is to copy the disk containers on the domO and then add the new block devices to the running domU (“virtual machine” in Xen language). This can be done easily as the following example shows:
Usage: xm block-attachxm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w! xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w! xm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w! xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w!
In the example, rac11gr2drnode{1,2} are the domU, the backend device is the copied file on the file system, the front end device in the domU is xvd{g,h}, and the mode is read/write, shareable. The exclamation mark here is crucial or else the second domU can’t mount the new block device-it is already exclusively mounted to another domU.
The fdisk command in my example immediately “sees” the new LUNs, with device mapper multipathing you might have to go through iterations of restarting multipathd and discovering partitions using kpartx. It is again very important to have all disks presented to all cluster nodes!
Here’s the sample output from my system:
[root@rac11gr2drnode1 ~]# fdisk -l | grep Disk | sort Disk /dev/xvda: 4294 MB, 4294967296 bytes Disk /dev/xvdb: 16.1 GB, 16106127360 bytes Disk /dev/xvdc: 5368 MB, 5368709120 bytes Disk /dev/xvdd: 16.1 GB, 16106127360 bytes Disk /dev/xvde: 16.1 GB, 16106127360 bytes Disk /dev/xvdf: 10.7 GB, 10737418240 bytes Disk /dev/xvdg: 16.1 GB, 16106127360 bytes Disk /dev/xvdh: 16.1 GB, 16106127360 bytes
I cloned /dev/xvdd and /dev/xvde to /dev/xvdg and /dev/xvdh.
Do NOT run /etc/init.d/oracleasm scandisks yet! Otherwise the renamedg command will complain about duplicate disk names, which is entirely reasonable.
I dumped all headers for disks /dev/xvd{d,e,g,h}1 to /tmp to be able to compare.
[root@rac11gr2drnode1 ~]# kfed read /dev/xvdd1 > /tmp/xvdd1.header # repeat with the other disks
Start with phase one of the renamedg command:
[root@rac11gr2drnode1 ~]# renamedg phase=one dgname=DATA newdgname=DATACLONE \ > confirm=true verbose=true config=/tmp/cfg Parsing parameters.. Parameters in effect: Old DG name : DATA New DG name : DATACLONE Phases : Phase 1 Discovery str : (null) Confirm : TRUE Clean : TRUE Raw only : TRUE renamedg operation: phase=one dgname=DATA newdgname=DATACLONE confirm=true verbose=true config=/tmp/cfg Executing phase 1 Discovering the group Performing discovery with string: Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with disk number:0 and timestamp (32940276 1937075200) Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with disk number:1 and timestamp (32940276 1937075200) Checking for hearbeat... Re-discovering the group Performing discovery with string: Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with disk number:0 and timestamp (32940276 1937075200) Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with disk number:1 and timestamp (32940276 1937075200) Checking if the diskgroup is mounted Checking disk number:0 Checking disk number:1 Checking if diskgroup is used by CSS Generating configuration file.. Completed phase 1 Terminating kgfd context 0x2b7a2fbac0a0 [root@rac11gr2drnode1 ~]#
You should always check “$?” for errors-the message “terminating kgfd context” sounds bad, but isn’t. At the end of stage 1, there is no change to the header. Only at phase two there is:
[root@rac11gr2drnode1 ~]# renamedg phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg Parsing parameters.. renamedg operation: phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg Executing phase 2 Completed phase 2
Now there are changes:
[root@rac11gr2drnode1 tmp]# grep DATA *header xvdd1.header:kfdhdb.driver.provstr: ORCLDISKDATA1 ; 0x000: length=13 xvdd1.header:kfdhdb.dskname: DATA1 ; 0x028: length=5 xvdd1.header:kfdhdb.grpname: DATACLONE ; 0x048: length=9 xvdd1.header:kfdhdb.fgname: DATA1 ; 0x068: length=5 xvde1.header:kfdhdb.driver.provstr: ORCLDISKDATA2 ; 0x000: length=13 xvde1.header:kfdhdb.dskname: DATA2 ; 0x028: length=5 xvde1.header:kfdhdb.grpname: DATACLONE ; 0x048: length=9 xvde1.header:kfdhdb.fgname: DATA2 ; 0x068: length=5 xvdg1.header:kfdhdb.driver.provstr: ORCLDISKDATA1 ; 0x000: length=13 xvdg1.header:kfdhdb.dskname: DATA1 ; 0x028: length=5 xvdg1.header:kfdhdb.grpname: DATA ; 0x048: length=4 xvdg1.header:kfdhdb.fgname: DATA1 ; 0x068: length=5 xvdh1.header:kfdhdb.driver.provstr: ORCLDISKDATA2 ; 0x000: length=13 xvdh1.header:kfdhdb.dskname: DATA2 ; 0x028: length=5 xvdh1.header:kfdhdb.grpname: DATA ; 0x048: length=4 xvdh1.header:kfdhdb.fgname: DATA2 ; 0x068: length=5
Although the original disks (/dev/xvdd1 and /dev/xvde1) had their disk group name changed, the provider string remained untouched. So if we were to issue a scandisks command now through /etc/init.d/oracleasm, there’d still be duplicate disk names. This is a bug in my opinion, and a bad thing.
Renaming the disks is straight forward, the difficult bit is to find out which have to be renamed. Again, you can use kfed to figure that out. I knew the disks to be renamed were /dev/xvdd1 and /dev/xvde1 after consulting the header information.
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvdd1 DATACLONE1 Renaming disk "/dev/xvdd1" to "DATACLONE1": [ OK ] [root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvde1 DATACLONE2 Renaming disk "/dev/xvde1" to "DATACLONE2": [ OK ]
I then performed a scandisks operation on all nodes just to be sure… I had corruption of the disk group before :)
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm scandisks Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac11gr2drnode1 tmp]# [root@rac11gr2drnode2 ~]# /etc/init.d/oracleasm scandisks Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac11gr2drnode2 ~]#
The output on all cluster nodes should be identical, on my system I found the following disks:
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm listdisks
ACFS1
ACFS2
ACFS3
ACFS4
DATA1
DATA2
DATACLONE1
DATACLONE2
VOL1
VOL2
VOL3
VOL4
VOL5
Sure enough, the cloned disks were present. Although everything seemed ok at this point, I could not start disk group DATA and had to reboot the cluster nodes to rectify that problem. Maybe there is some not so transient information stored somewhere about ASM disks. After the reboot, CRS started my database correctly, and with all dependent resources:
[oracle@rac11gr2drnode1 ~]$ srvctl status database -d dev Instance dev1 is running on node rac11gr2drnode1 Instance dev2 is running on node rac11gr2drnode2
Recent comments
17 weeks 3 hours ago
26 weeks 5 days ago
28 weeks 3 days ago
31 weeks 4 days ago
33 weeks 6 days ago
43 weeks 3 days ago
45 weeks 5 hours ago
46 weeks 6 hours ago
46 weeks 1 day ago
48 weeks 6 days ago