Search

Top 60 Oracle Blogs

Recent comments

linux

Troubleshooting Grid Infrastructure startup

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong?An obvious reason for this could be a reboot, and fair enough, there has been one:

[grid@node1.example.com] $ uptime
09:09:22 up  2:40,  1 user,  load average: 1.47, 1.46, 1.42

The next step was to check if the local CRS stack was up, or better, to check what was down. Sometimes it’s only crsd which has a problem. In my case everything was down:

[grid@node1.example.com] $ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[grid@node1.example.com] $ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
CRS-4404: The following nodes did not reply within the allotted time:
node2,node3, node4, node5, node6, node7, node8

The CRS-4404 was slightly misleading, I assumed all cluster nodes were down after a clusterwide reboot. Sometimes a single node reboot triggers worse things. However, logging on to node 2 I saw that all but the first node were ok.

CRSD really needs CSSD to be up and running, and CSSD requires the OCR to be there. I wanted to know if the OCR was impacted in any way:

[grid@node1.example.com] $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
ORA-29701: unable to connect to Cluster Synchronization Service

Well it seemed that the OCR location was unavailable. I know that on this cluster, the OCR is stored on ASM. Common reasons for the PROC-26 error are

  • Unix admin upgrades the kernel but forgets to upgrade the ASMLib kernel module (common grief with ASMLib!)
  • Storage is not visible on the host, i.e. SAN connectivity broken/taken away (happens quite frequently with storage/sys admin unaware of ASM)
  • Permissions not set correctly on the block devices (not an issue when using asmlib)

I checked ASMLib and it reported a working status:

[oracle@node1.example.com] $ /etc/init.d/oracleasm status
Checking if ASM is loaded: yes
Checking if /dev/oracleasm is mounted: yes

That was promising, /dev/oracleasm/ was populated and the matching kernel modules loaded. /etc/init.d/oracleasm listdisks listed all my disks as well. Physical storage not accessible (PROC-26) seemed a bit unlikely now.

I could rule out permission problems since ASMLib was working fine, and I also rule out the kernel upgrade/missing libs problem by comparing the RPM with the kernel version: they matched. So maybe it’s storage related?

Why did the node go down?

Good question, usually to be asked towards the unix administration team. Luckily I have a good contact placed right inside that team and I could get the following excerpt from /var/log/messages arond the time of the crash (6:31 this morning):

Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192116, count = 1
Mar 17 06:26:06 node1 kernel: Aborting journal on device dm-2.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192216, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192166, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192122, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192140, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192174, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 10 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_truncate: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_orphan_del: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_delete_inode: Journal has aborted
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 kernel: ext3_abort called.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Mar 17 06:26:06 node1 kernel: Remounting filesystem read-only
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:41076
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:41076
Mar 17 06:26:09 node1 kernel: SysRq : Resetting
Mar 17 06:31:15 node1 syslogd 1.4.1: restart.

So it looks like a file system error triggered the reboot-I’m glad the box came back up ok on it’s own. The $GRID_HOME/log/hostname/alerthostname.log didn’t show anything specific to storage. Normally you would see that it starts counting a node down if it lost contact to the voting disks (in this case OCR and voting disks share the same diskgroup).

And why does Clusteware not start?

After some more investigation it seems there was no underlying problem with the storage, so I tried to manually start the cluster, traililng the ocssd.log file for possible clues.

[root@node1 ~]# crsctl start cluster
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2674: Start of ‘ora.cssd’ on ‘node1′ failed
CRS-2679: Attempting to clean ‘ora.cssd’ on ‘node1′
CRS-2681: Clean of ‘ora.cssd’ on ‘node1′ succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′

… the command eventually failed. The ocssd.log file showed this:

...
2011-03-17 09:47:49.073: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996354 ms, node 0x2aaab008a260 { host 'node4', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-a7491097, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.084: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996364 ms, node 0x2aaab008a630 { host 'node6', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-2f6ece1c, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.113: [    CSSD][1113332032]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846440, LATS 10996434, lastSeqNo 30846437, uniqueness 1300108895, timestamp 1300355268/3605443434
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355257, LATS 10996434, lastSeqNo 31355254, uniqueness 1300344405, timestamp 1300355268/10388584
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372473, LATS 10996434, lastSeqNo 31372470, uniqueness 1297097908, timestamp 1300355268/3605182454
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384686, LATS 10996434, lastSeqNo 31384683, uniqueness 1297098093, timestamp 1300355268/3604696294
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388819, LATS 10996434, lastSeqNo 31388816, uniqueness 1297098327, timestamp 1300355268/3604712934
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612975, LATS 10996434, lastSeqNo 29612972, uniqueness 1297685443, timestamp 1300355268/3603054884
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203293, LATS 10996434, lastSeqNo 31203290, uniqueness 1297156000, timestamp 1300355268/3604855704
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 3, node33, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355258, LATS 10996434, lastSeqNo 31355255, uniqueness 1300344405, timestamp 1300355268/10388624
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372474, LATS 10996434, lastSeqNo 31372471, uniqueness 1297097908, timestamp 1300355268/3605182494
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384687, LATS 10996434, lastSeqNo 31384684, uniqueness 1297098093, timestamp 1300355268/3604696304
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388821, LATS 10996434, lastSeqNo 31388818, uniqueness 1297098327, timestamp 1300355268/3604713224
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612977, LATS 10996434, lastSeqNo 29612974, uniqueness 1297685443, timestamp 1300355268/3603055224
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846441, LATS 10996474, lastSeqNo 30846438, uniqueness 1300108895, timestamp 1300355269/3605443654
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355259, LATS 10996474, lastSeqNo 31355256, uniqueness 1300344405, timestamp 1300355268/10389264
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203294, LATS 10996474, lastSeqNo 31203291, uniqueness 1297156000, timestamp 1300355269/3604855914
2011-03-17 09:47:49.619: [    CSSD][1116485952]clssnmSendingThread: sending join msg to all nodes
...

The interesting bit is the “BUT NO NETWORK HB”, i.e. something must be wrong with the network configuration. I quickly checked the output of ifconfig and found a missing entry for my private interconnect. This is defined in the GPnP profile if you are unsure:


 
 
 

Now that’s a starting point! If tried to bring up bond1.251, but that failed:

[root@node1 network-scripts]# ifup bond1.251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
ERROR: could not add vlan 251 as bond1.251 on dev bond1

The “invalid argument” didn’t mean too much to me, so I ran ifup with the “-x” flag to get more information about which argument was invalid:

[root@node1 network-scripts]# which ifup
/sbin/ifup
[root@node1 network-scripts]# view /sbin/ifup
# turned out it's a shell script! Let's run with debug output enabled
[root@node1 network-scripts]# bash -x /sbin/ifup bond1.251
+ unset WINDOW
...
+ MATCH='^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$'
+ [[ bond1.251 =~ ^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$ ]]
++ echo bond1.251
++ LC_ALL=C
++ sed 's/^[a-z0-9]*\.0*//'
+ VID=251
+ PHYSDEV=bond1
+ [[ bond1.251 =~ ^vlan[0-9]{1,4}? ]]
+ '[' -n 251 ']'
+ '[' '!' -d /proc/net/vlan ']'
+ test -z ''
+ VLAN_NAME_TYPE=DEV_PLUS_VID_NO_PAD
+ /sbin/vconfig set_name_type DEV_PLUS_VID_NO_PAD
+ is_available bond1
+ LC_ALL=
+ LANG=
+ ip -o link
+ grep -q bond1
+ '[' 0 = 1 ']'
+ return 0
+ check_device_down bond1
+ echo bond1
+ grep -q :
+ LC_ALL=C
+ ip -o link
+ grep -q 'bond1[:@].*,UP'
+ return 1
+ '[' '!' -f /proc/net/vlan/bond1.251 ']'
+ /sbin/vconfig add bond1 251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
+ /usr/bin/logger -p daemon.info -t ifup 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
+ echo 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
ERROR: could not add vlan 251 as bond1.251 on dev bond1
+ exit 1

Hmmm so it seemed that the underlying interface bond1 was missing-which was true. The output of ifconfig didn’t show it as configured, and trying to start it manually using ifup bond1 failed as well. It turned out that the ifcfg-bond1 file was missing and had to be recreated from the documentation. All network configuration files in Red Hat based systems belong into /etc/sysconfig/network-scripts/ifcfg-interfaceName. With the recreated file in place, I was back in the running:

[root@node1 network-scripts]# ll *bond1*
-rw-r–r– 1 root root 129 Mar 17 10:07 ifcfg-bond1
-rw-r–r– 1 root root 168 May 19  2010 ifcfg-bond1.251
[root@node1 network-scripts]# ifup bond1
[root@node1 network-scripts]# ifup bond1.251
Added VLAN with VID == 251 to IF -:bond1:-
[root@node1 network-scripts]#

Now I could try to start the lower stack again:

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2676: Start of ‘ora.cssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip’ on ‘node1′
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘node1′
CRS-2676: Start of ‘ora.ctssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.evmd’ on ‘node1′
CRS-2676: Start of ‘ora.evmd’ on ‘node1′ succeeded
CRS-2676: Start of ‘ora.cluster_interconnect.haip’ on ‘node1′ succeeded
CRS-2679: Attempting to clean ‘ora.asm’ on ‘node1′
CRS-2681: Clean of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.asm’ on ‘node1′
CRS-2676: Start of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.crsd’ on ‘node1′
CRS-2676: Start of ‘ora.crsd’ on ‘node1′ succeeded
[root@node1 network-scripts]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Brilliant-problem solved. This is actually the first time that an incorrect network config prevented a cluster I looked after from starting. The best indication in this case is in the gipcd log file, but it didn’t occur to me to have a look at is as the error was clearly related to storage.

DBCA failing to create RAC databases

This is a weird problem I ran in today. As part of an automation project the code deploys RAC One databases across a cluster, depending on the capacity available of the node. These are 128G RAM BL685c G6 currently but will be upgraded to G7 later.

Now, my problem was that after the weekend we couldn’t deploy any more RAC One databases, except for 1 node. DBCA simply created single instance databases instead. Newly created databases were properly registered in the OCR, and their build completed ok, but not as RAC One databases. Take for example this database:

$ srvctl config database -d MYDB
Database unique name: MYDB
Database name: MYDB
Oracle home: /u01/app/oracle/product/11.2.0.2
Oracle user: oracle
Spfile: +DATA/MYDB/spfileMYDB.ora
Domain: example.com
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: MYDB
Database instances: MYDB
Disk Groups: DATA
Mount point paths:
Services:
Type: SINGLE
Database is administrator managed

How come? We are sure that we pass the RACOneNode flag to dbca, which can be found in the command line. Trying again I spotted these (alongside the sys and system passwords … you should change these as soon as DBCA completes!)

[rac]oracle@node2.example.com $ ps -ef|grep MYDB
oracle   14865 14854 65 11:22 ?        00:00:07 /u01/app/oracle/product/11.2.0.2/jdk/jre/bin/java -Doracle.installer.not_bootstrap=true -DORACLE_HOME=/u01/app/oracle/product/11.2.0.2 -DSET_LAF= -Dsun.java2d.font.DisableAlgorithmicStyles=true -Dice.pilots.html4.ignoreNonGenericFonts=true -DDISPLAY= -DJDBC_PROTOCOL=thin -mx128m -classpath ... oracle.sysman.assistants.dbca.Dbca -silent -createDatabase -templateName /u01/app/oracle/product/admin/templates/Default.dbc -gdbName MYDB.example.com -RACOneNode -RACOneNodeServiceName MYDB_APP.example.com -sid MYDB -sysPassword xxx -systemPassword xxx -emConfiguration NONE -totalMemory 4096 -storageType ASM -asmSysPassword xxx -diskGroupName DATA -initParams db_create_file_dest=+DATA,cpu_count=1 -nodelist node2
oracle   15415  5109  0 11:22 pts/0    00:00:00 grep MYDB

So why the problem? Looking at the dbca trace file I found these lines

[main] [ 2011-03-14 14:15:31.845 GMT ] [SQLEngine.initialize:363]  Starting Reader Thread...
[main] [ 2011-03-14 14:15:31.927 GMT ] [OracleHome.initOptions:1240]  executing: startup nomount pfile='/u01/app/oracle/product/11.2.0.2/dbs/initDBUA0.ora'
[main] [ 2011-03-14 14:15:55.417 GMT ] [SQLEngine.done:2167]  Done called
[main] [ 2011-03-14 14:15:55.418 GMT ] [OracleHome.initOptions:1247]  ORA-00304: requested INSTANCE_NUMBER is busy

oracle.sysman.assistants.util.sqlEngine.SQLFatalErrorException: ORA-00304: requested INSTANCE_NUMBER is busy

 at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeImpl(SQLEngine.java:1655)
 at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeSql(SQLEngine.java:1903)
 at oracle.sysman.assistants.util.OracleHome.initOptions(OracleHome.java:1241)
 at oracle.sysman.assistants.dbca.backend.SilentHost.initialize(SilentHost.java:179)
 at oracle.sysman.assistants.dbca.Dbca.execute(Dbca.java:116)
 at oracle.sysman.assistants.dbca.Dbca.main(Dbca.java:180)
[main] [ 2011-03-14 14:15:55.420 GMT ] [OracleHome.initOptions:1250]  executing: select parameter from v$option where value='TRUE'
[main] [ 2011-03-14 14:15:55.420 GMT ] [SQLEngine.reInitialize:735]  Reinitializing SQLEngine...

Interesting-the ORA-304 error sticks out. The DBCA logs are in $ORACLE_BASE/cfgtoollogs/dbca/dbName/ in 11.2 btw. Further down the logfile it then determines that the RAC option is not available. This isn’t true-and I checked on each node:

$ cd $ORACLE_HOME/rdbms/lib
$ nm -r libknlopt.a | grep -c kcsm.o
1

That was identical on all nodes. So we definitely had RAC compiled into the oracle binary. I also compared the size and timestamp of all oracle binaries in $ORACLE_HOME only to find them identical. However dbca didn’t seem impressed with my contradiction and went on creating single instance databases. That now became a little inconvenient.

I then tried relocating one of the succesfully created RAC One databases to the nodes where we had problems building them, hoping to find out more about the problem. At this stage I was convinced there was a problem with semophores or other SysV IPC.

I ceratainly didn’t want to use the Windows Fix and reboot!

Moving On

So to recap, we should be able to build RAC (One Node) databases as the option is compiled into the binary, and yet it doesn’t work. From the trace I gathered that Oracle builds an auxiliary instance first, and uses initDBUA0.ora in $ORACLE_HOME to start it. So where are it’s logs/where’s the diagnostic dest? Turns out it is in $ORACLE_HOME/log/ – simply set your ADR base to this location and use the familiar commands. And this finally give me a clue:

*** 2011-03-14 12:06:40.104
2011-03-14 12:06:40.104: [ CSSCLNT]clssgsGroupJoin: member in use group(0/DBDBUA0)
kgxgnreg: error: status 14
kgxgnreg: error: member number 0 is already in use
kjxgmjoin: can not join the group (DBDBUA0) with id 0 (inst 1)
kjxgmjoin: kgxgn error 3

So somewhere else in the cluster had to be a DBUA0 instance that prevented my new instance from starting. A quick trawl through the process table on all nodes revealed that DBUA was active on node6. Shutting that down solved the problem!

Summary

DBCA is a nice tool to create databases, together with user definable templates it is really flexible. From a technical point of view it works as follows:

  • For RAC and RAC One Node it tries to create an auxiliary instance, called DBUA0, as a cluster database. If DBUA0 is used on the same node, it will use DBUA1 etc.
  • Next it will rename the database to what we assign on the command line
  • It then performs a lot more actions which are not of relevance here.

In my case, one of these DBUA0s aux instances was still present on a different node in the cluster as a result of a crashed database creation. When subsequent calls to dbca created another auxiliary (cluster!) DBUA0 instance on a different node, it wasn’t aware that there was a DBUA0 already and LMON refused to create it. This is expected behaviour- instance names have to be unique across the cluster. The DBUA0 of node2 for example clashed with the one on node6.

Why did it work on p6 then I hear you ask? DBCA seems to have code logic to establish that DBUA0 on a node is in use, and uses DBUA1 next.

Update

I got this update from Oracle Support who acknowledge this as a bug:

Notes 16-Mar-2011 1:21:16 GMT+00:00 PM Oracle Support
Unscheduled
Generic Note
————————
Dear Martin,

Following bug is created for this particular issue.

Bug 11877668 – DBCA DOESN’T CREATE A RAC ONE DATABASE IN SILENT MODE

Cluster callouts to create blackouts in EM

Finally I got around to providing a useful example for a cluster callout script. It is actually on the verge of taking too long-remember that scripts in the $GRID_HOME/racg/usrco/ directory should execute quickly. Before deploying this, you should definitely ensure that the script executes quickly enough-the “time” utility can help you with this. Nevertheless this has been necessary to work around a limitation of Grid Control: RAC One Node databases are not supported in GC 11.1 (I complained about that earlier).

The Problem

To work around the problem I wrote a script which can alleviate one of the arising problems: when using srvctl relocate database, another instance (usually called dbName_2) will be started to allow existing sessions to survive the failover operation if they use TAF or FAN/FCF.

This poses a big problem to Grid Control though-the second instance didn’t exist when you registered the database as a target, hence GC doesn’t know about it. Subsequently you may get paged that the database is down when in reality it is not. Receiving one of the “false positive” alarms is annoying at best at 02:00 AM in the morning. Actually, Grid Control is right in assuming that the database is down: although detected as a cluster database target, it only consists of 1 instance. If that’s down, it has to be assumed that the whole cluster database is down. In a perfect world we wouldn’t have this problem-GC was aware that the RON database moved to another node in the cluster and update its configuration accordingly. This is planned for the next major release sometime later in 2011. Apparently dbconsole has the ability to deal with such a situation.Now with the background explained, management had to weigh the possibilities-either not register the RAC One database in Grid Control and have no monitoring at all or to bite the bullet and have monitoring only when the initial instance is started on the primary node. The decision was made to have (limited) monitoring. To prevent the DBA from being woken up I developed the simple script below to automatically create a blackout in GC if the “_2″ instance starts. Subsequently, the blackout is taken off when the “_1″ instance starts.

Room for improvement: if the script assumes that a RON database can only have a maximum of 2 member servers-if your database can run on more than 2 nodes then you should use the relocate_target if the _1 instance comes up on a different node from what GC expects.

The Script

My algorithm checks for cluster events, and if an instance dbName_2 starts, I create a blackout on the initial instance to prevent being paged until Oracle have come up with a better solution (we are flying blind once the 2nd database instance has started).

The script assumes that you have deployed emcli on each cluster node (or ACFS). EMCLI is the Enterprise Manager Command Line Lnterface, it’s located on your OMS together with the installation instructions. This is the default location:: https://oms.example.com:7799/em/console/emcli/download – 7799 is the default port for Grid Control.

Let’s have a look at the script:

#!/bin/bash

# enable debugging if needed
set -x
exec >> /tmp/autoBlackout.log 2>&1

EVENTTYPE=$1

# only SERVICEMEMBER populates instance, database and service as needed
# for the blackout section below.
if [ "$EVENTTYPE" != "SERVICEMEMBER" ]; then
 exit 0
fi

# adjust to your needs or set to the empty string if you are not using db_domain
# assumes that both database and service have the same domain
DOMAIN=example.com

# bail out if there are too many instances of this script running. Inform the
# admin via email
ME=`basename $0`
RUNNING=`ps -ef | grep -v grep | grep $ME | wc -l`
if [ $RUNNING -ge 6 ]; then
 echo Too many instances of this script running, aborting
 echo Too many instances of $ME running, aborting |
 mail -s "$RUNNING instances of $ME detected on `hostname`" admin@example.com
fi

# set up for emcli (emcli requires jdk 1.6)
JAVA_HOME=/shared/acfs/emcli/jdk1.6.0_24
PATH=$JAVA_HOME/bin:$PATH
EMCLI=/shared/acfs/emcli/emcli
export JAVA_HOME PATH

# turn off debugging for a moment - the below parsing of the command line
# parameters is very verbose.
set +x

# read the parameters passed to us-modified version of a script
# found at rachelp.nl
for ARGS in $*;
 do
 PROPERTY=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $1}'`
 VALUE=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $2}'`
 case $PROPERTY in
 VERSION|version)    VERSION=$VALUE ;;
 SERVICE|service)    SERVICE=$VALUE ;;
 DATABASE|database)    DATABASE=$VALUE ;;
 INSTANCE|instance)    INSTANCE=$VALUE ;;
 HOST|host)        HOST=$VALUE ;;
 STATUS|status)        STATUS=$VALUE ;;
 REASON|reason)        REASON=$VALUE ;;
 CARD|card)        CARDINALITY=$VALUE ;;
 TIMESTAMP|timestamp)    LOGDATE=$VALUE ;;
 ??:??:??)        LOGTIME=$VALUE ;;
 esac
 done

# and turn debugging on again
set -x

# targets are reported in lower case :( Someone please suggest a better
# way to get a lower case string to upper case
DATABASE=`echo $DATABASE | tr "[a-z]" "[A-Z]"`

# targets affected are rac_database and the oracle_database (instance)
# not using emcli here as it has to be quick. A rac_database target is a
# composite target, consisting of multiple oracle_database targets. In
# RAC One Node there is only one 1 instance - see output from GC below:
# $ emcli get_targets | grep "RAC"
# 0       Down           oracle_database  RAC.example.com_RAC_1
# 0       Down           rac_database     RAC.example.com

# define what we want to black out (only ever the primary instance!)
BLACKOUT_NAME=blackout_${DATABASE}
BLACKOUT_TARGETS="$DATABASE.${DOMAIN}:rac_database;${DATABASE}.${DOMAIN}_${DATABASE}_1:oracle_database"

# create a blackout if the secondary instance is up (we only ever register the _1 instance)
# the blackout duration is indefinite-it will be stopped and lifted automatically. You may
# want to limit this to a few hours to raise visibility.
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_2" ]]; then
 echo create blackout
 $EMCLI login -username=user -password=supersecretpassword
 $EMCLI create_blackout -name=${BLACKOUT_NAME} -add_targets=${BLACKOUT_TARGETS} \
 -reason="auto blackout" -schedule="frequency:once;duration:-1"
fi

# disable the blackout if instance *_1 starts
# this is where the script could be improved if the RON database can run on more
# than 2 nodes. You could use emcli relocate_target to relocate the target to another
# node
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_1" ]]; then
 echo remove blackout
 $EMCLI login -username=user-password=supersecretpassword
 $EMCLI stop_blackout -name=${BLACKOUT_NAME}
 $EMCLI delete_blackout -name=${BLACKOUT_NAME}
fi

I tried to add a lot of comments to the script, which should make it easy for you to adjust it. I recommend you store it in ACFS and mount that directory on all cluster nodes. Create a symbolic link from the ACFS to $GRID_HOME/racg/usrco/ to make maintenance easier. You could enable log rotation for the logfile in /tmp if you liked, otherwise keep an eye on it so it doesn’t grow to gigabytes.

Quis custodiet ipsos custodies-Nagios monitoring for Grid Control

I have a strange problem with my Grid Control 11.10.1.2 Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.

NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)

So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.

I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing.If you haven’t got a development environment, build it:

  • # yum install httpd
  • # yum install php
  • # yum install gcc glibc glibc-common
  • # yum install gd gd-devel
  • # yum install openssl-devel

From then on it’s as simple as copy-pasting from the quickstart guide. The only problem I had with the check_http plugin was the lack of openssl-devel. I initially built the plugins without “–with-openssl=/usr/include/openssl” flag. After executing the configure command again the build didn’t work for check_http (undefined symbol foo), but that could be fixed with a “make clean; make”. I just realised that my wordpress theme seems to combine two dashes into 1 – there is nothing I can do about that, sorry (frustrating in the case of the configure command etc)

For the remainder of this article I assume you built nagios with these arguments to configure:

./configure –with-command-group=nagcmd

The plugins have been built with these options:

./configure –with-openssl=/usr/include/openssl –with-nagios-user=nagios –with-nagios-group=nagios

This will install nagios to /usr/local/nagios which is fine by me-you’d obviously choose a different prefix when configuring for a production nagios server . Start the nagios server as per the quickstart guide using “service nagios start”.

With nagios up and running you can connect to the dashboard: http://nagios.example.com/nagios

You authenticate yourself using the nagiosadmin account and the password you supplied earlier to the htpasswd command.

Great! Your nagios environment is up and running. Next you need to add the OMS to the configuration. First of all I opted to change the example configuration-three steps are to be performed:

  • Modify the contacts
  • Create a check_oms command
  • Add the OMS to the list of monitored targets

Again, I should note that this setup is for monitoring 2 OMS hosts only, nothing else. I’m saying this because the way I add the targets is not the most elegant one. If you intend to add more targets to the nagios setup you should opt for a better approach which is commented out in the nagios.cfg file.

Modifying contact information

I would like to be informed in case something goes wrong. Nagios offers a wealth of notification methods, I’m limiting myself to email.

The file you’d like to modify with your favourite text editor is /usr/local/nagios/etc/objects/contacts.cfg

The most basic (but sufficient) way is to edite the nagiosadmin contact. Simply change the email address to your email address and save the file. NB: you may have to configure your local MTA and add a mail relay-ask your friendly sys admin how to do so.

Create the check_oms command

Before we can define it as a target in nagios, we need to tell nagios how to monitor the OMS. Nagios comes with a basic set of plugins, amongst which the check_http seems the most suitable. It needs to be compiled with the openssl-devel package (see above) since the OMS logon requires the https protocol.

Open /usr/local/nagios/etc/objects/commands.cfg with your favourite text editor and add a command such as this one:

define command{
command_name    check_oms
command_line    $USER1$/check_http -H $HOSTALIAS$ -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon
}

Translated back to English this means that if the check_oms command is defined as a so called service check in nagios then the check_http script is called against the host defined by the host alias (we’ll define that in a minute) variable. Furthermore, if we receive http 302 codes (moved temporarily) I want the check to return a critical error instead of an OK. If my response time is > 5 seconds I want the service to emit a “warning” reply, and if it takes longer than 10 seconds than that’s critical. The remaining variables specify that I need to use SSL against port 7799 (default Grid Control port-change if yours is different) and the URL is /em/console/logon/logon. Don’t simply specify /em as the URL as that will silently redirect you to /em/console/logon/logon after a HTTP 302 message which doesn’t help in this case. You can run the command interactively on the nagios host. The check is in /usr/local/nagios/libexec; the “-v” option displays the HTTP traffic:

./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v

[root@nagios libexec]# ./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v
GET /em/console/logon/logon HTTP/1.1
User-Agent: check_http/v1.4.15 (nagios-plugins 1.4.15)
Connection: close
Host: oms.example.com:7799
https://oms.example.com:7799/em/console/logon/logon is 8671 characters
STATUS: HTTP/1.1 200 OK
**** HEADER ****
Date: Mon, 28 Feb 2011 10:27:14 GMT
Server: Oracle-Application-Server-11g
Set-Cookie: JSESSIONID=tJ0yNr4CgGf4gyTPJR4kKTzL2WBg1SFLQvh0ytrpC3Kgv9xqkDsF!-2069537441; path=/em; HttpOnly
X-ORACLE-DMS-ECID: 00074S^kp_dF8Dd_Tdd9ic0000B4000DkI
X-Powered-By: Servlet/2.5 JSP/2.1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Content-Language: en
**** CONTENT ****
fe8
[content skipped...]
HTTP OK: HTTP/1.1 200 OK – 8671 bytes in 0.067 second response time |time=0.067383s;5.000000;10.000000;0.000000 size=8671B;;;0
[root@nagios libexec]#

Right- HTTP 200 and sub-second response time: I’m happy.

Create a new monitored target

This is the final step to be completed. I started off by copying the localhost.cfg file to oms.cfg and edited it. Below is a sample file with all comments and other verbose information removed:

define host{
use               linux-server
host_name         OMS_pilot
alias             oms.example.com
address           192.168.99.13
}

define service{
use                     generic-service
host_name               OMS_pilot
service_description     HTTP
is_volatile             0
check_period            24x7
max_check_attempts      10
normal_check_interval   1
retry_check_interval    1
contact_groups          admins
notification_options    w,u,c,r
notification_interval   960
notification_period     workhours
check_command           check_oms
}

I’m also using the check_ping command but that’s the default and not shown here. How does nagios know what server to execute the check against? That’s back in the command definition. Remember the -H $HOSTALIAS$ directive? Upon execution of the check, the value of the host’s alias configuration variable will be passed to the check_oms command. You should therefore ensure that the nagios host can resolve that host name, and I’d recommend using the FQDN as well.

The service check will execute the check_oms command against the host every minute 24×7. In case the service is critical, it will notify the contact group admins (which you edited in step 1) and send email during work hours (09:00 – 17:00 by default, defined in timeperiods.cfg.

The final bit where everything is tied together is the nagios.cfg file: add the definition for your host as in this example:

cfg_file=/usr/local/nagios/etc/objects/oms.cfg

Alternatively, if you would like to logically group your objects, you could create /usr/local/nagios/etc/servers and put all your server configuration files in there. Regardless what option you choose, the next step is to  reload the nagios service to reflect the current configuration.

(Ignore the warning-that’s a http 403 issue on another host …)

Happy monitoring!

RAC One Node and Database Protection

An email from fellow Oak Table Member James Morle about RAC One Node and failover got me thinking about the capabilities of the product.

I have written about RON (Rac One Node) in earlier posts, but haven’t really explored what happens with session failover during a database relocation.

Overview

So to clarify what happens in these two scenarios I have developed a simple test. Taking a RON database + a service I modified both to suit my test needs. Connected to the service I performed a database relocation to see what happens. Next I killed the instance (I wasn’t able to reboot the node) t o simulate what happens when the node crashes.

Setup

The setup used an existing database, “RON”. It also had a service defined already, but that needed tweaking. The database was defined as follows:

$ srvctl config database -d RON
Database unique name: RON
Database name: RON
Oracle home: /u01/app/oracle/product/11.2.0.2
Oracle user: oracle
Spfile: +DATA/RON/spfileRON.ora
Domain: example.com
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: RON
Database instances:
Disk Groups: DATA
Mount point paths:
Services: RON_APP.example.com
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: RON
Candidate servers: node1,node2
Database is administrator managed

The service was initially defined as follows:

srvctl config service -d RON
Service name: RON_APP.example.com
Service is enabled
Server pool: RON
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: RON_1
Available instances:
[rac]oracle@node1.example.com $

Some of these attributes require special attention, we have 3 categories to deal with: preferred instances for TAF, the TAF setup itself and the runtime load balancing (RLB) advisory.

Transparent Application Failover with “real” RAC only works if there are two preferred instances. As you can see the service has only one preferred instance. Not sure if that can be changed though…Let’s try:

srvctl modify service -s RON_APP -d RON -i RON_1,RON_2
PRKO-2007 : Invalid instance name: RON_2

I wasn’t surprised-RON is not a RAC database so it has only 1 active instance. When registering the RAC One Node database you don’t add instances as you would with a RAC database, instead you set the database type to RACONENODE (srvctl add database -d name -c RACONEONE … )

I recommend setting TAF properties on the service level-that way you don’t miss crucial parameters in your tnsnames.ora file. This is the preferred way of doing it at least since 11.2. Changing the service is straight forward:

srvctl modify service -d RON -s RON_APP.example.com \
> -P BASIC -e SESSION -m BASIC

This piece of code instructs the service to use a BASIC TAF policy, a failover type of SESSION and the BASIC failover method. These parameters were normally be configured in the CONNECT_DATA section of your TNSNames.ora file.

With these changes made, the service configuration has changed to the below:

$ srvctl config service -d RON -s RON_APP.example.com
Service name: RON_APP.example.com
Service is enabled
Server pool: RON
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: RON_1
Available instances:
[rac]oracle@node1.example.com $

I also wanted to change the defaults to a more suitable RLB configuration. Instead of a CLB goal of “LONG” I wanted to set it up for “SHORT”. And I needed emphasis on SERVICE_TIME as well (my intention was to run swingbench). The change and resulting service configuration are shown below:

$ srvctl modify service -d RON -s RON_APP.example.com \
> -j short -B service_time

$ srvctl config service -d RON -s RON_APP.example.com
Service name: RON_APP.example.com
Service is enabled
Server pool: RON
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: SHORT
Runtime Load Balancing Goal: SERVICE_TIME
TAF policy specification: BASIC
Edition:
Preferred instances: RON_1
Available instances:
[rac]oracle@node1.example.com $

My SCAN name was scan1.example.com, and I ensured that the local_listener was pointing to my none default port of 1821 and the remote_lister was using the EZConnect synatax (“scan1.example.com:1825). It is very important to set the local_listener parameter if you are not using the default port of 1521!

On my client system I defined a local TNS alias “RON” to connect  to the RON database. Note that it doesn’t use any TAF parameters.

C:\oracle\product\11.2.0\client_1\network\admin>tnsping ron

TNS Ping Utility for 32-bit Windows: Version 11.2.0.2.0 - Production on 15-FEB-2011 17:07:44

Copyright (c) 1997, 2010, Oracle.  All rights reserved.

Used parameter files:
C:\oracle\product\11.2.0\client_1\network\admin\sqlnet.ora

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = scan1.example.com)(PORT = 1825)
)) (CONNECT_DATA = (SERVICE_NAME = LRON_APP.uk.db.com)))
OK (10 msec)

Database Relocation

Oracle’s promise is that you don’t lose your session during a database relocation. Let’s see if that is actually true. Using my setup I connected to the RON database (sorry for the broken formatting!):

C:\oracle\product\11.2.0\client_1\network\admin>sqlplus martin/test@ron

SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 15 16:50:08 2011

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> select username,failover_method,failover_type,failed_over from gv$session where username='MARTIN';

USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
------------------------------ ---------- ------------- ---
MARTIN                         BASIC      SESSION       NO

OK, so TAF is working. At this time I started the relocate command:

[RON_1]oracle@node1.example.com $ srvctl relocate database -d RON -n node2 -w 1 -v
Configuration updated to two instances
Instance RON_2 started

I then repeatedly required my SQL statement above while the instance was relocating. The output is shown below.

SQL> select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN';

 INST_ID USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
---------- ------------------------------ ---------- ------------- ---
 1 MARTIN                         NONE       NONE          NO
 1 MARTIN                         BASIC      SESSION       NO
 2 MARTIN                         NONE       NONE          NO

SQL> select * from v$active_instances;

INST_NUMBER INST_NAME
----------- ------------------------------------------------------------
 1 node1.example.com:RON_1
 2 node2.example.com:RON_2

SQL> select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN';

 INST_ID USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
---------- ------------------------------ ---------- ------------- ---
 1 MARTIN                         NONE       NONE          NO
 1 MARTIN                         BASIC      SESSION       NO
 2 MARTIN                         NONE       NONE          NO

As you can see the configuration change to 2 instances is reflected in the output of my query to v$active_instances. You also see the number of sessions increasing, pay attention to the inst_id colum: a new session is created on the second instance.

In my other session I saw the relocation completing:

Services relocated
Waiting for 1 minutes for instance RON_1 to stop.....
Instance RON_1 stopped
Configuration updated to one instance

I required one more time to see what happened-would my session survive?

SQL> /
select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'
*
ERROR at line 1:
ORA-25408: can not safely replay call

SQL> /

 INST_ID USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
---------- ------------------------------ ---------- ------------- ---
 2 MARTIN                         BASIC      SESSION       YES

SQL>
SQL> select * from v$active_instances;

INST_NUMBER INST_NAME
----------- ------------------------------------------------------------
 2 node2.example.com:RON_2

Well it did survice. Note the ORA-25408 error. That’s expected, since I’m using the SQL*Plus client I don’t have the opportunity to trap the error and replay my OCI call. You should capture this SQLException in Java or your preferred development environment. I have provided an example in chapter 11 of Pro Oracle Database 11g RAC on Linux.

Node failure

I couldn’t see how a session would survive in the case of a node failure… I said this in my email to James:

> I cannot see how TAF or FCF work in case of a node failure. In my
> tests I did for the book TAF only worked if there was a service with
> at least 2 preferred instances. And FCF requires a FAN aware
> connection pool which is rare to find. RAC one however only has only 1
> active node (unless you relocate it).

But better test before jumping to conclusions!

I could only kill an instance rather than the server which would have been a better test. I assumed Clusterware would try to restart the failed instance on the same node a few times and then relocate the resource if the stat was not successful.

I knew the database was now running on node 2 so I killed the SMON process. That sure results in an instance crash.

oracle@node2.example.com $ ps -ef | grep RON | grep smon
oracle   14208     1  0 16:52 ?        00:00:00 ora_smon_RON_2
$ kill -9 14208

Just as you would expect, the session didn’s survive this (how could it? There is no active second instance!)

SQL> /
select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 15649
Session ID: 183 Serial number: 3

And just as expected, Clusterware restarted the failed instance the second it detected the failure. The node’s alert log showed this:

System State dumped to trace file /u01/app/oracle/product/admin/RON/admin/diag/rdbms/RON/RON_2/trace/RON_2_diag_14174.trc
ORA-1092 : opitsk aborting process
2011-02-15 16:58:29.255000 +00:00
ORA-1092 : opitsk aborting process
License high water mark = 6
2011-02-15 16:58:32.569000 +00:00
Instance terminated by PMON, pid = 14164
USER (ospid: 16101): terminating the instance
Instance terminated by USER, pid = 16101
2011-02-15 16:58:35.286000 +00:00
Starting ORACLE instance (normal)
2011-02-15 16:58:36.477000 +00:00
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
2011-02-15 16:58:42.687000 +00:00
Private Interface ‘bond1:1′ configured from GPnP for use as a private interconnect.
[name='bond1:1',...
..., use=public/1]

Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/11.2.0.2/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 – 64bit Production

Summary

That concludes my testing. RON works better than a “classic” active/passive cluster and allows sessions to stay connected when migrating the database to a different host. And it makes it easier to convert the database to a full RAC database (I have an example but it needs a bit of tidying up before posting). On the other hand, virtualisation technology has allowed us to do the same for quite some time now. Xen and OracleVM can relocate domUs, and vmotion is the commercial alternative. Whatever suits your needs.

Error message of the day: OUI-25023 and the FQDN

It’s been a long day with many problems around a Grid Control installation, including (but not limited to) corruption of the repository database, bugs in OUI when it comes to deinstalling the Oracle Management Server, lots of files left over by the weblogic “uninstall.sh” script and many more. Some of the error messages were quite misleading, and OUI-25023 just was one too many. What happened?

Earlier today I was trying to install the 64bit 11.1.0.1 agent on an 8 node cluster. After an initial headache (see below) it worked ok. However, I couldn’t resist mentioning OUI-25023. Here’s the complete story.

I downloaded the 11.1 agent for linux x86-64 as per the GC 11.1 documentation and deployed it to my fresh-installed management server. The OMS is on Solaris SPARC, and Grid Control doesn’t supply agents for a different platform. However, the security experts have locked the oracle account down on the cluster which ruled out the “agent push” scenario. I then opted for the installation via a response file, as described in the documentation.

The idea is that you set a number of variables in a response file “additional_agent.rsp” to inform the installer about your desired configuration. If memory serves me right then there was no way to run OUI in GUI mode, it had to be a silent installation.

Amongst the variables you need to specify the cluster nodes. I dutifully filled that information in, using the {“node1.example.com”, “node2.example.com”, …, “node8.example.com”} – i.e. the fully qualified domain name as per the documentation.

Starting the installer with the -silent and -responseFile /path/to/additional_agent.rsp options failed:

$ ./linux_x64/agent/runInstaller -silent -responseFile /u01/app/oracle/stage/agent_11.1.0.1/linux....
Starting Oracle Universal Installer...

Checking Temp space: must be greater than 150 MB.   Actual 1662 MB    Passed
Checking swap space: must be greater than 150 MB.   Actual 16415 MB    Passed
Preparing to launch Oracle Universal Installer from /tmp/OraInstall2011-02-15_03-15-10PM. Please wait ...
*** Check for updates ***
*** Select Installation Type ***
*** Check Prerequisites ***
*** Specify Oracle Management Service Location ***
*** Customize Ports ***
*** Review ***

ERROR: OUI-25023: The local node is not selected for installing this product. Include the local node in the cluster list or perform the installation on the nodes on which the install is to be performed.Also note that the cluster nodes should be specified in non FQDN format.

$

I was pretty sure that my node_list included my first node… Pinging the node and DNS access also worked (I wrote about that in a previous post). So a little research on Metalink revealed this gem:

OUI-25023 When Trying To Install A Patchset On RAC [ID 394868.1]

So contradicting the error message (…FQDN format…) I needed to specify the node_list as in the inventory: without the “example.com” bit. Once that was changed in the response file, the installation completed ok.

11.1 GC agent refuses to start

This is a follow-up post from my previous tale of how not to move networks for RAC. After having successfully restarted the cluster as described here in a previous post I went on to install a Grid Control 11.1 system. This was to be on Solaris 10 SPARC-why SPARC? Not my platform of choice when it comes to Oracle software, but my customer has a huge SPARC estate and wants to make most of it.

After the OMS has been built (hopefully I’ll find time to document this as it can be quite tricky on SPARC) I wanted to secure the agents on my cluster against it. That worked ok for the first node:

  • emctl clearstate agent
  • emctl secure agent
  • emctl start agent

Five minutes later the agent appeared in my list of agents in the Grid Control console. With this success backing me I went to do the same on the next cluster node.

Here things were different-here’s the sequence of commands I used:

$ emctl stop agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved
$

I didn’t pay too much attention to the fact that there has been no acknowledgement of the completion of the stop command. I noticed something wasn’t quite right when I tried to get the agent’s status:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
emctl stop agent
Error connecting to https://node2.example.com:3872/emd/main

Now that should have reported that the agent was down. Strange. I tried a few more commands,  such as the following one to start the agent.

[agent]oracle@node2.example.com $ emctl start agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already running

Which wasn’t the case-there was no agent process whatsoever in the process table. I also checked the emd.properties file. Note that the emd.properties file is in $AGENT_HOME/hostname/sysman/config/ now instead of $AGENT_HOME/sysman/config as it was in 10g.

Everything looked correct, and even a comparison with the first node didn’t reveal any discrepancy. So I scratched my head a little more until I found a MOS note on the subject stating that the agent cannot listen to multiple addresses. The note is for 10g only and has the rather clumsy title “Grid Control Agent Startup: “emctl start agent” Command Returns “Agent is already running” Although the Agent is Stopped (Doc ID 1079424.1)

Although stating it’s for 10g and multiple NICs it got me thinking. And indeed, the /etc/hosts file has not been updated, leaving the old cluster address in /etc/hosts while the new one was in DNS.

# grep node2 /etc/hosts
10.x.x4.42            node2.example.com node2
172.x.x.1x8          node2-priv.example.com node2-priv
# host node2.example.com
node2.example.com has address 10.x5.x8.3
[root@node2 ~]# grep ^hosts /etc/nsswitch.conf
hosts:      files dns

This also explained why the agent started on the first node-it had an updated /etc/hosts file. Why the other nodes didn’t have their hosts file updated will forever remain a mystery.

Things then changed dramatically after the hosts file has been updated:

$ emctl status agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running

Note how emctl acknowledges that the agent it down now. I successfully secured and started the agent:

$ emctl secure agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already stopped...   Done.
Securing agent...   Started.
Enter Agent Registration Password :
Securing agent...   Successful.

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running
$ emctl start agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Starting agent .............. started.

One smaller problem remained:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:00:13
Last successful upload                       : 2011-02-14 10:00:19
Total Megabytes of XML files uploaded so far :    11.56
Number of XML files pending upload           :      188
Size of XML files pending upload(MB)         :    65.89
Available disk space on upload filesystem    :    60.11%
Collection Status                            : Disabled by Upload Manager
Last successful heartbeat to OMS             : 2011-02-14 10:00:17
---------------------------------------------------------------
Agent is Running and Ready

The message in red highlights the “Disabled by Upload Manager”. That’s because a lot of stuff hasn’t been transferred yet. Let’s force an upload-I know the communication between agent and OMS is working, so that should resolve the issue.

$ emctl upload
$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:02:12
Last successful upload                       : 2011-02-14 10:02:53
Total Megabytes of XML files uploaded so far :    91.12
Number of XML files pending upload           :       22
Size of XML files pending upload(MB)         :     1.50
Available disk space on upload filesystem    :    60.30%
Last successful heartbeat to OMS             : 2011-02-14 10:02:19
---------------------------------------------------------------
Agent is Running and Ready

That’s about it-a few minutes later the agent was visible on the console. Now that only had to be repeated for all remaining 6 nodes…

NB: For the reasons shown in this article I don’t endorse duplicating host information in /etc/hosts and DNS-a resilient DNS infrastructure should always be used to store this kind of information.

Oracle 11gR2 on Oracle Linux 6…

With the arrival of Oracle Linux 6 comes the inevitable installation articles.

The Oracle installation on Oracle Linux 6 is certainly smoother than the recent Fedora installations have been. Even Enterprise Manager works fine with no meddling.

The official 11gR2 installation guide has not been updated to include Oracle Linux 6 and I can’t see any notes on MOS about it, so I’ve essentially followed the installation for Oracle Linux 5 and adjusted where necessary. I’m guessing when the official notes are released they are going to be pretty close to this. I can’t see any certifications against Oracle Linux 6, so I guess I would avoid it for production Oracle installations at the moment.

Cheers

Tim…

Oracle Linux 6 has landed…

Oracle Linux 6 has landed. It has been available for Unbreakable Linux Network (ULN) subscribers since the 6th, but arrived for regular folk yesterday. Download the DVD here.

Cheers

Tim…

Troubleshooting ora.net1.network on an 8 node cluster

It seems I am doing a lot of fixing broken stuff recently. So this time I have been asked to repair a broken 8 node RAC cluster on OEL 5.5 with Oracle RAC 11.2.0.2. The system has been moved into a different, more secure network, and its firewalls prevented all access to the machines except for ILO. Another way of “security through obscurity”. The new network didn’t allow any clients to connect to any of the 8 node RAC which means that it is actually quite expensive kit to sit idle. The cluster is not in production, it’s still being build to specification but this accessibility problem has been a holdup to the project for a little while now. Yesterday has been a breakthrough-the netops team found an error to their configuration and for the first time the hosts could be accessed via ssh. Unfortunately for me that access is possible via audited gateways using PowerBroker to which I don’t have access.An alternative was the ILO interface which has not yet been hardened to production standards. So after some discussion internally I was given the ILO access credentials. This is good and bad: good, because it was a thoroughly broken system, and bad because there is no copy and paste with a java based console. And if that wasn’t bad enough, I had to contend myself with 80×24 characters on the console (however in very big letters). I pretty much needed all of my 24″ screen to display it. But I digress.

When logging on, I found the following situation:

  • Only 1 out of 8 nodes had OHAS/CRSD started. The others were still down, a kernel upgrade has taken place, but the asmlib kernel module hasn’t been upgraded at the same time. The first node had the correct RPM installed and ASMLib has done its magic on this node
  • Clusterware’s lower stack was up. However the ora.net1.network and all resources depending on it (listener, scan, scan listener, etc) were down. Not a single byte went over the public interconnect. That was strange.

Running /sbin/ifconfig has been a dream on this machine – I saw all 3 SCAN IPs on it, and all 8 node virtual IP addresses. Plus it has 6 NICs for Oracle, bonded into pairs of 2. And this is exactly where the confusion starts. I found the following bonded interfaces defined:

  • bond0
  • bond1.251
  • bond0.212

It took a while to figure out why these interfaces were named as they were, but apparently the suffix is a VLAN name. It also filtered through that one of my colleagues has tried to replace the previously used bond0.212 with bond0 as the public interconnect. He was however not successful in doing so, leaving the cluster in the state it was in.

He used the following commands to update the public interface:

$ oifcfg getif
bond1.251  172.xxx.0  global  cluster_interconnect
bond0  10.2xxx8.0  global  public

He also changed the vip configuration, with the result shown here:

srvctl config vip -n node11
VIP exists: /node1-vip/10.2xx8.13/10.2xx8.0/255.255.255.0/bond0, hosting node node11

However

The VIP however remained unimpressed:

srvctl start vip -n node1
PRCR-1079 : Failed to start resource ora.node1.vip
CRS-2674: Start of 'ora.net1.network' on 'node1' failed
CRS-2632: There are no more servers to try to place resource 'ora.node1.vip' on that would satisfy its placement policy

That’s where I have been asked to cast a keen eye over the installation.

The Investigation

First of all I could find nothing wrong with what has been done so far. So starting my investigation I first thought there was something wrong with the public network so I decided to shut it down:

# ifdown bond0

I then checked the network configuration of /etc/sysconfig/network-scripts. The setting is shown here:

ifcfg-bond0

device=bond0
bonding_opts="use_carrier=0 miimon=0 mode=1 arp_interval=10000 arp_ip_target=10.xxx.4 primary=eth0"
bootproto=none
onboot=yes
network=10.2xxx.0
netmask=255.255.254.0
ipaddr=10.xxx.2
userctl=no

ifcfg-eth0

device=eth0
hwaddr=f4:ce:46:87:fa:d0
bootproto=none
onboot=yes
master=bond0
slave=y
userctl=no

ifcfg-eth1

device=eth1
hwaddr=f4:ce:46:87:fa:d4
bootproto=none
onboot=yes
master=bond0
slave=yes
userctl=no

The MAC addresses of ifcfg-eth* matched the output from the ifconfig command. In the lab I occasionally have the problem that my configurartion files don’t match the real MAC addresses and therefore my NICs don’t come up. But this wasn’t the case here.

I then checked if the kernel module is loaded correctly. Usually you’d find that in /etc/modprobe.conf but there was not entry. I added these lines as per the documentation:

alias bond0 bonding
alias bond1 bonding
alias bond1.251 bonding

With that all done I brought the bond0 interface back up (don’t ever try to bring down the private interconnect-it will cause a node eviction!). Still nothing. The output of crsctl status resource -t remained “OFFLINE” for resource ora.net1.network. BTW, you cannot manually start that a network resource using srvctl (it’s an ora.* resource so don’t even think about trying crsctl start resource ora.net1.network :). All you can do with a network resource is to get its configuration (srvctl config network -k 1…) and modify it (srvctl modify network -k 1…)

ORAROOTAGENT is responsible for starting the network, and it will try to do so every second or so. That’s CRSD’s ORAROOTAGENT by the way, the log file is in $GRID_HOME/log/`hostname -s`/agent/crsd/orarootagent_root/orarootagent_root.log.

After the modification to bond0 I could now ping the IP associated with bond0 so at least that was a success. One thing I learned that day is that the MAC address of the bonded NIC matches the primary eth* interface’s NIC, in my case it was that of eth0, i.e. f4:ce:46:87:fa:d0. If one of the enslaved NICs failed it would probably assume the failback NIC’s MAC address. So in summary:

  • the network bonding was correctly configured
  • I could ping bond0

At this point I could see no reason why starting of the network failed. Maybe a typo in the configuration? The network configuration can be queried with 2 commands: oifcfg and servctl config network. So I tried oifcfg first.oifcfg getif returns:

bond0 10.xx.x2.0           "good"
bond0 10.xx.x8.0           "old/bad"
bind1.251 172.xx.xx.160    interconnect
bind1.251 169.254.0.0

Hmmm, where’s that second bond0 interface from? The bond1.251 interface is in use and working, the 172.xxx IP matches the IP address assigned in ifcfg-bon1.251. The second entry for bind1.251 is created by the HAIP resource and has to do with the high available cluster interconnect which uses multicasting for communication (to the frustration of many users who upgraded to 11.2.0.2 only to find out that the lower stack doesn’t start on the second and other nodes).

So to be sure that I was seeing something unusual I compared the output with another node on the cluster. There I found I only have 3 interfaces …. bond0 and bond1 + the UDP multicast address. I initially tried to remove the bad network with oifcfg delif but that didn’t work. I then verified the output of srvctl config network to see if it matched what I expected to. And here was a surprise: the output of the network listed a wrong subnet mask. Instead of 255.255.254.0 (note the “254″!) i found 255.255.255.0. That was easy to fix and while I was back again trying to delete the old network using oifcfg I suddenly realised that the cluster has sprung back into life. Small typo-big consequences! Finally all the resources depending on ora.net1.network were started, including SCAN VIPs, SCAN listeners, listeners, VIPs…

References for NIC bonding on RHEL5