This is a weird problem I ran in today. As part of an automation project the code deploys RAC One databases across a cluster, depending on the capacity available of the node. These are 128G RAM BL685c G6 currently but will be upgraded to G7 later.
Now, my problem was that after the weekend we couldn’t deploy any more RAC One databases, except for 1 node. DBCA simply created single instance databases instead. Newly created databases were properly registered in the OCR, and their build completed ok, but not as RAC One databases. Take for example this database:
$ srvctl config database -d MYDB Database unique name: MYDB Database name: MYDB Oracle home: /u01/app/oracle/product/22.214.171.124 Oracle user: oracle Spfile: +DATA/MYDB/spfileMYDB.ora Domain: example.com Start options: open Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Server pools: MYDB Database instances: MYDB Disk Groups: DATA Mount point paths: Services: Type: SINGLE Database is administrator managed
How come? We are sure that we pass the RACOneNode flag to dbca, which can be found in the command line. Trying again I spotted these (alongside the sys and system passwords … you should change these as soon as DBCA completes!)
[rac]email@example.com $ ps -ef|grep MYDB oracle 14865 14854 65 11:22 ? 00:00:07 /u01/app/oracle/product/126.96.36.199/jdk/jre/bin/java -Doracle.installer.not_bootstrap=true -DORACLE_HOME=/u01/app/oracle/product/188.8.131.52 -DSET_LAF= -Dsun.java2d.font.DisableAlgorithmicStyles=true -Dice.pilots.html4.ignoreNonGenericFonts=true -DDISPLAY= -DJDBC_PROTOCOL=thin -mx128m -classpath ... oracle.sysman.assistants.dbca.Dbca -silent -createDatabase -templateName /u01/app/oracle/product/admin/templates/Default.dbc -gdbName MYDB.example.com -RACOneNode -RACOneNodeServiceName MYDB_APP.example.com -sid MYDB -sysPassword xxx -systemPassword xxx -emConfiguration NONE -totalMemory 4096 -storageType ASM -asmSysPassword xxx -diskGroupName DATA -initParams db_create_file_dest=+DATA,cpu_count=1 -nodelist node2 oracle 15415 5109 0 11:22 pts/0 00:00:00 grep MYDB
So why the problem? Looking at the dbca trace file I found these lines
[main] [ 2011-03-14 14:15:31.845 GMT ] [SQLEngine.initialize:363] Starting Reader Thread... [main] [ 2011-03-14 14:15:31.927 GMT ] [OracleHome.initOptions:1240] executing: startup nomount pfile='/u01/app/oracle/product/184.108.40.206/dbs/initDBUA0.ora' [main] [ 2011-03-14 14:15:55.417 GMT ] [SQLEngine.done:2167] Done called [main] [ 2011-03-14 14:15:55.418 GMT ] [OracleHome.initOptions:1247] ORA-00304: requested INSTANCE_NUMBER is busy oracle.sysman.assistants.util.sqlEngine.SQLFatalErrorException: ORA-00304: requested INSTANCE_NUMBER is busy at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeImpl(SQLEngine.java:1655) at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeSql(SQLEngine.java:1903) at oracle.sysman.assistants.util.OracleHome.initOptions(OracleHome.java:1241) at oracle.sysman.assistants.dbca.backend.SilentHost.initialize(SilentHost.java:179) at oracle.sysman.assistants.dbca.Dbca.execute(Dbca.java:116) at oracle.sysman.assistants.dbca.Dbca.main(Dbca.java:180) [main] [ 2011-03-14 14:15:55.420 GMT ] [OracleHome.initOptions:1250] executing: select parameter from v$option where value='TRUE' [main] [ 2011-03-14 14:15:55.420 GMT ] [SQLEngine.reInitialize:735] Reinitializing SQLEngine...
Interesting-the ORA-304 error sticks out. The DBCA logs are in $ORACLE_BASE/cfgtoollogs/dbca/dbName/ in 11.2 btw. Further down the logfile it then determines that the RAC option is not available. This isn’t true-and I checked on each node:
$ cd $ORACLE_HOME/rdbms/lib $ nm -r libknlopt.a | grep -c kcsm.o 1
That was identical on all nodes. So we definitely had RAC compiled into the oracle binary. I also compared the size and timestamp of all oracle binaries in $ORACLE_HOME only to find them identical. However dbca didn’t seem impressed with my contradiction and went on creating single instance databases. That now became a little inconvenient.
I then tried relocating one of the succesfully created RAC One databases to the nodes where we had problems building them, hoping to find out more about the problem. At this stage I was convinced there was a problem with semophores or other SysV IPC.
I ceratainly didn’t want to use the Windows Fix and reboot!
So to recap, we should be able to build RAC (One Node) databases as the option is compiled into the binary, and yet it doesn’t work. From the trace I gathered that Oracle builds an auxiliary instance first, and uses initDBUA0.ora in $ORACLE_HOME to start it. So where are it’s logs/where’s the diagnostic dest? Turns out it is in $ORACLE_HOME/log/ – simply set your ADR base to this location and use the familiar commands. And this finally give me a clue:
*** 2011-03-14 12:06:40.104 2011-03-14 12:06:40.104: [ CSSCLNT]clssgsGroupJoin: member in use group(0/DBDBUA0) kgxgnreg: error: status 14 kgxgnreg: error: member number 0 is already in use kjxgmjoin: can not join the group (DBDBUA0) with id 0 (inst 1) kjxgmjoin: kgxgn error 3
So somewhere else in the cluster had to be a DBUA0 instance that prevented my new instance from starting. A quick trawl through the process table on all nodes revealed that DBUA was active on node6. Shutting that down solved the problem!
DBCA is a nice tool to create databases, together with user definable templates it is really flexible. From a technical point of view it works as follows:
In my case, one of these DBUA0s aux instances was still present on a different node in the cluster as a result of a crashed database creation. When subsequent calls to dbca created another auxiliary (cluster!) DBUA0 instance on a different node, it wasn’t aware that there was a DBUA0 already and LMON refused to create it. This is expected behaviour- instance names have to be unique across the cluster. The DBUA0 of node2 for example clashed with the one on node6.
Why did it work on p6 then I hear you ask? DBCA seems to have code logic to establish that DBUA0 on a node is in use, and uses DBUA1 next.
I got this update from Oracle Support who acknowledge this as a bug:
|Notes||16-Mar-2011 1:21:16 GMT+00:00 PM||Oracle Support|
Following bug is created for this particular issue.
Bug 11877668 – DBCA DOESN’T CREATE A RAC ONE DATABASE IN SILENT MODE
Recently I encountered a situation that demonstrates how, in an Oracle database, an error may stem from a very unrelated cause. A DBA was building a physical standby database for an upcominmg training. The two servers he was using were part of a RAC cluster; so the Oracle binaries were already there. He decided to use the same ORACLE_HOME for the new databases as well - a quite logical decision to savbe on space and administration issues. He created a primary database on server n1 and a standby database on the server n2. Follooing the usual manual procedure in building the standby database, he copied the pfile from primary database, modified the parameters and brought up the standby instance in nomount mode on the server n2.
SQL> startup nomount pfile=initSTBY.ora
But it refused to come up, with the following error:
ORA-00304: requested INSTANCE_NUMBER is busy
Alert log showed:
USER (ospid: 14210): terminating the instance due to error 304
Instance terminated by USER, pid = 14210
This was highly unusual. The primary and standby both were non-RAC; there was no instance number concept in a non-RAC database. By the way, the RAC instance on that server (or on the server n1) was not running; so there was no question of any conflict with the RAC instances either. The primary database was called PRIM while the standby was called STBY - eliminating the possibility of an instance name clash as well. And this error came while merely trying to start the instance, not even while mounting - eliminating the standby controlfile as a cause.
The error 304 showed:
00304, 00000, "requested INSTANCE_NUMBER is busy"
// *Cause: An instance tried to start by using a value of the
// initialization parameter INSTANCE_NUMBER that is already in use.
// *Action: Either
// a) specify another INSTANCE_NUMBER,
// b) shut down the running instance with this number
// c) wait for instance recovery to complete on the instance with
// this number.
Needless to say, being for a non-RAC database there was no "instance_number" parameter in the initialization parameter file of primary or the standby. So, the suggestions for the resolution seemed odd. MetaLink provided no help. All the ORA-304 errors were related to RAC with the instance_number mismatch.
As it always happens, it fell on my lap at this time. With just days to go live, I had to find a solution quickly. Long hours of troubleshooting, tracing the processes and examination of occasional trace files did not yield any clue. All the clue seemed to point to RAC, which this database was not. The Oracle Home was a RAC home, which meant the oracle binary was linked with the "rac" option.
So, the next logical step was to install a new Oracle Home without the rac option. After doing so, I tried to bring up the instance, using the new ORACLE_HOME and LD_LIBRARY_PATH variable; but, alas, the same error popped up.
Just to eliminate the possibility of some unknown bug, I decided to put an instance_number parameter, setting it to "1", from the default "0". The same error. I changed it to "2", again, the result was the same error.
Although this didn't help, at least it gave a clue that the error was not related to instance_number. The error message was clearly wrong. With this in mind, I went back to the basics. I went through the alert log with a fine toothed comb, scanning and analyzing each line.
The following line drew my attention:
DB_UNQIUE_NAME STBY is not in the Data Guard configuration
This was odd; the db_unique_name STBY was not defined in DG configuration. [BTW, note the spelling of "unique" in the message above. That is not what I typed; it was a copy and paste from the actual message in the alert log. Someone in Oracle Development should really pay atytention to typos in messages. This is clearly more than a nuisance; what if some process scans for db_unique_name related errors? It will not find the message at all!]
Checking the dg configuration, I found that the DBA has correctly defined the primary and standby names. In any case, Data Guard has not been started yet; this is merely at the instance startup - why is it complaining for data guard configuration at this time.
Perplexed, I resorted to a different approach. I renamed the pfile and all other relevant files. Then I built the standby myself, from scratch - using the same names - PRIM and STBY. And this time, everything worked fine. The instance STBY did come up.
While this solvbed the urgency problem, everyone, inclduing myself, wanted to know what the issue was in the earlier case where the DBA had failed to bring up the instance. To get the answer, I compared the files I created with the DBA created when tried and failed. Voila! The cause was immediately clear - the DBA forgot to put a vital parameter in the pfile of the standby instance:
db_unique_name = 'STBY'
This parameter was absent; so it took the default value as the db_name, which was "PRIM". This caused the instance to fail with a seemingly unrelated message - "ORA-304 Instance_number is busy"!
Hope this was helpful. Happy New Year, everybody.