Search

OakieTags

Who's online

There are currently 0 users and 20 guests online.

Recent comments

Affiliations

11g Release 2

A more user friendly multipath.conf

During some recent work I did involving a stretched 11.2.0.2 RAC for a SAP implementation at a customer site I researched TimeFinder/Clone backups. As part of this exercise I have been able to experiment with RHEL (OEL) 5.6 and the new device mapper multipath package on the mount host. I have been very pleasantly surprise about this new feature which I’d like to share.

Background of this article

Device Mapper Multipath is the “native” Linux multipathing software, as opposed to vendor-supplied multipathing such as EMC’s Power Path or Hitachi’s HDLM.

My customer’s setup is rather unique for a SAP environment as it uses Oracle Enterprise Linux and not Solaris/SPARC or AIX on the Power platform with an active/passive solution. Well if that doesn’t make it sound unique, the fact that there is a plan to run Oracle 11.2.0.2 RAC potentially across sites using ASM and ACFS certainly makes this deployment stand out from the rest.

The reason I mention this is simple-it requires a lot of engineering effort to certify components for this combination. For example: it was not trivial to get vendor support for Solutions Enabler the storage engineering uses for connectivity with the VMAX arrays, and so on. After all, Oracle Enterprise is a fairly new platform and Red Hat certainly has an advantage when it comes to vendor certification.

What hasn’t been achieved was a certification of the EMC Power Path software for use with SAP on Linux, for reasons unknown to me. The result was simple: the setup will use the device-mapper multipath package that comes with the Linux distribution.

Configuring multipath

Now with this said I started looking at MOS to get relevant support notes about Linux’s native multipath and found some. The summary of this research is available on this blog, I have written about it in these articles:

What I didn’t know up to date was the fact that the multipath.conf file allows you to define the ownership and mode of a device. As an ASM user this is very important to me. Experience taught me that incorrect permissions are one of the main reasons for ASM failing to start. Remember that root owns block devices by default.

Consider this example:

multipaths {
...
multipath {
wwid        360a98000486e58526c34515944703277
alias       ocr01
mode        660
uid         501
gid         502
}
...
}

The above entry in the multipaths section translates into English as follows:

  • If you encounter a device with the WWID 36a…277
  • Give it an alias name of “OCR01” in /dev/mapper
  • Set the mode to 660 (i.e. rw-rw—)
  • Assign device ownership to the user with UID 501 (maps to “grid” on my system)
  • Assign the group of the device to 502 (maps to “asmdba” on my system)

The path settings are defined globally and don’t need to be mentioned explicitly for each device unless you prefer to override them. I like to use an alias although it isn’t really necessary since ASM relies on a 4k header in a block device to store its identity. If you don’t chose to alias a device I recommend you use the user friendly name instead, mainly for aesthetic reasons.

Why is this really cool? For me it implies two things:

  • I can now be independent of ASMLib which provided device name stability and set the permissions on the block devices correctly for ASM
  • I don’t need to create udev rules to set the permissions (or /etc/rc.local or whichever other way you chose before to set permissions)

Nice! One less headache, as I have to say that I didn’t really like udev…

How to use the RAC FAN Java API

I researched an interesting new feature available with Oracle 11g R2, the so called RAC FAN API when writing the workload management chapter for the RAC book. The RAC FAN API is documented in Oracle® Database JDBC Developer’s Guide, 11g Release 2 (11.2) available online, but when it came to the initial documentation following the 11.2.0.1 release on Linux it was pretty useless. The good news is that it improved!

The RAC FAN Java API

The aim of this API is to allow a Java application to listen to FAN events by creating a subscription to the RAC nodes’ ONS processes. The application then registers a FANListener, based on the subscription, which can pick up instances of the following events:

  • ServiceDownEvent
  • NodeDownEvent
  • LoadAdvisoryEvent

All of these are in the oracle.simplefan namespace, the javadoc reference of which you can find in the official documenation.

When it was initially released in 11.2.0.1 I tried to get the FAN subscription to work without any luck, the registration with the ONS didn’t work. Schematically, this is how it should work:

The application (shown on the top) requires an Oracle client for the ONS configuration and libraries, and the setup is very similar to how you’d set up FCF. With the ons.config file pointing to the RAC nodes’ ONS processes you should be able to read FAN events as they occur on the source system. The ons.config in $CLIENT_HOME/opmn/conf/ons.config has to contain the local and remote ONS port, as well as all the RAC nodes with their remote ONS port. When subscribing to the events, you specify the service name you’d like to receive events for.

My 4 node RAC cluster is comprised of node rac11gr2drnode{1,2,3}, with local ONS listing on port 6200 for external, and port 6100 for internal requests. This can be checked using the onsctl debug command on any cluster node.. The client’s ons.config contains these lines:

localport:6100
remoteport:6200
nodes=rac11gr2drnode1:6200, rac11gr2drnode2:6200,rac11gr2drnode3:6200, rac11gr2drnode4:6200

Unfortunately I couldn’t subscribe to the remote ONS-I know that my setup is valid because of my UCP/FCF testing I did previously (check my earlier blog postings about how to get started with UCP and Tomcat 6). Or maybe it doesn’t work on Windows, who knows? Trying to subscribe to the remote ONS my application bails out with the following error:

Exception in thread “main” oracle.ons.SubscriptionException: Subscription request timed out after 30000 millseconds.  Possible causes: OPMN may not be running, you may have an OPMN running in an alternate ORACLE_HOME using duplicate port values, or OPMN may be misconfigured.
at oracle.ons.SubscriptionNotification.waitForReply(SubscriptionNotification.java:83)
at oracle.ons.ONS.addSubscriber(ONS.java:956)
at oracle.ons.Subscriber.realStartup(Subscriber.java:103)
at oracle.ons.Subscriber.(Subscriber.java:79)
at oracle.simplefan.impl.FanSubscription.subscribe(FanSubscription.java:228)
at oracle.simplefan.impl.FanSubscription.createFanSubscription(FanSubscription.java:46)
at oracle.simplefan.impl.FanManager.subscribe(FanManager.java:120)
at de.mbh.TestClass.(TestClass.java:21)
at de.mbh.TestClass.main(TestClass.java:53)

I tried my favourite search engine on the Exception but couldn’t find any useful information. This is not a big deal for me, I can start a local ONS on the client. This is shown here:

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl ping
ons is not running ...

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl start
onsctl start: ons started

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl debug
HTTP/1.1 200 OK
Content-Length: 3627
Content-Type: text/html
Response:

== client:6200 5872 11/05/26 11:47:16 ==
Home: c:\oracle\product\11.2.0\client_1

======== ONS ========
IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
10.xxx.xx.98  6200 4dde2fb0 00000006 00000008

Listener:

TYPE                BIND ADDRESS               PORT  SOCKET
-------- --------------------------------------- ----- ------
Local                                  127.0.0.1  6100    344
Remote                                       any  6200    352

Servers: (3)

INSTANCE NAME                  TIME   SEQUENCE  FLAGS     DEFER
---------------------------------------- -------- -------- -------- ----------
dbInstance_rac11gr2drnode1_6200          4d53d6a4 00029496 00000002          0
10.xxx.xx.155 6200
dbInstance_rac11gr2drnode2_6200          4d3eb9e2 0006cbeb 00000002          0
10.xxx.xx157 6200
dbInstance_rac11gr2drnode3_6200 4d5937d6 0001990f 00000002          0
10.xxx.xx.158 6200

Connection Topology: (4)

IP                      PORT   VERS  TIME
--------------------------------------- ----- ----- --------
10.xxx.xx.158  6200     4 4dde2fb2
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.157 6200
**                            10.xxx.xx.98 6200
10.xxx.xx.157  6200     4 4dde2fb2
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.158 6200
**                            10.xxx.xx.98 6200
10.xxx.xx.155  6200     4 4dde2fb2
**                            10.xxx.xx.98 6200
**                           10.xxx.xx.157 6200
**                           10.xxx.xx.158 6200
10.xxx.xx.98  6200     4 4dde2fb2=
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.158 6200
**                           10.xxx.xx.157 6200

Server connections:

ID            CONNECTION ADDRESS              PORT  FLAGS  SENDQ REF WSAQ
-------- --------------------------------------- ----- ------ ----- --- ----
0                           10.xxx.xx.155  6200 010405 00000 001  ---
1                           10.xxx.xx.156  6200 002405 00000 001
2                           10.xxx.xx.157  6200 010405 00000 001  --
3                           10.xxx.xx.158  6200 010405 00000 001  ---

Client connections:

ID            CONNECTION ADDRESS              PORT  FLAGS  SENDQ REF SUB W
-------- --------------------------------------- ----- ------ ----- --- --- -
4                                internal     0 01008a 00000 001 002
request                               127.0.0.1  6100 03201a 00000 001 000

Worker Ticket: 28/28, Last: 11/05/26 11:47:15

THREAD   FLAGS
-------- --------
120 00000012
124 00000012
128 00000012

Resources:

Notifications:
Received: Total 12 (Internal 6), in Receive Q: 0
Processed: Total 12, in Process Q: 0

Pool Counts:
Message: 1, Link: 1, Ack: 1, Match: 1

C:\oracle\product\11.2.0\client_1\opmn\bin>

With a local ONS started on my client, I can actually subscribe to the ONS and make use of the events. The easiest way is to simply decode the load balancing events, as I did in my code, shown below (modified version of the code in the Oracle documentation to make it work):

package de.mbh;

import oracle.simplefan.FanSubscription;
import oracle.simplefan.FanEventListener;
import oracle.simplefan.FanManager;
import oracle.simplefan.LoadAdvisoryEvent;
import oracle.simplefan.NodeDownEvent;
import oracle.simplefan.ServiceDownEvent;
import java.util.Properties;

public class TestClass {

TestClass() {
System.out.println("Hello");

Properties p = new Properties();
p.put("serviceName", "OEMSRV");

System.setProperty("oracle.ons.oraclehome", "c:\\oracle\\product\\11.2.0\\client_1");
System.out.println(System.getProperty("oracle.ons.oraclehome"));

FanSubscription sub = FanManager.getInstance().subscribe(p);

System.out.println("I'm subscribed!");

sub.addListener(new FanEventListener() {

public void handleEvent(ServiceDownEvent arg0) {
System.out.println("Service Down registered!");
}

public void handleEvent(NodeDownEvent arg0) {
System.out.println("Node Down Event Registered");
}

public void handleEvent(LoadAdvisoryEvent arg0) {
System.out.println("Just got a Load Advisory event");

System.out.println("originating database: " + arg0.getDatabaseUniqueName());
System.out.println("originating instance: " + arg0.getInstanceName());
System.out.println("Service Quality     : " + arg0.getServiceQuality());
System.out.println("Percent             : " + arg0.getPercent());
System.out.println("Service Name        : " + arg0.getServiceName());
System.out.println("Service Quality     : " + arg0.getServiceQuality());
System.out.println("Observed at         : " + arg0.getTimestamp() + "\n\n");
} } );
}

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TestClass tc = new TestClass();

int i = 0;
while ( i < 100000)  {
try {
Thread.sleep(100);
i++;
} catch (Exception e) {
System.out.println(e);
}
}

System.out.println("execution ended");
}

}

When compiling or executing the code, you need the simplefan.jar and ons.jar files in your classpath, and chances are that you need ojdbc6.jar as well.

Starting the application reveals Load Balancing events being read:

Hello
c:\oracle\product\11.2.0\client_1
I'm subscribed!
Just got a Load Advisory event
originating database: LNGDS1RD
originating instance: dbInstance_rac11gr2drnode3_6200
Service Quality     : -1
Percent             : 51
Service Name        : OEMSRV
Service Quality     : -1
Observed at         : Thu May 26 11:56:03 BST 2011

Just got a Load Advisory event
originating database: LNGDS1RD
originating instance: dbInstance_rac11gr2drnode3_6200
Service Quality     : -1
Percent             : 49
Service Name        : OEMSRV
Service Quality     : -1
Observed at         : Thu May 26 11:56:03 BST 2011

This proves that you can write your own connection cache, which allows you to react to *down events and rebalance your session differently.

Offloading production backups to a different storage array

For quite a while Oracle DBAs have performed split mirror backups using special devices called “Business Continuance Volumes” or BCVs for short. A BCV is a special mirror copy of a LUN on the same storage array as the primary copy.

In a BCV backup scenario, the storage administrator (usually) “splits” the mirror after putting the database into hot backup mode. After the mirror is split, the database is taken out of hot backup mode and resumes normal operation. A new Oracle instance on a different host can be mounted using the split mirror copy of the database for backups. The use of this technology for refreshing a test environment is out of scope of this article. The below figure demonstrates the idea:

The advantage of such an approach is that the backup operation, initiated from the mount host, should not impact the performance of the production database. Once the backup is complete, the BCV for the ARCH diskgroup should be re-synchronised with the source LUN, whereas the DATA disk group should not. This allows us to quickly recover from problems with the primary LUNs-more on that on a later post.

One Step Further

On my current site this process has been refined. One of the requirements was that SRDF should be used as a means for disaster recovery. I should say that SRDF, or Symmetrix Remote Data Facility is my customer’s preferred method for DR, and I do by no means want to advertise for EMC here; it just so happened that I was working on EMC storage for this project.

I should also note that Data Guard cannot be used due to an application constraint (an ACFS file system is integral part of the application and must be backed up together with the database).

All storage is presented by ASM, which in many ways makes life easier for me. The ASM LUNs or “disks” will have all the required information in the “disk” header. So after the cloned LUNs have been presented to the mount host, all I need to do is make them available to the OS, and optionally run an “oracleasm scandisks” as root to detect them. From then on I should be able to simply mount the disk group (either via sql*plus or srvctl in 11.2). The actual backup requires a few more steps, these are shown below.

Before going further into detail let’s have a look at the architecture first:

Two data centres are in use: the local one is used for production in normal operations, including backups. One of the design requirements was that backups can be taken on either data centre, in respect to the DR situation.

Split mirror backups as shown in the above figure are taken on the local data centre in normal operations. In case of DR, the remote data centre will be configured to take backups. For this to happen, it is necessary to clone the replicated LUNs (that would be activated in the DR event) much as it’s done for the local data centre’s split mirror backups. As an added advantage, the clones can be used to create pre- and post batch “backups” that would be activated in case of a horrible failure of the batch/end-of-year processing.

Taking the Backups

To be able to take a backup of the split mirror, a few things are necessary. Most of these are documented in MOS note “RMAN and Split Mirror Disk Backups [ID 302615.1]”. You certainly require a recovery catalogue database in the first place. As a first step you register the database in the catalog. You perform this step connect to the production database and recovery catalog, as shown in the below example:

$ rman target / catalog=rman/rmanPwd@rccat

Recovery Manager: Release 11.2.0.2.0 - Production on Fri May 20 10:39:00 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: PROD (DBID=1796248120)
connected to recovery catalog database

RMAN> register database;

database registered in recovery catalog
starting full resync of recovery catalog
full resync complete

RMAN> exit

If you like, you can configure defaults at this stage, I opted for the following (still connected to the primary database and recovery catalog):

RMAN> configure retention policy to redundancy 3;

new RMAN configuration parameters:
CONFIGURE RETENTION POLICY TO REDUNDANCY 3;
new RMAN configuration parameters are successfully stored

RMAN> CONFIGURE DEFAULT DEVICE TYPE TO SBT_TAPE;

new RMAN configuration parameters:
CONFIGURE DEFAULT DEVICE TYPE TO 'SBT_TAPE';
new RMAN configuration parameters are successfully stored

RMAN> configure controlfile autobackup off;

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP OFF;
new RMAN configuration parameters are successfully stored

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE sbt_tape to '%F';

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE 'SBT_TAPE' TO '%F';
new RMAN configuration parameters are successfully stored

RMAN> configure device type sbt_tape parallelism 4 backup type to backupset;

new RMAN configuration parameters:
CONFIGURE DEVICE TYPE 'SBT_TAPE' PARALLELISM 4 BACKUP TYPE TO BACKUPSET;
new RMAN configuration parameters are successfully stored

RMAN> exit

Now switch over to the mount host for some real work.

The high level steps for taking a backup of the “split mirror” are:

  1. Ensure that the cloned LUNs are presented to the mount host’s operating system, including appropriate zoning on the fabric
  2. Ensure that the multi-pathing solution of choice is correctly configured for all paths to the cloned LUNs
  3. Ensure that the ASM disks are known to the local ASM instance. That may including running an /etc/init.d/oracleasm scandisks as root, or putting the relevant rules into /etc/udev/rules.d/

These steps have been quite generic and will depend on your OS and storage stack. I won’t go into detail, but you might find some relevant bits and pieces on my blog if you are interested.

The next step is to make the ASM disks known to ASM. Initially that has to be done via the command line, from 11.1 onwards you have to connect as SYSASM:

SQL> alter  diskgroup DGName mount;

For Oracle 11.2 this automatically creates a resource for diskgroup DGName in the OCR which is very convenient (especially in RAC environments). Next time, all you need to do is to execute “srvctl start diskgroup -g DGName” as the grid software owner.

For pre 11.2 environments you might want to consider updating the “asm_diskstring” initialisation parameter accordingly.

Once the ASM disk groups your database requires are mounted in ASM, it’s time to mount the database. If not done so already, register the database in the OCR and ensure that you add the spfile option as well, as in this example (Oracle RAC users would add instances additionally):

$ srvctl add database -d PROD -o $ORACLE_HOME -c SINGLE -p ‘+data/PROD/spfilePROD.ora’ \
> -s mount -a DATA,ARCH

As per said MOS document, it is required to start the database using a BACKUP controlfile on the mount host. Otherwise you’d end up with these RMAN errors after the first backup:

RMAN-3014: Implicit resync of recovery catalog failed
RMAN-6038: Recovery catalog package detected an error
RMAN-20035: Invalid high RECID error

To do so, create a backup controlfile on the primary database before splitting the mirror. The process is very much the same you would for a physical standby database:

SQL> alter database backup controlfile to ‘/tmp/backup.ctl’;

Database altered.

This controlfile now needs to be made available to the mount host-an elegant way would be to use DBMS_FILETRANSFER to perform this task, or asmcmd’s copy command.

Once it’s on the mount host, say in /tmp//backup.ctl, it needs to be made available to Oracle. The easiest way is to use RMAN for this:

Connect to the mirror instance as SYSDBA; do not connect to the recovery catalog.

$ rman target /

Recovery Manager: Release 11.2.0.2.0 - Production on Fri May 20 10:53:40 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: PROD (DBID=1796248120, not open)

RMAN> shutdown immediate

using target database control file instead of recovery catalog
database dismounted
Oracle instance shut down

RMAN> startup nomount
connected to target database (not started)
Oracle instance started

Total System Global Area    9219969024 bytes

Fixed Size                     2234056 bytes
Variable Size               4630513976 bytes
Database Buffers            4563402752 bytes
Redo Buffers                  23818240 bytes

RMAN> restore controlfile from '/tmp/PROD/ backup.ctl';

Starting restore at 20-MAY-11
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=200 instance=PROD device type=DISK

channel ORA_DISK_1: copied control file copy
output file name=+ARCH/prod/controlfile/current.256.737725307
output file name=+DATA/prod/controlfile/current.256.737725307
Finished restore at 20-MAY-11

If you were really desperate to back the CURRENT controlfile (which is part of the clone) up, you could have done this prior to the restore of the backup controlfile. You must not connect to the recovery catalog in this case-see MOS note 1111369.1 for more information.

Querying V$DATABASE.CONTROLFILE_TYPE should now return “BACKUP”. With the setup completed, you are ready to back the database up from the cloned LUNs. Connect to RMAN again, using the backup instance and the recovery catalog and initiate the backup. For example:

run {
allocate channel t1 type sbt_tape parms 'ENV=(TDPO_OPTFILE=/u01/app/oracle/product/admin/PROD/tdp/tdpo_PROD.opt)';
allocate channel t2 type sbt_tape parms 'ENV=(TDPO_OPTFILE=/u01/app/oracle/product/admin/PROD/tdp/tdpo_PROD.opt)';
backup database;
}

This completes the backup of the database. One caveat exists: the control file is not backed up as part of this process-RMAN doesn’t back a “backup” controlfile up:

$ rman target / catalog rman/rmanPwd@rccat

Recovery Manager: Release 11.2.0.2.0 - Production on Tue May 24 12:51:22 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: GB1 (DBID=1796248120, not open)
connected to recovery catalog database

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP ON;

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP ON;
new RMAN configuration parameters are successfully stored

RMAN> run {
2> allocate channel t1 type sbt_tape parms
3> 'ENV=(TDPO_OPTFILE=/u01/app/oracle/product/admin/PROD/tdp/tdpo_PROD.opt)';
3> backup tablespace USERS;
4> }

allocated channel: t1
channel t1: SID=199 instance=PROD device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.1.0

Starting backup at 24-MAY-11
channel t1: starting full datafile backup set
channel t1: specifying datafile(s) in backup set
input datafile file number=00006 name=+DATA/prod/datafile/users.264.737727697
channel t1: starting piece 1 at 24-MAY-11
channel t1: finished piece 1 at 24-MAY-11
piece handle=1cmd4ott_1_1 tag=TAG20110524T125316
comment=API Version 2.0,MMS Version 5.5.1.0
channel t1: backup set complete, elapsed time: 00:00:01
Finished backup at 24-MAY-11

RMAN-06497: WARNING: control file is not current, control file AUTOBACKUP skipped
released channel: t1

RMAN>  exit

It’s not difficult at all to get around this problem. As part of the regular archive log backups you are performing on the production database anyway, you add a “backup current controlfile” and a command to resync the catalog (“resync catalog”).

At this stage I should note that the backup process wouldn’t be different in principle, if additional backups were to be taken on the DR site (licensing questions aside). Instead of having to clone the primary LUNs on the local data centre, the storage admins would clone the replicated LUNs (the “R2s” in EMC-talk) and bring these clones up on the DR mount host.

Summary

The process described above approach was interesting from my personal point of view as I haven’t used this concept before. I’m an Oracle DBA, and I feel most comfortable when I have things in hand, relying on another team for doing BAU database tasks is a new experience.

The process description deliberately left product names out unless they were part of the command output. The concept is quite universal and is by no means tied down to a specific vendor.

One of the requirements is that your tape solution can talk to the primary host(s) and the mount host in the local and remote data centres. Although I have seen sites (including rather large ones ) where the DR site was not configured to take backups. Although you could argue that you shouldn’t operate from your DR site for long, that’s by no means an excuse for not having backups. Let’s suggest you run into a block corruption, how would you recover from that without a backup? But then the storage team usually argue that block corruption doesn’t happen on their high end arrays.

References

  • RMAN and Split Mirror Disk Backups [ID 302615.1]
  • RMAN Backup of Controlfile fails from Split Mirror / BCV Copy. Is there any way to take controlfile backup from Split Mirror / BCV so that the Production DB is not used for Controlfile backup? [ID 1111369.1]
  • Supported Backup, Restore and Recovery Operations using Third Party Snapshot Technologies [ID 604683.1]

A quick word of warning about database PSU 11.2.0.2.2

I am playing around with the Grid Infrastructure 11.2.0.2 PSU 2 and found an interesting note on My Oracle Support regarding the Patch Set Update. This reminds me that it’s always a good idea to search for a patch number on Metalink before applying a PSU. It also seems to be a good idea to wait for a few days before trying a PSU (or maybe CPU) on your DEV environment for the first time (and don’t even think about applying a PSU on production without thorough testing!)

OK, back to the story: there is a known issue with the patchset which has to do with the change in the Mutex behaviour which the PSU was intended to fix. To quote MOS note “Oracle Database Patch Set Update 11.2.0.2.2 Known Issues (Doc ID 1291879.1)”, Patch 12431716 Is a Recommended Patch for 11.2.0.2.2. In fact, Oracle strongly recommends you to apply the patch to fix Bug 12431716 – Unexpected change in mutex wait behavior in 11.2.0.2.2 PSU (higher CPU possible).

In a nutshell, not applying the patch can cause your system to suffer from excessive CPU usage and more than expected mutex contention. More information can be found in the description of Bug 12431716  Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU which is worth reading.

Besides this, the PSU was applied without any problems to my four node cluster, I just wish there was a way to roll out a new version of opatch to all cluster node’s $GRID_HOME and $ORACLE_HOME in one command. The overall process for the PSU is the same as already described in my previous post about Bundle Patch 3:

  1. Get the latest version of OPatch
  2. Deploy OPatch to $GRID_HOME and $ORACLE_HOME (ensure permissions are set correctly for the OPatch in $GRID_HOME!)
  3. Unzip the PSU (Bug 11724916 – 11.2.0.2.2 Patch Set Update (PSU) (Doc ID 11724916.8)), for example to /tmp/PSU
  4. Change directory to where you unzipped (/tmp/PSU) and become root
  5. Ensure that $GRID_HOME/OPatch is part of the path
  6. Read the readme
  7. Create an OCM response file and save it to say, /tmp/ocm.rsp
  8. Start the patch as root: opatch auto and supply the full path to the OCM response file (/tmp/ocm.rsp)
  9. Apply the beforementioned one-off patch

Then wait, and after a little while you spend trailing the logfile in $GRID_HOME/cfgtoollogs/ and having a coffee the process eventually finishes. Repeat on each node and you’re done. I’m really happy there aren’t these long readme files anymore with 8 steps to be performed, partially as root, partially as CRS owner/RDBMS owner. It reduces tge tune ut takes to apply a PSU significantly.

Happy patching!

Pro Oracle Database 11g RAC on Linux is available for Kindle

Addmittedly I haven’t checked for a little while, but an email by my co-author Steve Show prompted me to go to the Amazon website and look it up.

And yes, it’s reality! Our book is now finally available as a kindle version, how great is that?!?

There isn’t really a lot more to say about this subject. I’ll wonder how many techies are intersted in the kindle version after the PDF has been out for quite a while. If you read this and decide to get the kindle version, could you please let me know how you liked it?  Personally I think the book is well suited for the Amazon reader as it’s mostly text which suits the device well.

Happy reading!

Slight 11.2.0.2 caveat with TTS and missing XMLDB

A small post about a problem I encountered a little while ago, quite interesting nevertheless.

The background story is that-for security reasons-my customer’s databases are all created with “Oracle Database Catalog Views” and “Oracle Database Packages and Types” only. Anything else will have to be requested by the application team. I like the idea in principle, and it is certainly a lot better than all those databases with UltraSearch, Spatial and Text installed which open the door to vulnerabilities and exploits. Have a look at the problems fixed by Critical Patch Updates and you will know what I am talking about.

As part of a project I have been asked to transport a tablespace from such a database to another host. Everything is fine until you try to export the tablespace metadata with expdp. As you know, anyone using traditional exp is a bit backward ;)

[oracle@node1 ~] $ expdp / dumpfile=tts_metadata.dmp directory=data_pump_dir logfile=tts_metadata.log \
> transport_tablespaces='TTS_TEST'
Export: Release 11.2.0.2.0 - Production on Tue May 10 16:05:49 2011
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
Starting "ORACLE"."SYS_EXPORT_TRANSPORTABLE_01":  /******** dumpfile=tts_metadata.dmp logfile=tts_metadata.log
  directory=data_pump_dir transport_tablespaces=TTS_TEST
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
ORA-39126: Worker unexpected fatal error in KUPW$WORKER.UNLOAD_METADATA [PLUGTS_BLK]
ORA-00904: "Q"."TABOBJ_NUM": invalid identifier
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 95
ORA-06512: at "SYS.KUPW$WORKER", line 8358
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x15e53a900     19208  package body SYS.KUPW$WORKER
0x15e53a900      8385  package body SYS.KUPW$WORKER
0x15e53a900      2935  package body SYS.KUPW$WORKER
0x15e53a900      9054  package body SYS.KUPW$WORKER
0x15e53a900      1688  package body SYS.KUPW$WORKER
0x155a4e638         2  anonymous block
Job "ORACLE"."SYS_EXPORT_TRANSPORTABLE_01" stopped due to fatal error at 16:05:57
[oracle@node1 ~]

This is due to bug 10185688, which is fixed in 12.1, 11.2.0.3.0, or in one-off patch 10185688. Oracle 12.1 and 11.2.0.3 are not available at the time of this writing.

You will only encounter this problem if your database doesn’t have XML and dependent options compiled in. I wonder how many databases out there have options in DBA_SERVER_REGISTRY which aren’t used. When I was working at IGEFI we only introduced XMLDB to Multifonds after customers requested the ability to load XML files, prior to the move from 9i to 10g ( in 2006 that was). I still don’t understand why you’d want to prefer XML input files rather than CSV for external tables but that wasn’t my call either.

By the way, the old backward exp command doesn’t have this restriction!

[oracle@node1 ~] $ exp file=tts.dmp log=tts.log TRANSPORT_TABLESPACE=y TABLESPACES=TTS_TEST
Export: Release 11.2.0.2.0 - Production on Tue May 10 16:20:56 2011
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
Username: / as sysdba
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Tes
Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)
Note: table data (rows) will not be exported
About to export transportable tablespace metadata...
For tablespace TTS_TEST ...
. exporting cluster definitions
. exporting table definitions
. . exporting table                           TEST
. exporting referential integrity constraints
. exporting triggers
. end transportable tablespace metadata export
Export terminated successfully without warnings.

Not a problem then, but quite interesting.

ASM normal redundancy and failure groups spanning SANs

Julian Dyke has started an interesting thread on the Oak Table mailing list after the latest UKOUG RAC and HA SIG. Unfortunately I couldn’t attend that event, I wish I had, and I knew it would be great.

Anyway, the question revolved around an ASM disk group created with normal redundancy spanning two storage arrays. This should in theory protect against the failure of an array, although at a high price. All ASM disks exported from an array would be 1 failure group. Remember that disks in a failure group all fail if the supporting infrastructure (network, HBA, controller etc) fails. So what would happen with such a setup, if you followed these steps:

  • Shutdown the array for failure group 2
  • Stop the database
  • Shutdown the second array – failure group 1
  • Do some more maintenance…
  • Startup failgroup B SAN
  • Start the database
  • Startup failgroup A SAN

ASM can tolerate the failure of one failgroup (capacity permitting), so the failure of failgroup 2 should not bring the disk group down, which would result in immediate loss of service. But what happens if it comes up after the data in the other failure group has been modified? Will there be data corruption?

Replaying

To simulate two storage arrays my distinguished filer01 and filer02 OpenFiler appliances have been used, each exporting 2 approx. 4G “LUNS” to my database host. At this time I only had access to my 11.1.0.7 2 node RAC system, if time permits I’ll repeat this with 10.2.0.5 and 11.2.0.2. The RAC cluster in the SIG presentation was 10.2. I am skipping the bit about the LUN creation and presentation to the hosts, and assume the following setup:

[root@rac11gr1node1 ~]# iscsiadm -m session
tcp: [1] 192.168.99.51:3260,1 iqn.2006-01.com.openfiler:filer02DiskB
tcp: [2] 192.168.99.50:3260,1 iqn.2006-01.com.openfiler:filer01DiskA
tcp: [3] 192.168.99.51:3260,1 iqn.2006-01.com.openfiler:filer02DiskA
tcp: [4] 192.168.99.50:3260,1 iqn.2006-01.com.openfiler:filer01DiskB

192.168.99.50 is my first openfiler instance, .192.168.99.51 the second. As you can see each export DISKA and DISKB. Mapped to the hosts, this is the target mapping (use iscsiadm –mode session –print 3 to find out):

  • filer02DiskB: /dev/sda
  • filer01DiskA: /dev/sdb
  • filer02DiskA: /dev/sdc
  • filer01DiskB: /dev/sdd

I am using ASMLib (as always on the lab) to label these disks:

[root@rac11gr1node1 ~]# oracleasm listdisks
DATA1
DATA2
FILER01DISKA
FILER01DISKB
FILER02DISKA
FILER02DISKB

DATA1 and DATA2 will not play a role in this article, I’m interested in the other disks. Assuming that the scandisks command completed on all nodes, I can add the disks to the new diskgroup:

SQL> select path from v$asm_disk

PATH
--------------------------------------------------------------------------------
ORCL:FILER01DISKA
ORCL:FILER01DISKB
ORCL:FILER02DISKA
ORCL:FILER02DISKB
ORCL:DATA1
ORCL:DATA2

Let’s create the diskgroup. The important part is to create failure groups per storage array. By the way this is not different from extended distance RAC!

SQL> create diskgroup fgtest normal redundancy
 2  failgroup filer01 disk 'ORCL:FILER01DISKA', 'ORCL:FILER01DISKB'
 3  failgroup filer02 disk 'ORCL:FILER02DISKA', 'ORCL:FILER02DISKB'
 4  attribute 'compatible.asm'='11.1';

Diskgroup created.

With that done let’s have a look at the asm disk information:

SQL> select MOUNT_STATUS,HEADER_STATUS,STATE,REDUNDANCY,FAILGROUP,PATH from v$asm_disk where group_number=2;

MOUNT_S HEADER_STATU STATE    REDUNDA FAILGROUP                      PATH
------- ------------ -------- ------- ------------------------------ --------------------
CACHED  MEMBER       NORMAL   UNKNOWN FILER01                        ORCL:FILER01DISKA
CACHED  MEMBER       NORMAL   UNKNOWN FILER01                        ORCL:FILER01DISKB
CACHED  MEMBER       NORMAL   UNKNOWN FILER02                        ORCL:FILER02DISKA
CACHED  MEMBER       NORMAL   UNKNOWN FILER02                        ORCL:FILER02DISKB

I have set the disk repair time to 24 hours and raised compatible parameters for RDBMS and ASM to 11.1, resulting in these attributes:

SQL> select * from v$asm_attribute

NAME                           VALUE                GROUP_NUMBER ATTRIBUTE_INDEX ATTRIBUTE_INCARNATION READ_ON SYSTEM_
------------------------------ -------------------- ------------ --------------- --------------------- ------- -------
disk_repair_time               3.6h                            2               0                     1 N       Y
au_size                        1048576                         2               8                     1 Y       Y
compatible.asm                 11.1.0.0.0                      2              20                     1 N       Y
compatible.rdbms               11.1.0.0.0                      2              21                     1 N       Y

Unlike 11.2 where disk groups are managed as Clusterware resource, 11.1 requires you to manually start them or append the new disk group to ASM_DISKGORUPS. You should query gv$asm_diskgroup.state to ensure the new diskgroup is mounted on all cluster nodes.

I need some data! A small demo database can be restored to the new failure group to provide some experimental playground. This is quite easily done by using an RMAN duplicate with the correct {db|log}_file_name_convert parameter set.

Mirror

The diskgroup is created with normal redundancy, which means that ASM will create a mirror for every primary extent, taking failure groups into consideration. I wanted to ensure that the data is actually mirrored on the new disk group, which has group number 2.I need to get this information from V$ASM_FILE and V$ASM_ALIAS:

SQL> select * from v$asm_file where group_number = 2

GROUP_NUMBER FILE_NUMBER COMPOUND_INDEX INCARNATION BLOCK_SIZE     BLOCKS      BYTES      SPACE TYPE                 REDUND STRIPE CREATION_ MODIFICAT R
------------ ----------- -------------- ----------- ---------- ---------- ---------- ---------- -------------------- ------ ------ --------- --------- -
 2         256       33554688   747669775      16384       1129   18497536   78643200 CONTROLFILE          HIGH   FINE   05-APR-11 05-APR-11 U
 2         257       33554689   747669829       8192      69769  571547648 1148190720 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         258       33554690   747669829       8192      60161  492838912  990904320 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         259       33554691   747669829       8192      44801  367009792  739246080 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         260       33554692   747669831       8192      25601  209723392  424673280 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         261       33554693   747669831       8192        641    5251072   12582912 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         262       33554694   747670409        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         263       33554695   747670409        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         264       33554696   747670417        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         265       33554697   747670417        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         266       33554698   747670419       8192       2561   20979712   44040192 TEMPFILE             MIRROR COARSE 05-APR-11 05-APR-11 U

11 rows selected.

SQL> select * from v$asm_alias where group_NUMBER=2

NAME                           GROUP_NUMBER FILE_NUMBER FILE_INCARNATION ALIAS_INDEX ALIAS_INCARNATION PARENT_INDEX REFERENCE_INDEX A S
------------------------------ ------------ ----------- ---------------- ----------- ----------------- ------------ --------------- - -
RAC11G                                    2  4294967295       4294967295           0                 3     33554432        33554485 Y Y
CONTROLFILE                               2  4294967295       4294967295          53                 3     33554485        33554538 Y Y
current.256.747669775                     2         256        747669775         106                 3     33554538        50331647 N Y
DATAFILE                                  2  4294967295       4294967295          54                 1     33554485        33554591 Y Y
SYSAUX.257.747669829                      2         257        747669829         159                 1     33554591        50331647 N Y
SYSTEM.258.747669829                      2         258        747669829         160                 1     33554591        50331647 N Y
UNDOTBS1.259.747669829                    2         259        747669829         161                 1     33554591        50331647 N Y
UNDOTBS2.260.747669831                    2         260        747669831         162                 1     33554591        50331647 N Y
USERS.261.747669831                       2         261        747669831         163                 1     33554591        50331647 N Y
ONLINELOG                                 2  4294967295       4294967295          55                 1     33554485        33554644 Y Y
group_1.262.747670409                     2         262        747670409         212                 1     33554644        50331647 N Y
group_2.263.747670409                     2         263        747670409         213                 1     33554644        50331647 N Y
group_3.264.747670417                     2         264        747670417         214                 1     33554644        50331647 N Y
group_4.265.747670417                     2         265        747670417         215                 1     33554644        50331647 N Y
TEMPFILE                                  2  4294967295       4294967295          56                 1     33554485        33554697 Y Y
TEMP.266.747670419                        2         266        747670419         265                 1     33554697        50331647 N Y

My USERS tablespace which I am interested in most has file number 261-I chose it for this example as it’s only 5M in size. Taking my 1 MB allocation unit into account, it means I don’t have to trawl through thousands of line of output when getting the extent map.

Credit where credit is due-the next queries are partly based on the excellent work by Luca Canali from CERN, who has looked at ASM internals for a while. Make sure you have a look at the excellent reference available here: https://twiki.cern.ch/twiki/bin/view/PDBService/ASM_Internals. So to answer the question if the extents making up my users tablespace we need to have a look at the X$KFFXP, i.e. file extent pointers view:

SQL> select GROUP_KFFXP,DISK_KFFXP,AU_KFFXP from x$kffxp where number_kffxp=261 and group_kffxp=2 order by disk_kffxp;

GROUP_KFFXP DISK_KFFXP   AU_KFFXP
----------- ---------- ----------
 2          0        864
 2          0        865
 2          0        866
 2          1        832
 2          1        831
 2          1        833
 2          2        864
 2          2        866
 2          2        865
 2          3        832
 2          3        833
 2          3        831

12 rows selected.

As you can see, I have a number of extents, all evenly spread over my disks. I can verify that this information is correct by querying the X$KFDAT view as well which contains similar information, but more related to the disk  to AU mapping

SQL> select GROUP_KFDAT,NUMBER_KFDAT,AUNUM_KFDAT from x$kfdat where fnum_kfdat = 261 and group_kfdat=2

GROUP_KFDAT NUMBER_KFDAT AUNUM_KFDAT
----------- ------------ -----------
 2            0         864
 2            0         865
 2            0         866
 2            1         831
 2            1         832
 2            1         833
 2            2         864
 2            2         865
 2            2         866
 2            3         831
 2            3         832
 2            3         833

12 rows selected.

OK so I am confident that my data is actually mirrored-otherwise the following test would not make any sense. I have double checked that the disks in failgroup FILER01 actually belong to my OpenFiler “filer01″, and the same for filer02. Going back to the original scenario:

Shut down Filer02

This will take down all the disks of failure group B. Two minutes after taking the filer down I checked if it was indeed shut down:

martin@dom0:~> sudo xm list | grep filer
filer01                                    183   512     1     -b----   1159.6
filer02                                          512     1              1179.6
filer03                                    185   512     1     -b----   1044.4

Yes, no doubt about it-it’s down. What would the effect be? Surely I/O errors, but I wanted to enforce a check. Connected to +ASM2 I issued the “select * from v$asm_disk” command. This caused quite significant logging in the instance’s alert.log:

NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
Wed Apr 06 17:17:39 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61330 bufp:0x2b29977b3e00 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61220 bufp:0x2b29977b0200 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:17:58 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61440 bufp:0x2b29977a6400 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:18:03 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (86400) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (86400) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.
Wed Apr 06 17:19:26 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61440 bufp:0x2b29977b0200 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:20:10 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61000 bufp:0x2b29977a6400 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:21:07 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (86217) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (86217) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.
Wed Apr 06 17:27:15 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (85849) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (85849) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.

The interesting lines are “all mirror sides found readable, no repair required”. So taking down the failgroup didn’t cause an outage. The other ASM instance complained as well a little later:

2011-04-06 17:16:58.393000 +01:00
NOTE: initiating PST update: grp = 2, dsk = 2, mode = 0x15
NOTE: initiating PST update: grp = 2, dsk = 3, mode = 0x15
kfdp_updateDsk(): 24
kfdp_updateDskBg(): 24
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:2, diskname:ORCL:FILER02DISKA disk:0x2.0x97a6d9f7 au:1
 iop:0x2b9ea4855e70 bufp:0x2b9ea4850a00 offset(bytes):1052672 iosz:4096 operation:2(Write) synchronous:1
 result: 4 osderr:0x3 osderr1:0x2e pid:870
NOTE: group FGTEST: updated PST location: disk 0000 (PST copy 0)
2011-04-06 17:17:03.508000 +01:00
NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
NOTE: PST update grp = 2 completed successfully
NOTE: initiating PST update: grp = 2, dsk = 2, mode = 0x1
NOTE: initiating PST update: grp = 2, dsk = 3, mode = 0x1
kfdp_updateDsk(): 25
kfdp_updateDskBg(): 25
2011-04-06 17:17:07.454000 +01:00
NOTE: group FGTEST: updated PST location: disk 0000 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
NOTE: cache closing disk 2 of grp 2: FILER02DISKA
NOTE: cache closing disk 3 of grp 2: FILER02DISKB
SUCCESS: extent 0 of file 267 group 2 repaired by offlining the disk
NOTE: repairing group 2 file 267 extent 0
SUCCESS: extent 0 of file 267 group 2 repaired - all mirror sides found readable, no repair required
2011-04-06 17:19:04.526000 +01:00
GMON SlaveB: Deferred DG Ops completed.
2011-04-06 17:22:07.487000 +01:00
GMON SlaveB: Deferred DG Ops completed.

No interruption of service though, which is good-the GV$ASM_CLIENT view reported all database instances still connected.

SQL> select * from gv$asm_client;

 INST_ID GROUP_NUMBER INSTANCE_NAME                                                    DB_NAME  STATUS       SOFTWARE_VERSIO COMPATIBLE_VERS
---------- ------------ ---------------------------------------------------------------- -------- ------------ --------------- ---------------
 2            2 rac11g2                                                          rac11g   CONNECTED    11.1.0.7.0      11.1.0.0.0
 1            2 rac11g1                                                          rac11g   CONNECTED    11.1.0.7.0      11.1.0.0.0

The result in the V$ASM_DISK view was as follows:

SQL> select name,state,header_status,path from v$asm_disk;

NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 NORMAL   UNKNOWN      ORCL:FILER02DISKA
 NORMAL   UNKNOWN      ORCL:FILER02DISKB
DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2
FILER01DISKA                   NORMAL   MEMBER       ORCL:FILER01DISKA                                  FILER01
FILER01DISKB                   NORMAL   MEMBER       ORCL:FILER01DISKB                                  FILER01
FILER02DISKA                   NORMAL   UNKNOWN                                                         FILER02
FILER02DISKB                   NORMAL   UNKNOWN                                                         FILER02

8 rows selected.

As I expected the disks for failgroup filer02 are gone, and so is the information about the failure group. My disk repair time should be high enough to protect me from having to rebuild the whole disk group. Now I’m really curious if my database can become corrupted-I’ll increase the SCN.

[oracle@rac11gr1node1 ~]$ . setsid.sh rac11g
[oracle@rac11gr1node1 ~]$ sq

SQL*Plus: Release 11.1.0.7.0 - Production on Wed Apr 6 17:24:18 2011

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
 5999304

SQL> begin
 2   for i in 1..5 loop
 3    execute immediate 'alter system switch logfile';
 4   end loop;
 5  end;
 6  /

PL/SQL procedure successfully completed.

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
 5999378

SQL>

Back to the test case.

Stop the Database

[oracle@rac11gr1node1 ~]$ srvctl stop database -d rac11g
[oracle@rac11gr1node1 ~]$ srvctl status database -d rac11g
Instance rac11g2 is not running on node rac11gr1node1
Instance rac11g1 is not running on node rac11gr1node2

Done-this part was simple. Next they stopped their first filer. To prevent bad things from happening I’ll shut down ASM on all nodes. I hope that doesn’t invalidate the test but I can’t see how ASM would not get a problem if the other failgroup went down as well.

Shut down Filer01 and start Filer02

Also quite simple. Shutting down this filer will allow me to follow the story. After filer01 was down I started filer02. I’m curious as to how ASM will react. I have deliberately NOT put disk group FGTEST into the ASM_DISKSTRING, I want to start it manually to get a better understanding of what happens.

After having started ASM on both nodes, I queried V$ASM_DISK and tried to mount the disk group:

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0                                NORMAL   MEMBER       ORCL:FILER02DISKA
 1                                NORMAL   MEMBER       ORCL:FILER02DISKB
 2                                NORMAL   UNKNOWN      ORCL:FILER01DISKA
 3                                NORMAL   UNKNOWN      ORCL:FILER01DISKB
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2

6 rows selected.

Ooops, now they are both gone….

SQL> alter diskgroup fgtest mount;
alter diskgroup fgtest mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing
ORA-15042: ASM disk "0" is missing
ORA-15080: synchronous I/O operation to a disk failed
ORA-15080: synchronous I/O operation to a disk failed

OK, I have a problem here. Both ASM instances report I/O errors with the FGTEST diskgroup, and I can’t mount it. That means I can’t mount the database either-in a way it proves I won’t have corruption. But neither will I have a database, what is worse?

Can I get around this problem?

I think I’ll have to start filer01 and see if that makes a difference. Hopefully I can recover my system with the information in failgroup filer01. Soon after filer01 came online I tried the query against v$asmdisk again and tried to mount it.

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0                                NORMAL   MEMBER       ORCL:FILER02DISKA
 1                                NORMAL   MEMBER       ORCL:FILER02DISKB
 2                                NORMAL   MEMBER       ORCL:FILER01DISKA
 3                                NORMAL   MEMBER       ORCL:FILER01DISKB
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2

6 rows selected.

That worked!

Wed Apr 06 17:45:32 2011
SQL> alter diskgroup fgtest mount
NOTE: cache registered group FGTEST number=2 incarn=0x72c150d7
NOTE: cache began mount (first) of group FGTEST number=2 incarn=0x72c150d7
NOTE: Assigning number (2,0) to disk (ORCL:FILER01DISKA)
NOTE: Assigning number (2,1) to disk (ORCL:FILER01DISKB)
NOTE: Assigning number (2,2) to disk (ORCL:FILER02DISKA)
NOTE: Assigning number (2,3) to disk (ORCL:FILER02DISKB)
Wed Apr 06 17:45:33 2011
NOTE: start heartbeating (grp 2)
kfdp_query(): 12
kfdp_queryBg(): 12
NOTE: cache opening disk 0 of grp 2: FILER01DISKA label:FILER01DISKA
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: FILER01DISKB label:FILER01DISKB
NOTE: cache opening disk 2 of grp 2: FILER02DISKA label:FILER02DISKA
NOTE: F1X0 found on disk 2 fcn 0.0
NOTE: cache opening disk 3 of grp 2: FILER02DISKB label:FILER02DISKB
NOTE: cache mounting (first) group 2/0x72C150D7 (FGTEST)
Wed Apr 06 17:45:33 2011
* allocate domain 2, invalid = TRUE
kjbdomatt send to node 0
Wed Apr 06 17:45:33 2011
NOTE: attached to recovery domain 2
NOTE: cache recovered group 2 to fcn 0.7252
Wed Apr 06 17:45:33 2011
NOTE: LGWR attempting to mount thread 1 for diskgroup 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.7252 ABA
NOTE: seq=3 blk=337
NOTE: cache mounting group 2/0x72C150D7 (FGTEST) succeeded
NOTE: cache ending mount (success) of group FGTEST number=2 incarn=0x72c150d7
Wed Apr 06 17:45:33 2011
kfdp_query(): 13
kfdp_queryBg(): 13
NOTE: Instance updated compatible.asm to 11.1.0.0.0 for grp 2
SUCCESS: diskgroup FGTEST was mounted
SUCCESS: alter diskgroup fgtest mount

The V$ASM_DISK view is nicely updated and everything seems to be green:

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2
 0 FILER01DISKA                   NORMAL   MEMBER       ORCL:FILER01DISKA                                  FILER01
 1 FILER01DISKB                   NORMAL   MEMBER       ORCL:FILER01DISKB                                  FILER01
 2 FILER02DISKA                   NORMAL   MEMBER       ORCL:FILER02DISKA                                  FILER02
 3 FILER02DISKB                   NORMAL   MEMBER       ORCL:FILER02DISKB                                  FILER02

6 rows selected.

Brilliant-will it have an effect on the database?

Starting the Database

Even though things looked ok, they weren’t! I didn’t expect this to happen:

[oracle@rac11gr1node1 ~]$ srvctl start database -d rac11g
PRKP-1001 : Error starting instance rac11g2 on node rac11gr1node1
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL*Plus: Release 11.1.0.7.0 - Production on Wed Apr 6 17:48:58 2011
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Copyright (c) 1982, 2008, Oracle.  All rights reserved.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Enter user-name: Connected to an idle instance.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL> ORACLE instance started.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Total System Global Area 1720328192 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Fixed Size                    2160392 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Variable Size              1291847928 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Database Buffers    419430400 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Redo Buffers                  6889472 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [9572],
rac11gr1node1:ora.rac11g.rac11g2.inst:[9533], [0x000000000], [], [], [], [], [], [], [], []
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
rac11gr1node1:ora.rac11g.rac11g2.inst:With the Partitioning, Real Application Clusters, OLAP, Data Mining
rac11gr1node1:ora.rac11g.rac11g2.inst:and Real Application Testing options
rac11gr1node1:ora.rac11g.rac11g2.inst:
CRS-0215: Could not start resource 'ora.rac11g.rac11g2.inst'.
PRKP-1001 : Error starting instance rac11g1 on node rac11gr1node2
CRS-0215: Could not start resource 'ora.rac11g.rac11g1.inst'.

Oops. A quick search on Metalink revealed note Ora-00600: [Kccpb_sanity_check_2], [3621501],[3621462] On Startup (Doc ID 435436.1). The explanation for the ORA-600 is that “the seq# of the last read block is higher than the seq# of the control file header block.” Oracle Support explains it with a lost write, but here the situation is quite different. Interesting! I have to leave that for another blog post.

Summer Seminars

I am doing a couple of one day seminars with Oracle University, currently planned for Austria and Switzerland. They go by the title “Grid Intrastructure and Database High Availability Deep Dive”, and can be accessed via these links.

To save you from having to get the abstract, I copied it from the Oracle University website:

Providing a highly available database architecture fit for today’s fast changing requirements can be a complex task. Many technologies are available to provide resilience, each with its own advantages and possible disadvantages. This seminar begins with an overview of available HA technologies (hard and soft partitioning of servers, cold failover clusters, RAC and RAC One Node) and complementary tools and techniques to provide recovery from site failure (Data Guard or storage replication).

In the second part of the seminar, we look at Grid Infrastructure in great detail. Oracle Grid Infrastructure is the latest incarnation of the Clusterware HA framework which successfully powers every single 10g and 11g RAC installation. Despite its widespread implementation, many of its features are still not well understood by its users. We focus on Grid Infrastructure, what it is, what it does and how it can be put to best use, including the creation of an active/passive cold failover cluster for web and database resources. Special focus will be placed on the various storage options (Cluster File System, ASM, etc), the cluster interconnect and other implementation choices and on troubleshooting Grid Infrastructure. In the final part of the seminar, we explore Real Application Clusters and its various uses, from HA to scalability to consolidation. We discuss patching and workload management, coding for RAC and other techniques that will allow users to maximise the full potential of the package.

See you there if you are interested!

Copying the password file for RAC databases

This post is inspired by a recent thread on the oracle-l mailing list. In post “11g RAC orapw file issue- RAC nodes not updated” the fact that the password file is local to the instance has been brought up. In fact, all users with the SYSOPER or SYSDBA role granted are stored in the password file, and changing the account for the SYS user on one instance doesn’t mean the password change is reflected on the other RAC instances. Furthermore, your Data Guard configuration will break as well, since the SYS account is used to log in to the standby database.

On a related note, the change of the sys password for the ASM instance in GRID_HOME will propagate to all cluster nodes automatically, a fact I have first seen mentioned on the Dutch Prutser’s weblog, Harald van Breederode.

Now to get over the annoyance of having to manually copy the new password file to all cluster nodes I have written a small shell script, which I use for all my Linux clusters. It takes the ORACLE_SID of the local instance for input, then works out the corresponding ORACLE_HOME and copies the password file to all instances in the cluster, as listed in the output of olsnodes. The script can deal with separation of duty, i.e. Systems where GRID_HOME is owned by a different owner then the RDBMS ORACLE_HOME. The script is by no means perfect, and could be extended to deal with a more general setup. My assumption is that all cluster nodes have a 1:1 mapping of Oracle instance and ORACLE_SID, for example instance PROD1 will be hosted on the first cluster node, prodnode1.

The script is shown below, it’s been written and tested on Linux:

#!/bin/bash

# A small and simple script to copy a password file
# to all nodes of a cluster
# This works for me, it doesn't necessarily work for you,
# and the script is provided "as is"-I will not take
# responsibility for its operation and it comes with no
# warrenty of any sorts
#
# Martin Bach 2011
#
# You are free to use the script as you feel fit, but please
# retain the reference to the author.
#
# Usage: requires the local ORACLE_SID as a parameter.
# requires the ORACLE_SID or DBNAME to be in oratab

ORACLE_SID=$1
[[ $ORACLE_SID == "" ]] && {
 echo usage `basename $0` ORACLE_SID
 exit 1
}

#### TUNEABLES

# change to /var/opt/oracle/oratab for Solaris
ORATAB=/etc/oratab
GRID_HOME=/u01/crs/11.2.0.2

#### this section doesn't normally have to be changed

DBNAME=${ORACLE_SID%*[0-9]}
ORACLE_HOME=`grep $DBNAME $ORATAB | awk -F":" '{print $2}'`
[[ $ORACLE_HOME == "" ]] && {
 echo cannot find ORACLE_HOME for database $DBNAME in $ORATAB
 exit 2
}

cd $ORACLE_HOME/dbs
cp -v orapw$ORACLE_SID /tmp
INST=1

echo starting copy of passwordfile
for NODE in `$GRID_HOME/bin/olsnodes`; do
 echo copying orapw$ORACLE_SID to $NODE as orapw${DBNAME}${INST}
 scp orapw$ORACLE_SID $NODE:${ORACLE_HOME}/dbs/orapw${DBNAME}${INST}
 INST=$(( $INST + 1))
done

It’s fairly straight forward, we first get the ORACLE_SID and use this to get the ORACLE_HOME for the database.  The GRID_HOME has to be hard coded to keep it compatible with < 11.2 database where you could have a CRS_HOME different from the ASM_HOME. For Oracle < 11.2, you need to set the GRID_HOME variable to your Clusterware home.

The DBNAME is the $ORACLE_SID without trailing number, which I need to work out the SIDs of the other cluster nodes. Before copying the password file from the local node to all cluster nodes a copy is taken to /tmp, just in case.

The main logic is in the loop provided by the output of olsnodes, and the local password file is copied across all cluster nodes.

Feel free to use at your own risk, and modify/distribute as needed. This works well for me, especially across the 8 node cluster.

Troubleshooting Grid Infrastructure startup

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong?An obvious reason for this could be a reboot, and fair enough, there has been one:

[grid@node1.example.com] $ uptime
09:09:22 up  2:40,  1 user,  load average: 1.47, 1.46, 1.42

The next step was to check if the local CRS stack was up, or better, to check what was down. Sometimes it’s only crsd which has a problem. In my case everything was down:

[grid@node1.example.com] $ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[grid@node1.example.com] $ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
CRS-4404: The following nodes did not reply within the allotted time:
node2,node3, node4, node5, node6, node7, node8

The CRS-4404 was slightly misleading, I assumed all cluster nodes were down after a clusterwide reboot. Sometimes a single node reboot triggers worse things. However, logging on to node 2 I saw that all but the first node were ok.

CRSD really needs CSSD to be up and running, and CSSD requires the OCR to be there. I wanted to know if the OCR was impacted in any way:

[grid@node1.example.com] $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
ORA-29701: unable to connect to Cluster Synchronization Service

Well it seemed that the OCR location was unavailable. I know that on this cluster, the OCR is stored on ASM. Common reasons for the PROC-26 error are

  • Unix admin upgrades the kernel but forgets to upgrade the ASMLib kernel module (common grief with ASMLib!)
  • Storage is not visible on the host, i.e. SAN connectivity broken/taken away (happens quite frequently with storage/sys admin unaware of ASM)
  • Permissions not set correctly on the block devices (not an issue when using asmlib)

I checked ASMLib and it reported a working status:

[oracle@node1.example.com] $ /etc/init.d/oracleasm status
Checking if ASM is loaded: yes
Checking if /dev/oracleasm is mounted: yes

That was promising, /dev/oracleasm/ was populated and the matching kernel modules loaded. /etc/init.d/oracleasm listdisks listed all my disks as well. Physical storage not accessible (PROC-26) seemed a bit unlikely now.

I could rule out permission problems since ASMLib was working fine, and I also rule out the kernel upgrade/missing libs problem by comparing the RPM with the kernel version: they matched. So maybe it’s storage related?

Why did the node go down?

Good question, usually to be asked towards the unix administration team. Luckily I have a good contact placed right inside that team and I could get the following excerpt from /var/log/messages arond the time of the crash (6:31 this morning):

Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192116, count = 1
Mar 17 06:26:06 node1 kernel: Aborting journal on device dm-2.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192216, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192166, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192122, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192140, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192174, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 10 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_truncate: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_orphan_del: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_delete_inode: Journal has aborted
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 kernel: ext3_abort called.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Mar 17 06:26:06 node1 kernel: Remounting filesystem read-only
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:41076
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:41076
Mar 17 06:26:09 node1 kernel: SysRq : Resetting
Mar 17 06:31:15 node1 syslogd 1.4.1: restart.

So it looks like a file system error triggered the reboot-I’m glad the box came back up ok on it’s own. The $GRID_HOME/log/hostname/alerthostname.log didn’t show anything specific to storage. Normally you would see that it starts counting a node down if it lost contact to the voting disks (in this case OCR and voting disks share the same diskgroup).

And why does Clusteware not start?

After some more investigation it seems there was no underlying problem with the storage, so I tried to manually start the cluster, traililng the ocssd.log file for possible clues.

[root@node1 ~]# crsctl start cluster
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2674: Start of ‘ora.cssd’ on ‘node1′ failed
CRS-2679: Attempting to clean ‘ora.cssd’ on ‘node1′
CRS-2681: Clean of ‘ora.cssd’ on ‘node1′ succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′

… the command eventually failed. The ocssd.log file showed this:

...
2011-03-17 09:47:49.073: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996354 ms, node 0x2aaab008a260 { host 'node4', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-a7491097, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.084: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996364 ms, node 0x2aaab008a630 { host 'node6', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-2f6ece1c, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.113: [    CSSD][1113332032]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846440, LATS 10996434, lastSeqNo 30846437, uniqueness 1300108895, timestamp 1300355268/3605443434
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355257, LATS 10996434, lastSeqNo 31355254, uniqueness 1300344405, timestamp 1300355268/10388584
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372473, LATS 10996434, lastSeqNo 31372470, uniqueness 1297097908, timestamp 1300355268/3605182454
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384686, LATS 10996434, lastSeqNo 31384683, uniqueness 1297098093, timestamp 1300355268/3604696294
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388819, LATS 10996434, lastSeqNo 31388816, uniqueness 1297098327, timestamp 1300355268/3604712934
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612975, LATS 10996434, lastSeqNo 29612972, uniqueness 1297685443, timestamp 1300355268/3603054884
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203293, LATS 10996434, lastSeqNo 31203290, uniqueness 1297156000, timestamp 1300355268/3604855704
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 3, node33, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355258, LATS 10996434, lastSeqNo 31355255, uniqueness 1300344405, timestamp 1300355268/10388624
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372474, LATS 10996434, lastSeqNo 31372471, uniqueness 1297097908, timestamp 1300355268/3605182494
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384687, LATS 10996434, lastSeqNo 31384684, uniqueness 1297098093, timestamp 1300355268/3604696304
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388821, LATS 10996434, lastSeqNo 31388818, uniqueness 1297098327, timestamp 1300355268/3604713224
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612977, LATS 10996434, lastSeqNo 29612974, uniqueness 1297685443, timestamp 1300355268/3603055224
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846441, LATS 10996474, lastSeqNo 30846438, uniqueness 1300108895, timestamp 1300355269/3605443654
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355259, LATS 10996474, lastSeqNo 31355256, uniqueness 1300344405, timestamp 1300355268/10389264
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203294, LATS 10996474, lastSeqNo 31203291, uniqueness 1297156000, timestamp 1300355269/3604855914
2011-03-17 09:47:49.619: [    CSSD][1116485952]clssnmSendingThread: sending join msg to all nodes
...

The interesting bit is the “BUT NO NETWORK HB”, i.e. something must be wrong with the network configuration. I quickly checked the output of ifconfig and found a missing entry for my private interconnect. This is defined in the GPnP profile if you are unsure:


 
 
 

Now that’s a starting point! If tried to bring up bond1.251, but that failed:

[root@node1 network-scripts]# ifup bond1.251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
ERROR: could not add vlan 251 as bond1.251 on dev bond1

The “invalid argument” didn’t mean too much to me, so I ran ifup with the “-x” flag to get more information about which argument was invalid:

[root@node1 network-scripts]# which ifup
/sbin/ifup
[root@node1 network-scripts]# view /sbin/ifup
# turned out it's a shell script! Let's run with debug output enabled
[root@node1 network-scripts]# bash -x /sbin/ifup bond1.251
+ unset WINDOW
...
+ MATCH='^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$'
+ [[ bond1.251 =~ ^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$ ]]
++ echo bond1.251
++ LC_ALL=C
++ sed 's/^[a-z0-9]*\.0*//'
+ VID=251
+ PHYSDEV=bond1
+ [[ bond1.251 =~ ^vlan[0-9]{1,4}? ]]
+ '[' -n 251 ']'
+ '[' '!' -d /proc/net/vlan ']'
+ test -z ''
+ VLAN_NAME_TYPE=DEV_PLUS_VID_NO_PAD
+ /sbin/vconfig set_name_type DEV_PLUS_VID_NO_PAD
+ is_available bond1
+ LC_ALL=
+ LANG=
+ ip -o link
+ grep -q bond1
+ '[' 0 = 1 ']'
+ return 0
+ check_device_down bond1
+ echo bond1
+ grep -q :
+ LC_ALL=C
+ ip -o link
+ grep -q 'bond1[:@].*,UP'
+ return 1
+ '[' '!' -f /proc/net/vlan/bond1.251 ']'
+ /sbin/vconfig add bond1 251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
+ /usr/bin/logger -p daemon.info -t ifup 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
+ echo 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
ERROR: could not add vlan 251 as bond1.251 on dev bond1
+ exit 1

Hmmm so it seemed that the underlying interface bond1 was missing-which was true. The output of ifconfig didn’t show it as configured, and trying to start it manually using ifup bond1 failed as well. It turned out that the ifcfg-bond1 file was missing and had to be recreated from the documentation. All network configuration files in Red Hat based systems belong into /etc/sysconfig/network-scripts/ifcfg-interfaceName. With the recreated file in place, I was back in the running:

[root@node1 network-scripts]# ll *bond1*
-rw-r–r– 1 root root 129 Mar 17 10:07 ifcfg-bond1
-rw-r–r– 1 root root 168 May 19  2010 ifcfg-bond1.251
[root@node1 network-scripts]# ifup bond1
[root@node1 network-scripts]# ifup bond1.251
Added VLAN with VID == 251 to IF -:bond1:-
[root@node1 network-scripts]#

Now I could try to start the lower stack again:

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2676: Start of ‘ora.cssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip’ on ‘node1′
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘node1′
CRS-2676: Start of ‘ora.ctssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.evmd’ on ‘node1′
CRS-2676: Start of ‘ora.evmd’ on ‘node1′ succeeded
CRS-2676: Start of ‘ora.cluster_interconnect.haip’ on ‘node1′ succeeded
CRS-2679: Attempting to clean ‘ora.asm’ on ‘node1′
CRS-2681: Clean of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.asm’ on ‘node1′
CRS-2676: Start of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.crsd’ on ‘node1′
CRS-2676: Start of ‘ora.crsd’ on ‘node1′ succeeded
[root@node1 network-scripts]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Brilliant-problem solved. This is actually the first time that an incorrect network config prevented a cluster I looked after from starting. The best indication in this case is in the gipcd log file, but it didn’t occur to me to have a look at is as the error was clearly related to storage.