Search

OakieTags

Who's online

There are currently 0 users and 32 guests online.

Recent comments

Affiliations

11g Release 2

First contact with Oracle 11.2.0.2 RAC

As you may know, Oracle released the first patchset on top of 11g Release 2. At the time of this writing, the patchset is out for 32bit and 64bit Linux, 32bit and 64bit Solaris SPARC and Intel. What an intersting combination of platforms… I thought there was no Solaris 32bit on Intel anymore.

Upgrade

Oracle has come up with a fundamentally different approach to patching with this patchset. The long version of this can be found in MOS document 1189783.1 “Important Changes to Oracle Database Patch Sets Starting With 11.2.0.2″. The short version is that new patches will be supplied as full releases. This is really cool, and some people have asked why that wasn’t always the case. In 10g Release 2, to get to the latest version with all the patches, you had to

  • Install the base release for Clusterware, ASM and at least one RDBMS home
  • Install the latest patchset on Clusterware, ASM and the RDBMS home
  • Apply the latest PSU for Clusterware/RDBMS, ASM and RDBMS

Especially applying the PSUs for Clusterware were very labour intensive. In fact, for a fresh install it was usually easier to install and patch everything on only one node and then extend the patched software homes to the other nodes of the cluster.

Now in 11.2.0.2 things are different. You no longer have to apply any of the interim releases-the patch contains everything you need, already on the correct version. The above process is shortened to:

  • Install Grid Infrastructure 11.2.0.2
  • Install RDBMS home 11.2.0.2

Optionally, apply PSUs or other patches when they become available. Currently, MOS note 756671.1 doesn’t list any patch as recommended on top of 11.2.0.2.

Interestingly upgrading from 11.2.0.1 to 11.2.0.2 is more painful than from Oracle 10g, at least on the Linux platform. Before you can run rootupgrade.sh, the script tests if you applied the Grid Infrastructure PSU for 11.2.0.1.2. OUI hasn’t performed the test when it checked for prerequisistes which caught me off-guard. The casual observer may now ask: why do I have to apply a PSU when the bug fixes should be rolled up into the patchset anyway? I honestly don’t have an answer, other than that if you are not on Linux you should be fine.

Grid Infrastructure will be an out-of-place upgrade which means you have to manage your local disk space very carefully from now on. I would not use anything less than 50-75G on my Grid Infrastructure mount point.This takes the new cluster health monitor facility (see below) into account, as well as the fact that Oracle performs log rotation for most logs in $GRID_HOME/log.

The RDBMS binaries can be patched either in-place or out-of-place. I’d say that the out-of-place upgrade for RDBMS binaries is wholeheartedly recommended as it makes backing out a change so much easier. As I said, you don’t have a choice for Grid Infrastructure which is always out-of-place.

And then there is the multicast issue Julian Dyke (http://juliandyke.wordpress.com/) has written about. I couldn’t reproduce the test case, and my lab and real-life clusters run with 11.2.0.2 happily.

Changes to Grid Infrastructure

After the successful upgrade you’d be surprised to find new resources in Grid Infrastructure. Have a look at these:

[grid@node1] $ crsctl stat res -t -init
-----------------------------------------------------------------
NAME           TARGET  STATE        SERVER          STATE_DETAILS
-----------------------------------------------------------------
Cluster Resources
-----------------------------------------------------------------
ora.asm
 1        ONLINE  ONLINE       node1           Started
ora.cluster_interconnect.haip
 1        ONLINE  ONLINE       node1
ora.crf
 1        ONLINE  ONLINE       node1
ora.crsd
 1        ONLINE  ONLINE       node1
ora.cssd
 1        ONLINE  ONLINE       node1
ora.cssdmonitor
 1        ONLINE  ONLINE       node1
ora.ctssd
 1        ONLINE  ONLINE       node1           OBSERVER
ora.diskmon
 1        ONLINE  ONLINE       node1
ora.drivers.acfs
 1        ONLINE  ONLINE       node1
ora.evmd
 1        ONLINE  ONLINE       node1
ora.gipcd
 1        ONLINE  ONLINE       node1
ora.gpnpd
 1        ONLINE  ONLINE       node1
ora.mdnsd
 1        ONLINE  ONLINE       node1

The cluster_interconnect.haip is yet another step towards the self contained system. The Grid Infrastructure installation guide for Linux states:

“With Redundant Interconnect Usage, you can identify multiple interfaces to use for the cluster private network, without the need of using bonding or other technologies. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2).”

So – good news for anyone who is relying on third party software like for example HP ServiceGuard for network bonding. Linux has always done this for you, even in the times of the 2.4 kernel. Linux network bonding is actually quite simple to set up as well. But anyway, I’ll run a few tests in the lab when I have time with this new feature enabled, deliberately taking down NICs to see if the new feature works as labelled on the tin. The documentation states that you don’t need to bond your NICs for the private interconnect, simply leave the ethx (or whatever name you NICs have on your OS) as they are, and indicate the ones you like to use for the private interconnect as private during the installation. If you decide to add a NIC to the cluster for use with the private interconnect later, use oifcfg as root to add the new interface (or watch this space for a later blog post on this). Oracle states that if one of the private interconnects fails, it will transparently use another one. Additionally to the high availability benefit, Oracle apparently also performs load balancing across the configured interconnects.

To learn more about the redundant interconnect feature I had a glance at its profile. As with any resource in the lower stack (or HA stack), you need to append the “-init” argument to crsctl.

[oracle@node1] $ crsctl stat res ora.cluster_interconnect.haip -p -init
NAME=ora.cluster_interconnect.haip
TYPE=ora.haip.type
ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:grid:r-x
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%
AUTO_START=always
CARDINALITY=1
CHECK_INTERVAL=30
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION="Resource type for a Highly Available network IP"
ENABLED=1
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=
LOAD=1
LOGGING_LEVEL=1
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PLACEMENT=balanced
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
SERVER_POOLS=
START_DEPENDENCIES=hard(ora.gpnpd,ora.cssd)pullup(ora.cssd)
START_TIMEOUT=60
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=hard(ora.cssd)
STOP_TIMEOUT=0
UPTIME_THRESHOLD=1m
USR_ORA_AUTO=
USR_ORA_IF=
USR_ORA_IF_GROUP=cluster_interconnect
USR_ORA_IF_THRESHOLD=20
USR_ORA_NETMASK=
USR_ORA_SUBNET=

With this information at hand, we see that the resource is controlled through ORAROOTAGENT, and judging from the start sequence position and the fact that we queried crsctl with the “-init” flag, it must be OHASD’s ORAROOTAGENT.

Indeed, there are references to it in the $GRID_HOME/log/`hostname -s`/agent/ohasd/orarootagent_root/ directory. Further reference to the resource was found in cssd.log which makes perfect sense: it will use it for many things, last but not least fencing.

[ USRTHRD][1122056512] {0:0:2} HAIP: configured to use 1 interfaces
...
[ USRTHRD][1122056512] {0:0:2} HAIP:  Updating member info HAIP1;192.168.52.0#0
[ USRTHRD][1122056512] {0:0:2} InitializeHaIps[ 0]  infList 'inf bond1, ip 192.168.52.155, sub 192.168.52.0'
[ USRTHRD][1122056512] {0:0:2} HAIP:  starting inf 'bond1', suggestedIp '169.254.79.209', assignedIp ''
[ USRTHRD][1122056512] {0:0:2} Thread:[NetHAWork]start {
[ USRTHRD][1122056512] {0:0:2} Thread:[NetHAWork]start }
[ USRTHRD][1089194304] {0:0:2} [NetHAWork] thread started
[ USRTHRD][1089194304] {0:0:2}  Arp::sCreateSocket {
[ USRTHRD][1089194304] {0:0:2}  Arp::sCreateSocket }
[ USRTHRD][1089194304] {0:0:2} Starting Probe for ip 169.254.79.209
[ USRTHRD][1089194304] {0:0:2} Transitioning to Probe State
[ USRTHRD][1089194304] {0:0:2}  Arp::sProbe {
[ USRTHRD][1089194304] {0:0:2} Arp::sSend:  sending type 1
[ USRTHRD][1089194304] {0:0:2}  Arp::sProbe }
...
[ USRTHRD][1122056512] {0:0:2} Completed 1 HAIP assignment, start complete
[ USRTHRD][1122056512] {0:0:2} USING HAIP[  0 ]:  bond1 - 169.254.79.209
[ora.cluster_interconnect.haip][1117854016] {0:0:2} [start] clsn_agent::start }
[    AGFW][1117854016] {0:0:2} Command: start for resource: ora.cluster_interconnect.haip 1 1 completed with status: SUCCESS
[    AGFW][1119955264] {0:0:2} Agent sending reply for: RESOURCE_START[ora.cluster_interconnect.haip 1 1] ID 4098:343
[    AGFW][1119955264] {0:0:2} ora.cluster_interconnect.haip 1 1 state changed from: STARTING to: ONLINE
[    AGFW][1119955264] {0:0:2} Started implicit monitor for:ora.cluster_interconnect.haip 1 1
[    AGFW][1119955264] {0:0:2} Agent sending last reply for: RESOURCE_START[ora.cluster_interconnect.haip 1 1] ID 4098:343

OK, I know understand this a bit better. But the log information mentioned something else as well, an IP address that I haven’t assigned to the cluster. It turns out that this IP address is another virtual IP on the private interconnect, called bond1:1

[grid]grid@node1 $ /sbin/ifconfig
bond1     Link encap:Ethernet  HWaddr 00:23:7D:3d:1E:77
 inet addr:192.168.52.155  Bcast:192.168.52.255  Mask:255.255.255.0
 inet6 addr: fe80::223:7dff:fe3c:1e74/64 Scope:Link
 UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
 RX packets:33155040 errors:0 dropped:0 overruns:0 frame:0
 TX packets:20677269 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
 RX bytes:21234994775 (19.7 GiB)  TX bytes:10988689751 (10.2 GiB)
bond1:1   Link encap:Ethernet  HWaddr 00:23:7D:3d:1E:77
 inet addr:169.254.79.209  Bcast:169.254.255.255  Mask:255.255.0.0
 UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

Ah, something running multicast. I tried to sniff that traffic but couldn’t make any sense if it. There is UDP (not TCP) multicast traffic on that interface. This can be checked with tcpdump:

root@node1 ~]# tcpdump src 169.254.79.209 -i bond1:1 -c 10  -s 1514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond1:1, link-type EN10MB (Ethernet), capture size 1514 bytes
14:30:18.704688 IP 169.254.79.209.55310 > 169.254.228.144.31112: UDP, length 252
14:30:18.704943 IP 169.254.79.209.55310 > 169.254.169.62.20057: UDP, length 252
14:30:18.705155 IP 169.254.79.209.55310 > 169.254.45.135.30040: UDP, length 252
14:30:18.895764 IP 169.254.79.209.51227 > 169.254.228.144.57323: UDP, length 192
14:30:18.895976 IP 169.254.79.209.51227 > 169.254.228.144.21319: UDP, length 296
14:30:18.897109 IP 169.254.79.209.48094 > 169.254.45.135.40464: UDP, length 192
14:30:18.897633 IP 169.254.79.209.48094 > 169.254.45.135.40464: UDP, length 192
14:30:18.897998 IP 169.254.79.209.48094 > 169.254.169.62.48215: UDP, length 192
14:30:18.902325 IP 169.254.79.209.51227 > 169.254.228.144.57323: UDP, length 192
14:30:18.902422 IP 169.254.79.209.51227 > 169.254.228.144.21319: UDP, length 296
10 packets captured
14 packets received by filter
0 packets dropped by kernel

If you are interested in the actual messages, use this command instead to capture a package:

[root@node1 ~]# tcpdump src 169.254.79.209 -i bond1:1 -c 1 -X -s 1514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond1:1, link-type EN10MB (Ethernet), capture size 1514 bytes
14:31:43.396614 IP 169.254.79.209.58803 > 169.254.169.62.16178: UDP, length 192
 0x0000:  4500 00dc 0000 4000 4011 ed04 a9fe 4fd1  E.....@.@.....O.
 0x0010:  a9fe a93e e5b3 3f32 00c8 4de6 0403 0201  ...>..?2..M.....
 0x0020:  e403 0000 0000 0000 4d52 4f4e 0003 0000  ........MRON....
 0x0030:  0000 0000 4d4a 9c63 0000 0000 0000 0000  ....MJ.c........
 0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
 0x0050:  a9fe 4fd1 4d39 0000 0000 0000 0000 0000  ..O.M9..........
 0x0060:  e403 0000 0000 0000 0100 0000 0000 0000  ................
 0x0070:  5800 0000 ff7f 0000 d0ff b42e 0f2b 0000  X............+..
 0x0080:  a01e 770d 0403 0201 0b00 0000 67f2 434c  ..w.........g.CL
 0x0090:  0000 0000 b1aa 0500 0000 0000 cf0f 3813  ..............8.
 0x00a0:  0000 0000 0400 0000 0000 0000 a1aa 0500  ................
 0x00b0:  0000 0000 0000 ae2a 644d 6026 0000 0000  .......*dM`&....
 0x00c0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
 0x00d0:  0000 0000 0000 0000 0000 0000            ............
1 packets captured
10 packets received by filter
0 packets dropped by kernel

Substitute the correct values of course for interface and source address.

Oracle CRF resources

Another intersting new feature is the CRF resource, which seems to be an implementation of IPD/OS Cluster Health Monitor on the servers. I need to dig a little deeper in this feature, currently I can’t get any configuration data from the cluster:

[grid@node1] $ oclumon showobjects

 Following nodes are attached to the loggerd
[grid@node1] $

You will see some additional background processes now, namely ologgerd and osysmond.bin, which are started through the CRF resource. The resource profile (shown below) suggests that this resource is started through OHASD’s ORAROOTAGENT and can take custom logging levels.

[grid]grid@node1 $ crsctl stat res ora.crf -p -init
NAME=ora.crf
TYPE=ora.crf.type
ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:grid:r-x
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%
AUTO_START=always
CARDINALITY=1
CHECK_ARGS=
CHECK_COMMAND=
CHECK_INTERVAL=30
CLEAN_ARGS=
CLEAN_COMMAND=
DAEMON_LOGGING_LEVELS=CRFMOND=0,CRFLDREP=0,...,CRFM=0
DAEMON_TRACING_LEVELS=CRFMOND=0,CRFLDREP=0,...,CRFM=0
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION="Resource type for Crf Agents"
DETACHED=true
ENABLED=1
FAILOVER_DELAY=0
FAILURE_INTERVAL=3
FAILURE_THRESHOLD=5
HOSTING_MEMBERS=
LOAD=1
LOGGING_LEVEL=1
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
ORA_VERSION=11.2.0.2.0
PID_FILE=
PLACEMENT=balanced
PROCESS_TO_MONITOR=
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
SERVER_POOLS=
START_ARGS=
START_COMMAND=
START_DEPENDENCIES=hard(ora.gpnpd)
START_TIMEOUT=120
STATE_CHANGE_TEMPLATE=
STOP_ARGS=
STOP_COMMAND=
STOP_DEPENDENCIES=hard(shutdown:ora.gipcd)
STOP_TIMEOUT=120
UPTIME_THRESHOLD=1m
USR_ORA_ENV=

An investigation of orarootagent_root.log revealed that the rootagent indeed starts the CRF resource. This resource will start the ologgerd and oysmond processes, which then write their log files into $GRID_HOME/log/`hostname -s`/crf{logd,mond}.

Configuration of the daemons can be found in $GRID_HOME/ologgerd/init and $GRID_HOME/osysmond/init. Except for the PID file for the daemons there didn’t seem to be anything of value in the directory.

The command line of the ologgerd process shows it’s configuration options:

root 13984 1 0 Oct15 ? 00:04:00 /u01/crs/11.2.0.2/bin/ologgerd -M -d /u01/crs/11.2.0.2/crf/db/node1

The files in the directory specified by the “-d” flag denote where the process stores its logging information. The files are in BDB format, or Berkeley DB (now Oracle too). The oclumon tool should be able to read these files, but until I can persuade it to connect to the host there is no output.

CVU

Unlike the previous resources, the cvu resource is actually cluster aware. It’s the Cluster Verification Utility we all know from installing RAC. Going by the profile (shown below), I conclude that the utility is run through the grid software owner’s scriptagent and has exactly 1 incarnation on the cluster. It is only executed every 6 hours and restarted if it fails. If you like to execute a manual check, simply execute the action script with the command line argument “check”.

[root@node1 tmp]# crsctl stat res ora.cvu -p
NAME=ora.cvu
TYPE=ora.cvu.type
ACL=owner:grid:rwx,pgrp:oinstall:rwx,other::r--
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=%CRS_HOME%/bin/cvures%CRS_SCRIPT_SUFFIX%
ACTIVE_PLACEMENT=1
AGENT_FILENAME=%CRS_HOME%/bin/scriptagent
AUTO_START=restore
CARDINALITY=1
CHECK_INTERVAL=21600
CHECK_RESULTS=
CHECK_TIMEOUT=600
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION=Oracle CVU resource
ENABLED=1
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=
LOAD=1
LOGGING_LEVEL=1
NLS_LANG=
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PLACEMENT=balanced
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
SERVER_POOLS=*
START_DEPENDENCIES=hard(ora.net1.network)
START_TIMEOUT=0
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=hard(ora.net1.network)
STOP_TIMEOUT=0
TYPE_VERSION=1.1
UPTIME_THRESHOLD=1h
USR_ORA_ENV=
VERSION=11.2.0.2.0

The action script $GRID_HOME/bin/cvures implements the usual callbacks required by scriptagent: start(), stop(), check(), clean(), abort(). All log information goes into $GRID_HOME/log/`hostname -s`/cvu.

The actual check performed is this one: $GRID_HOME/bin/cluvfy comp health -_format & > /dev/null 2>&1

Summary

Enough for now, this has become a far longer post than I initially anticipated. There are so many more new things around, like Quality of Server that need exploring making it very difficult to keep up.

Build your own 11.2.0.2 stretched RAC

Finally time for a new series! With the arrival of the new 11.2.0.2 patchset I thought it was about time to try and set up a virtual 11.2.0.2 extended distance or stretched RAC. So, it’s virtual, fair enough. It doesn’t allow me to test things like the impact of latency on the inter-SAN communication, but it allowed me to test the general setup. Think of this series as a guide after all the tedious work has been done, and SANs happily talk to each other. The example requires some understanding of how XEN virtualisation works, and it’s tailored to openSuSE 11.2 as the dom0 or “host”. I have tried OracleVM in the past but back then a domU (or virtual machine) could not mount an iSCSI target without a kernel panic and reboot. Clearly not what I needed at the time. OpenSuSE has another advantage: it uses a new kernel-not the 3 year old 2.6.18 you find in Enterprise distributions. Also, xen is recent (openSuSE 11.3 even features xen 4.0!) and so is libvirt.

The Setup

The general idea follows the design you find in the field, but with less cluster nodes. I am thinking of 2 nodes for the cluster, and 2 iSCSI target providers. I wouldn’t use iSCSI in the real world, but my lab isn’t connected to an EVA or similar.A third site will provide quorum via an NFS provided voting disk.

Site A will consist of filer01 for the storage part, and edcnode1 as the RAC node. Site B will consist of filer02 and edcnode2. The iSCSI targets are going to be provided by openFiler’s domU installation, and the cluster nodes will make use of Oracle Enterprise Linux 5 update 5.To make it more realistic, site C will consist of another openfiler isntance, filer03 to provide the NFS export for the 3rd voting disk. Note that openFiler seems to support NFS v3 only at the time of this writing. All systems are 64bit.

The network connectivity will go through 3 virtual switches, all “host only” on my dom0.

  • Public network: 192.168.99/24
  • Private network: 192.168.100/24
  • Storage network: 192.168.101/24

As in the real world, private and storage network have to be separated to prevent iSCSI packets clashing with Cache Fusion traffic. Also, I increased the MTU for the private and storage networks to 9000 instead of the default 1500. If you like to use jumbo frames you should check if your switch supports it.

Grid Infrastructure will use ASM to store OCR and voting disks, and the inter-SAN replication will also be performed by ASM in normal redundancy. I am planning on using preferred mirror read and intelligent data placement to see if that makes a difference.

Known limitations

This setup has some limitations, such as the following ones:

  • You cannot test inter-site SAN connectivity problems
  • You cannot make use of udev for the ASM devices-a xen domU doesn’t report anything back from /sbin/scsi_id which makes the mapping to /dev/mapper impossible (maybe someone knows a workaround?)
  • Network interfaces are not bonded-you certainly would use bonded NICs in real life
  • No “real” fibre channel connectivity between the cluster nodes

So much for the introduction-I’ll post the setup step-by-step. The intended series will consist of these articles:

  1. Introduction to XEN on openSuSE 11.2 and dom0 setup
  2. Introduction to openFiler and their installation as a virtual machine
  3. Setting up the cluster nodes
  4. Installing Grid Infrastructure 11.2.0.2
  5. Adding third voting disk on NFS
  6. Installing RDBMS binaries
  7. Creating a database

That’s it for today, I hope I got you interested and following the series. It’s been real fun doing it; now it’s about writing it all up.

UKOUG RAC&HA SIG September 2010

Just a quick one to announce that I’ll present at said event. Here’s the short synopsis of my talk:

Upgrading to Oracle Real Application Cluster 11.2

With the end of premier support in sight mid 2011 many business start looking at possible upgrade paths. With the majority of RAC systems deployed on Oracle 10g, there is a strong demand to upgrade these systems to 11.2. The presentation focuses on different upgrade paths, including Grid Infrastructure and the RDBMS. Alternative approaches to upgrading the software will be discussed as well. Experience from migrations performed at a large financial institution round the presentation up.

The renamdg command revisited-ASMLib

I have already written about the renamedg command, but since then fell in love with ASMLib. The use of ASMLib introduces a few caveats you should be aware of.

USAGE NOTES

This document presents research I performed with ASM on a lab environment. It should be applicable to any environment, but you should NOT use this for production-the renamedg command still is buggy, and you should not mess with ASM disk headers in an important system such as production or staging/UAT. You set the importance here!  The recommended setup for cloning disk groups is to use a data guard physical standby database on a different storage array to create a real time copy of your production database on that array. Again, do not use you production array for this!

Walking through a renamdg session

Oracle ASMLib introduces a new value to the ASM header, called the provider string as the following example shows:

[root@asmtest ~]# kfed read /dev/oracleasm/disks/VOL1 | grep prov
kfdhdb.driver.provstr:     ORCLDISKVOL1 ; 0x000: length=12

This can be verified with ASMLib:

[root@asmtest ~]# /etc/init.d/oracleasm querydisk /dev/xvdc1
Device "/dev/xvdc1" is marked an ASM disk with the label "VOL1"

The prefix “ORCLDISK” is automatically added by ASMLib and cannot easily be changed.

The problem with ASMLib is that the renamedg command does NOT update the provider string, which I’ll illustrate by walking through an example session. Disk group “DATA”, setup with external redundancy and two disks, DATA1 and DATA2, is to be cloned to “DATACLONE”.

The renamedg command requires the disk group to be cloned to be stopped. To prevent nasty surprises, you should stop the databases using that diskgroup manually.

[grid@rac11gr2drnode1 ~]$ srvctl stop database -d dev
[grid@rac11gr2drnode1 ~]$ ps -ef | grep smon
grid      3424     1  0 Aug07 ?        00:00:00 asm_smon_+ASM1
grid     17909 17619  0 15:13 pts/0    00:00:00 grep smon
[grid@rac11gr2drnode1 ~]$ srvctl stop diskgroup -g data
[grid@rac11gr2drnode1 ~]$

You can use the new “lsof” command of asmcmd to check for open files:

ASMCMD> lsof
DB_Name  Instance_Name  Path
+ASM     +ASM1          +ocrvote.255.4294967295
asmvol   +ASM1          +acfsdg/APACHEVOL.256.724157197
asmvol   +ASM1          +acfsdg/DRL.257.724157197
ASMCMD>

So apart from files from other disk groups no files are open, especially not referring to disk group DATA.

Now comes the part where you copy the LUNs, and this entirely depends on your system. The EVA series of storage arrays I worked with in this particular project offered a “snapclone” function, which used COW to create an identical copy of the source LUN, with a new WWID (which can be an input parameter to the snapclone call). When you are using device-mapper-multipath then ensure that your sys admins add the newly created LUNs to the /etc/multipath.conf file on all cluster nodes!

I am using Xen in my lab, which makes it simpler-all I need to do is to copy the disk containers on the domO and then add the new block devices to the running domU (“virtual machine” in Xen language). This can be done easily as the following example shows:

Usage: xm block-attach     

xm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w!
xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w!

xm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w!
xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w!

In the example, rac11gr2drnode{1,2} are the domU, the backend device is the copied file on the file system, the front end device in the domU is xvd{g,h}, and the mode is read/write, shareable. The exclamation mark here is crucial or else the second domU can’t mount the new block device-it is already exclusively mounted to another domU.

The fdisk command in my example immediately “sees” the new LUNs, with device mapper multipathing you might have to go through iterations of restarting multipathd and discovering partitions using kpartx. It is again very important to have all disks presented to all cluster nodes!

Here’s the sample output from my system:

[root@rac11gr2drnode1 ~]# fdisk -l | grep Disk | sort
Disk /dev/xvda: 4294 MB, 4294967296 bytes
Disk /dev/xvdb: 16.1 GB, 16106127360 bytes
Disk /dev/xvdc: 5368 MB, 5368709120 bytes
Disk /dev/xvdd: 16.1 GB, 16106127360 bytes
Disk /dev/xvde: 16.1 GB, 16106127360 bytes
Disk /dev/xvdf: 10.7 GB, 10737418240 bytes
Disk /dev/xvdg: 16.1 GB, 16106127360 bytes
Disk /dev/xvdh: 16.1 GB, 16106127360 bytes

I cloned /dev/xvdd and /dev/xvde to /dev/xvdg and /dev/xvdh.

Do NOT run /etc/init.d/oracleasm scandisks yet! Otherwise the renamedg command will complain about duplicate disk names, which is entirely reasonable.

I dumped all headers for disks /dev/xvd{d,e,g,h}1 to /tmp to be able to compare.

[root@rac11gr2drnode1 ~]# kfed read /dev/xvdd1 > /tmp/xvdd1.header
# repeat with the other disks

Start with phase one of the renamedg command:

[root@rac11gr2drnode1 ~]# renamedg phase=one dgname=DATA newdgname=DATACLONE \
> confirm=true verbose=true config=/tmp/cfg

Parsing parameters..

Parameters in effect:

 Old DG name       : DATA
 New DG name          : DATACLONE
 Phases               :
 Phase 1
 Discovery str        : (null)
 Confirm            : TRUE
 Clean              : TRUE
 Raw only           : TRUE
renamedg operation: phase=one dgname=DATA newdgname=DATACLONE confirm=true
  verbose=true config=/tmp/cfg
Executing phase 1
Discovering the group
Performing discovery with string:
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with
  disk number:0 and timestamp (32940276 1937075200)
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with
  disk number:1 and timestamp (32940276 1937075200)
Checking for hearbeat...
Re-discovering the group
Performing discovery with string:
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with
  disk number:0 and timestamp (32940276 1937075200)
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with
  disk number:1 and timestamp (32940276 1937075200)
Checking if the diskgroup is mounted
Checking disk number:0
Checking disk number:1
Checking if diskgroup is used by CSS
Generating configuration file..
Completed phase 1
Terminating kgfd context 0x2b7a2fbac0a0
[root@rac11gr2drnode1 ~]#

You should always check “$?” for errors-the message “terminating kgfd context” sounds bad, but isn’t. At the end of stage 1, there is no change to the header. Only at phase two there is:

[root@rac11gr2drnode1 ~]# renamedg phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg

Parsing parameters..
renamedg operation: phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg
Executing phase 2
Completed phase 2

Now there are changes:

[root@rac11gr2drnode1 tmp]# grep DATA *header
xvdd1.header:kfdhdb.driver.provstr:    ORCLDISKDATA1 ; 0x000: length=13
xvdd1.header:kfdhdb.dskname:                   DATA1 ; 0x028: length=5
xvdd1.header:kfdhdb.grpname:               DATACLONE ; 0x048: length=9
xvdd1.header:kfdhdb.fgname:                    DATA1 ; 0x068: length=5
xvde1.header:kfdhdb.driver.provstr:    ORCLDISKDATA2 ; 0x000: length=13
xvde1.header:kfdhdb.dskname:                   DATA2 ; 0x028: length=5
xvde1.header:kfdhdb.grpname:               DATACLONE ; 0x048: length=9
xvde1.header:kfdhdb.fgname:                    DATA2 ; 0x068: length=5
xvdg1.header:kfdhdb.driver.provstr:    ORCLDISKDATA1 ; 0x000: length=13
xvdg1.header:kfdhdb.dskname:                   DATA1 ; 0x028: length=5
xvdg1.header:kfdhdb.grpname:                    DATA ; 0x048: length=4
xvdg1.header:kfdhdb.fgname:                    DATA1 ; 0x068: length=5
xvdh1.header:kfdhdb.driver.provstr:    ORCLDISKDATA2 ; 0x000: length=13
xvdh1.header:kfdhdb.dskname:                   DATA2 ; 0x028: length=5
xvdh1.header:kfdhdb.grpname:                    DATA ; 0x048: length=4
xvdh1.header:kfdhdb.fgname:                    DATA2 ; 0x068: length=5

Although the original disks (/dev/xvdd1 and /dev/xvde1) had their disk group name changed, the provider string remained untouched. So if we were to issue a scandisks command now through /etc/init.d/oracleasm, there’d still be duplicate disk names. This is a bug in my opinion, and a bad thing.

Renaming the disks is straight forward, the difficult bit is to find out which have to be renamed. Again, you can use kfed to figure that out. I knew the disks to be renamed were /dev/xvdd1 and /dev/xvde1 after consulting the header information.

[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvdd1 DATACLONE1
Renaming disk "/dev/xvdd1" to "DATACLONE1":                [  OK  ]
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvde1 DATACLONE2
Renaming disk "/dev/xvde1" to "DATACLONE2":                [  OK  ]

I then performed a scandisks operation on all nodes just to be sure… I had corruption of the disk group before :)

[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]
[root@rac11gr2drnode1 tmp]#

[root@rac11gr2drnode2 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]
[root@rac11gr2drnode2 ~]#

The output on all cluster nodes should be identical, on my system I found the following disks:

[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm listdisks
ACFS1
ACFS2
ACFS3
ACFS4
DATA1
DATA2
DATACLONE1
DATACLONE2
VOL1
VOL2
VOL3
VOL4
VOL5

Sure enough, the cloned disks were present. Although everything seemed ok at this point, I could not start disk group DATA and had to reboot the cluster nodes to rectify that problem. Maybe there is some not so transient information stored somewhere about ASM disks. After the reboot, CRS started my database correctly, and with all dependent resources:

[oracle@rac11gr2drnode1 ~]$ srvctl status database -d dev
Instance dev1 is running on node rac11gr2drnode1
Instance dev2 is running on node rac11gr2drnode2

Where are the logs for the SCAN listeners?

Quick post and note to self. Where are the SCAN listener log files? A little bit of troubleshooting was required, but I guess I could have read the manuals too. In the end it turned out to be quite simple!

First of all, I needed to find out where on my four node cluster I had a SCAN listener. This is done quite easily by asking Clusterware:

[grid@rac11gr2node2 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node rac11gr2node2
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node rac11gr2node4
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node rac11gr2node3

I was initially on the first node, so had to ssh to the second. From there on I thought that the proc file system might have the answer. I needed to get the PID of the SCAN listener first:

[grid@rac11gr2node2 ~]$ ps -ef | grep -i scan
grid      4738     1  0 Jun03 ?        00:00:13 /u01/app/grid/product/11.2.0/crs/bin/tnslsnr LISTENER_SCAN1 -inherit
grid     24694 24147  0 20:55 pts/0    00:00:00 grep -i scan

Now /proc/4738/fd lists all the open file descriptors used by the SCAN listener. Surely the log.xml file would be there somewhere:

[grid@rac11gr2node2 ~]$ ll /proc/4738/fd
total 0
lrwx------ 1 grid oinstall 64 Jun 16 20:46 0 -> /dev/null
lrwx------ 1 grid oinstall 64 Jun 16 20:46 1 -> /dev/null
lrwx------ 1 grid oinstall 64 Jun 16 20:46 10 -> socket:[20906]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 11 -> socket:[20908]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 12 -> socket:[20927]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 13 -> socket:[20957]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 14 -> socket:[20958]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 15 -> socket:[22991]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 16 -> socket:[10712179]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 17 -> socket:[10173760]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 18 -> socket:[10176036]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 19 -> socket:[9106216]
lrwx------ 1 grid oinstall 64 Jun 16 20:46 2 -> /dev/null
lr-x------ 1 grid oinstall 64 Jun 16 20:46 3 -> /u01/app/grid/product/11.2.0/crs/rdbms/mesg/diaus.msb
lr-x------ 1 grid oinstall 64 Jun 16 20:46 4 -> /proc/4738/fd
lr-x------ 1 grid oinstall 64 Jun 16 20:46 5 -> /u01/app/grid/product/11.2.0/crs/network/mesg/nlus.msb
lr-x------ 1 grid oinstall 64 Jun 16 20:46 6 -> pipe:[20893]
lr-x------ 1 grid oinstall 64 Jun 16 20:46 7 -> /u01/app/grid/product/11.2.0/crs/network/mesg/tnsus.msb
lrwx------ 1 grid oinstall 64 Jun 16 20:46 8 -> socket:[20904]
l-wx------ 1 grid oinstall 64 Jun 16 20:46 9 -> pipe:[20894]

Well maybe not. Next option is to query the listener itself via lsnrctl. Nothing easier that that:

LSNRCTL> set current_listener LISTENER_SCAN1
Current Listener is LISTENER_SCAN1
LSNRCTL> show log_file
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1)))
LISTENER_SCAN1 parameter "log_file" set to /u01/app/grid/product/11.2.0/crs/log/diag/tnslsnr/rac11gr2node2/listener_scan1/alert/log.xml
The command completed successfully
LSNRCTL>

Aha, it uses the ADR as well. So back there, change the base and query the file:

[grid@rac11gr2node2 ~]$ adrci

ADRCI: Release 11.2.0.1.0 - Production on Wed Jun 16 20:58:17 2010

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/u01/app/oracle"
adrci> set base /u01/app/grid/product/11.2.0/crs/log
adrci> show home
ADR Homes:
diag/tnslsnr/rac11gr2node2/listener_scan1
diag/tnslsnr/rac11gr2node2/listener_scan3
diag/tnslsnr/rac11gr2node2/listener_scan2
adrci> set home diag/tnslsnr/rac11gr2node2/listener_scan1
adrci> show alert -tail
2010-06-16 20:58:25.021000 +01:00
16-JUN-2010 20:58:25 * service_update * polstdby_1 * 0
2010-06-16 20:58:27.441000 +01:00
16-JUN-2010 20:58:27 * service_update * poldb_2 * 0
2010-06-16 20:58:30.444000 +01:00
16-JUN-2010 20:58:30 * service_update * poldb_2 * 0
16-JUN-2010 20:58:30 * service_update * poldb_1 * 0
2010-06-16 20:58:33.442000 +01:00
16-JUN-2010 20:58:33 * service_update * poldb_2 * 0
2010-06-16 20:58:35.784000 +01:00
16-JUN-2010 20:58:35 * service_update * prod1 * 0
16-JUN-2010 20:58:36 * service_update * poldb_2 * 0
16-JUN-2010 20:58:36 * service_update * poldb_1 * 0
2010-06-16 20:58:39.546000 +01:00
16-JUN-2010 20:58:39 * service_update * poldb_2 * 0
16-JUN-2010 20:58:39 * service_update * poldb_1 * 0
2010-06-16 20:58:42.574000 +01:00
16-JUN-2010 20:58:42 * service_update * poldb_2 * 0
2010-06-16 20:58:45.574000 +01:00
16-JUN-2010 20:58:45 * service_update * poldb_2 * 0
2010-06-16 20:58:48.576000 +01:00
16-JUN-2010 20:58:48 * service_update * poldb_2 * 0
16-JUN-2010 20:58:48 * service_update * poldb_1 * 0
2010-06-16 20:58:51.575000 +01:00
16-JUN-2010 20:58:51 * service_update * poldb_2 * 0
16-JUN-2010 20:58:51 * service_update * poldb_1 * 0
2010-06-16 20:58:54.578000 +01:00
16-JUN-2010 20:58:54 * service_update * poldb_2 * 0

Job done.

crsctl status resource – state details are really useful

A very short post about a cool new feature I noticed today. RAC 11.2 has moved a lot of commands previously having their own syntax into crsctl. One of the cool new things is the fact that crsctl status resource -t (“tabular”) reports state details. Here I could see that my lab environment had a stuck archiver. Other state details include information about the cluster time synchronisation daemon ctss, or ASM instances. Have a look at my 4 node cluster:

[oracle@rac11gr2node2 ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  INTERMEDIATE rac11gr2node4                                
ora.LISTENER.lsnr
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  ONLINE       rac11gr2node4                                
ora.OCRVOTE.dg
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  INTERMEDIATE rac11gr2node4                                
ora.asm
 ONLINE  ONLINE       rac11gr2node1            Started             
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  INTERMEDIATE rac11gr2node4                                
ora.eons
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  ONLINE       rac11gr2node4                                
ora.gsd
 OFFLINE OFFLINE      rac11gr2node1                                
 OFFLINE OFFLINE      rac11gr2node2                                
 OFFLINE OFFLINE      rac11gr2node3                                
 OFFLINE OFFLINE      rac11gr2node4                                
ora.net1.network
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  ONLINE       rac11gr2node4                                
ora.ons
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  ONLINE       rac11gr2node4                                
ora.registry.acfs
 ONLINE  ONLINE       rac11gr2node1                                
 ONLINE  ONLINE       rac11gr2node2                                
 ONLINE  ONLINE       rac11gr2node3                                
 ONLINE  ONLINE       rac11gr2node4                                
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
 1        ONLINE  ONLINE       rac11gr2node2                                
ora.LISTENER_SCAN2.lsnr
 1        ONLINE  ONLINE       rac11gr2node4                                
ora.LISTENER_SCAN3.lsnr
 1        ONLINE  ONLINE       rac11gr2node3                                
ora.oc4j
 1        OFFLINE OFFLINE                                                   
ora.poldb.db
 1        ONLINE  INTERMEDIATE rac11gr2node3            Stuck Archiver      
 2        ONLINE  INTERMEDIATE rac11gr2node4            Stuck Archiver      
ora.poldb.drcp.svc
 1        ONLINE  ONLINE       rac11gr2node3                                
 2        ONLINE  INTERMEDIATE rac11gr2node4                                
ora.poldb.nondrcp.svc
 1        ONLINE  INTERMEDIATE rac11gr2node4                                
ora.polstdby.db
 1        ONLINE  INTERMEDIATE rac11gr2node4            Stuck Archiver      
 2        OFFLINE OFFLINE                                                   
ora.prod.batchserv.svc
 1        ONLINE  ONLINE       rac11gr2node2                                
 2        ONLINE  ONLINE       rac11gr2node1                                
ora.prod.db
 1        ONLINE  ONLINE       rac11gr2node1            Open                
 2        ONLINE  ONLINE       rac11gr2node2                                
ora.prod.reporting.svc
 1        ONLINE  ONLINE       rac11gr2node2                                
 2        ONLINE  ONLINE       rac11gr2node1                                
ora.rac11gr2node1.vip
 1        ONLINE  ONLINE       rac11gr2node1                                
ora.rac11gr2node2.vip
 1        ONLINE  ONLINE       rac11gr2node2                                
ora.rac11gr2node3.vip
 1        ONLINE  ONLINE       rac11gr2node3                                
ora.rac11gr2node4.vip
 1        ONLINE  ONLINE       rac11gr2node4                                
ora.scan1.vip
 1        ONLINE  ONLINE       rac11gr2node2                                
ora.scan2.vip
 1        ONLINE  ONLINE       rac11gr2node4                                
ora.scan3.vip
 1        ONLINE  ONLINE       rac11gr2node3                                

Nice!

Display scheduler class for a process in linux

The ps command in the ways I use it most (ps -ef and ps auxwww) doesn’t display the scheduling class for a process. Oracle have cunningly released a patchset to update Grid Infrastructure that changes the scheduling class from the VKTM and LMSn ASM processes to “Timeshare” instead of Realtime.

So far so good, but I had no idea how to display the scheduling class of a process so some man page reading and Internet research were in order. After some digging around I found out that using the BSD command line syntax combined with the “–format” option does the trick. The difficult bit was in figuring out which format identifiers to use. All the information ps can get about a process are recorded in /proc/pid/stat. Parsing this with a keen eye however proves difficult due to the sheer number of fields in the file. So back to using ps (1).

Here’s the example. Before applying the workaround to the patch, Oracle ASM’s VKTM (virtual keeper of time) and LMSn (global cache services process) run with TS priority:

[oracle@rac11gr2node2 ~]$ ps ax --format uname,pid,ppid,tty,cmd,cls,pri,rtprio \
>| egrep "(vktm|lms)" | grep asm
grid      4296     1 ?        asm_vktm_+ASM2               TS  24      -
grid      4318     1 ?        asm_lms0_+ASM2               TS  24      -

After applying the workaround the scheduling class changed:

[oracle@rac11gr2node1 ~]$ ps ax --format uname,pid,ppid,tty,cmd,cls,pri,rtprio | egrep "(vktm|lms)" | grep asm
grid      2352     1 ?        asm_vktm_+ASM1               RR  41      1
grid      2374     1 ?        asm_lms0_+ASM1               RR  41      1

Notice how the cls field changed, and also that the rtprio is now populated. I have learned something new today.