Top 60 Oracle Blogs

Recent comments


First contact with Oracle RAC

As you may know, Oracle released the first patchset on top of 11g Release 2. At the time of this writing, the patchset is out for 32bit and 64bit Linux, 32bit and 64bit Solaris SPARC and Intel. What an intersting combination of platforms… I thought there was no Solaris 32bit on Intel anymore.


Oracle has come up with a fundamentally different approach to patching with this patchset. The long version of this can be found in MOS document 1189783.1 “Important Changes to Oracle Database Patch Sets Starting With″. The short version is that new patches will be supplied as full releases. This is really cool, and some people have asked why that wasn’t always the case. In 10g Release 2, to get to the latest version with all the patches, you had to

  • Install the base release for Clusterware, ASM and at least one RDBMS home
  • Install the latest patchset on Clusterware, ASM and the RDBMS home
  • Apply the latest PSU for Clusterware/RDBMS, ASM and RDBMS

Especially applying the PSUs for Clusterware were very labour intensive. In fact, for a fresh install it was usually easier to install and patch everything on only one node and then extend the patched software homes to the other nodes of the cluster.

Now in things are different. You no longer have to apply any of the interim releases-the patch contains everything you need, already on the correct version. The above process is shortened to:

  • Install Grid Infrastructure
  • Install RDBMS home

Optionally, apply PSUs or other patches when they become available. Currently, MOS note 756671.1 doesn’t list any patch as recommended on top of

Interestingly upgrading from to is more painful than from Oracle 10g, at least on the Linux platform. Before you can run, the script tests if you applied the Grid Infrastructure PSU for OUI hasn’t performed the test when it checked for prerequisistes which caught me off-guard. The casual observer may now ask: why do I have to apply a PSU when the bug fixes should be rolled up into the patchset anyway? I honestly don’t have an answer, other than that if you are not on Linux you should be fine.

Grid Infrastructure will be an out-of-place upgrade which means you have to manage your local disk space very carefully from now on. I would not use anything less than 50-75G on my Grid Infrastructure mount point.This takes the new cluster health monitor facility (see below) into account, as well as the fact that Oracle performs log rotation for most logs in $GRID_HOME/log.

The RDBMS binaries can be patched either in-place or out-of-place. I’d say that the out-of-place upgrade for RDBMS binaries is wholeheartedly recommended as it makes backing out a change so much easier. As I said, you don’t have a choice for Grid Infrastructure which is always out-of-place.

And then there is the multicast issue Julian Dyke ( has written about. I couldn’t reproduce the test case, and my lab and real-life clusters run with happily.

Changes to Grid Infrastructure

After the successful upgrade you’d be surprised to find new resources in Grid Infrastructure. Have a look at these:

[grid@node1] $ crsctl stat res -t -init
Cluster Resources
 1        ONLINE  ONLINE       node1           Started
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1           OBSERVER
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1
 1        ONLINE  ONLINE       node1

The cluster_interconnect.haip is yet another step towards the self contained system. The Grid Infrastructure installation guide for Linux states:

“With Redundant Interconnect Usage, you can identify multiple interfaces to use for the cluster private network, without the need of using bonding or other technologies. This functionality is available starting with Oracle Database 11g Release 2 (”

So – good news for anyone who is relying on third party software like for example HP ServiceGuard for network bonding. Linux has always done this for you, even in the times of the 2.4 kernel. Linux network bonding is actually quite simple to set up as well. But anyway, I’ll run a few tests in the lab when I have time with this new feature enabled, deliberately taking down NICs to see if the new feature works as labelled on the tin. The documentation states that you don’t need to bond your NICs for the private interconnect, simply leave the ethx (or whatever name you NICs have on your OS) as they are, and indicate the ones you like to use for the private interconnect as private during the installation. If you decide to add a NIC to the cluster for use with the private interconnect later, use oifcfg as root to add the new interface (or watch this space for a later blog post on this). Oracle states that if one of the private interconnects fails, it will transparently use another one. Additionally to the high availability benefit, Oracle apparently also performs load balancing across the configured interconnects.

To learn more about the redundant interconnect feature I had a glance at its profile. As with any resource in the lower stack (or HA stack), you need to append the “-init” argument to crsctl.

[oracle@node1] $ crsctl stat res ora.cluster_interconnect.haip -p -init
DESCRIPTION="Resource type for a Highly Available network IP"

With this information at hand, we see that the resource is controlled through ORAROOTAGENT, and judging from the start sequence position and the fact that we queried crsctl with the “-init” flag, it must be OHASD’s ORAROOTAGENT.

Indeed, there are references to it in the $GRID_HOME/log/`hostname -s`/agent/ohasd/orarootagent_root/ directory. Further reference to the resource was found in cssd.log which makes perfect sense: it will use it for many things, last but not least fencing.

[ USRTHRD][1122056512] {0:0:2} HAIP: configured to use 1 interfaces
[ USRTHRD][1122056512] {0:0:2} HAIP:  Updating member info HAIP1;
[ USRTHRD][1122056512] {0:0:2} InitializeHaIps[ 0]  infList 'inf bond1, ip, sub'
[ USRTHRD][1122056512] {0:0:2} HAIP:  starting inf 'bond1', suggestedIp '', assignedIp ''
[ USRTHRD][1122056512] {0:0:2} Thread:[NetHAWork]start {
[ USRTHRD][1122056512] {0:0:2} Thread:[NetHAWork]start }
[ USRTHRD][1089194304] {0:0:2} [NetHAWork] thread started
[ USRTHRD][1089194304] {0:0:2}  Arp::sCreateSocket {
[ USRTHRD][1089194304] {0:0:2}  Arp::sCreateSocket }
[ USRTHRD][1089194304] {0:0:2} Starting Probe for ip
[ USRTHRD][1089194304] {0:0:2} Transitioning to Probe State
[ USRTHRD][1089194304] {0:0:2}  Arp::sProbe {
[ USRTHRD][1089194304] {0:0:2} Arp::sSend:  sending type 1
[ USRTHRD][1089194304] {0:0:2}  Arp::sProbe }
[ USRTHRD][1122056512] {0:0:2} Completed 1 HAIP assignment, start complete
[ USRTHRD][1122056512] {0:0:2} USING HAIP[  0 ]:  bond1 -
[ora.cluster_interconnect.haip][1117854016] {0:0:2} [start] clsn_agent::start }
[    AGFW][1117854016] {0:0:2} Command: start for resource: ora.cluster_interconnect.haip 1 1 completed with status: SUCCESS
[    AGFW][1119955264] {0:0:2} Agent sending reply for: RESOURCE_START[ora.cluster_interconnect.haip 1 1] ID 4098:343
[    AGFW][1119955264] {0:0:2} ora.cluster_interconnect.haip 1 1 state changed from: STARTING to: ONLINE
[    AGFW][1119955264] {0:0:2} Started implicit monitor for:ora.cluster_interconnect.haip 1 1
[    AGFW][1119955264] {0:0:2} Agent sending last reply for: RESOURCE_START[ora.cluster_interconnect.haip 1 1] ID 4098:343

OK, I know understand this a bit better. But the log information mentioned something else as well, an IP address that I haven’t assigned to the cluster. It turns out that this IP address is another virtual IP on the private interconnect, called bond1:1

[grid]grid@node1 $ /sbin/ifconfig
bond1     Link encap:Ethernet  HWaddr 00:23:7D:3d:1E:77
 inet addr:  Bcast:  Mask:
 inet6 addr: fe80::223:7dff:fe3c:1e74/64 Scope:Link
 RX packets:33155040 errors:0 dropped:0 overruns:0 frame:0
 TX packets:20677269 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
 RX bytes:21234994775 (19.7 GiB)  TX bytes:10988689751 (10.2 GiB)
bond1:1   Link encap:Ethernet  HWaddr 00:23:7D:3d:1E:77
 inet addr:  Bcast:  Mask:

Ah, something running multicast. I tried to sniff that traffic but couldn’t make any sense if it. There is UDP (not TCP) multicast traffic on that interface. This can be checked with tcpdump:

root@node1 ~]# tcpdump src -i bond1:1 -c 10  -s 1514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond1:1, link-type EN10MB (Ethernet), capture size 1514 bytes
14:30:18.704688 IP > UDP, length 252
14:30:18.704943 IP > UDP, length 252
14:30:18.705155 IP > UDP, length 252
14:30:18.895764 IP > UDP, length 192
14:30:18.895976 IP > UDP, length 296
14:30:18.897109 IP > UDP, length 192
14:30:18.897633 IP > UDP, length 192
14:30:18.897998 IP > UDP, length 192
14:30:18.902325 IP > UDP, length 192
14:30:18.902422 IP > UDP, length 296
10 packets captured
14 packets received by filter
0 packets dropped by kernel

If you are interested in the actual messages, use this command instead to capture a package:

[root@node1 ~]# tcpdump src -i bond1:1 -c 1 -X -s 1514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond1:1, link-type EN10MB (Ethernet), capture size 1514 bytes
14:31:43.396614 IP > UDP, length 192
 0x0000:  4500 00dc 0000 4000 4011 ed04 a9fe 4fd1  E.....@.@.....O.
 0x0010:  a9fe a93e e5b3 3f32 00c8 4de6 0403 0201  ...>..?2..M.....
 0x0020:  e403 0000 0000 0000 4d52 4f4e 0003 0000  ........MRON....
 0x0030:  0000 0000 4d4a 9c63 0000 0000 0000 0000  ....MJ.c........
 0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
 0x0050:  a9fe 4fd1 4d39 0000 0000 0000 0000 0000  ..O.M9..........
 0x0060:  e403 0000 0000 0000 0100 0000 0000 0000  ................
 0x0070:  5800 0000 ff7f 0000 d0ff b42e 0f2b 0000  X............+..
 0x0080:  a01e 770d 0403 0201 0b00 0000 67f2 434c  ..w.........g.CL
 0x0090:  0000 0000 b1aa 0500 0000 0000 cf0f 3813  ..............8.
 0x00a0:  0000 0000 0400 0000 0000 0000 a1aa 0500  ................
 0x00b0:  0000 0000 0000 ae2a 644d 6026 0000 0000  .......*dM`&....
 0x00c0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
 0x00d0:  0000 0000 0000 0000 0000 0000            ............
1 packets captured
10 packets received by filter
0 packets dropped by kernel

Substitute the correct values of course for interface and source address.

Oracle CRF resources

Another intersting new feature is the CRF resource, which seems to be an implementation of IPD/OS Cluster Health Monitor on the servers. I need to dig a little deeper in this feature, currently I can’t get any configuration data from the cluster:

[grid@node1] $ oclumon showobjects

 Following nodes are attached to the loggerd
[grid@node1] $

You will see some additional background processes now, namely ologgerd and osysmond.bin, which are started through the CRF resource. The resource profile (shown below) suggests that this resource is started through OHASD’s ORAROOTAGENT and can take custom logging levels.

[grid]grid@node1 $ crsctl stat res ora.crf -p -init
DESCRIPTION="Resource type for Crf Agents"

An investigation of orarootagent_root.log revealed that the rootagent indeed starts the CRF resource. This resource will start the ologgerd and oysmond processes, which then write their log files into $GRID_HOME/log/`hostname -s`/crf{logd,mond}.

Configuration of the daemons can be found in $GRID_HOME/ologgerd/init and $GRID_HOME/osysmond/init. Except for the PID file for the daemons there didn’t seem to be anything of value in the directory.

The command line of the ologgerd process shows it’s configuration options:

root 13984 1 0 Oct15 ? 00:04:00 /u01/crs/ -M -d /u01/crs/

The files in the directory specified by the “-d” flag denote where the process stores its logging information. The files are in BDB format, or Berkeley DB (now Oracle too). The oclumon tool should be able to read these files, but until I can persuade it to connect to the host there is no output.


Unlike the previous resources, the cvu resource is actually cluster aware. It’s the Cluster Verification Utility we all know from installing RAC. Going by the profile (shown below), I conclude that the utility is run through the grid software owner’s scriptagent and has exactly 1 incarnation on the cluster. It is only executed every 6 hours and restarted if it fails. If you like to execute a manual check, simply execute the action script with the command line argument “check”.

[root@node1 tmp]# crsctl stat res ora.cvu -p
DESCRIPTION=Oracle CVU resource

The action script $GRID_HOME/bin/cvures implements the usual callbacks required by scriptagent: start(), stop(), check(), clean(), abort(). All log information goes into $GRID_HOME/log/`hostname -s`/cvu.

The actual check performed is this one: $GRID_HOME/bin/cluvfy comp health -_format & > /dev/null 2>&1


Enough for now, this has become a far longer post than I initially anticipated. There are so many more new things around, like Quality of Server that need exploring making it very difficult to keep up.

After OOW, my laptop broke down – data rescue scenario

I just got back in the office from a 2 week conference + vacation (SFO,WAS,NY). Then I was finally back in shape to work and do the usual geek stuff again but suddenly my Neo laptop suddenly stopped working! (the one I mentioned here, but it’s now on Fedora)

It can’t even boot to BIOS, certainly a case worse than BSOD.

So after fiddling with the laptop and systematically ruling out other component failures (power cable,monitor,memory,HD), Yes it’s much like troubleshooting an Oracle database! … we decided to bring it to the service center.

Build your own stretched RAC part III

On to the next part in the series. This time I am showing how I prepared the iSCSI openFiler “appliances” on my host. This is quite straight forward, if one knows how it works :)

Setting up the openFiler appliance on the dom0

OpenFiler 2.3 has a special download option suitable for paravirtualised Xen hosts. Proceed by downloading the file from your favourite mirror, the file name I am using is “openfiler-2.3-x86_64.tar.gz”, you might have to pick another one if you don’t want a 64bit system.

All my domU go to /var/lib/xen/images/vm-name, and so do the openFiler ones. I am not using LVM to present storage to the domUs, my system came without free space I could have turned into a physical volume. Here are the steps to create the openFiler, remember to repeat this 3 times, one for each storage provider.

Begin with the first openFiler appliance. Whenever you see numbers in {} then that implies that the operation has to be repeated for each of the numbers in the curly braces.

# cd /var/lib/xen/images/
# mkdir filer0{1,2,3}
# cd filer0{1,2}

Next create the virtual disks for the appliance. I use 4G for the root file system and one 5G + 2 10G disks. The 5G disk will later on be part of the OCR and voting files disk group, whereas the other two are going to be the local ASM disks. These steps are for filer01 and filer02, the iSCSI target providers.

# dd if=/dev/zero of=disk01 bs=1 count=0 seek=4G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 1.3296e-05 s, 0.0 kB/s  

# dd if=/dev/zero of=disk02 bs=1 count=0 seek=5G
# dd if=/dev/zero of=disk03 bs=1 count=0 seek=10G
# dd if=/dev/zero of=disk04 bs=1 count=0 seek=10G

For the NFS filer03, you only need two 4G disks, disk1 and disk2. For all filers, a root partition has to be created. You also have to create a file system on the “root” volume:

# mkfs.ext3 disk01
mke2fs 1.41.9 (22-Aug-2009)
disk01 is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
262144 inodes, 1048576 blocks
52428 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1073741824
32 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 21 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
openSUSE-112-64-minimal:/var/lib/xen/images/filer01 #

Prepare to mount the root volume as a loop device, and also label the disk. Once mounted, copy the contents of the downloaded openfiler tarball into it as shown in this example:

# e2label disk01 root
# mkdir tmpmnt/
# mount -o loop disk01 tmpmnt/
# cd tmpmnt
# tar --gzip -xvf /m/downloads/openfiler-2.3-x86_64.tar.gz

With this done, we need to extract the kernel and the initial RAMdisk for later use in the xen config file. I have not experimented with pygrub for the openfiler appliances, someone with more knowledge may correct me here. This in any case works for this demonstration:

# mkdir  /m/xenkernels/openfiler
# cp -a /var/lib/xen/images/filer01/tmpmnt/boot /m/xenkernels/openfiler

Here are the files now stored inside the kernel directory on the dom0:

# ls -l /m/xenkernels/openfiler/
total 9276
-rw-r--r-- 1 root root  770924 May 30  2008
-rw-r--r-- 1 root root   32220 Jun 28  2008 config-
drwxr-xr-x 2 root root    4096 Jul  1  2008 grub
-rw-r--r-- 1 root root 1112062 Jul  1  2008 initrd-
-rw-r--r-- 1 root root 5986208 May 14 18:01 vmlinux
-rw-r--r-- 1 root root 1558259 Jun 28  2008 vmlinuz-

With this information, at hand we can construct ourselves a xen configuration file, such as the following:

# cat filer01.xml

    root=/dev/xvda1 ro 

In plain English, this verbose XML file describes the VM as a paravirtualised linux system with 4 hard disks and 2 network interfaces. The MAC must be static, otherwise you’ll end up with network problems each time you boot. For all currently started domUs the MAC also has to be unique! Change the UUID, name, paths to the disks (“source file”) and MAC addresses for filer02. The same applies for filer03, but this one only uses 2 disks-xvda and xvdb so please remove the disk-tags for disk03 and disk04.

Define the VM in xenstore and start it, while staying attached to the console:

# virsh define filer0{1,2,3}.xml
# xm start filer01 -c

Repeat this for filer02.xml and filer03.xml in separate terminal sessions.

Eventually, you are going to be presented with the welcome screen:

 Welcome to Openfiler NAS/SAN Appliance, version 2.3

You do not appear to have networking. Please login to start networking.

Configuring the OpenFiler domU

Log in as root (which doesn’t have a password, you should change this now!) and correct the missing network information. We have 2 virtual NICs, eth0 for the public network, and eth1 for the storage network. As root, navigate to /etc/sysconfig/network-scripts/ and edit ifcfg-eth{0,1}. In our example, we need 2 static interfaces. For eth0 for example, the existing file has the following contents:

[root@localhost network-scripts]# vi ifcfg-eth0
# Device file installed by rBuilder

Change this to:

[root@localhost network-scripts]# cat ifcfg-eth0
# Device file installed by rBuilder

Similarly, change ifcfg-eth1 for address and restart the network:

[root@localhost network-scripts]# service network restart

After this, ifconfig should report the correct interfaces and you are ready to access the web console.

The network for filer02 uses for eth0 and for eth1. Similarly, filer03 uses for eth0 and for eth1.

All domUs are in the internal network, you have to set up some port forwarding rules. The easiest way  to do this is in your $HOME/.ssh/config file. For my server, I set up the following options:

martin@linux-itgi:~> cat .ssh/config
Host *eq8
HostName eq8
User martin
Compression yes
# note the white space
LocalForward 4460
LocalForward 4470
LocalForward 4480
LocalForward 5902

# other hosts
Host *
PasswordAuthentication yes
 FallBackToRsh no

I am forwarding the local ports 4460, 4470, 4480 on my PC to the openfiler appliances. This way, I can enter https://localhost:44{6,7,8}0 to access the web frontend for the openFiler appliance. This is needed, as you can’t really administer them otherwise. When using Firefox, you’ll get a warning about certificates-I have added security exceptions because I know the web server is not conducting a man in the middle attack on me. You should always be careful adding unknown certificates to your browser in other cases.

Administering OpenFiler

NOTE: The following steps are for filer01 and filer02 only!

Once logged in as user “openfiler” (the default password is “password”), you might want to secure that password. Click on Accounts -> Admin Password and make the changes you like.

Next I recommend you verify the system setup. Click on System and review the settings. You should see the network configured correctly, and can change the hostname to filer0{1,2}.localdomain. Save your changes. Networking settings should be correct, if not you can update them here.

Next we need to partition our block devices. Previously unknown to me, openFiler uses the “gpt” format to partition disks. Click on Volumes -> Block devices to see all the block devices. Since you are running a domU, you can’t see the root device /dev/xvda. For each device (xvd{b,c,d} create one partition spanning the whole of the “disk”. You can do so by clicking on the device name. Scroll down to the “Create partition in /dev/xvdx” section and fill the data. Click “create” to create the partition. Note that you can’t see the partitions in fdisk should you log in to the appliance as root.

Once the partitions are created, it’s time to create volumes to be exported as iSCSI targets. Still in “Volumes”, click on “Volume Groups”. I chose to create the following volume groups:

  • ASM_VG with member PVs xvdc1 and xvdd1
  • OCRVOTE_VG with member PV xvdb1

Once the volume groups are created, you should proceed by creating logical volumes within these. Click on “Add Volume” to access this screen. You have a drop-down menu to select your volume group. For OCRVOTE_VG I opted to create the following logical volumes (you have to set the type to iSCSI rather than XFS):

  • ocrvote01_lv, about 2.5G in size, type iSCSI
  • ocrvote02_lv, about 2.5G in size, type iSCSI

For volume group ASM_VG, I created these logical volumes:

  • asmdata01_lv, about 10G in size, type iSCSI
  • asmdata02_lv, about 10G in size, type iSCSI

We are almost there! The storage has been carved out of the pool of available storage, and what remains to be done is the definition of the iSCSI targets and ACLs. You can define very fine grained access to iSCSI targets, and even for iSCSI discovery! This example tries to keep it simple and doesn’t use any CHAP authentication for iSCSI targets and discovery-in the real world you’d very much want to implement these security features though.

Preparing the iSCSI part

We are done for now on the Volumes tab. First, we need to enable the iSCSI target server. In “Services”, ensure that the “iSCSI target server” is enabled. If not, click on the link next to it. Before we can export any LUNs, we need to define who is eligible to mount them. In openFiler, this is configured via ACLs. Go to the “System” tab and scroll down to the “Network access configuration” section. Fill in the details of our cluster nodes here as shown below. These are the settings for edcnode1:

  • Name: edcnode1
  • Network/Host:
  • Netmaksk: (IMPORTANT: it has to be, NOT
  • Type: share

The settings for edcnode2 are identical, except for the IP address which is, we are configuring the “STORAGE” network here! Click on “Update” to make the changes permanent. You are now ready to create the iSCSI targets, of which there will be 2: one for the OCR/Voting Disk, and another one for the ASM LUNs.

Back to the Volume tab, click on “iSCSI targets”. You will be notified that no targets have been defined yet. You will have to defined the following targets for filer01:


Leave the default settings, they will do for our example. You simply add the name to the “Target IQN” field and then click on “Add”. The targets currently don’t support any LUNs yet, something that needs addressing in this step.

Switch to target and then use the tab “LUN mapping” to map a LUN. In the list of available LUNs add ocrvote01_lv and ocrvote02_lv to the target. Click on “network ACL” and allow access to the LUN from edcnode1 and edcnode2. For the first ASM target, map asmdata01_lv and set the permissions, then repeat for the last target with asmdata02_lv.

Create the following targets for filer02:


The mappings and settings for the ASM targets are identical to filer01, but for the OCRVOTE target only export the first logical volume, i.e. ocrvote01_lv.

NFS export

The third filer, filer03 is a little bit different in way that it only exports a NFS share to the cluster. It only has one data disk, data02. In a nutshell, create the filer as described to the point where it’s accessible via its web interface. The high level steps for it are:

  1. Partition /dev/xvdb into 1 partition spanning the whole disk
  2. Create a volume group ocrvotenfs_vg from /dev/xvdb1
  3. Create a logical volume nfsvol_lv, approx 1G in size with ext3 as its file system
  4. Enable the NFS v3 server (Services tab)

From there on the procedure is slightly different. Click on “Shares” to access the network shares available from the filer. You should see your volume group with the logical volume nfsvol_lv. Click on the link “nfsvol_lv” and enter “ocrvote” as subfolder name. A new folder icon with the name ocrvote will appear. Click on this one, and in the pop-up dialog click on “Make share”. You should set the following on the now opening lengthy configuration dialog:

  • Public guest acces
  • Host access for edcnode1 and edcnode2 for NFS RW (select the radio button)
  • Click on edit to access special options for edcnode1 and edcnode2. Ensure that the anonymous UID and GID match the one for the grid software owner. The UID/GID mapping has to be “all_squash”, IO mode has to be “sync”. You can ignore the write delay and origin port for this example
  • Leave all other protocols deselected
  • Click update to make the changes permanent

That was it! The storage layer is now perfectly set up for the cluster nodes which I’ll discuss in a follow-on post.

openSUSE-112-64-minimal:/var/lib/xen/images/filer01 # mkfs.ext3 disk01
mke2fs 1.41.9 (22-Aug-2009)
disk01 is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
262144 inodes, 1048576 blocks
52428 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1073741824
32 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: doneThis filesystem will be automatically checked every 21 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
openSUSE-112-64-minimal:/var/lib/xen/images/filer01 #mkdir tmpmnt/e2label disk01 root

mount -o loop disk01 tmpmnt/

cd tmpmnt

tar –gzip -xvf ../openfiler-2.3-x86_64.tar.gz

# only for the first time
mkdir  /m/xenkernels/openfiler

cp -a /var/lib/xen/images/filer01/tmpmnt/boot /m/xenkernels/openfiler/

cd ..
umount tmpmnt

openSUSE-112-64-minimal:/var/lib/xen/images/filer01 # ls -l /m/xenkernels/openfiler/
total 9276
-rw-r–r– 1 root root  770924 May 30  2008
-rw-r–r– 1 root root   32220 Jun 28  2008 config-
drwxr-xr-x 2 root root    4096 Jul  1  2008 grub
-rw-r–r– 1 root root 1112062 Jul  1  2008 initrd-
-rw-r–r– 1 root root 5986208 May 14 18:01 vmlinux
-rw-r–r– 1 root root 1558259 Jun 28  2008 vmlinuz-

Build your own stretched RAC part II

I promised in the introduction to introduce my lab environment in the first part of the series. So here we go…


Similar to the Fedora project, SuSE (now Novell) have come up with a community distribution some time ago which can be freely downloaded from the Internet. All these community editions give the users a glimpse at the new and upcoming Enterprise distribution, such as RHEL or SLES.

I have chosen the OpenSuSE 12.2 distribution for the host operating system. It has been updated to xen 3.4.1, kernel and libvirt 0.7.2. These packages provide a stable execution environment of the virtual machines we are going to build. Alternative xen-based solutions have not been considered. During initial testing I found that Oracle VM 2.1.x virtual machines could not mount iSCSI targets without kernel-panicking and crashing. Citrix’s xenserver is too commercial, and the community edition is lacking needed features, and finally Virtual Iron had already been purchased by Oracle.

All kernel 2.6.18-x based distributions such as Red Hat 5.x and clones were discarded for lack of features and their age. After all, 2.6.18 has been introduced three years ago and although features were back-ported to it, xen support is way behind what I needed. The final argument in favour of OpenSuSE was the fact that SuSE provide a xen-capable 2.6.31 kernel out of the box. Although it is perfectly possibly to build one’s own xen-kernel, this is an advanced topic and not covered here. OpenSuSE also makes configuring the networking bridges very straight forward by a good integration into yast, the distributions setup and configuration tool.

The host system uses the following components:

  • Single Intel Core i7 processor
  • 24GB RAM
  • 1.5 TB hard disk space in RAID 1

The whole configuration can be rented from hosting providers, something I have chosen to do. The host has run a four node 11.2 cluster plus 2 additional virtual machines for Enterprise Manager Grid Control 11.1 without problems. To my experience the huge amount of memory is the greatest benefit of the above configuration. Allocating four GB of RAM to each VM helped a lot.


You should be roughly familiar with the concepts behind XEN virtualisation, the following list explains the most important terminology.

  • Hypervisor The enabling technology to run virtual machines. The hypervisor used in this document is the xen hypervisor.
  • dom0 The dom(ain) 0 is the name for the host. The dom0 has full access to all the system’s peripherals
  • domU In Xen parlance, the domU is a virtual machine. Xen differentiates between paravirtualised and fully virtualised machines. Paravirtualisation broadly speaking offers superior performance, but requires a modified operating system. I am going to use paravirtualised domUs
  • Bridge: A (virtual) network device used for IP communication between virtual machines and the


Start off by installing the openSuSE 11.2 distribution, either choosing the GNOME or KDE desktop. Long years of exposure to Red Hat based systems made me chose the GNOME desktop. Once the installation has completed, start the yast administration tool and click on the “install hypervisor and tools” button. This will install the xen-aware kernel and add the necessary entry to GRUB boot loader. Once completed, reboot the server and boot the xen kernel. You don’t need to configure any network bridges at this stage, even though yast prompts you to do so.

Networking on the dom0

RAC requires at least 2 NICs per cluster nodes with fibre channel connectivity. In our example I am going to use iSCSI targets for storage, provided by the OpenFiler community edition. It is good practice to separate storage communication from any other communication, the same as with the cluster interconnect. Therefore, a third bridge will be used. Production setups would of course use a different setup, but as iSCSI serves the purpose quite well I decided to implement it. Also, a production cluster would feature redundancy everywhere, including NICs and HBAs. Remember that redundancy can prevent outages!

The communication between the cluster nodes will be channeled over virtual switches, so called bridges. It used to be quite difficult to set up a network bridge for XEN, but openSuSE’s yast configuration
tool makes this quite simple. My host has the following bridges configured:

  • br0 This is the only bridge that has a physical interface bridged, normally eth0 or bond0. It won’t be used for the cluster and is used purely to allow my ssh traffic coming in. If not yet configured,
  • br1 I used br1 as a host only network for the public cluster communication. It does not have a bridged physical interface
  • br2 This is in use for the private cluster interconnect. This bridge doesn’t have a physical NIC configured
  • br3 Finally this bridge will be used to allow iSCSI communication between the filers and the cluster nodes. Neither does this have a physical NIC configured

I said a number of times that configuring a bridge was quite tedious, and for some other distributions it still is. It requires quite a bit of knowledge of the bridge-utils package and the naming conventions for virtual and physical network interfaces in XEN. To configure a bridge in OpenSuSE, start yast, and click on the “Network Settings” icon to start the network configuration.

The configuration tool will load the current network configuration. Bridge br0 should be configured to bridge the public interface name, usually eth0. All other bridges should not bridge physical devices, effectively making them host-only. If you haven’t configured a network bridge when you installed the xen hypervisor and tools, it’s time to do so now. Identify your external networking device in the list of devices shown on the “Overview” page. Take note of all settings such as IP address, netmask, gateway, MTU, routes, etc. You can get this information by selecting your external NIC and clicking on the “Edit” button.

You should see a Network Bridge entry in the list of interfaces, which probably uses DHCP. Select it and click on “Edit”. Enter all the details you just copied from your actual physical NIC and ensure that under tab “Bridged Devices” that interface is listed. Click on “Next”. Confirm the warning that a device is already configured with these settings. This will effectively deconfigure the physical device and replace it with the bridge.

Adding the host-only bridges is easier. Select the “Add” option next, and on the following screen ensure to have selected “Bridge” as the device type. The configuration name will be set correctly, don’t change it unless you know what you are doing. In the following Network Card Setup screen, assign a static IP address, a subnet, and optionally ahostname. I left the hostname blank for all but the public bridge br0.

Finish the configuration assistant. Before restarting the network ensure you have an alternative means of getting to your machine, for example using a console. If the network is badly configured, you might be locked out.

The network setup

The following IP addresses are used for the example cluster:

IP Address Range Used For The IP addresses for the web interfaces of the openFiler iSCSI “SAN” The Single Client Access Name for the cluster Node virtual IP addresses and 58 Private cluster interconnect and 58 Storage subnet for the cluster nodes The IP addresses for the iSCSI interfaces of the openFiler “SAN”

Some of these addresses need to go into DNS. Edit your DNS server’s zone files and include the following to the zone’s forward lookup file:

; extended distance cluster
filer01                 IN A
filer02                 IN A
filer03                 IN A

edc-scan                IN A
edc-scan                IN A
edc-scan                IN A

edcnode1                IN A
edcnode1-vip            IN A

edcnode2                IN A
edcnode2-vip            IN A

The reverse lookup looks as follows:

; extended distance cluster
50              IN PTR          filer01.localdomain.
51              IN PTR          filer02.localdomain.
52              IN PTR          filer03.localdomain.

53              IN PTR          edc-scan.localdomain.
54              IN PTR          edc-scan.localdomain.
55              IN PTR          edc-scan.localdomain.

56              IN PTR          edcnode1.localdomain.
57              IN PTR          edcnode1-vip.localdomain.

58              IN PTR          edcnode2.localdomain.
59              IN PTR          edcnode2-vip.localdomain.

The public network maps to bridge br1 on network 192.168.99/24, the private  network is supported through br2 in the 192.168.100/24, and the storage will go through br3, using the 192.168.101/24 subnet.

Reload the DNS service now to make these changes active.

You should use the “host” utility to check if the SCAN resolves in DNS to be sure it all works.

That’s it-you successfully set up the dom0 for working with the virtual machines. Continue with the next part of the series, which is going to introduce openFiler and how to install it as a domU with minimal effort.

Build your own stretched RAC

Finally time for a new series! With the arrival of the new patchset I thought it was about time to try and set up a virtual extended distance or stretched RAC. So, it’s virtual, fair enough. It doesn’t allow me to test things like the impact of latency on the inter-SAN communication, but it allowed me to test the general setup. Think of this series as a guide after all the tedious work has been done, and SANs happily talk to each other. The example requires some understanding of how XEN virtualisation works, and it’s tailored to openSuSE 11.2 as the dom0 or “host”. I have tried OracleVM in the past but back then a domU (or virtual machine) could not mount an iSCSI target without a kernel panic and reboot. Clearly not what I needed at the time. OpenSuSE has another advantage: it uses a new kernel-not the 3 year old 2.6.18 you find in Enterprise distributions. Also, xen is recent (openSuSE 11.3 even features xen 4.0!) and so is libvirt.

The Setup

The general idea follows the design you find in the field, but with less cluster nodes. I am thinking of 2 nodes for the cluster, and 2 iSCSI target providers. I wouldn’t use iSCSI in the real world, but my lab isn’t connected to an EVA or similar.A third site will provide quorum via an NFS provided voting disk.

Site A will consist of filer01 for the storage part, and edcnode1 as the RAC node. Site B will consist of filer02 and edcnode2. The iSCSI targets are going to be provided by openFiler’s domU installation, and the cluster nodes will make use of Oracle Enterprise Linux 5 update 5.To make it more realistic, site C will consist of another openfiler isntance, filer03 to provide the NFS export for the 3rd voting disk. Note that openFiler seems to support NFS v3 only at the time of this writing. All systems are 64bit.

The network connectivity will go through 3 virtual switches, all “host only” on my dom0.

  • Public network: 192.168.99/24
  • Private network: 192.168.100/24
  • Storage network: 192.168.101/24

As in the real world, private and storage network have to be separated to prevent iSCSI packets clashing with Cache Fusion traffic. Also, I increased the MTU for the private and storage networks to 9000 instead of the default 1500. If you like to use jumbo frames you should check if your switch supports it.

Grid Infrastructure will use ASM to store OCR and voting disks, and the inter-SAN replication will also be performed by ASM in normal redundancy. I am planning on using preferred mirror read and intelligent data placement to see if that makes a difference.

Known limitations

This setup has some limitations, such as the following ones:

  • You cannot test inter-site SAN connectivity problems
  • You cannot make use of udev for the ASM devices-a xen domU doesn’t report anything back from /sbin/scsi_id which makes the mapping to /dev/mapper impossible (maybe someone knows a workaround?)
  • Network interfaces are not bonded-you certainly would use bonded NICs in real life
  • No “real” fibre channel connectivity between the cluster nodes

So much for the introduction-I’ll post the setup step-by-step. The intended series will consist of these articles:

  1. Introduction to XEN on openSuSE 11.2 and dom0 setup
  2. Introduction to openFiler and their installation as a virtual machine
  3. Setting up the cluster nodes
  4. Installing Grid Infrastructure
  5. Adding third voting disk on NFS
  6. Installing RDBMS binaries
  7. Creating a database

That’s it for today, I hope I got you interested and following the series. It’s been real fun doing it; now it’s about writing it all up.

Oracle Exadata Database Machine v2 vs x2-2 vs x2-8 Deathmatch

This post has bee updated live from the Oracle OpenWorld as I’m learning what’s new. Last update done on 28-Sep-2010.

Oracle Exadata v2 has been transformed into x2-2 and x2-8. x2-2 is just slightly updated while x2-8 is a much more high-end platform. Please note that Exadata x2-2 is not just an old Exadata v2 — it’s a fully refreshed model. This is a huge confusion here at the OOW and even at the Oracle web site.

The new Exadata pricing list is released and Exadata x2-2 costs exactly the same as old Exadata v2. Exadata x2-8 Full Rack (that’s the only x2-8 configuration — see below why) is priced 50% higher then Full Rack x2-2. This is hardware price only to clarify the confusion (updated 18-Oct-2010).

Exadata Storage Server Software pricing is the same and licensing costs per storage server and per full rack is the same as for Exadata v2 because number of disks didn’t change. Note that storage cells got upgraded but priced the same when it comes to Exadata Server software and hardware. Nice touch but see implications on databases licensing below.

This comparison is for Full-Rack models Exadata x2-2 and x2-8 and existing v2 model.

Finally, data-sheets are available for both x2-2 (Thx Dan Norris for the pointers):

and x2-8:

It means that live update of this post is probably over (27-Sep-2010).

v2 Full Rack x2-2 Full Rack x2-8 Full Rack
Database servers 8 x Sun Fire x4170 1U 8 x Sun Fire x4170 M2 1U 2 x Sun Fire x4800 5U
Database CPUs Xeon E5540 quad core 2.53GHz Xeon X5670 six cores 2.93GHz Xeon X7560 eight cores 2.26GHz
database cores 64 96 128
database RAM 576GB 768GB 2TB
Storage cells 14 x SunFire X4275 14 x SunFire X4270 M2 14 x SunFire X4270 M2
storage cell CPUs Xeon E5540 quad core 2.53GHz Xeon L5640 six cores 2.26GHz Xeon L5640 six cores 2.26GHz
storage cells CPU cores 112 168 168
IO performance & capacity 15K RPM 600GB SAS or 2TB SATA 7.2K RPM disks 15K RPM 600GB SAS (HP model – high performance) or 2TB SAS 7.2K RPM disks (HC model – high capacity)
Note that 2TB SAS are the same old 2 TB drives with new SAS electronics. (Thanks Kevin Closson for ref)
15K RPM 600GB SAS (HP model – high performance) or 2TB SAS 7.2K RPM disks (HC model – high capacity)
Note that 2TB SAS are the same old 2 TB drives with new SAS electronics. (Thanks Kevin Closson for ref)
Flash Cache 5.3TB 5.3TB 5.3TB
Database Servers networking 4 x 1GbE x 8 servers = 32 x 1GbE 4 x 1GbE x 8 servers + 2 x 10GbE x 8 servers = 32 x 1Gb + 16 x 10GbEE 8 x 1GbE x 2 servers + 8 x 10GbE x 2 servers = 16 x 1Gb + 16 x 10GbEE
InfiniBand Switches QDR 40Gbit/s wire QDR 40Gbit/s wire QDR 40Gbit/s wire
InfiniBand ports on database servers (total) 2 ports x 8 servers = 16 ports 2 ports x 8 servers = 16 ports 8 ports x 2 servers = 16 ports
Database Servers OS Oracle Linux only Oracle Linux (possible Solaris later, still unclear) Oracle Linux or Solaris x86

x2-8 has fewer but way bigger database servers. That means that x2-8 will scale better with the less RAC overhead for the databases. The bad news is that if one database server fails or down for maintenance, 50% of capacity is gone. What does that mean? It means that Exadata x2-8 is designed more for multi-rack deployments so that you can go beyond “simple” 2 node RAC. Some folks argue that two node RAC is less reliable for evictions and etc but you probably don’t know that Exadata has special IO fencing mechanism that makes it much more reliable.

Because there is 4 times more RAM in Exadata x2-8, more and more operations can be done fully in memory without even going to storage cells. This is why boost in number of cores / CPU performance is important — since InfniBand bandwidth stays the same, you need some other way to access more data so having more data on buffer cache will keep more CPU cores busy.

With Exadata x2-2, processing capacity on database servers increased and RAM increase is insignificant. So how does it impact “well-balanced” Exadata v2? Well, if more and more operations are offloaded to storage cells then database servers could have more “useful” data pumped in over InfniBand and actually spend CPU cycles processing the data rather then filtering it. With Exadata v2, depending on the compression level, CPU was often a bottleneck on data loads so having some more CPU capacity on database tiers won’t harm.

Old configuration v2 will not be available so be ready to spend more on Oracle database licenses unless you are licensed under ULA or something.

Both Exadata x2-8 and x2-2 will run updated Oracle Linux 5.5 with Oracle Enterprise Kernel. x2-8 can also run Solaris x86 on database servers as expected. This confirms my assumption that if Oracle adds Solaris x86 into Exadata, it will prove that Oracle is fully committed to Solaris Operating System. A rather pleasant news to me! However, Solaris 11 Express is not available right now and probably will be available towards the end of this calendar year.

If you look at x2-2 and x2-8 side by side physically, you will see that four 1U databases servers of x2-2 basically replaced by one 5U database server in x2-8 in terms of space capacity. There are also more internal disks in those bigger servers and more power supplies so they are more redundant.

More processing power on storage servers in x2-8 and x2-2 (not dramatically more but definitely noticeable) will speed up smart scans accessing data compressed with high level. As more and more operations can be uploaded to the storage cells, boost in CPU capacity there is quite handy. Note that this doesn’t impact licensing in any way — Exadata Storage Server Software is using number of physical disk spindles as the licensing metric.

Regarding claims of the full database encryption — need to understand how it works and what are the improvements. Oracle Transparent Data Encryption was available on Exadata v2 but had many limitations when using with other Exadata features. I assume that Exadata x2-x addresses those but need to follow up on details so stay tuned. I believe that customers of Exadata v2 will be able to take advantage of all new Exadata software features – the platform architecture hasn’t changed.

Wish List of Oracle OpenWorld 2010 Announcements: Exadata v3 x2-8, Linux, Solaris, Fusion Apps, Mark Hurd, Exalogic Elastic Cloud, Cloud Computing

It’s Sunday morning early in San Francisco and the biggest ever Oracle OpenWorld is about to start. It looks like it’s also going to be the busiest ever OpenWorld for me — my schedule looks crazy and I still need to do the slides for my Thursday sessions (one on ASM and one on cloud computing). Fortunately, my slides for today’s presentation are all ready to go.

OK. Don’t let me carry away — I started this post with the intention to write about what I expect Oracle to announce at this OpenWorld and it seems like the most important announcements happen at tonight’s keynote. I hasn’t been at the Oracle ACE Directors briefing so unlike them, all I can say is pure speculation-based and my wishes of what should be covered. Actually, unlike them, I actually CAN say at least something. :)

  1. Oracle Exadata Database Machine v3 (x2-8) — well, that shouldn’t come as a surprise to anybody by now. I fully expect upgrade of the hardware — new Intel CPUs (probably with more cores), more memory, possibly more flash (this technology moves really quick these days). Maybe 10GbE network can be introduced to address some of the customers demands but I don’t think it’s needed that much. InfiniBand might just stay as it is — I think there is enough throughput but Marc Fielding noted that moving InfiniBand to the next speed level shouldn’t be very expensive. Other then cosmetic upgrade, I believe that hardware architecture will largely stay the same — it works very well, it’s proven and very successful. Maybe something should be done to let customers integrate Exadata better into their data-centers — folks keep complaining of inflexibility (and I think Oracle should stay firm on this and don’t let customer screw themselves up but who knows).
    On the software side, I expect new Exadata Storage Software release announcement that will be able to offload more and more on the storage side. The concept of moving data intensive operation closed to the disks has proven to be very effective. I also expect to have more Exadata features for consolidation. If you didn’t notice, database release few days ago has Exadata specific QoS (Quality of Service) feature. I think this is what’s going to make Exadata to be a killer consolidation platform for the databases — true private cloud for Oracle databases or a true grid as Oracle insists calling it’s private cloud idea. Speaking about software… hm — see Linux and Solaris below.
    And back to consolidation, there must be the new platform similar to Exadata that integrates Oracle hardware and software and that should fill the gap as a consolidation offering for anything else but databases — Fusion Middleware, Fusion Apps and whole lineup of Oracle software. Whether it’s going to have Exadata in its name — I don’t know. It’s going to be names Exalogic Elastic Cloud. It would make sense to support both Solaris and Linux virtualization technologies on that new platform.
    Oh, and I hope to see Oracle start offering vertical focused solutions based on Exadata. Like Retail Database Machine. Maybe it won’t come at the OpenWorld but I think it would be a good move by Oracle.
  2. Solaris and Linux — I’ve been preaching for a while that having acquired Solaris engineering team, it would be insane not to take over Linux distribution from RedHat and start providing truly Oracle Linux. I was expecting Oracle to do that for a while. Either that or change Oracle’s commitment from Linux to Solaris on x86 platform. If Oracle is serious about Solaris now then the best indication of that would be Solaris x86 powered Exadata. In other words, the future of Linux and Solaris at Oracle should be made clear during this OpenWorld.
  3. Fusion Apps — god, I really hope something will be out. After all those years talking about it, I can’t stand anymore hearing about the ghost product (or line of products). I think it’s also confirmed by Debra Lilley’s increased activity over the past year — she is buzzing unusually strong about it. ;-) Of course, Fusion Apps will be all about integration of zillion of Oracle products into one system (which is a very difficult task). Oh, and if Fusion Apps are announced then they will run best on Exadata, of course. Oracle Fusion Apps Machine?
  4. Mark Hurd — finally, I’d be very keen to see the first serious public appearance of Mark Hurd as Oracle’s co-president. I think he will set the tone for the future of Oracle’s hardware business. So far it’s been all about profitability which is probably the best thing Oracle could do with otherwise dead Sun hardware business.

That’s all. I’m sure there will be more. I didn’t mention SPARC and that’s not because I forgot.

This OpenWorld promises to be very interesting!

Where can I download CentOS 2.1?

I need to download a copy of CentOS 2.1 (x86), but I can’t find it anywhere. I’ve been down the list of mirrors and they all list 2.1, but then have an empty tree below it.

If anyone knows how I can get hold of it please drop me a line.



An investigation into exadata

This is an investigation into an half rack database machine (the half rack database machine at VX Company). It’s an exadata/database V2, which means SUN hardware and database and cell (storage) software version 11.2.

I build a table (called ‘CG_VAR’), which consists of:
- bytes: 50787188736 (47.30 GB)
- extents: 6194
- blocks: 6199608

The table doesn’t have a primary key, nor any other constraints, nor any indexes. (of course this is not a real life situation)

No exadata optimisation

At first I disabled the Oracle storage optimisation using the session parameter ‘CELL_OFFLOAD_PROCESSING’:
alter session set cell_offload_processing=false;

Then executed: select count(*) from cg_var where sample_id=1;
The value ’1′ in the table ‘CG_VAR’ accounts for roughly 25%.

Execution plan:

Fedora 13 and Oracle…

Until a couple of days ago I hadn’t even realized that Fedora 13 was out. I guess that shows how interested I am in Fedora these days. :)

Anyway, I had a play around with it.