Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Reintroducing SLB – The Silly Little Benchmark.

A few years ago I was working a series on AMD Opteron, Hypertransport, NUMA and what that meant to Oracle. Along the way I put out the Silly Little Benchmark (SLB) as discussed in my post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark. I’ve had a lot of requests recently for copies of SLB since all my posts with links to the old OakTable Network site are dead-ends. If you are looking for SLB, please download it at the following link:

Download Silly Little Benchmark (a.k.a. SLB) Here

New SLB Kit
I’d like to point out a couple of things about the new SLB tar archive.

  1. The code has changed so results from this kit are not comparable to prior kits.
  2. The kit now performs 30 seconds of random memory reads followed by 30 seconds of random memory writes.
  3. The kit includes a wrapper script called runit.sh that runs SLB processes each with 512MB physical memory. The argument to runit.sh is a loop control of how many SLB processes to run upon each invocation of the benchmark.
  4. The kit includes a README that shows how to compile the kit and also offers further explanation of item #3 in this list.

Previous SLB Blog Posts
The following are a few pointers to prior content that dealt with SLB in one way or the other.

Some Recent SLB Results
The following are a few updated SLB results using the new kit.

The first recent result is from a 2s Westmere EP (Xeon 5600) system. I passed “6” into runit.sh to see what one socket’s worth of performance looks like.

$ ./runit.sh 6
Users: 6
Buffer area size  524288 KB  ADDR 0x2AD3ABA02010
Waiting for semaphore...
Total wops 239999724  Secs 30.1  Avg nsec/op 125 TPUT ops/sec 7978921.92
Total rops 399999540  Secs 30.1  Avg nsec/op 75 TPUT ops/sec 13300920.06
Buffer area size  524288 KB  ADDR 0x2B56949FA010
Waiting for semaphore...
Total wops 379999563  Secs 30.6  Avg nsec/op 80 TPUT ops/sec 12416873.45
Total rops 619999287  Secs 30.1  Avg nsec/op 48 TPUT ops/sec 20623014.23
Buffer area size  524288 KB  ADDR 0x2AAF02293010
Waiting for semaphore...
Total wops 239999724  Secs 30.2  Avg nsec/op 125 TPUT ops/sec 7939962.30
Total rops 459999471  Secs 30.1  Avg nsec/op 65 TPUT ops/sec 15257316.97
Buffer area size  524288 KB  ADDR 0x2AFADC4C9010
Waiting for semaphore...
Total wops 379999563  Secs 31.1  Avg nsec/op 81 TPUT ops/sec 12216920.78
Total rops 599999310  Secs 30.2  Avg nsec/op 50 TPUT ops/sec 19873638.29
Buffer area size  524288 KB  ADDR 0x2AEB7B430010
Waiting for semaphore...
Total wops 379999563  Secs 31.2  Avg nsec/op 82 TPUT ops/sec 12174302.22
Total rops 599999310  Secs 30.1  Avg nsec/op 50 TPUT ops/sec 19941302.38
Buffer area size  524288 KB  ADDR 0x2B6A80F63010
Waiting for semaphore...
Total wops 239999724  Secs 30.2  Avg nsec/op 125 TPUT ops/sec 7938049.67
Total rops 479999448  Secs 31.0  Avg nsec/op 64 TPUT ops/sec 15474601.85

Test Summary: Total wops 1859997861  Total rops  3159996366 Runtime seconds: 31 wops/s 59615316 rops/s 101281934

That was a bit bumpy. I re-ran it with affinity (taskset) and collected the following results:

$ taskset -pc 0-5 $$
pid 15320's current affinity list: 0-23
pid 15320's new affinity list: 0-5
$ sh ./runit.sh 6
Users: 6
Buffer area size  524288 KB  ADDR 0x2B28784C4010
Waiting for semaphore...
Total wops 379999563  Secs 31.0  Avg nsec/op 81 TPUT ops/sec 12238869.35
Total rops 499999425  Secs 30.2  Avg nsec/op 60 TPUT ops/sec 16580155.46
Buffer area size  524288 KB  ADDR 0x2B1241B67010
Waiting for semaphore...
Total wops 379999563  Secs 31.4  Avg nsec/op 82 TPUT ops/sec 12118541.38
Total rops 499999425  Secs 30.4  Avg nsec/op 60 TPUT ops/sec 16446948.61
Buffer area size  524288 KB  ADDR 0x2B4893BFD010
Waiting for semaphore...
Total wops 379999563  Secs 31.3  Avg nsec/op 82 TPUT ops/sec 12136661.49
Total rops 499999425  Secs 30.5  Avg nsec/op 60 TPUT ops/sec 16403891.60
Buffer area size  524288 KB  ADDR 0x2B94FD5AA010
Waiting for semaphore...
Total wops 379999563  Secs 31.0  Avg nsec/op 81 TPUT ops/sec 12272774.98
Total rops 519999402  Secs 30.9  Avg nsec/op 59 TPUT ops/sec 16820126.30
Buffer area size  524288 KB  ADDR 0x2B0D09454010
Waiting for semaphore...
Total wops 379999563  Secs 31.4  Avg nsec/op 82 TPUT ops/sec 12107983.29
Total rops 499999425  Secs 30.5  Avg nsec/op 61 TPUT ops/sec 16368642.72
Buffer area size  524288 KB  ADDR 0x2AAD4513E010
Waiting for semaphore...
Total wops 379999563  Secs 31.4  Avg nsec/op 82 TPUT ops/sec 12097160.14
Total rops 499999425  Secs 30.6  Avg nsec/op 61 TPUT ops/sec 16354937.65

Test Summary: Total wops 2279997378  Total rops  3019996527 Runtime seconds: 31 wops/s 72611381 rops/s 96178233

That result was a lot smoother and the wops (write ops per second) improved 22%. The rops, on the other hand, suffered a small 5% degredation. I’ll blog further about that in another post.

Other Results?
It sure would be nice if folks could try this out on other platforms. I’ve compiled and run it on Power6 so I know that it works on AIX 5L.

Filed under: oracle

Configuration device mapper multipath on OEL5 update 5

I have always wondered how to configure the device mapper multipath package for a Linux system. I knew how to do it in principle, but was never involved in the configuration from start up. Today I got the chance to work on this. The system is used for a lab test and not a production box (otherwise I probably wouldn’t have been allowed on). Actually it’s part of a 2 node cluster.

So the first step is to find out which partitions are visible to the system. The Linux kernel presents this information in the /proc/partitions table, as in the following example:


[root@node1 ~]# cat /proc/partitions
major minor  #blocks  name

 104     0   71652960 cciss/c0d0
 104     1     152586 cciss/c0d0p1
 104     2   71497282 cciss/c0d0p2
 8     0       2880 sda
 8    16  190479360 sdb
 8    32   23809920 sdc
 8    48   23809920 sdd
 8    64   23809920 sde
 8    80   23809920 sdf
 8    96   23809920 sdg
 8   112    1048320 sdh
 8   128    1048320 sdi
 8   144    1048320 sdj
 8   160       2880 sdk
 8   176  190479360 sdl
 8   192   23809920 sdm
 8   208   23809920 sdn
 8   224   23809920 sdo
 8   240   23809920 sdp
 65     0   23809920 sdq
 65    16    1048320 sdr
 65    32    1048320 sds
 65    48    1048320 sdt
 253     0    5111808 dm-0
 253     1   25591808 dm-1
 253     2   10223616 dm-2
 253     3    1015808 dm-3
 253     4   16777216 dm-4
[root@node1 ~]#

Using a keen eye you can see that sdk is the same size as sda, so probably that means that we have two paths to sda to sdj. We’ll confirm this later. The more HBAs and paths you have, the more partitions you are going to see. This is where the multipathing software comes into play: it allows us to abstract from the physical paths and presents a logical device. And offers some additional goodies such as path failover and limited load balancing.

Before proceeding I checked the status of the multipath daemon:

[root@node1 ~]# service multipathd status
multipathd is stopped
[root@node1 ~]# chkconfig --list multipathd
multipathd      0:off   1:off   2:off   3:off   4:off   5:off   6:off
[root@node1 ~]# chkconfig multipathd on

As you can see it was not started, and wouldn't start with a reboot - it was necessary to enable the service at boot time using the chkconfig command. This will automatically create links in /etc/rc.d/rcx.d to start and stop the service. As an additional benefit this command will respect dependencies the authors of the startup script have defined and create the {K,S}xxmultipathd links accordingly.

I next loaded the necessary modules-dm-multipath and dm-round-robin:

[root@node1 ~]# modprobe dm-multipath
[root@node1 ~]# modprobe dm-round-robin

With the multipathing nearly done, I need to get the WWIDs of all attached devices. At some point the WWID is going to be repeated - this is where you stop creating meta devices. Let's have a look at the output of this first. You have to change directory to /sys, as the scsi_id commands are relative to it.

[node1 sys]# for i in `cat /proc/partitions | awk '{print $4}' |grep sd`; do echo "### $i: `scsi_id -g -u -s /block/$i`"; done
### sda: 360000970000294900664533030303238
### sdb: 360000970000294900664533030344133
### sdc: 360000970000294900664533030344142
### sdd: 360000970000294900664533030344143
### sde: 360000970000294900664533030344144
### sdf: 360000970000294900664533030344239
### sdg: 360000970000294900664533030344241
### sdh: 360000970000294900664533030344244
### sdi: 360000970000294900664533030344245
### sdj: 360000970000294900664533030344246
### sdk: 360000970000294900664533030303238
### sdl: 360000970000294900664533030344133
### sdm: 360000970000294900664533030344142
### sdn: 360000970000294900664533030344143
### sdo: 360000970000294900664533030344144
### sdp: 360000970000294900664533030344239
### sdq: 360000970000294900664533030344241
### sdr: 360000970000294900664533030344244
### sds: 360000970000294900664533030344245
### sdt: 360000970000294900664533030344246
[node1 sys]#

Here you see again that sda and sdk have the same WWID. I like to assign alias names to the multipathing devices-that's going to make it easier to find out what they are used for. I now have to get the disk sizes and map these to their intended use.

Getting disk sizes:

[node1 sys]# fdisk -l 2>/dev/null | grep ^Disk
Disk /dev/cciss/c0d0: 73.3 GB, 73372631040 bytes        local
Disk /dev/sda: 2 MB, 2949120 bytes                ignore
Disk /dev/sdb: 195.0 GB, 195050864640 bytes
Disk /dev/sdc: 24.3 GB, 24381358080 bytes
Disk /dev/sdd: 24.3 GB, 24381358080 bytes
Disk /dev/sde: 24.3 GB, 24381358080 bytes
Disk /dev/sdf: 24.3 GB, 24381358080 bytes
Disk /dev/sdg: 24.3 GB, 24381358080 bytes
Disk /dev/sdh: 1073 MB, 1073479680 bytes
Disk /dev/sdi: 1073 MB, 1073479680 bytes
Disk /dev/sdj: 1073 MB, 1073479680 bytes
Disk /dev/sdk: 2 MB, 2949120 bytes                ignore
Disk /dev/sdl: 195.0 GB, 195050864640 bytes
Disk /dev/sdm: 24.3 GB, 24381358080 bytes
Disk /dev/sdn: 24.3 GB, 24381358080 bytes
Disk /dev/sdo: 24.3 GB, 24381358080 bytes
Disk /dev/sdp: 24.3 GB, 24381358080 bytes
Disk /dev/sdq: 24.3 GB, 24381358080 bytes
Disk /dev/sdr: 1073 MB, 1073479680 bytes
Disk /dev/sds: 1073 MB, 1073479680 bytes
Disk /dev/sdt: 1073 MB, 1073479680 bytes

The cleared, consolidated view on the storage:

### sdb: 360000970000294900664533030344133    195G
### sdc: 360000970000294900664533030344142    24.3G
### sdd: 360000970000294900664533030344143    24.3G   
### sde: 360000970000294900664533030344144    24.3G   
### sdf: 360000970000294900664533030344239    24.3G   
### sdg: 360000970000294900664533030344241    24.3G   
### sdh: 360000970000294900664533030344244    1G
### sdi: 360000970000294900664533030344245    1G
### sdj: 360000970000294900664533030344246    1G

### sdl: 360000970000294900664533030344133    repeat - second path
### sdm: 360000970000294900664533030344142
### sdn: 360000970000294900664533030344143
### sdo: 360000970000294900664533030344144
### sdp: 360000970000294900664533030344239
### sdq: 360000970000294900664533030344241
### sdr: 360000970000294900664533030344244
### sds: 360000970000294900664533030344245
### sdt: 36000097000029490066453303034424

Finally here's the mapping I will use:

  • sdb    DATA001
  • sdc    REDO001
  • sdd     FRA001
  • sde    FRA002
  • sdf    ACFS001
  • sdg    ACFS002
  • h,i,j     VOTINGOCR{1,2,3}

The mapping between WWID and alias happens in the /etc/multipath.conf file. The defaults section has been taken from MOS note 555603.1. The devnode_blacklist section has to be set up according to your storage config-in my case I ignore IDE devices and the internal RAID adapter.

[root@node1 ~]# cat /etc/multipath.conf
defaults {
 udev_dir                /dev
 polling_interval        10
 selector                "round-robin 0"
 path_grouping_policy    multibus
 getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
 prio_callout            /bin/true
 path_checker            readsector0
 rr_min_io               100
 rr_weight               priorities
 failback                immediate
 no_path_retry           fail
 user_friendly_name      no
}

devnode_blacklist {
 devnode "^(ramrawloopfdmddm-srscdst)[0-9]*"
 devnode "^hd[a-z]"
 devnode "^cciss!c[0-9]d[0-9]*"
 }

}

multipaths {
 multipath {
 wwid 360000970000294900664533030344133
 alias data001
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344142
 alias redo001
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344143
 alias fra001
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344144
 alias fra002
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344239
 alias acfs001
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344241
 alias acfs002
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344244
 alias votingocr001
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344245
 alias votingocr002
 path_grouping_policy failover
 }
 multipath {
 wwid 360000970000294900664533030344246
 alias votingocr003
 path_grouping_policy failover
 }
}

The mapping is really simple-for each device you use, create a "multipath" section, enter WWID, an alias and a path policy. Done! See if that worked by starting the multipath daemon:

[root@node1 ~]# service multipathd start

As always, /var/log/messages is a good place to check:

Nov 16 16:34:58 loninengblc204 kernel: device-mapper: table: 253:5: multipath: error getting device
Nov 16 16:34:58 loninengblc204 kernel: device-mapper: ioctl: error adding target to table
Nov 16 16:34:58 loninengblc204 multipathd: 360000970000294900664533030303238: load table [0 5760 multipath 0 0 1 1 round-robin 0 2 1 8:0 1000 8:160 1000]
Nov 16 16:34:58 loninengblc204 multipathd: data001: load table [0 380958720 multipath 0 0 2 1 round-robin 0 1 1 8:16 1000 round-robin 0 1 1 8:176 1000]
Nov 16 16:34:58 loninengblc204 multipathd: redo001: load table [0 47619840 multipath 0 0 2 1 round-robin 0 1 1 8:32 1000 round-robin 0 1 1 8:192 1000]
Nov 16 16:34:58 loninengblc204 multipathd: fra001: load table [0 47619840 multipath 0 0 2 1 round-robin 0 1 1 8:48 1000 round-robin 0 1 1 8:208 1000]
Nov 16 16:34:58 loninengblc204 multipathd: fra002: load table [0 47619840 multipath 0 0 2 1 round-robin 0 1 1 8:64 1000 round-robin 0 1 1 8:224 1000]
Nov 16 16:34:58 loninengblc204 multipathd: acfs001: load table [0 47619840 multipath 0 0 2 1 round-robin 0 1 1 8:80 1000 round-robin 0 1 1 8:240 1000]
Nov 16 16:34:58 loninengblc204 multipathd: acfs002: load table [0 47619840 multipath 0 0 2 1 round-robin 0 1 1 8:96 1000 round-robin 0 1 1 65:0 1000]
Nov 16 16:34:58 loninengblc204 multipathd: votingocr001: load table [0 2096640 multipath 0 0 2 1 round-robin 0 1 1 8:112 1000 round-robin 0 1 1 65:16 1000]
Nov 16 16:34:58 loninengblc204 multipathd: votingocr002: load table [0 2096640 multipath 0 0 2 1 round-robin 0 1 1 8:128 1000 round-robin 0 1 1 65:32 1000]
Nov 16 16:34:58 loninengblc204 multipathd: votingocr003: load table [0 2096640 multipath 0 0 2 1 round-robin 0 1 1 8:144 1000 round-robin 0 1 1 65:48 1000]
Nov 16 16:34:58 loninengblc204 multipathd: 360000970000294900664533030303238: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: data001: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: redo001: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: fra001: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: fra002: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: acfs001: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: acfs002: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: votingocr001: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: votingocr002: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: votingocr003: event checker started
Nov 16 16:34:58 loninengblc204 multipathd: path checkers start u

Great - are all paths working?

[root@node1 ~]# multipath -ll | head
fra002 (360000970000294900664533030344144) dm-9 EMC,SYMMETRIX
[size=23G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:4 sde 8:64  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:4 sdo 8:224 [active][ready]
fra001 (360000970000294900664533030344143) dm-8 EMC,SYMMETRIX
[size=23G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:3 sdd 8:48  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:3 sdn 8:208 [active][ready]
acfs002 (360000970000294900664533030344241) dm-11 EMC,SYMMETRIX
[size=23G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:6 sdg 8:96  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:6 sdq 65:0  [active][ready]

Congratulations - distribute the working multipath.conf to all cluster nodes and start multipathd.

The beauty of this over a solution such as power path is that the device names are consistent across the cluster. With PowerPath I have come across a situation where /dev/rdsk/emcpower1a on node1 was /dev/rdsk/emcpower4a on node2 and again a different device on the other nodes. Not really user friendly, but neither a big issue with ASM: it'll read the information from the disk headers anyway. It was more a problem with pre 11.2 when you had to use block devices to store the OCR and voting files.

The Law of Logical Argument…

I received a mail today containing a list of laws (in the scientific sense), all of which were silly observations just for amusement, but this one stood out.

“The Law of Logical Argument – Anything is possible if you don’t know what you are talking about.”

Although not directly relevant, that got me thinking… I answer a lot of questions on forums and one of the things that strikes me time and again is people find is really difficult to define and communicate their problem. In some cases there’s a language barrier to contend with, but most of the time it comes down to a basic lack of thought by the poster. How many times have you spent days following a forum thread only to find the question you were answering was not the question the original poster “thought” they had asked? This is not just a feature of forum posts, but also in every day interactions. People will spend hours skating around issues, but never formally outline what the issue is. In many cases, the process of clearly defining the issue actually allows them to answer the question for themselves, or at least highlights a possible path of investigation.

Unless you can clearly define a problem, there is no way you can communicate it to anyone else. What’s more, unless you have an understanding of your problem, you can’t judge the value of answers, making you susceptible to suggested voodoo solutions.

Over the years I’ve had the good fortune of interacting with a lot of people I consider Uber Geeks, but what continually strikes me is the simplicity of their approach to solving problems. They may sometimes look like they are making leaps, but that’s just because experience allows them to do the steps quickly in their heads, rather than have to draft it out on paper, but most of the time they are following a fairly simple recipe to solve the problem, the first stage of which is actually defining the problem itself.

Is it surprising that database performance tuning is still considered by many to be some form of black magic? Not at all, because it is one of those areas that requires thought, reason and a meticulous approach.

Enough ranting for today. I’m off to run a database on a single 1TB disk and wonder why my I/O throughput is so bad. Must need a faster CPU. :)

Cheers

Tim…

PS. Someone on Twitter recently linked to an article called The Lost Tools of Learning. It’s quite long an a bit heavy at times, but some of the points it makes tie in with this post and I think it’s worth a read.

UltraEdit Beta on Mac…

I signed up for the beta program, so now I’m rockin’ UltraEdit (beta 5) on Mac. It’s not perfect yet, but it’s already so much nicer than TextWrangler.

Cheers

Tim…

Oracle RAC One Node revisited – 11.2.0.2

Since we published the RAC book, Oracle has released patchset 11.2.0.2. Amongst other things, this improved the RAC One Node option, exactly the way we expected.

How it was

A quick recap on the product as it was in 11.2.0.1: RAC One Node is part of Oracle Enterprise Edition, any other software editions are explicitly not allowed. Another restriction exists for 3rd party Clusterware: it’s not allowed to use one. RAC One Node is a hybrid between full blown RAC and the active/passive cluster. The option uses Grid Infrastructure for cluster management and storage provisioning via ASM. The RAC One instance starts its life as a RAC database, limited to only one cluster node. It only ever runs on one node only, but that node can change. It is strongly recommended to create a service for that RAC database. Utilities such as raconeinit provide a text based command line interface to transform that database to a “RAC One Node”-instance. In the process, the administrator can elect which nodes should be allowed to run the instance. The “omotion” utilities allowed the DBA to move the RAC One Node instance from the current node to another one. Optionally a time threshold could be set after which all ongoing transactions were to move to the new node. This feature required TAF or FAN to be set up correctly. The raconestatus utility allowed you to view the status of your RAC One Node instances. Conversion to full RAC was made possible by the racone2rac utility.

If you were after a Data Guard setup you’d be disappointed: that wasn’t (and AFAIK still is not) supported.

So all in all, that seemed a little premature. A patch to be downloaded and applied, no Data Guard and a new set of utilities are not really user friendly. Plus, initially this patch was available for Linux only. But at least a MOS note (which I didn’t find until after having finished writing this!) exists, RAC One — Changes in 11.2.0.2 [ID 1232802.1]

Changes

Instead of having to apply patch 9004119 to your environment, RAC One Node is available “out of the box” with 11.2.0.2. Sadly, the Oracle RAC One Node manual has not been updated, and searches on Metalink reveal no new information. One interesting piece of information: the patch for RAC One Node is listed as “undocumented Oracle Server” section.

The creation of a RAC One Node instance has been greatly simplified-dbca has added support for it, both from the command line for silent installations as well as the interactive GUI. Consider these options for dbca.

$ dbca -help
dbca  [-silent | -progressOnly | -customCreate] {  }  |
 { [ [options] ] -responseFile   }
 [-continueOnNonFatalErrors ]
Please refer to the manual for details.
You can enter one of the following command:

Create a database by specifying the following parameters:
-createDatabase
 -templateName 
 [-cloneTemplate]
 -gdbName 
 [-RACOneNode
 -RACOneNodeServiceName  ]
 [-policyManaged | -adminManaged ]
 [-createServerPool ]
 [-force ]
 -serverPoolName 
 -[cardinality ]
 [-sid ]
...

With RAC One Node you will most likely end up with a policy managed database in the end, I can’t see how an admin managed database made sense.

The srvctl command line tool has been improved to deal with the RAC One node. The most important operations are to add, remove, config and status. The nice thing about dbca is that it actually registers the database in the OCR. Immediately after the installation, you see this status information:

$ srvctl status database -d rontest
Instance rontest_1 is running on node node2
Online relocation: INACTIVE

$ srvctl config database -d rontest
Database unique name: rontest
Database name:
Oracle home: /data/oracle/product/11.2.0.2
Oracle user: oracle
Spfile: + DATA/rontest/spfilerontest.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: rontest
Database instances:
Disk Groups: DATA
Mount point paths:
Services: rontestsrv
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: rontest
Candidate servers: node2,node3
Database is administrator managed

Note that the instance_name, although the instance is administrator managed, changed to $ORACLE_SID_1. Relocating works now with the srvctl relocate database command as in this example:

$ srvctl relocate database -d rontest -n node2

You’ll get feedback about this in the output of the “status” command:

$ srvctl status database -d rontest
Instance rontest_1 is running on node node2
Online relocation: ACTIVE
Source instance: rontest_1 on node2
Destination instance: rontest_2 on node3

After the command completed, check the status again:

srvctl status database -d rontest
Instance rontest_2 is running on node node2
Online relocation: INACTIVE

The important difference between an admin managed database and a policy managed database is that you are responsible for undo tablespaces. If you don’t create and configure undo tablespaces, the relocate command will fail:

$ srvctl relocate database –d rontest -n node3                       <
PRCD-1222 : Online relocation of database rontest failed but database was restored to its original state
PRCD-1129 : Failed to start instance rontest_2 for database rontest
PRCR-1064 : Failed to start resource ora.rontest.db on node node3
CRS-5017: The resource action "ora.rontest.db start" encountered the following error:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-30013: undo tablespace 'UNDOTBS1' is currently in use
Process ID: 1587
Session ID: 35 Serial number: 1

CRS-2674: Start of 'ora.rontest.db' on 'node3' failed

In this case, the database runs on the same node. Check the ORACLE_SID (rontest_2 in my case) and modify the initialisation parameter.

SQL> select tablespace_name from dba_data_files where tablespace_name like ‘%UNDO%’;

TABLESPACE_NAME
------------------------------
UNDOTBS1
UNDOTBS2

So the tablespace was there, but the initialisation parameter was wrong! Let’s correct this:

SQL> alter system set undo_tablespace='UNDOTBS1' sid='rontest_1';

System altered.

SQL> alter system set undo_tablespace='UNDOTBS2' sid='rontest_2';

System altered.

Now the relocate will succeed.

To wrap this article up, the srvctl convert database command will convert between single instance, RAC One Node and RAC databases.

An investigation into exadata, part 2

It’s been a while since I’ve post the first part of this series.

Considerations

I created a little tool, snap, which is more or less a combination of Tanel Poder’s snapper and a script from Tom Kyte. I used this for the first part to give me all the details of database process I was using. The next step is invoking Oracle’s Parallel Query feature. Parallel Query means multiple processing are working together to execute a query.

When measuring the session with snap, I only get the statistics of “my” database session, not the statistics of all the Parallel Query slaves which are working for me. This means I get too little details measuring only my own session. The only way I could think of to measure all relevant statistics in an easy way, is to gather the same statistics as my snap tool, but globally. (global means all processes and in all instances of the database) The result is gsnap; global snap. In my case, I am the only user in the database when doing anything, so the global stats are about my session, the parallel query slaves I have invoked and the background processes. If multiple sessions are working in the database, you cannot use gsnap statistics to determine what your session has done, because it’s unknown which session contributed to the statistics.

Another interesting thing came up during the investigation: when looking in the SQL area (v$sqlarea for example), the statistics with the SQL are the statistics of all the session involved (parallel query slaves) accumulated for that instance. While this sounds logical, it means that the elapsed time with the SQL is not the same as the elapsed time the requesting session experienced. To get the average(!) ‘real’ elapsed time, you need to divide the elapsed time by the number of PX slaves used. The number of slaves can be found using the ‘PX_SERVERS_EXECUTIONS’ column. To get the number slaves used per execution (on average, it’s accumulated), divide by ‘EXECUTIONS’. In case of RAC, the number of executions is only available in the instance in which the SQL is executed, the other instances report ’0′.

In this part I investigate the execution of parallel query on a half rack exadata. This means 4 instances. When measuring the statistics with 4 instances, there is a lot of information. A real lot. For the sake of readability, I do not include all statistics, only the statistics I deem important or interesting.

Parallel query slave usage ‘spills’ to other instances in a RAC database at a certain point. When parallel is set to 2-16 the slaves remain local, when set to 17 all the instances are used. There are no other sessions in any of the instances. The parameter ‘PARALLEL_DEGREE_LIMIT’ is set to CPU. 16 is the number CPU threads on an Exadata V2 database node. (2 sockets, 8 core’s, 16 threads per database node)

In the first part, I have gotten a response time of 37 seconds. In fact, the table has changed a bit so it is different from the first post, but consider it roughly the same. (the same query now takes 33 seconds)

Parallel 16 – local node

As Kevin Closson points out in his comment, the query in the first part was CPU bound, and could be made faster if I could find extra CPU cycles or a faster CPU. Well, I am not going to change the hardware. But wait! I have 16 CPU threads, and the query only uses one. So the obvious next step is to use these threads. To do this, I am going to use Oracle’s Parallel Query functionality. This can be done in several ways, I choose to use the parallel hint in the SQL:

select /*+parallel(16)*/ count(*) from cg_var where sample_id=1;

Wow, the response time now is 13.32 seconds!

What is the result of this parallellisation?

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |   307K(100)|          |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |     3 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000 |     1 |     3 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |     3 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |          |   399M|  1141M|   307K  (1)| 01:01:26 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| CG_VAR   |   399M|  1141M|   307K  (1)| 01:01:26 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                                                                                                                          

   6 - storage(:Z>=:Z AND :Z<=:Z)
       filter("SAMPLE_ID"=1)                                                                                                                                                                                                                                                                                 

Note
-----
   - Degree of Parallelism is 16 because of hint

How do the statistics look like?


i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests			   127,557		    7,645.497
1,STAT physical read total multi block requests 	   127,445		    7,638.784
1,STAT physical read requests optimized 			14			 .839
1,STAT physical read total bytes		   133,276,368,896	    7,988,304,014.076
1,STAT physical write total IO requests 			24			1.439
1,STAT physical write total bytes			   372,736		   22,341.008
1,STAT cell physical IO interconnect bytes	     4,678,771,904	      280,435,704.328
1,STAT cell physical IO bytes eligible for pred    133,275,516,928	    7,988,252,948.914
1,STAT cell physical IO bytes saved by storage		14,680,064		  879,892.025
1,STAT cell smart IO session cache lookups		       192		       11.508
1,STAT cell smart IO session cache hits 		       192		       11.508
1,STAT cell physical IO interconnect bytes retu      4,676,846,784	      280,320,316.702
1,STAT cell session smart scan efficiency		       447		       26.792
1,STAT table scans (short tables)				 1			 .060
1,STAT table scans (long tables)			       208		       12.467
1,STAT table scans (rowid ranges)			       208		       12.467
1,STAT table scans (direct read)			       208		       12.467
1,STAT table scan rows gotten			       391,553,915	       23,468,914.533
1,STAT table scan blocks gotten 			 3,995,742		  239,496.335
1,STAT cell scans					       208		       12.467
1,STAT cell blocks processed by cache layer		16,270,709		  975,231.927
1,STAT cell blocks processed by txn layer		16,270,709		  975,231.927
1,STAT cell blocks processed by data layer		16,267,256		  975,024.961
1,STAT cell blocks helped by minscn optimizatio 	16,270,709		  975,231.927
1,STAT cell simulated session smart scan effici    133,261,361,152	    7,987,404,481.604
1,STAT cell IO uncompressed bytes		   133,261,361,152	    7,987,404,481.604
1,WAIT cell smart table scan				       185.188		       11.100
1,WAIT PX Idle Wait					       426.043		       25.536
1,WAIT PX Deq: Join ACK 					  .007			 .000
1,WAIT PX Deq Credit: send blkd 				  .003			 .000
1,WAIT PX Deq: Parse Reply					  .006			 .000
1,WAIT PX Deq: Execute Reply					13.104			 .785
1,WAIT enq: CF - contention					  .000			 .000
1,WAIT PX Deq: Signal ACK RSG					  .000			 .000
1,WAIT PX Deq: Signal ACK EXT					  .011			 .001
1,WAIT PX Deq: reap credit					  .003			 .000
1,WAIT PX Nsq: PQ load info query				  .200			 .012
1,WAIT PX Deq: Slave Session Stats				  .053			 .003
1,WAIT enq: PS - contention					  .009			 .001
1,WAIT enq: TO - contention					  .000			 .000
1,TIME repeated bind elapsed time				  .000			 .000
1,TIME parse time elapsed					  .009			 .001
1,TIME background cpu time					  .037			 .002
1,TIME PL/SQL execution elapsed time				  .001			 .000
1,TIME DB CPU							22.180			1.329
1,TIME sql execute elapsed time 			       207.183		       12.418
1,TIME DB time						       207.355		       12.428
1,TIME background elapsed time					  .305			 .018

As I’ve stated above, the query response time was 13.32 seconds. Simplistic profile:

 

elapsed time                        : 207
CPU                                 :  22
wait / cell smart table scan        : 185

 

Apparently, I’ve spend 207 seconds of total elapsed time (foreground and PX slaves). 207/16=12.94 a little less than the response time. (13.32)
The profile shows that by enlarging the CPU capacity with PX, the bottleneck shifts, probably towards storage. Are we pushing the storage? Lets see: total amount processed 133,276,368,896, IO eliminated by storage indexes: 14,680,064, that leaves 133,261,688,832.
We got 7 storage servers, so divide by 7: 133,261,688,832/7=19,037,384,119 = average amount of IO done by every cell for this query. If we take 19,037,384,119 and divide that by the total time the query took (13.32); 19,037,384,119/13.32=1,429,233,042. So approximately 1.4GB per second per cell is physically read. I managed to get 1.5GB/s out of a cell during other tests, so we are near the maximum.

This would mean that if I enlarge the parallelism number further, which means multiple instances are used, the response time will not significantly decrease further. Well, let’s see:

Parallel 64 – the entire cluster

select /*+parallel(64)*/ count(*) from cg_var where sample_id=1;

The response time is 12.88 seconds.

 

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       | 76776 (100)|          |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |     3 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000 |     1 |     3 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |     3 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |          |   399M|  1141M| 76776   (1)| 00:15:22 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| CG_VAR   |   399M|  1141M| 76776   (1)| 00:15:22 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                                                                                                                          

   6 - storage(:Z>=:Z AND :Z<=:Z)
       filter("SAMPLE_ID"=1)                                                                                                                                                                                                                                                                                 

Note
-----
   - Degree of Parallelism is 64 because of hint

 

Statistics:


---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests			    33,306		    1,977.687
1,STAT physical read total multi block requests 	    33,197		    1,971.214
1,STAT physical read requests optimized 			 3			 .178
1,STAT physical read total bytes		    34,602,745,856	    2,054,686,537.670
1,STAT physical write total IO requests 			14			 .831
1,STAT physical write total bytes			   240,640		   14,289.033
1,STAT cell physical IO interconnect bytes	     1,208,841,744	       71,780,166.462
1,STAT cell physical IO bytes eligible for pred     34,602,024,960	    2,054,643,731.376
1,STAT cell physical IO bytes saved by storage		 3,145,728		  186,791.100
1,STAT cell smart IO session cache lookups		       200		       11.876
1,STAT cell smart IO session cache hits 		       200		       11.876
1,STAT cell physical IO interconnect bytes retu      1,207,442,960	       71,697,107.658
1,STAT cell session smart scan efficiency		       450		       26.721
1,STAT table scans (short tables)				 1			 .059
1,STAT table scans (long tables)			       216		       12.826
1,STAT table scans (rowid ranges)			       216		       12.826
1,STAT table scans (direct read)			       216		       12.826
1,STAT table scan rows gotten			       101,106,534		6,003,634.369
1,STAT table scan blocks gotten 			 1,031,921		   61,274.738
1,STAT cell scans					       216		       12.826
1,STAT cell blocks processed by cache layer		 4,223,803		  250,806.430
1,STAT cell blocks processed by txn layer		 4,223,803		  250,806.430
1,STAT cell blocks processed by data layer		 4,223,503		  250,788.616
1,STAT cell blocks helped by minscn optimizatio 	 4,223,803		  250,806.430
1,STAT cell simulated session smart scan effici     34,598,936,576	    2,054,460,345.321
1,STAT cell IO uncompressed bytes		    34,598,936,576	    2,054,460,345.321
1,WAIT cell smart table scan				       192.931		       11.456
1,WAIT library cache pin					  .006			 .000
1,WAIT library cache lock					  .002			 .000
1,WAIT library cache: mutex X					  .000			 .000
1,WAIT PX Idle Wait					       448.150		       26.611
1,WAIT PX Deq: Join ACK 					  .007			 .000
1,WAIT PX Deq Credit: send blkd 				  .003			 .000
1,WAIT PX Deq: Parse Reply					  .006			 .000
1,WAIT PX Deq: Execute Reply					12.591			 .748
1,WAIT PX Deq: Execution Msg					 3.475			 .206
1,WAIT PX Deq: Signal ACK RSG					  .000			 .000
1,WAIT PX Deq: Signal ACK EXT					  .014			 .001
1,WAIT PX Deq: reap credit					  .016			 .001
1,WAIT PX Nsq: PQ load info query				  .200			 .012
1,WAIT PX Deq: Slave Session Stats				  .087			 .005
1,WAIT enq: PS - contention					  .024			 .001
1,WAIT enq: TO - contention					  .000			 .000
1,TIME repeated bind elapsed time				  .000			 .000
1,TIME parse time elapsed					  .013			 .001
1,TIME background cpu time					  .039			 .002
1,TIME PL/SQL execution elapsed time				  .001			 .000
1,TIME DB CPU							 6.539			 .388
1,TIME sql execute elapsed time 			       199.272		       11.833
1,TIME DB time						       199.499		       11.846
1,TIME background elapsed time					  .582			 .035
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
2,STAT physical read total IO requests			    31,750		    1,885.293
2,STAT physical read total multi block requests 	    31,666		    1,880.305
2,STAT physical read requests optimized 			10			 .594
2,STAT physical read total bytes		    32,994,410,496	    1,959,184,723.276
2,STAT physical write total IO requests 			12			 .713
2,STAT physical write total bytes			   196,608		   11,674.444
2,STAT cell physical IO interconnect bytes	     1,153,927,168	       68,519,377.838
2,STAT cell physical IO bytes eligible for pred     32,993,722,368	    1,959,143,862.723
2,STAT cell physical IO bytes saved by storage		10,485,760		  622,637.001
2,STAT cell smart IO session cache lookups		       190		       11.282
2,STAT cell smart IO session cache hits 		       190		       11.282
2,STAT cell physical IO interconnect bytes retu      1,152,649,216	       68,443,493.954
2,STAT cell session smart scan efficiency		       451		       26.780
2,STAT cell scans					       206		       12.232
2,STAT cell blocks processed by cache layer		 4,026,750		  239,105.563
2,STAT cell blocks processed by txn layer		 4,026,750		  239,105.563
2,STAT cell blocks processed by data layer		 4,026,283		  239,077.832
2,STAT cell blocks helped by minscn optimizatio 	 4,026,750		  239,105.563
2,STAT cell simulated session smart scan effici     32,983,310,336	    1,958,525,603.638
2,STAT cell IO uncompressed bytes		    32,983,310,336	    1,958,525,603.638
2,WAIT cell smart table scan				       193.646		       11.499
2,WAIT library cache pin					  .006			 .000
2,WAIT library cache lock					  .006			 .000
2,WAIT library cache: mutex X					  .000			 .000
2,WAIT PX Idle Wait					       460.706		       27.356
2,WAIT PX Deq Credit: send blkd 				  .006			 .000
2,WAIT PX Deq: Execution Msg					 2.922			 .173
2,WAIT enq: CO - master slave det				  .001			 .000
2,WAIT PX Deq: reap credit					  .007			 .000
2,WAIT PX Deq: Slave Session Stats				  .114			 .007
2,WAIT enq: PS - contention					  .012			 .001
2,TIME repeated bind elapsed time				  .000			 .000
2,TIME parse time elapsed					  .020			 .001
2,TIME background cpu time					  .062			 .004
2,TIME DB CPU							 6.462			 .384
2,TIME sql execute elapsed time 			       199.424		       11.842
2,TIME DB time						       199.637		       11.854
2,TIME background elapsed time					  .282			 .017
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
3,STAT physical read total IO requests			    33,001		    1,959.576
3,STAT physical read total multi block requests 	    32,902		    1,953.697
3,STAT physical read requests optimized 			 1			 .059
3,STAT physical read total bytes		    34,282,340,352	    2,035,661,085.805
3,STAT physical write total IO requests 			12			 .713
3,STAT physical write total bytes			   196,608		   11,674.444
3,STAT cell physical IO interconnect bytes	     1,201,407,024	       71,338,698.054
3,STAT cell physical IO bytes eligible for pred     34,281,635,840	    2,035,619,252.381
3,STAT cell physical IO bytes saved by storage		 1,048,576		   62,263.700
3,STAT cell smart IO session cache lookups		       198		       11.757
3,STAT cell smart IO session cache hits 		       198		       11.757
3,STAT cell physical IO interconnect bytes retu      1,200,112,688	       71,261,841.300
3,STAT cell session smart scan efficiency		       451		       26.780
3,STAT cell scans					       214		       12.707
3,STAT cell blocks processed by cache layer		 4,185,106		  248,508.630
3,STAT cell blocks processed by txn layer		 4,185,106		  248,508.630
3,STAT cell blocks processed by data layer		 4,184,655		  248,481.850
3,STAT cell blocks helped by minscn optimizatio 	 4,185,106		  248,508.630
3,STAT cell simulated session smart scan effici     34,280,693,760	    2,035,563,312.338
3,STAT cell IO uncompressed bytes		    34,280,693,760	    2,035,563,312.338
3,WAIT cell smart table scan				       193.330		       11.480
3,WAIT library cache pin					  .002			 .000
3,WAIT library cache lock					  .003			 .000
3,WAIT library cache: mutex X					  .000			 .000
3,WAIT PX Idle Wait					       456.274		       27.093
3,WAIT PX Deq Credit: send blkd 				  .005			 .000
3,WAIT PX Deq: Execution Msg					 3.006			 .179
3,WAIT enq: CO - master slave det				  .004			 .000
3,WAIT PX Deq: reap credit					  .007			 .000
3,WAIT PX Deq: Slave Session Stats				  .086			 .005
3,WAIT enq: PS - contention					  .009			 .001
3,TIME repeated bind elapsed time				  .000			 .000
3,TIME parse time elapsed					  .011			 .001
3,TIME background cpu time					  .038			 .002
3,TIME DB CPU							 7.078			 .420
3,TIME sql execute elapsed time 			       199.347		       11.837
3,TIME DB time						       199.570		       11.850
3,TIME background elapsed time					  .449			 .027
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
4,STAT physical read total IO requests			    30,213		    1,794.027
4,STAT physical read total multi block requests 	    30,129		    1,789.039
4,STAT physical read total bytes		    31,398,838,272	    1,864,440,774.857
4,STAT physical write total IO requests 			12			 .713
4,STAT physical write total bytes			   196,608		   11,674.444
4,STAT cell physical IO interconnect bytes	     1,118,129,280	       66,393,724.606
4,STAT cell physical IO bytes eligible for pred     31,398,133,760	    1,864,398,941.433
4,STAT cell smart IO session cache lookups		       180		       10.688
4,STAT cell smart IO session cache hits 		       180		       10.688
4,STAT cell physical IO interconnect bytes retu      1,116,834,944	       66,316,867.852
4,STAT cell session smart scan efficiency		       442		       26.246
4,STAT cell scans					       196		       11.638
4,STAT cell blocks processed by cache layer		 3,833,202		  227,612.819
4,STAT cell blocks processed by txn layer		 3,833,202		  227,612.819
4,STAT cell blocks processed by data layer		 3,832,789		  227,588.296
4,STAT cell blocks helped by minscn optimizatio 	 3,833,202		  227,612.819
4,STAT cell simulated session smart scan effici     31,398,207,488	    1,864,403,319.350
4,STAT cell IO uncompressed bytes		    31,398,207,488	    1,864,403,319.350
4,WAIT cell smart table scan				       192.808		       11.449
4,WAIT library cache pin					  .007			 .000
4,WAIT library cache lock					  .002			 .000
4,WAIT library cache: mutex X					  .000			 .000
4,WAIT PX Idle Wait					       444.140		       26.373
4,WAIT PX Deq Credit: send blkd 				  .005			 .000
4,WAIT PX Deq: Execution Msg					 4.306			 .256
4,WAIT enq: CO - master slave det				  .004			 .000
4,WAIT PX Deq: reap credit					  .005			 .000
4,WAIT PX Deq: Slave Session Stats				  .082			 .005
4,WAIT enq: PS - contention					  .008			 .000
4,TIME repeated bind elapsed time				  .000			 .000
4,TIME parse time elapsed					  .019			 .001
4,TIME background cpu time					  .042			 .002
4,TIME DB CPU							 6.086			 .361
4,TIME sql execute elapsed time 			       198.045		       11.760
4,TIME DB time						       198.275		       11.773
4,TIME background elapsed time					  .538			 .032

Here we see the requests are spread over the available instances. Also, response time is only marginally lower (from 13.32 to 12.88), because the current limitation (physical disk access) hasn’t changed. The average amount of IO becomes:
Total amount of IO: 34,602,745,856+32,994,410,496+34,282,340,352+31,398,838,272=133,278,334,976
Storage index saved: 3,145,728+10,485,760+1,048,576+0=14,680,064
Amount of physical IO per cell: 133,263,654,912 (total amount with storage index IO saved substracted)/7/12.88=1,478,079,579 = 1.4GB/s, which is approximately the same as the previous SQL.

Let’s build the same profile (four times):

 

1- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193
2- sql execute elapsed: 199, cpu time: 6, cell smart table scan: 194
3- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193
4- sql execute elapsed: 198, cpu time: 6, cell smart table scan: 193

 

As expected: we are totally IO bound.

Can we push this further? Yes! Stay tuned for part 3!

Tagged: oracle exadata database machine performance

Local Indexes – 2

In the previous note on local indexes I raised a couple of questions about the problems of different partitions holding different volumes of data, and supplied a script to build some sample data that produced the following values for blevel across the partitions of a list-partitioned table.

INDEX_NAME           PARTITION_NAME           BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------------- -------------------- ---------- ----------- ----------
T1_ID                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          33       9000
                     P4                            1         339      90000
                     P5                            2        3384     900000

T1_N1                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          32       9000
                     P4                            1         314      90000
                     P5                            2        3136     900000

Q1: What do you think Oracle will record as the blevel at the global level for the two indexes ?
A1: As one of the commentators suggested, it seems to be the highest blevel recorded for any partition – in this case 2. (It’s possible that this assumption is wrong, of course, there may be some subtle weighting calculation involved – but I haven’t yet tested that hypothesis.)

Q2: If you have query with a where clause like “id between 100 and 400 and n1 != 5″ – which is designed very precisely to exclude the last, very big, partition – what value of blevel is Oracle going to use when considering the cost of using the index t1_id to access the data ?
A2: As I pointed out in an earlier note on list partitioned tables, Oracle doesn’t recognise the (obvious to the human eye) option for partition pruning in this predicate, so it uses the global blevel in the calculations.

The second answer is the one that is causing me a problem – because I have a client system where almost all the data is in a “dead” partiiton – it has a status, stored as the partition key in a list-partitioned table, of “COMPLETE”, and lots of their code includes the predicate: status != ‘COMPLETE’, but this can make the optimizer take the wrong execution path because it uses a global blevel that has been dictated by the huge volume of data that we know we don’t want to see.

The client queries are fairly complex, of course, but here’s a very trivial example demonstrating the basic problem (using the data generated by the code in the previous note – running under 11.1.0.6):

set autotrace traceonly explain

select	*
from	t1
where
	id = 99
and	n1 != 5
;

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |    14 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |    14 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |    14 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |    13 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

From Oracle’s perspective it has to visit all six partitions because it can’t use the most apporpriate index and do partition pruning – and the final cost of this simple query is 14 because the value used (six times, in effect) for the blevel in the calculations is two; but we have inside information that tells us that this is essentially an unreasonable cost.

If Oracle were to believe that a more appropriate blevel for this query was just one then the cost would drop significantly (although in this case the plan wouldn’t change):

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |     8 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |     8 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |     8 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |     7 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

Of course for a really big system, where the “dead” partition was 200 Million rows, we might have a blevel of three:

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |    20 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |    20 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |    20 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |    19 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

Note how changing the global blevel by one makes the cost change by six – a consequence of the fact that we have six partitions with no pruning. If you’re having trouble with queries against partitioned table that don’t use the right index, take a close look at the data volumes and values recorded for blevel at the global, partition and subpartition levels – it’s possible that you’re suffering from a bias introduced by one partition being much larger than all the rest.

If you’re wondering how I got these plans (without simply editing them) it was by using dbms_stats.set_index_stats() to change the stored statistics – see “Copy Stats” for an example of the type of code needed. In cases like this, where I have better information about the data and the intent of the code than the optimizer has, I am perfectly happy to give a “more truthful” picture of the data to the optimizer by writing scripts to adjust statistics.

There are three drawbacks to such an approach, of course. First: on the next upgrade the optimizer might get smarter and make my clever little hack a liability rather than a benefit; secondly, there may be examples of application code that I haven’t noticed that might go wrong because of my hack; finally, and more importantly in the short term, I have to make sure that my code runs every time the statistics on the index are modified by any other program (such as the automatic stats collection job).

But the principle is sound – if we understand the system better than the optimizer then it’s positively important to help the optimizer in the most truthful way possible. List partitions (in a way similar to frequency histograms) are an obvious target for this type of treatment.

Returning the Days in a Month

How do you write a query to return all the days in a month as individual
rows? Given a single date, you want one row to be returned for each day of
the month into which the date falls. 



Read the full post at www.gennick.com/database.

Returning the Days in a Month

How do you write a query to return all the days in a month as individual
rows? Given a single date, you want one row to be returned for each day of
the month into which the date falls. 



Read the full post at www.gennick.com/database.

Skyline…

Just got back from watching Skyline. I read some reviews before I went, most of which said it was terrible and something to avoid at all cost. With that message firmly embedded in my brain I went with zero positive expectations and quite enjoyed it.

Most of the reviews said the acting was really bad. It wasn’t wonderful, but it certainly wasn’t Twilight bad. The reviews said it lacked plot. Well yeah, but Cloverfield and District 9 weren’t exactly brimming with plot and they were cool. Now don’t get me wrong, I’m not heaping praise on this film, it’s certainly a B movie, but it’s by no means the worst film I’ve ever seen. It beats the hell out of any Twilight movie, but then so does a blank screen. :)

The film wears its influences very much on its sleeve. A lot of the visuals look like they were stolen heavily inspired by The Matrix and Independence Day (I really hate that film). There is a District 9 and Cloverfield feel to the film, with some of the visuals kinda stolen inspired by Cloverfield too, along with a bit of War of the Worlds for good measure.

One of the most interesting things about the film is it allegedly cost less than $10 Million to make. That’s like one episode of Friends (final season). I think the visuals are pretty sweet for that price. It also means it might make enough money, even with half empty cinemas, for a sequel to happen, which they definitely want judging by the ham-fisted hook into a sequel at the end. If one happens I will go to see it just out of curiosity.

So go expecting the worst and it won’t be quite as bad as you expected, possibly… :)

Cheers

Tim…