Search

OakieTags

Who's online

There are currently 0 users and 39 guests online.

Recent comments

Oakies Blog Aggregator

EM13c Upgrade Tips- Part II

There is a flurry of excitement about Enterprise Manager 13c and I’m blown away by how many customers are looking not just to install and test out the new release, but have plans already underway to upgrade to it.  At Oracle, we’re covering a lot of information that will help make that upgrade easier and I want to make sure I share what we learn with the public so you can be as cool about upgrades as we are.

ZQT94hx

Your Repository is Going to be a Single Tenant Database 12c

  • You MUST upgrade your repository, (OMR) to 12.1.0.2
  • Also plan on applying the latest Bundle patch to the 12.1.0.2 database, don’t just take the base installation and think you’re done.  There were important updates in the bundle patch.
  • Upgrade your software only vs. upgrading both software and database at once, then upgrade the repository later, which will decrease downtime.  This is done by running $OMR_HOME/sysman/install/ConfigureGC.sh at the time of actual repository upgrade.

I Have a LOT of Agents to Upgrade!

  • Consider retaining your agent at 12.1.0.4 and then upgrade them to 13c agent post your initial upgrade process.
  • This can save you downtime and you can even automate this if you have provisioning and patching with DBLM.
  • Utilize the Gold Agent Image to simplify and ensure you possess a unified agent environment.  It allows mass update of all agents to “gold image” which is very cool!

New 13c Plugins, Too

  • With the infrastructure changes to EM13c, a requirement for plugin rewrites were also required.  New plugins for 13c are being released almost daily, so verify the ones you need are available before upgrading.
  • Remember to remove any plugins that are decommissioned, (obsolete.)  This includes Exalogic Elastic Cloud Infrastructure and EMC Symmetrix Array.

I’m going to have a lot more tips and tricks, but these are three topics to add to your list as you plan your own Enterprise Manager 13c upgrade!

 



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [EM13c Upgrade Tips- Part II], All Right Reserved. 2016.

Amazon Web Services (AWS) : Relational Database Services (RDS) for MySQL

Here’s another video on my YouTube channel. This one is a quick run through of RDS for MySQL, a DBaaS offering from Amazon Web Services.

The video was based on this article.

If you watch the little outtake at the end you will hear me cracking up with the goofiest while filming Brian ‘Bex’ Huff‘s clip. :)

Cheers

Tim…


Amazon Web Services (AWS) : Relational Database Services (RDS) for MySQL was first posted on February 8, 2016 at 3:35 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

Bitwise operations

The long existing BITAND function is now within the documentation, to let you do logical AND on two numbers, and is also available from PL/SQL

image

 

If you need other bit operations, a little boolean math should suffice Smile Just make sure you stay within the limits of BINARY_INTEGER


CREATE OR replace FUNCTION bitor( x IN binary_integer, y IN binary_integer ) RETURN binary_integer  AS
BEGIN
    RETURN x - bitand(x,y) + y;
END;
/

CREATE OR replace FUNCTION bitxor( x IN binary_integer, y IN binary_integer ) RETURN binary_integer  AS
BEGIN
    RETURN bitor(x,y) - bitand(x,y);
END;
/



Annoying your user base never pays off!

lego-face-angryThis post is heavily inspired by the events of #RIPTwitter and the recent Fine Brothers fiasco, but it could apply to just about any company, product or person. When I say user base, I could easily mean customers or fan base.

There is a tendency for success to breed a certain level of arrogance. I think many of us have fallen victim to that in a small way from time to time. Now magnify the level success you might have encountered by several orders of magnitude and I think you will start to realise how disconnected most successful people and companies can become. As you become more disconnected, the normal feedback mechanisms start to break down. You are surrounded by hangers on who act like everything you say is the word of God. Without those feedback mechanisms holding you in check, it’s easy to spiral out of control. We see it again and again with popstars and actors. Recently we’ve seen examples of this in social media, as mentioned earlier.

In addition, companies are in the difficult position of having to be seen to grow and develop. If a company stands still, everyone, including the shareholders, believe they are dying. Balancing the needs of the shareholders and the user base is not an easy thing. Having said that, annoying your user base never pays off. Even if you think you’ve got away with it, it’s filed away ready to be resurrected the next time you annoy them. Twitter and the Fine Brothers have had the equivalent of an extramarital affair. As a result, some will choose to leave. Some will stay, pretending they can forgive and forget, but in reality it’s all just been stored in the bank for use later.

As I said at the start, this could be a post about any number of products, people or companies. It doesn’t matter who it is targeted at, the message is the same.

You can only kick a dog so many times before it bites back!

Cheers

Tim…


Annoying your user base never pays off! was first posted on February 7, 2016 at 1:09 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

Big Nodes, Concurrent Parallel Execution And High System/Kernel Time

The following is probably only relevant for customers that run Oracle on big servers with lots of cores in single instance mode (this specific problem here doesn't reproduce in a RAC environment, see below for an explanation why), like one of my clients that makes use of the Exadata Xn-8 servers, for example a X4-8 with 120 cores / 240 CPUs per node (but also reproduced on older and smaller boxes with 64 cores / 128 CPUs per node).

They recently came up with a re-write of a core application functionality. Part of this code did start the same code path for different data sets potentially several times concurrently ending up with many sessions making use of Parallel Execution. In addition a significant part of the queries used by this code did make questionable use of Parallel Execution, in that sense that queries of very short duration used Parallel Execution, hence ending up with several Parallel Execution starts per second. You could see this pattern from the AWR reports like this, showing several "DFO trees" parallelized per second on average over an hour period:

When the new code was tested with production-like data volumes and patterns, in the beginning the CPU profile of such a big node (running in single instance mode) looked like this, when nothing else was running on that box:

As you can see, the node was completely CPU bound, spending most of the time in System/Kernel time. The AWR reports showed some pretty unusual PX wait events as significant:

"PX Deq: Slave Session Stats" shouldn't be a relevant wait event since it is about the PX slaves at the end of a PX execution passing an array of session statistics to the PX coordinator for aggregating the statistics on coordinator level. So obviously something was slowing down this PX communication significantly (and the excessive usage of Parallel Execution was required to see this happen).

Also some of the more complex Parallel Execution queries performing many joins and ending up with a significant number of data redistributions ran like in slow motion, although claiming to spend 100% of their time on CPU, but according to Active Session History almost 90% of that time was spent on the redistribution operations:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            98|             86|             87|

Running the same query with the same execution plan on the same data and the same box during idle times showed a almost 20 times better performance, and less than 40% time spent on redistribution:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            96|             38|             37|


So it looked like those queries ran into some kind of contention that wasn't instrumented in Oracle but happened outside on O/S level, showing up as CPU Kernel time - similar to what could be seen in previous versions of Oracle when spinning on mutexes.

Reducing the excessive usage of Parallel Execution showed a significant reduction in CPU time, but still the high System/Kernel time was rather questionable:

So the big question was - where was that time spent in the kernel to see whether this gives further clues.

Analysis

Running "perf top" on the node during such a run showed this profile:

  PerfTop:  129074 irqs/sec  kernel:76.4%  exact:  0.0% [1000Hz cycles],  (all, 128 CPUs)
-------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                 DSO
             _______ _____ ________________________ ___________________________________________________________

          1889395.00 67.8% __ticket_spin_lock       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            27746.00  1.0% ktime_get                /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            24622.00  0.9% weighted_cpuload         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            23169.00  0.8% find_busiest_group       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            17243.00  0.6% pfrfd1_init_locals       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            16961.00  0.6% sxorchk                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            15434.00  0.6% kafger                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            11531.00  0.4% try_atomic_semop         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            11006.00  0.4% __intel_new_memcpy       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10557.00  0.4% kaf_typed_stuff          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10380.00  0.4% idle_cpu                 /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             9977.00  0.4% kxfqfprFastPackRow       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             9070.00  0.3% pfrinstr_INHFA1          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8905.00  0.3% kcbgtcr                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8757.00  0.3% ktime_get_update_offsets /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             8641.00  0.3% kgxSharedExamine         /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             7487.00  0.3% update_queue             /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             7233.00  0.3% kxhrPack                 /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6809.00  0.2% rworofprFastUnpackRow    /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6581.00  0.2% ksliwat                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6242.00  0.2% kdiss_fetch              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6126.00  0.2% audit_filter_syscall     /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5860.00  0.2% cpumask_next_and         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5618.00  0.2% kaf4reasrp1km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5482.00  0.2% kaf4reasrp0km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5314.00  0.2% kopp2upic                /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5129.00  0.2% find_next_bit            /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             4991.00  0.2% kdstf01001000000km       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4842.00  0.2% ktrgcm                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4762.00  0.2% evadcd                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4580.00  0.2% kdiss_mf_sc              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle

Running "perf" on a number of Parallel Slaves being busy on CPU showed this profile:

     0.36%     ora_xxxx  [kernel.kallsyms]             [k] 

__ticket_spin_lock
               |
               --- __ticket_spin_lock
                  |          
                  |--99.98%-- _raw_spin_lock
                  |          |          
                  |          |--100.00%-- ipc_lock
                  |          |          ipc_lock_check
                  |          |          |          
                  |          |          |--99.83%-- semctl_main
                  |          |          |          sys_semctl
                  |          |          |          system_call
                  |          |          |          __semctl
                  |          |          |          |          
                  |          |          |           --100.00%-- skgpwpost
                  |          |          |                     kslpsprns
                  |          |          |                     kskpthr
                  |          |          |                     ksl_post_proc
                  |          |          |                     kxfprienq
                  |          |          |                     kxfpqrenq
                  |          |          |                     |          
                  |          |          |                     |--98.41%-- kxfqeqb
                  |          |          |                     |          kxfqfprFastPackRow
                  |          |          |                     |          kxfqenq
                  |          |          |                     |          qertqoRop
                  |          |          |                     |          kdstf01001010000100km
                  |          |          |                     |          kdsttgr
                  |          |          |                     |          qertbFetch
                  |          |          |                     |          qergiFetch
                  |          |          |                     |          rwsfcd
                  |          |          |                     |          qertqoFetch
                  |          |          |                     |          qerpxSlaveFetch
                  |          |          |                     |          qerpxFetch
                  |          |          |                     |          opiexe
                  |          |          |                     |          kpoal8

Running "strace" on those Parallel Slaves showed this:

.
.
.
semctl(1347842, 397, SETVAL, 0x1)       = 0
semctl(1347842, 388, SETVAL, 0x1)       = 0
semctl(1347842, 347, SETVAL, 0x1)       = 0
semctl(1347842, 394, SETVAL, 0x1)       = 0
semctl(1347842, 393, SETVAL, 0x1)       = 0
semctl(1347842, 392, SETVAL, 0x1)       = 0
semctl(1347842, 383, SETVAL, 0x1)       = 0
semctl(1347842, 406, SETVAL, 0x1)       = 0
semctl(1347842, 389, SETVAL, 0x1)       = 0
semctl(1347842, 380, SETVAL, 0x1)       = 0
semctl(1347842, 395, SETVAL, 0x1)       = 0
semctl(1347842, 386, SETVAL, 0x1)       = 0
semctl(1347842, 385, SETVAL, 0x1)       = 0
semctl(1347842, 384, SETVAL, 0x1)       = 0
semctl(1347842, 375, SETVAL, 0x1)       = 0
semctl(1347842, 398, SETVAL, 0x1)       = 0
semctl(1347842, 381, SETVAL, 0x1)       = 0
semctl(1347842, 372, SETVAL, 0x1)       = 0
semctl(1347842, 387, SETVAL, 0x1)       = 0
semctl(1347842, 378, SETVAL, 0x1)       = 0
semctl(1347842, 377, SETVAL, 0x1)       = 0
semctl(1347842, 376, SETVAL, 0x1)       = 0
semctl(1347842, 367, SETVAL, 0x1)       = 0
semctl(1347842, 390, SETVAL, 0x1)       = 0
semctl(1347842, 373, SETVAL, 0x1)       = 0
semctl(1347842, 332, SETVAL, 0x1)       = 0
semctl(1347842, 379, SETVAL, 0x1)       = 0
semctl(1347842, 346, SETVAL, 0x1)       = 0
semctl(1347842, 369, SETVAL, 0x1)       = 0
semctl(1347842, 368, SETVAL, 0x1)       = 0
semctl(1347842, 359, SETVAL, 0x1)       = 0
.
.
.

So the conclusion was: A lot of CPU time is spent spinning on the "spin lock" (critical code section) - caused by calls to "semctl" (semaphores), which are part of the PX code path and come from "ipc_lock"->"raw_lock". "strace" shows that all of the calls to "semctl" make use of the same semaphore set (first parameter), which explains the contention on that particular semaphore set (indicating that the locking granule is the semaphore set, not the semaphore).

Solution

Based on the "perf" results an Oracle engineer found a suitable, unfortunately unpublished and closed bug from 2013 for 12.1.0.2 that comes up with three different ways how to address the problem:

- Run with "cluster_database" = true: This will take a different code path which simply reduces the number of semaphore calls by two orders of magnitude. We tested this approach and it showed immediate relief on kernel time - that is the explanation why in a RAC environment this specific issue doesn't reproduce.

- Run with different "kernel.sem" settings: The Exadata boxes came with the following predefined semaphore configuration:

kernel.sem = 2048 262144 256 256

"ipcs" showed the following semaphore arrays with this configuration when starting the Oracle instance:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 12941057   oracle    640        1502     
0xd87a8935 12973826   oracle    640        1502     
0xd87a8936 12006595   oracle    640        1502    

By reducing the number of semaphores per set and increasing the number of sets, like this:

kernel.sem = 100 262144 256 4096

the following "ipcs" output could be seen:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 13137665   oracle    640        93       
0xd87a8935 13170434   oracle    640        93       
0xd87a8936 13203203   oracle    640        93       
0xd87a8937 13235972   oracle    640        93       
0xd87a8938 13268741   oracle    640        93       
0xd87a8939 13301510   oracle    640        93       
0xd87a893a 13334279   oracle    640        93       
0xd87a893b 13367048   oracle    640        93       
0xd87a893c 13399817   oracle    640        93       
0xd87a893d 13432586   oracle    640        93       
0xd87a893e 13465355   oracle    640        93       
0xd87a893f 13498124   oracle    640        93       
0xd87a8940 13530893   oracle    640        93       
0xd87a8941 13563662   oracle    640        93       
0xd87a8942 13596431   oracle    640        93       
0xd87a8943 13629200   oracle    640        93       
0xd87a8944 13661969   oracle    640        93       
0xd87a8945 13694738   oracle    640        93       
0xd87a8946 13727507   oracle    640        93       
0xd87a8947 13760276   oracle    640        93       
0xd87a8948 13793045   oracle    640        93       
0xd87a8949 13825814   oracle    640        93       
0xd87a894a 13858583   oracle    640        93       
0xd87a894b 13891352   oracle    640        93       
0xd87a894c 13924121   oracle    640        93       
0xd87a894d 13956890   oracle    640        93       
0xd87a894e 13989659   oracle    640        93       
0xd87a894f 14022428   oracle    640        93       
0xd87a8950 14055197   oracle    640        93       
0xd87a8951 14087966   oracle    640        93       
0xd87a8952 14120735   oracle    640        93       
0xd87a8953 14153504   oracle    640        93       
0xd87a8954 14186273   oracle    640        93       
0xd87a8955 14219042   oracle    640        93

So Oracle now allocated a lot more sets with less semaphores per set. We tested this configuration instead of using "cluster_database = TRUE" and got the same low kernel CPU times

- The bug comes up with a third option how fix this, which has the advantage that the host configuration doesn't need to be changed, and the configuration can be done per instance: There is an undocumented parameter "_sem_per_sem_id" that defines the upper limit of semaphores to allocate per set. By setting this parameter to some comparable values like 100 or 128 the net result ought to be the same - Oracle allocates more sets with less semaphores per set, but we haven't tested this option.

Conclusion

So the bottom line was this: Certain usage patterns of the Oracle instance lead to contention on spin locks on Linux O/S level if Oracle runs in single instance mode and used the so far recommended semaphore settings, which resulted in all semaphore calls going for the same semaphore set. By having Oracle allocate more semaphore sets the calls were spread over more sets hence significantly reducing the contention.

There is probably some internal note available at Oracle that indicates that the default semaphore settings recommended for big nodes are not optimal for running single instance mode under certain circumstances, but I don't know if there is a definitive, official guide available yet.

This is the CPU profile of exactly the same test workload as before using the changed "kernel.sem" settings:

Also in the AWR report the unusual PX related wait events went away and performance improved significantly, in particular also for those complex queries mentioned above.

The Oracle wait interface granularity of measurement

The intention of this blogpost is to show the Oracle wait time granularity and the Oracle database time measurement granularity. One of the reasons for doing this, is the Oracle database switched from using the function gettimeofday() up to version 11.2 to clock_gettime() to measure time.

This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

The first thing I wanted to proof, is the switch of the gettimeofday() call to the clock_gettime() call. This turned out not to be as simple as I thought.

Because a lot of applications and executables need timing information, which is traditionally done via the gettimeofday() call, the Linux kernel maintainers understood the importance of making this call as performant as possible. Calling a (hardware) clock means you request information from a resource on your computer system. Requesting something like that requires a program to switch to kernel mode first. In order to lower the resources and time needed to get time information, the Linux kernel includes a “trick” to get that information, which is known as a virtual system call or vsyscall. Essentially this means this information is provided in userspace, so there are lesser resources needed, and there is no need to switch to kernel mode. James Morle has an excellent article describing this, this line is a link to that. By staying in userspace, the gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing.

However I said it wasn’t as easy as I thought. I was looking into this, and thought making the vsyscalls visible by echoing “0” in /proc/sys/kernel/vsyscall64. However, I am working with kernel version 3.8.13 for doing this part of the research….which does not have /proc/sys/kernel/vsyscall64, which means I can’t turn off the vsyscall optimisation and make both gettimeofday() and clock_gettime() visible as systemcall.

Searching for kernel.vsyscall64 on the internet I found out that with early versions Linux kernel version 3 vsyscall64 has been removed. This means I can’t use a simple switch to flip to make the calls visible. So, instead of going straight to the thing I wanted to research, I got stuck in doing the necessary preparing and essential preliminary investigation for it.

Can I do it in another way? Yes, this can be done using gdb, the GNU debugger. Start up a foreground (Oracle database) session, and fetch the process ID of that session and attach to it with gdb:

gdb -p PID
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
...etc...
(gdb)

Now simply break on gettimeofday and clock_gettime, and make gdb notify you it encountered the call. First 11.2.0.4:

(gdb) break gettimeofday
Breakpoint 1 at 0x332fc9c7c0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>printf "gettimeofday\n"
>continue
>end
(gdb) break clock_gettime
Breakpoint 2 at 0x3330803e10
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "clock_gettime\n"
>continue
>end
(gdb) 

You can save this gdb macro to time.gdb by executing “save breakpoints time.gdb”. Now execute “c” (continue) and enter, to make the process you attached to running again. Execute something very simple, like:

SQL> select * from dual;

This results in Oracle 11.2.0.4:

(gdb) c
Continuing.
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
...etc...

That is expected, we already knew the Oracle database is executing the gettimeofday function a lot. Now let’s do exactly the same, but with Oracle version 12.1.0.2. If you saved the breakpoints and macro’s, you can attach to an Oracle 12.1.0.2 foreground process with gdb and execute ‘source time.gdb’ to set the breakpoints and macro’s. When the ‘select * from dual’ is executed in this version of the database, it looks like this:

(gdb) c
Continuing.
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
...etc...

It is clearly (mostly) executing the clock_gettime() function.

The clock_gettime() function can use a variety of time sources. If you read the manpage of clock_gettime you will see that the first argument is the clock source. You can see the clock sources in the kernel source file Linux/include/uapi/linux/time.h, which shows:

/*
 * The IDs of the various system clocks (for POSIX.1b interval timers):
 */
 #define CLOCK_REALTIME                  0
 #define CLOCK_MONOTONIC                 1
 #define CLOCK_PROCESS_CPUTIME_ID        2
 #define CLOCK_THREAD_CPUTIME_ID         3
 #define CLOCK_MONOTONIC_RAW             4
 ...

The first argument of clock_gettime is the type of clock, so if I remove the macro with clock_gettime, execution stops when clock_gettime is called:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 
	breakpoint already hit 23 times
        silent
        printf "clock_gettime\n"
        c
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 
	breakpoint already hit 23 times
(gdb) c
Continuing.

Now execute something in the sqlplus session. What will happen in the gdb session is:

Breakpoint 2, 0x0000003330803e10 in clock_gettime () from /lib64/librt.so.1
(gdb)

Now look up the first argument of the call:

(gdb) print $rdi
$1 = 1

So Oracle is using CLOCK_MONOTONIC. Not the point of this article, but this means Oracle database time measurement is granular on the microsecond layer.

Now let’s look how much time the Oracle wait interface takes itself. The Oracle wait interface is using the functions kslwtbctx() (kernel service layer wait begin context) and kslwtectx() (kernel service layer wait end context) to start and stop measuring a wait event. Please mind that instead of looking at the time the wait interface provides, this means looking at the time that is taken executing in the kslwtbctx() and kslwtectx() functions. This can be done using systemtap:

global kslwtbctx_time, kslwtectx_time, kslwtbctx_count=0, kslwtbctx_tot=0, kslwtectx_count=0, kslwtectx_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	if ( pid() == target() ) {
		kslwtbctx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx").return {
	if ( pid() == target() ) {
		printf("kslwtbctx: %12d\n", local_clock_ns()-kslwtbctx_time)
		kslwtbctx_tot+=local_clock_ns()-kslwtbctx_time
		kslwtbctx_count++
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	if ( pid() == target() ) {
		kslwtectx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx").return {
	if ( pid() == target() ) {
		printf("kslwtectx: %12d\n", local_clock_ns()-kslwtectx_time)
		kslwtectx_tot+=local_clock_ns()-kslwtectx_time
		kslwtectx_count++
	}
}
probe end {
	printf("\nkslwtbctx: avg %12d\nkslwtectx: avg %12d\n",kslwtbctx_tot/kslwtbctx_count,kslwtectx_tot/kslwtectx_count)
}

This systemtap script has a ‘begin probe’, which executes once the systemtap script starts running. I simply print ‘Begin.’ with a newline. The idea is that it prompts me once the systemtap script is actually running.

Then there is a (userspace) process based probe for the oracle process. There are two probes for both the kslwtbctx and the kslwtectx function in the oracle process. The first one (.function(“kslwtbctx”)) fires when the function is entered, the second one (.function(“kslwtbctx”).return) fires when the function has ended.

The ‘if ( pid() == target() )’ function filters all the invocations and returns of the probed functions for the PID set by “-x PID” parameter. Otherwise any invocation of the probed function by any process would be picked up.

The entering probe simply records the time in nanoseconds in a variable. The returning probe subtracts the previous recorded time from the current time, which means the time between entering and returning is shown. Also, the returning probe adds the time the function took to another variable, and counts the number of times the return probe has fired.

The end probe shows the total time spend in each of the two functions, divided by the number of times the return probe was fired. This way the average time spend in the two functions is calculated. As you will see, the time spend in the function varies.

When this is executed against an Oracle foreground session, this is how it looks like:

# stap -x 2914 wait_interface.stap
Begin.
kslwtectx:         9806
kslwtbctx:         3182
kslwtectx:         1605
kslwtbctx:         1311
kslwtectx:         4200
kslwtbctx:         1126
kslwtectx:         1014
kslwtbctx:          840
kslwtectx:         4402
kslwtbctx:         2636
kslwtectx:         2023
kslwtbctx:         1586
^C
kslwtbctx: avg         2165
kslwtectx: avg         4305

The time measured is in nanoseconds. The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.

The obvious thought you might have, is: “why is this important?”. Well, this is actually important, because the 6us dictates what the wait interface should measure, and what it should not measure. What I mean to say, is that anything that is called inside the Oracle database for which the time spend is in the same ballpark as the wait interface overhead or lower, should not be measured by the wait interface, because the measurement would influence the overall performance in a negative way.

A good example of this are latch gets. The Oracle database does not instrument individual latch gets via the wait interface, but rather exposes waiting for a latch via the wait interface when a process has spun for it (a latch is a spinlock), and decides to go to sleep (on a semaphore) waiting to get woken once the latch becomes available.

However, using systemtap we can actually measure latch gets! Here is a simple script to measure the latch gets for non-shared latches via the kslgetl() function:

global kslgetl_time, kslgetl_count=0, kslgetl_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl") {
	if ( pid() == target() )
		kslgetl_time=local_clock_ns()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl").return {
	if ( pid() == target() ) {
		printf("kslgetl: %12d\n", local_clock_ns()-kslgetl_time)
		kslgetl_tot+=local_clock_ns()-kslgetl_time
		kslgetl_count++
	}
}
probe end {
	printf("\nkslgetl: avg %12d\n",kslgetl_tot/kslgetl_count)
}

This systemtap script follows the exact same structure and logic of the previous systemtap script.

This is how this looks like on my system when executed against a database foreground session executing ‘select * from dual’:

# stap -x 2914 kslgetl.stap
Begin.
kslgetl:         3363
kslgetl:          786
kslgetl:          744
kslgetl:          782
kslgetl:          721
kslgetl:          707
kslgetl:         1037
kslgetl:          728
kslgetl:          711
kslgetl:          736
kslgetl:          719
kslgetl:          714
kslgetl:         1671
kslgetl:          929
kslgetl:          968
kslgetl:          919
kslgetl:          883
kslgetl:          869
kslgetl:         3030
kslgetl:          750
^C
kslgetl: avg         1362

As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing.

This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity. Please mind I am not saying that it does not make sense to detail the response time, I think with modern computer systems with a lot of memory the Oracle database can run more and more without needing to wait for things like disk IOs. This means modern database servers can spend a lot of time just running on CPU, making it hard to understand in what routines the time is spend. Tuning on CPU execution requires an insight into where time is spend. The only option to understand how CPU time in Oracle is composited, is by using external (operating system based) tools like perf and flame graphs to detail CPU time. It would be great if an option would exist in the database to detail on CPU time.

Tagged: clock_gettime, debug, gdb, gettimeofday, kernel, oracle, performance, syscall, syscallv64

No Data Loss without Sync Data Guard

A few months ago I wrote about an exciting product that allows you to achieve no data loss in a disaster even without using the maximum protection Data Guard that relies on super expensive synchronous networks. Recently I had a chance to actually work on that system and replicate the scenarios. Here I explain how that actually works, with commands and actual outputs.

Background

First, you may want to refresh your memory on the concept of that product in the earlier blog http://arup.blogspot.com/2015/04/no-data-loss-without-synchronous-network.html.The product is Phoenix system from Axxana. To reiterate, I am not an employee of that company not in its payroll; I just want to make others aware of this innovative out of the box solution to a problem we are all familiar with it. I usually don't write about specific products unless I feel strongly about its effectiveness to the user community.

Let's examine the issue we are facing:
  1. In a Data Guard environment, some amount of redo information may be left behind at the primary site.
  2. If the primary site meets with a disaster where nothing is left, this data is lost for ever.
  3. If this data loss is not acceptable then you have to run the Data Guard in maximum protection mode which guarantees the updates occurring on the primary to be reflected on the standby database immediately. That way the data is updated at standby site and is not affected by the loss of the primary site.
  4. However, this also means that the network has to be super fast and super reliable to ensure the updates at the standby site. Unfortunately this also means the network is super expensive. And this could also become technically challenging for networks across very large distances.
  5. This is why many corporations that can't afford the expensive network infrastructure do not implement the maximum protection Data Guard and merely accept the data loss as an unavoidable problem.
Well, until now. Axxana's product addresses the issue in a non-traditional manner. The only thing that stands between the full recovery on the standby site in case of a disaster and an expensive network is the redo information that has not been transmitted yet to the standby server. If this information were available to the standby host, you would have been able to recover it completely. The Phoenix system from Axxana is a solid state drive enclosed inside a case protected from common disaster elements such as flood, fire, bomb, etc. You put this special box at the primary site and create one member of the redo log group on its storage, as shown in the figure below. This box has a protected port of network, etc. that allows you to access its contents even if the normal ports are all destroyed by the disaster. It even has a cellular modem that allows it to be connected remotely. In fact the Axxana software can pull the necessary data from this special box remotely from the standby host and apply to the standby database. And the best thing about it is that it is the plain vanilla Data Guard you are familiar with; not any special magic that goes under the hood.
Axxana Blackbox
Fig 1. Axxana Blackbox Internals
Let's see how it works with a hands-on approach inside a lab. I have two hosts:

Host
Database
Primary
e02lora adb
Standby
e02rora1 adbsb

The storage is ASM; not a filesystem. Replication in a filesystem is relatively simple; so I deliberately chose it to see if any ASM nuances coming up. The Data Guard is configured with maximum performance, so there is no need for an expensive fast low latency network infrastructure. The setup has an Axxana Phoenix system which consists of three components:
  1. Blackbox - the "special" protected storage system.
  2. Collector - a small server at the primary site that collects the changes to the files under the Axxana protection. In this case, they are controlfile, redo logs and archived logs. I don't need to place any database files there because they are useless for recovery. Keeping this small set of files also makes it possible to put the protected files of many databases--not just one--inside a single blackbox.
  3. Recoverer - a small server at the standby site that receives the changes from the collector and keeps the files up to date when disaster occurs.

Setting up

First I ensured that the managed recovery process is running and the standby database is in MOUNTED mode.

Then I created a test schema in the production database.

e02lora$ sqlplus / as sysdba
SQL> grant create session, create table, create procedure, unlimited tablespace to arup identified by arup;
Then I created some objects and data inside this schema.
SQL> connect arup/arup
SQL> create table T (c1 number, c2 timestamp);
SQL> create index IN_T ON T (C1);
SQL> create or replace procedure INS_T (p_c1 in number, p_commit in number) as
2 begin
3 for i in 1..p_c1 loop
4 insert into T values (i,current_timestamp);
5 if mod(i,p_commit)=0 then
6 commit;
7 end if;
8 end loop;
9 end;
10 /

SQL> create or replace procedure DEL_T (p_commit in number) as
2 begin
3 delete T where mod(C1, p_commit)=5;
4 commit;
5 end;
6 /
SQL> create or replace procedure UPD_T (p_commit number) as
2 begin
3 update T set C1 =-1 where mod(C1, p_commit)=3;
4 commit;
5 end;
6 /

SQL> exec INS_T (400000, 5000);

SQL> commit;
After creating the test data, I switched a few log files and waited a bit to make sure the changes are replicated to the standby.

Simulating a Failure

Now to simulate a disconnected network. I didn't have any special tools with me. So, I did the next best thing: I disabled the ethernet interface.
e02rora$ su -
Password:
e02rora$ ifdown eth0
At this point the standby host will not receive the updates. All the changes made to the primary database will be located at the primary site only. I canceled the recovery process first.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
I made a few changes to the primary data.

SQL> connect arup/arup
SQL> exec DEL_T (500);
SQL> exec UPD_T (500);

SQL> conn / as sysdba
SQL> alter system switch logfile;

SQL> conn arup/arup
SQL> select count(*) from T

COUNT(1)
----------
399200
The table has only 399,200 records. Since the Managed Recovery Process is stopped these final changes will not be replicated to the standby. I can confirm that by opening the standby in read only mode and checking the actual tables.

On the standby host:

SQL> conn / as sysdba
SQL> ALTER DATABASE OPEN READ ONLY;

Database altered.

SQL> SELECT COUNT (1) FROM ARUP.T;

COUNT(1)
----------
400000
The standby has all 400,000 rows, compared to 399,200 rows in primary. The discrepancy is due to the unpropagated changes to the standby. At this point I simulate a failure in the primary by killing the pmon process.

e02lora$ ps -aef|grep pmon
UID PID PPID TTY STIME COMMAND
oracle 10184 10564 pty0 16:26:17 pmon
e02lora$ kill -9 10184

The primary database is down. The data in the redo logs is lost as well since they have not been propagated to the standby system yet.

Performing Recovery

Now that we have simulated a disaster, let's see how to recover from it. Remember, since the Data Guard configuration is Maximum Performance, the database can only be recovered up to the most recent log entry. With Axxana software, however, there is additional data that can be pulled from the Blackbox (the storage unit at the primary site that would not be destroyed).  But how do you--the DBA--know what files are available at the standby site, which are still left at the primary site, and, most important, how to get those into the standby host? It gets even more complicated since the Blackbox is just a storage device; you have to mount the volumes, and mount the ASM disks, etc. These may be fun when creating a brand new database; definitely so when you are under the gun to recover the database and bring your business online.

No worries. This is where the Axxana software comes into rescue. I spun up the provided shell script at the standby site. This script contacts the blackbox at the primary, pulls the needed data and completes the creation of necessary files at the standby site. Once the files are at the standby site, all you have to do is to perform the typical managed standby database recovery to complete the recovery process. The best part of all? The script even gives you a step by step instruction sheet along with specific files names which you can copy and paste when Oracle prompts for it. Here is how I call the script and the resultant output.

root@e02rora1 AxxRecovery# ./AxxRecovery.sh
Logging to '/home/oracle/AxxRecovery/logs/axxana.debug.log'
Calculating Optimization Parameters [done]
Attempting to connect to Axxana's Recoverer [done]

Perform Failover?

Warning!
This operation terminates the Axxana protection for all DBs.

1) OK
2) Cancel & Exit
At this point the script pauses and asks me for confirmation. I enter "1" and the script continues:
Requesting Axxana's Recoverer to 'FAILOVER' [done]

Recoverer communicates with the Black Box (ENTER_FAILOVER_START) [done]
Recoverer communicates with the Black Box (ENTER_FAILOVER_BBX) [done]
Recoverer communicates with the Black Box (ENTER_FAILOVER_CONNECT) [done]

Recoverer is in FAILOVER mode

Transferring required logs files of the database 'adb'.

The script shows me the names of the files along with their respective transfer status and the %age progress.

FILE NAME | Size in MBytes | Stage | Progress
==================================|=================|==============|===================
ADB-group_3.258.902686913 | 50 (0) | Completed | 100%
ADB-thread_1_seq_111.266.902852033| 46 (46) | Completed | 100%
ADB-group_2.257.902686911 | 50 (50) | Completed | 100%
ADB-1454516139530A12435456XT_112.C| 11 (11) | Completed | 100%
ADB-group_1.256.902686901 | 50 (50) | Completed | 100%
==================================|=================|==============|===================
Total: | 208 (159) | 5 of 5 files recovered

All required log files exist at '/home/oracle/AxxRecovery/axx_recovery_files'

Preparing user recovery scripts ...

You can safely perform the Oracle RPO=0 recovery process

Please follow the RPO=0 recovery instructions that are described in the file '/home/oracle/AxxRecovery/axx_required_files/recovery_instructions.txt'
As I mentioned, the script creates a detailed step-by-step instruction sheet to be followed for the standby recovery. I am actually glad that it does not perform a recovery automatically. That is one process you want to watch and proceed with caution. You probably have only one chance at it; and rushing through it may force you to take missteps. But at the same time you want to think as less as possible under those stressful conditions. So, the detailed instruction sheet comes handy. The last line shows the location of the instructions files. Here is how the file looks like in my case, after removing some banner items:
|---------------------------------------------------------------------------------------------------------------|
| RECOVERY INSTRUCTIONS (03/02/2016 16:26:19) |
|===============================================================================================================|
===============================================================================================================

Step 1) Output setup
====================

Please log into the standby database (as sys user)
and then run the following commands at the SQL prompt:

SQL>
SET SERVEROUTPUT ON
SET LINES 999
SET PAGES 0

Step 2) Recovery using archive log files
========================================

Please run the following commands at the SQL prompt:
* (You can safely ignore the following ORA Error 'Managed Standby Recovery not active')

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

SQL> RECOVER STANDBY DATABASE UNTIL CANCEL;

-- when requested, feed the following file(s):

+DATA/adbsb/archivelog/2016_02_03/thread_1_seq_110.382.902852027
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ARCHIVELOG/2016_02_03/thread_1_seq_111.266.902852033

-- finally enter 'Cancel'

SQL> CANCEL

Step 3) Switching to primary control file
=========================================
Please run the following commands at the SQL prompt:

SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP NOMOUNT
SQL> ALTER SYSTEM SET CONTROL_FILES='/home/oracle/AxxRecovery/axx_required_files/1454516139530A12435456XT_112.C' SCOPE=SPFILE;
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT

Step 4) Renaming data and redo log file names
=============================================
Please run the following SQL statement

SQL> @/home/oracle/AxxRecovery/axx_required_files/logAndDateFileRename.sql

Step 5) Recovery using primary control file
===========================================
Please run the following command at the SQL prompt:

SQL> RECOVER DATABASE UNTIL CANCEL USING BACKUP CONTROLFILE
-- when requested, feed the following file(s):

/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_1.256.902686901
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913

-- You should now see a message saying 'Media Recovery complete'

Step 6) Open the Standby Database
=====================================================
Please run the following commands at the SQL prompt:

SQL> ALTER DATABASE OPEN RESETLOGS;
Well, it's dumbed down enough for those stressful moments associated with a standby database recovery, down to even the set pagesize commands in SQL*Plus--quite easy to forget in stressful situations. Note it shows the time it got generated at the very beginning, non-US date format as dd/mm/yyyy. Anyway, I followed the instructions step by step. Step 4 requires some attention. It shows the how to change the names of the redo logs and data files after the switchover.

For the sake of brevity I don't want to show the entire output. Here is an excerpt from the tail end of the activity:
...
... output truncated ...
...
Specify log: {=suggested | filename | AUTO | CANCEL}
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911
ORA-00279: change 5172890 generated at 02/03/2016 16:16:56 needed for thread 1
ORA-00289: suggestion : +DATA
ORA-00280: change 5172890 for thread 1 is in sequence #114
ORA-00278: log file '/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913
Log applied.
Media recovery complete.

SQL> ALTER DATABASE OPEN RESETLOGS;

Database altered.

SQL> SELECT COUNT (1) FROM ARUP.T;

COUNT(1)
----------
399200
Voila! The output shows "Media recovery complete". The count is 399,200--the same as the number in the production database. The recovery process got those last changes. I accomplished my objective without a synchronous replication.

Summary

Recovery without data loss has always been dependent on a high throughput, low latency, ultra-reliable network. Unfortunately the cost of this infrastructure often precludes the use of the no data loss setups in organizations. The Axxana solution is innovative in the sense that it addresses the issue with a completely out of the box solution slashing costs dramatically, in fact so dramatically that most corporations will be able accomplish no data loss in Oracle Data Guard setups. At the heart of the setup is the "blackbox" which is just solid state storage with protection to withstand water, fire and bombing. When a disaster obliterates the primary site, this blockbox likely survives and through its cellular modem allows is content to be pulled to the standby site to reconstruct the important files that have not been replicated yet. Once those files are retrieved, it's quite simple to recover the standby database completely. In this blog you saw an example of how I did exactly that as a user.

Hope this helps in making some decisions about no data loss setup in Data Guard with respect to cost.

Friday Philosophy – Database Performance is In My Jeans

Database performance is in my jeans. Not my genes, I really do mean my jeans – an old pair of denim trousers. I look at my tatty attire keeping my legs warm and it reminds me of Oracle database performance.

comfortable, baggy, old, DW jeans

comfortable, baggy, old, DW jeans

You can buy jeans in a range of styles & sizes. Just as you can set up your database in a number of standard ways. When you create a database with the install wizard or the DBCA (database configuration assistant) you get to pick from a few options. OLTP databases are like skinny, butt-tight jeans that fashionably young things might wear. I’m more of a Data Warehouse type. I like lose, baggy jeans with lots of space. However, no matter how good the initial setup, performance will degrade. Your jeans will get stretched, stained, more baggy and generally tatty. But you also get used to the performance of your database, it’s oddities and how to live with them. Your baggy, saggy jeans become comfortable.

I'm a dab-hand at doing turn-ups and SQL tuning

I’m a dab-hand at doing turn-ups and SQL tuning

Of course, you probably need to alter your database somewhat to suit you performance requirements. You could go to a tailor to get them done (pay a consultant) and make your jeans a top-notch fit but it’s expensive. Or ask the shop to alter them when you buy them (get some oracle consultancy as part of the purchase deal, to do a pretty average job of changing things). Or, if your requirements are specific (I can never get trousers with a leg length to suit me for some odd reason) and you have your have some skills (I can drive a sewing machine and, if needed and I have time, I can hand-sew) then you can tailor your jeans to your needs yourself. Little changes like this are like a bit of SQL tuning. Hand sewing is messing about with trace files.

You fix one performance bottle neck only to find the next one

You fix one performance bottle neck only to find the next one

Of course, over time more major performance issues will occur and the cracks will show. Well, tears. Bits of the system will give way and you’ll have to patch them. Sometimes the patch is a bit of an obvious cludge, but heck it does the job. The other option is to just live with the gaping knee, which is like not fixing your performance issue and just letting your knee get cold. My business requirements don’t allow for this, I need my knee covered and protected from the brambles and spiky stuff around the garden. And just like performance tuning, you fix one performance problem only to reveal the next point of weakness. The point of most stress in my jeans are the knees, what with all the gardening, crawling through hedges, kneeling in the dirt and grovelling to the wife. I patched that big tear across the knee – and within 2 weeks a new one started, just a little lower. You fix your critical batch load that is doing too much physical IO and now your problem is redo generation in the next step! I did not fix my performance bottle neck, I just moved it down the damned leg!

All those little tears needed a lot of fixing

All those little tears needed a lot of fixing

Many of us get tears in the knee of our jeans, it’s a common performance problem. But some performance problems are more esoteric. Not many people have had to patch the bottom of their jeans due to doing battle with barbed wire (and losing). I could do with self-healing jeans to match the self healing leg. I suppose with the latest dynamic performance tricks in the optimiser, we sort-of have self healing databases. I tried patching it with just the sewing machine but the damage was too great and so a swatch of fabric behind the area and a craze of zig-zag stitch is holding it all together. Maybe that’s like using row-level-security to allow different customer to see just their set of data. It works but it was a tad over-engineered.

Of course, over the years the requirements for your database and it’s performance are likely to vary and you might need to do more than a bit of sql tuning or tweaking of indexes. The sewing machine can’t fix all the problems with my tatty old jeans, especially as the workload first grew, shrank, and grew again. I needed a new performance enhancement tool. A belt. It’s stopped them falling down around my knees and also stopped them from cutting off the blood to my legs, depending on how well I’ve done at archiving off excess calories I no longer need.

Addition of a Modifiable Girth Control device

Addition of a Modifiable Girth Control device

The sad thing is, despite all my hard work, I think I’ll have to pension off these jeans soon. Just like computer system I’ve looked after for a while, I know where I am with them and I’ll miss them when I do a hardware refresh.

So there you go. How many of you thought that you could be reading about a tatty pair of jeans this week? I’m good to you lot.

Parallel DML

A recent posting on OTN presented a performance anomaly when comparing a parallel “insert /*+ append */” with a parallel “create table as select”.  The CTAS statement took about 4 minutes, the insert about 45 minutes. Since the process of getting the data into the data blocks would be the same in both cases something was clearly not working properly. Following Occam’s razor, the first check had to be the execution plans – when two statements that “ought” to do the same amount of work take very different times it’s probably something to do with the execution plans – so here are the two statements with their plans:

First the insert, which took 45 minutes:

insert  /*+ append parallel(a,16) */ into    
        dg.tiz_irdm_g02_cc  a
select
        /*+ parallel (a,16) parallel (b,16) */ 
        *
from    tgarstg.tst_irdm_g02_f01 a, 
        tgarstg.tst_irdm_g02_f02 b
where   a.ip_id = b.ip_id
;

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                 |                  |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->S | QC (RAND)  |
|*  4 |     HASH JOIN BUFFERED           |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000         |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001         |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

And here’s the ‘create table’ at 4:00 minutes:

create table dg.tiz_irdm_g02_cc 
nologging 
parallel 16 
compress for query high 
as
select
        /*+ parallel (a,16) parallel (b,16) */ 
        *
from    tgarstg.tst_irdm_g02_f01 a , 
        tgarstg.tst_irdm_g02_f02 b 
where
        a.ip_id = b.ip_id

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |                  |    13M|    36G|       |   397K  (1)| 00:00:14 |        |      |            |
|   1 |  PX COORDINATOR                  |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10002         |    13M|    36G|       |   255K  (1)| 00:00:09 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,02 | PCWP |            |
|*  4 |     HASH JOIN                    |                  |    13M|    36G|  1842M|   255K  (1)| 00:00:09 |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |                  |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000         |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |                  |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |                  |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,02 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001         |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |                  |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

As you can see, the statements are supposed to operate with degree of parallelism 16, and we were assured that the pre-existing table had been declared as nologging with the same level of compression as that given in the CTAS so, assuming the queries did run with the degree expected, they should take virtually the same amount of time.

But there’s an important clue in the plan about why there was a difference, and why the difference could be so great. The first statement is DML, the second is DDL. Parallel DDL is automatically enabled, parallel DML has to be enabled explicitly otherwise the select will run in parallel but the insert will be serialized. Look at operations 1 – 4 of the insert – the query co-ordinator does the “load as select” of the rowsource sent to it by the parallel execution slaves. Not only does this mean that one process (rather than 16) does the insert, you also have all the extra time for all the messaging and the hash join (at line 4) has to be buffered – which means a HUGE amount of data could have been dumped to disc by each slave prior to the join actually taking place and then been read back from disc, joined, and forwarded.

Note that the hash join in the CTAS is not buffered – each slave does the join as the data arrives and writes the result directly to its local segment. Basically the insert could be doing something like twice the I/O of the CTAS (and this is Exadata, so reads from temp can be MUCH slower than the tablescans that supply the data to be joined).

So the OP checked, and found that (although he thought he had enabled parallel DML) he hadn’t actually done so. And after enabling parallel DML the timing was … just as bad. Ooops!! Something else must have gone wrong. Here’s the plan after enabling parallel DML:


--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                   |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  PX COORDINATOR                    |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10003         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE                     |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,03 | PCWP |            |
|   5 |      PX SEND RANDOM LOCAL          | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->P | RANDOM LOCA|
|*  6 |       HASH JOIN BUFFERED           |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE                  |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH               | :TQ10000         |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR         |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE                  |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH               | :TQ10001         |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR         |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------------

As you can see, line 3 has the LOAD AS SELECT after which the slaves message the query co-ordinator – so the DML certainly was parallel even though it wasn’t any faster. But why is the hash join (line 6) still buffered, and why is there an extra data flow (lines 5 and 4 – PX SEND RANDOM LOCAL / PX RECEIVE). The hash join has to be buffered because of that extra data flow (which suggests that the buffering and messaging could still be the big problem) – but WHY is the data flow there at all, it shouldn’t be.

At this point I remembered that the first message in the thread had mentioned testing partitioned tables as well as non-partitioned tables – and if you do a parallel insert to a partitioned table and the data is going to be spread across several partitions, and the number of partitions is not a good match for the degree of parallelism then you’re likely to an extra stage of data distribution as Oracle tries to share the data and the partitions as efficiently as possible across slaves. One of the possible distribution methods is “local random” – which is fairly likely to appear if the number of slaves is larger than the number of partitions. This behaviour can be modified with the newer “single distribution” version of the pq_distribute hint. So I asked the OP if their latest test was on a partitioned table, and suggested they insert the hint /*+ pq_distribute(a none) */ just after the parallel hint.

The answer was yes, and the hint had the effect of dropping the run time down to 7 minutes – still not as good as the CTAS, but then the CTAS wasn’t creating a partitioned table so it’s still not a completely fair test. Here’s the (start of the) final plan:

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                   |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  PX COORDINATOR                    |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,02 | PCWP |            |
|*  4 |     HASH JOIN                      |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |

As you can see, we have a hash join that is NOT buffered; we don’t have a third distribution, and the slaves do the data load and then message the query co-ordinator.

It would be interesting to know if there was a significant skew in the data volumes that went into each partition of the partitioned table, and check where the time was spent for both the partitioned insert and the non-partitioned CTAS (and compare with a non-partitioned insert) – but real-world DBAs don’t necessarily have all the time for investigations that I do.

My reference: parallel_dml.sql

OGh DBA / SQL Celebration Day 2016 Keynote Speaker

I am very happy and honored, to announce, that Mr. Graham Wood (Oracle US) agreed…