Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Index Splits – 2

In yesterday’s article I described the mechanism that Oracle for an index leaf block split when you try to insert a new entry into a leaf block that is already full, and I demonstrated that the “50-50” split and the “90-10” split work in the same way, namely:

  • save the old block into the undo
  • prepare a new leaf block
  • “share” the data between the old and new leaf blocks
  • sort out pointers

The obvious question to ask about this process is: “Why does Oracle save and rewrite the whole of the old leaf block during a 90-10 split when the data in the block doesn’t appear to change ?” The “sharing” in the 90-10 split is most uneven, and it appears that Oracle simply attaches a new leaf block to the index structure and writes the new index entry into it, leaving the existing index entries unchanged in the current leaf block.

The answer to that question can be found by doing block dumps – except you won’t see the answer if you use my original test data. So here’s a follow-on script to the previous test (written 11 years after the previous script):


rem
rem     Script:         index_splits3a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem     Purpose:
rem

drop table t2 purge;

create table t2 as select * from t1 where id <= 148;
alter table t2 add constraint t2_pk primary key(id, idx_pad) using index pctfree 0;

column object_id new_value m_index_id
select object_id from user_objects where object_name = 'T2_PK' and object_type = 'INDEX';

begin
        for r in (select * from t1 where id between 149 and 292 order by dbms_random.value) loop
                insert into t2 values(r.id, r.idx_pad, r.padding);
        end loop;
        commit;
end;
/

alter session set events 'immediate trace name treedump level &m_index_id';
alter system flush buffer_cache;

prompt  check the trace file for the last block of the index
prompt  then do a block dump for it.
prompt  then press return

pause

insert into t2 values(293, rpad('x',40,'x'), rpad('y',50,'y'));
commit;

alter session set events 'immediate trace name treedump level &m_index_id';
alter system flush buffer_cache;


This test depends on the number of rows used for the previous test – and I have four hard-coded values (148, 149, 292, 293) in it that matter. If you’ve had to use a different number of rows in your version of the first test you will need to adjust these values to match.

I’ve created a clone of the t1 table copying only the first 148 rows – this is just enough rows that when I create a unique (PK) index on the table the index will have two leaf blocks, the first holding 147 entries and the second holding one entry. I’ve then inserted the next 144 rows from t1 into t2 in random order, so that I end up with two full leaf blocks.

Once the data is ready the code issues a treedump command (so that we can check the index is as I’ve described it) and flushes the buffer_cache, then prompts you with some instructions and waits for you to press return. At this point you need some manual intervention from another session – you can examine the treedump to work out the file and block addresses of the two leaf blocks and dump the second leaf block (‘alter database dump datafile N block M;’).

After you’ve done the block dump press return and my code resumes and inserts a new row that will cause a 90-10 split to happen, then it does another treedump (to let you check the block addresses and see that the split was 90-10), and flushes the buffer cache again. This is where you can check the block address of the second leaf block (in case it has managed to change – which it shouldn’t) and dump the block again.

Here, with a huge chunk removed from the middle, are the results I got from searching for the expression “row#” in the two block dumps that I generated in my test.


Before 90/10 block split:
-------------------------
row#0[7979] flag: -------, lock: 0, len=53, data:(6):  01 40 01 84 00 03
row#1[1885] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 2b
row#2[1938] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 2a
row#3[5595] flag: -------, lock: 2, len=53, data:(6):  01 40 01 f9 00 2c
row#4[3581] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 0b
...
row#142[1408] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 34
row#143[2150] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 26
row#144[878] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 3e


After 90/10 block split
-----------------------
row#0[348] flag: -------, lock: 0, len=53, data:(6):  01 40 01 84 00 03
row#1[401] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 2b
row#2[454] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 2a
row#3[507] flag: -------, lock: 0, len=53, data:(6):  01 40 01 f9 00 2c
row#4[560] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 0b
...
row#142[7873] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 34
row#143[7926] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 26
row#144[7979] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 3e

The “row#” is in ascending order – these lines in an index leaf block dump show Oracle walking through the block’s “row directory”; the number in square brackets following the row number is the offset into the block where the corresponding index entry will be found. When Oracle inserts an index entry into a leaf block it adjusts the row directory to make a gap in the right place so that walking the directory in row# order allows Oracle to jump around the block and find the index entries in key order.

When Oracle rewrites the block it first sorts the index entries into key order so that the actual index entries are written into the block in key order and a range scan that moves a pointer smoothly through the row directory will be moving another pointer smoothly down the block rather than making the pointer jump all over the place. Presumably this has (or maybe had) a benefit as far as the CPU cache and cache lines are concerned.

So there is a method in the madness of “copy the whole block even when the content doesn’t change”. The content doesn’t change but the order does, and paying the cost of sorting once may return a benefit in efficiency many times in the future.

 

Using colplot to visualise performance data

Back in 2011 I wrote a blog post about colplot but at that time focused on running the plot engine backed by a web server. However some people might not want to take this approach, and thinking about security it might not be the best idea in the world anyway. A port that isn’t opened can’t be scanned for vulnerabilities…

So what is colplot anyway? And why this follow-up to a 7 year old post?

Some background

Back in the day I learned about collectl: a small, light-weight performance monitoring tool that’s easily portable to many platforms. Collectl is very versatile and has capabilities to record a lot of detail, especially on Linux. Collectl comes with a companion, colplot, which I introduced in said post in 2011.

A typical use case – at least for me – is to start collectl recording, do some work, stop recording, and plotting the results. This can be handy if you don’t get permission to install TFA (think “OSWatcher”) at short notice, and still need performance data!

In this blog post I’d like to share how you could go about doing this. I am using Debian as my workstation O/S and Oracle Linux to capture data.

The workload

I wanted a really simple workload, so for lack of ingenuity I came up with the idea of running fio to drive storage and a tiny bit of CPU.

I am running a very basic random read I/O test and record it with collectl. It is important to understand that it’s your responsibility to define what you want to capture with regards to the plots you want to generate. I’ll explain this in a bit, but for now it’ll have to suffice that I need detail data for CPU and disk.

Before kicking off the load generator I start collectl to record data in (P)lot mode, saving data to my data subdirectory:

[oracle@server1 ~]$ collectl -sCD -f /home/oracle/data -P

(Another, probably cleverer option is to record the data without the -P switch, then replay the recording and writing the output to a file in plot format. That way you can have a lot more control over the process and as an added advantage have the raw data)

Once the workload has completed, I can transfer the data captured to my workstation for analysis with colplot. For each detail switch you select (C and D in the example) you’ll get a (compressed) file:

$ ls -l data
total 84
-rw-r--r-- 1 oracle oinstall 21790 Oct 30 08:26 server1-20181030.cpu
-rw-r--r-- 1 oracle oinstall 59357 Oct 30 08:26 server1-20181030.dsk

Plotting details

I imagine most people use the web-frontend to colplot, but that doesn’t mean there aren’t other ways of generating visualisations of your performance data. But first of all, what plots can you create? This depends on the version of colplot in use, for me it’s this:

$ colplot --version
colplot: V5.1.0, gnuplot: V:5.0[png,x11]

Copyright 2004-2014 Hewlett-Packard Development Company, L.P.
colplot may be copied only under the terms of either the Artistic License
or the GNU General Public License, which may be found in the source kit

Getting back to the list of plots supported in my version, it’s super impressive!

$ colplot -showplots
cpu         cpu     s  Cpu Utilization. Other
cpu20       cpu     s  Cpu Utilization, 0-20%
cpu80       cpu     s  Cpu Utilization, 80-100%
cpudold     cpu     s  Old format ony reports individual fields
cpumid      cpu     s  Cpu Utilization, 40-60%
cpumore     cpu     s  Additional types of use
cpuold      cpu     s  Old format ony reports individual fields
loadavg     cpu     s  Load Averages for 1,5,15 min
disk        disk    s  Disk Summary
diskio      disk    s  Disk I/O
disksize    disk    s  Bandwidth and transfer sizes
elan        elan    s  Quadrics ELAN Bandwidth
elanio      elan    s  Quadrics ELAN Packet Rates
ib          ib      s  Infiniband Bandwidth
ibio        ib      s  Infiniband Packet Rates
ibsize      ib      s  Bandwidth and transfer sizes
nvidia      import  s  nvidia GPU stats
inode       inode   s  Inode Summary
luscltio    lustre  s  Lustre Client Summary, I/O only
cltmeta     lustre  s  Lustre Client Meta Summary
cltreada    lustre  s  Lustre Client Read-Ahead Summary
lusmds      lustre  s  lustre Lustre MDS Summary
lusoss      lustre  s  Lustre OSS Data Rates
ossio       lustre  s  Lustre OSS I/Os
faults      mem     s  Page Faults
mem         mem     s  Memory
memanon     mem     s  Anon Memory
membuf      mem     s  Buffered Memory
memcache    mem     s  Cached Memory
memdirty    mem     s  Dirty Memory
memmap      mem     s  Mapped Memory
memslab     mem     s  Slab Memory
paging      mem     s  Paging
swap        mem     s  Swap Utilization
misc1       misc    s  Miscellaneous ctrs from '--import misc'
misc2       misc    s  CPU Frequency from '--import misc'
net         net     s  Network Summary
netpkt      net     s  Network packets
netsize     net     s  Bandwidth and transfer sizes
nfsV2c      nfs     s  older NFS V2 Client Summary
nfsV2s      nfs     s  older NFS V2 Server Summary
nfsV3c      nfs     s  older NFS V3 Client Summary
nfsV3s      nfs     s  older NFS V3 Server Summary
nfsV4c      nfs     s  older NFS V4 Client Summary
nfsV4s      nfs     s  older NFS V4 Server Summary
nfsmeta     nfs     s   NFS Metadata and Commits
nfsrpc      nfs     s  NFS RPC Summary
nfssum      nfs     s   NFS Aggregate Summary Data
ctxint      proc    s  Context and Interruputs
proc        proc    s  Processes
sock        sock    s  Socket Usage
accaudt     swift   s  Account Auditor
accreap     swift   s  Account Reaper
accrepl     swift   s  Account Replicator
accsrvr     swift   s  Account Server
conaudt     swift   s  Container Auditor
conrepl     swift   s  Container Replicator
consrvr     swift   s  Container Server
consync     swift   s  Container Sync
conupdt     swift   s  Container Updater
objaudt     swift   s  Object Auditor
objexpr     swift   s  Object Expirer
objrepl     swift   s  Object Replicator
objsrv2     swift   s  Object Server2
objsrvr     swift   s  Object Server
objupdt     swift   s  Object Updater
prxyacc     swift   s  Proxy Account
prxycon     swift   s  Proxy Container
prxyobj     swift   s  Proxy Object
tcp         tcp     s  TCP errors count summary
tcpold      tcp     s  old TCP acks & packet failures
cpudet      cpu     d  Cpu Details, Other
cpuint      cpu     d  Interrupts by Cpu
cpumored    cpu     d  Additional types of use
diskdet     disk    d  Disk Details
diskdsize   disk    d  Disk I/O Size Details
diskque     disk    d  Disk request queue depth
disktimes   disk    d  Disk wait/service times
diskutil    disk    d  Disk utilization
fans        env     d  Fan Speeds
power       env     d  Power Sensor
temps       env     d  Temperature Sensors
ibdet       inter   d  IB interconnect detail data
ibdsize     inter   d  IB packet size detail
elandio     inter   d  Elan interconnect IOs (get/put/comp)
elandmb     inter   d  Elan interconnect MBs (get/put/comp)
cltdet      lustre  d  Lustre Client FS I/O Detail
cltdetL     lustre  d  Lustre Client OST I/O Detail
ossdet      lustre  d  Lustre OSS Detail
netdet      net     d  Network Details
netdsize    net     d  Network Packet Size Details
nfsV2cd     nfs     d  NFS Version 2 Client Detail
nfsV2sd     nfs     d  NFS Version 2 Server Detail
nfsV3cd     nfs     d  NFS Version 3 Client Detail
nfsV3sd     nfs     d  NFS Version 3 Server Detail
nfsV4cd     nfs     d  NFS Version 4 Client Detail
nfsV4sd     nfs     d  NFS Version 4 Server Detail
cltbrw      macro      Lustre Client BRW stats
cltbrwD     macro      Lustre Client BRW detail stats
detall      macro      All detail plots except nfs and lustre
detlus      macro      Lustre detail plots (there can be a LOT of these!)
detnfs      macro      NFS detail plots, colplot only
inter       macro      Interconnect summary plots
interdet    macro      Interconnect detail plots
lusblkDR    macro      Lustre Block I/O read detail stats (there can be LOTS of these!)
lusblkDW    macro      Lustre Block I/O write detail stats (there can be LOTS of these!)
lusblkR     macro      Lustre Block I/O read summary stats
lusblkW     macro      Lustre Block I/O write summary stats
misc        macro      All misc counters from '--import misc'
ossbrw      macro      Lustre OSS BRW stats
ossbrwD     macro      Lustre OSS BRW detail stats
sumall      macro      All summary plots, except nfs client/server stats
sumlus      macro      Lustre summary plots for clients, mds and oss
summin      macro      Minimal set of summary plots (cpu, disk, mem and disk
sumnfs      macro      NFS summary plots, colplot only

I don’t think I promised too much!

One thing to keep in mind though: you cannot plot charts for which you don’t have data. For example, if I wanted to plot CPU summary data, colplot would tell that it can’t:

$ colplot -dir /home/vagrant/data -plots cpu
No plottable files match your selection criteria.
Are your dir, selection dates and/or file protections right?

I am certain I have set the permissions correctly, but I also know that I didn’t capture CPU summary information (this would be lower-case “c”, as opposed to upper case “C” for the detailed recording). I suggest you run a few tests until you are comfortable with collectl’s command line switches to avoid later disappointment when trying to plot :)

With the collected performance data transferred to ~/data and now can plot some CPU and disk details:

$ colplot -dir /home/vagrant/data -plots cpudet,diskdet \
> -filedir /home/vagrant/data --time 08:15:00-08:25:00
Your Plot(s) have been written to /home/vagrant/data/5376-colplot.pdf

The resulting file created by this particular command is a PDF. I like this format simply because it’s easy to store it for later reference. I also wanted to limit the plots to a specific time, otherwise my little 5m test would have been hardly noticeable.

This is what it looks like, please don’t try to read anything from the charts, they are included for illustration purposes only, taken from a lab VM without any resemblance to a real-world system.

colplot examplehttps://martincarstenbach.files.wordpress.com/2018/10/colplot-example.pn... 150w, https://martincarstenbach.files.wordpress.com/2018/10/colplot-example.pn... 300w, https://martincarstenbach.files.wordpress.com/2018/10/colplot-example.pn... 768w, https://martincarstenbach.files.wordpress.com/2018/10/colplot-example.pn... 1024w, https://martincarstenbach.files.wordpress.com/2018/10/colplot-example.png 1038w" sizes="(max-width: 960px) 100vw, 960px" />

It’s also possible to plot interactively by omitting the -filedir switch in colplot. Output is generated in your X-session and you can export it in different formats.

There is of course more to colplot than I could show in a single post, but I hope I have managed to give you a first impression.

Happy benchmarking!

Index splits

After writing this note I came to the conclusion that it will be of no practical benefit to anyone …  but I’m publishing it anyway because it’s just an interesting little observation about the thought processes of some Oracle designer/developer. (Or maybe it’s an indication of how it’s sensible to re-use existing code rather than coding for a particular boundary case, or maybe it’s an example of how to take advantage of “dead time” to add a little icing to the cake when the extra time required might not get noticed). Anyway, the topic came up in a recent thread on the OTN/ODC database forum and since the description given there wasn’t quite right I thought I’d write up a correction and a few extra notes.

When an index leaf block is full and a new row has to be inserted in the block Oracle will usually allocate a new leaf block, split the contents of the full block fairly evenly between two leaf blocks, then update various pointers to bring the index structure up to date. At various moments in this process the branch block above the leaf block and the leaf blocks either side of the splitting block have to be pinned. The number of times this happens is reported under the statistic “leaf node splits” but there is a special subset of leaf node splits that handles the case when the key in the new row is greater than the current high value in the block and the block is the “rightmost” (i.e. high values) block in the index. In this case Oracle adds a new leaf block to the end of the index and inserts the new value in the new block; it doesn’t share the data at all between the old and new leaf blocks. This special case is reported under the statistic “leaf node 90-10 splits”, even though “100-0” would be a more accurate description than “90-10”.

This note is a description of the work done by  a leaf node split and compares the work for a “50-50” split (as the general case is often called) and a 90-10 split. You might think that the latter would be less resource-intensive than the former but, in fact, that’s not the case. Here’s a little script to get things going – I’m using an 8KB block size and ASSM (automatic segment space management); if your default tablespace definition is different the number of rows you have to use will need to be changed.


rem
rem     Script:         index_splits3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          September 2007
rem

start setenv
set timing off
set feedback off

define m_limit = 292

drop table t1 purge;

create table t1 (id number, idx_pad varchar2(40), padding varchar2(50));
alter table t1 add constraint t1_pk primary key(id, idx_pad);

column object_id new_value m_index_id

select
        object_id
from
        user_objects
where
        object_name = 'T1_PK'
and     object_type = 'INDEX'
;

begin
        for i in 1..&m_limit loop
                insert into t1 values(
                        i, rpad('x',40,'x'), rpad(' ',50)
                );
                commit;
        end loop;
end;
/


I’ve created a table with a two-column primary key and inserted “m_limit” rows into that table in an order that matches the primary key. The limit of 292 rows (which I’ve declared at the top of the program) means that the index entries for the data set will exactly fill two leaf blocks. I’ve captured the object_id of the index because I want to do a “treedump” of the index before and after inserting the next row.

I now need to run one of two tests inserting a single row. Either insert a row that is above the current highest value to force a 90-10 index leaf node split, or insert a row a little below the current high value to force a 50-50 index node split in the 2nd of the two index leaf blocks.


alter session set events 'immediate trace name treedump level &m_index_id';

alter system switch logfile;
execute snap_my_stats.start_snap;

begin
        for i in &m_limit + 1 .. &m_limit + 1  loop
                insert into t1 values(
                        i, rpad('x',40,'x'), rpad(' ',50)
--                      i - 2 , rpad('y',40,'y'), rpad(' ',50)
                );
                commit;
        end loop;
end;
/

execute snap_my_stats.end_snap

alter session set events 'immediate trace name treedump level &m_index_id';

execute dump_log

The calls to snap_my_stats are using a package I wrote a long time ago to report a delta in my session’s stats. The call to dump_log uses another little package to identify the current log file and issue an “alter system dump logfile …” command. Of the two possible sets of values for the row being inserted the first one will cause a 90-10 split the second (commented out) will cause a 50-50 split.

Here are the results from the calls to treedump – first the dump taken before the insert then the dump after a 90-10 split, finally the dump after re-running the test and forcing a 50-50 split. These results came from a database running 11.2.0.4, but the results are the same for 12.1.0.2 and 18.3.0.0:


----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 2, level: 1)
   leaf: 0x140008e 20971662 (-1: nrow: 147 rrow: 147)
   leaf: 0x140008f 20971663 (0: nrow: 145 rrow: 145)
----- end tree dump

----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 3, level: 1)
   leaf: 0x140008e 20971662 (-1: nrow: 147 rrow: 147)
   leaf: 0x140008f 20971663 (0: nrow: 145 rrow: 145)
   leaf: 0x140008c 20971660 (1: nrow: 1 rrow: 1)
----- end tree dump


----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 3, level: 1)
   leaf: 0x140008e 20971662 (-1: nrow: 147 rrow: 147)
   leaf: 0x140008f 20971663 (0: nrow: 78 rrow: 78)
   leaf: 0x140008c 20971660 (1: nrow: 68 rrow: 68)
----- end tree dump


As you can see the extra row in the first case has been inserted into a new leaf block leaving the 2nd leaf block (apparently) unchanged; in the second case the 145 initial rows plus the one extra row have been shared fairly evenly between two leaf block. I can’t explain the imbalance in this case, it doesn’t affect the length of the branch entry. If you’re wondering why the first leaf block held 147 entries while the original 2nd leaf block held 145 it’s because the first 100 entries in the first leaf block had a value for the id column that was 2 bytes long, after which the id needed 3 bytes storage for Oracle’s internal representation.)

Having examined the treedumps to see that the splits are 90-10 and 50-50 respectively we can now look at the undo and redo generated by the different cases. Here are the relevant values extracted from the snapshots of the session stats. Again the first set comes from the 90-10 split, the second from the 50-50 split.


Redo/Undo stats 90/10 split
--------------------------------------------
redo entries                               9
redo size                             18,500
undo change vector size                8,736

Redo/Undo stats 50/50 split
--------------------------------------------
redo entries                               9
redo size                             18,520
undo change vector size                8,736

In both cases the volume of undo and redo is the same (plus or minus a tiny bit – with tiny variations across versions). The undo is equivalent to roughly a whole block plus a few percent (and that will be copied into the redo, of course) and the “pure” redo is also equivalent to a whole block plus a few percent for a total of two data blocks worth plus a couple of thousand bytes. (The extra percentage is mostly about the old and new pointers as we break and make links in the leaf blocks and update and insert links from the branch block above.)

So why does a 90/10 split, which appears simply to add a leaf block and insert one row, do so much work? The answer lies (to start with) in the dump of the redo log file. The session statistics show 9 redo entries (redo change records) generated in both cases, so I’m going to start by picking out a summary of those records from the log file dumps using egrep to identify the lines showing the redo change record length (LEN:) and the redo change vector op codes (OP:). Here’s the output, with a little cosmetic modification, for the 90-10 split.


egrep -e "LEN:" -e"OP:" test_ora_20690.trc

REDO RECORD - Thread:1 RBA: 0x00314b.00000002.0010 LEN: 0x0074 VLD: 0x05
(LWN RBA: 0x00314b.00000002.0010 LEN: 0038 NST: 0001 SCN: 0x0b86.0cfca74e)
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74a SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000002.0084 LEN: 0x0144 VLD: 0x01
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca746 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:74 AFN:3 DBA:0x00c0465f OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74e SEQ:1 OP:10.6 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000002.01c8 LEN: 0x20a4 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:74 AFN:3 DBA:0x00c04660 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74e SEQ:2 OP:10.9 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000013.017c LEN: 0x0044 VLD: 0x01
CHANGE #1 TYP:0 CLS:8 AFN:5 DBA:0x01400088 OBJ:349950 SCN:0x0b86.0cfca638 SEQ:3 OP:13.22 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000013.01c0 LEN: 0x01ac VLD: 0x01
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:74 AFN:3 DBA:0x00c04661 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349950 SCN:0x0b86.0cfca638 SEQ:2 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000014.017c LEN: 0x0048 VLD: 0x01
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008b OBJ:349950 SCN:0x0b86.0cfca639 SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000014.01c4 LEN: 0x00e0 VLD: 0x01
CHANGE #1 TYP:0 CLS:74 AFN:3 DBA:0x00c04661 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008b OBJ:349950 SCN:0x0b86.0cfca74e SEQ:1 OP:10.15 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000015.00b4 LEN: 0x1fb0 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:3 OP:5.4 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74e SEQ:3 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000025.0164 LEN: 0x0320 VLD: 0x09
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x01400084 OBJ:349949 SCN:0x0b86.0cfca74a SEQ:2 OP:11.2 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:83 AFN:3 DBA:0x00c004a8 OBJ:4294967295 SCN:0x0b86.0cfca738 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349950 SCN:0x0b86.0cfca74e SEQ:1 OP:10.5 ENC:0 RBL:0
CHANGE #4 TYP:0 CLS:83 AFN:3 DBA:0x00c004a8 OBJ:4294967295 SCN:0x0b86.0cfca750 SEQ:1 OP:5.4 ENC:0 RBL:0
CHANGE #5 TYP:0 CLS:84 AFN:3 DBA:0x00c04b0f OBJ:4294967295 SCN:0x0b86.0cfca738 SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #6 TYP:0 CLS:84 AFN:3 DBA:0x00c04b0f OBJ:4294967295 SCN:0x0b86.0cfca750 SEQ:1 OP:5.1 ENC:0 RBL:0

I’ve highlighted two redo records with ‘***’ at the end of line. One of these records has length 0x20a4, the other has length 0x1fb0 i.e. roughly a whole data block each. We’ll look at those in more detail in a moment. Here, for comparison, is the result from the 50-50 split – again with a few highlighted lines:

REDO RECORD - Thread:1 RBA: 0x00314f.00000002.0010 LEN: 0x0074 VLD: 0x05
(LWN RBA: 0x00314f.00000002.0010 LEN: 0038 NST: 0001 SCN: 0x0b86.0cfcbc25)
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc21 SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000002.0084 LEN: 0x0144 VLD: 0x01
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc15 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:70 AFN:3 DBA:0x00c10c43 OBJ:4294967295 SCN:0x0b86.0cfcbc15 SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:1 OP:10.6 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000002.01c8 LEN: 0x20a4 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:1 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:70 AFN:3 DBA:0x00c10c44 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:2 OP:10.9 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000013.017c LEN: 0x0044 VLD: 0x01
CHANGE #1 TYP:0 CLS:8 AFN:5 DBA:0x01400088 OBJ:349962 SCN:0x0b86.0cfcbb24 SEQ:3 OP:13.22 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000013.01c0 LEN: 0x1010 VLD: 0x01$                                       ***
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:70 AFN:3 DBA:0x00c10c45 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349962 SCN:0x0b86.0cfcbb24 SEQ:2 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.0000001c.0060 LEN: 0x0048 VLD: 0x01
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008b OBJ:349962 SCN:0x0b86.0cfcbb25 SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.0000001c.00a8 LEN: 0x00e0 VLD: 0x01
CHANGE #1 TYP:0 CLS:70 AFN:3 DBA:0x00c10c45 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008b OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:1 OP:10.15 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.0000001c.0188 LEN: 0x1150 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:3 OP:5.4 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:3 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000025.0168 LEN: 0x0330 VLD: 0x09
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x01400084 OBJ:349961 SCN:0x0b86.0cfcbc21 SEQ:2 OP:11.2 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfcbc1a SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:1 OP:10.5 ENC:0 RBL:0
CHANGE #4 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfcbc27 SEQ:1 OP:5.4 ENC:0 RBL:0
CHANGE #5 TYP:0 CLS:74 AFN:3 DBA:0x00c04c64 OBJ:4294967295 SCN:0x0b86.0cfcbc1a SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #6 TYP:0 CLS:74 AFN:3 DBA:0x00c04c64 OBJ:4294967295 SCN:0x0b86.0cfcbc27 SEQ:1 OP:5.1 ENC:0 RBL:0

There are three interesting records in the 50-50 split with lengths 0x20a4 (the same as the 90-10 split), 0x1010, 0x1150. So we seem to start the same way with a “full block” record, and follow up with two “half block” records. The numbers allow you to make a reasonable guess – Oracle copies the original leaf block into the undo, then writes the two new leaf blocks as “pure” redo; in one case the two new leaf block redo records constitute a whole block and a tiny fraction of a block; in the other case the two new leaf block redo records constitute two half blocks.

I won’t show you the full detail that I checked in the log file dump, but the big 0x20a4 record in the 90-10 split is mostly made up of an “OP:5.1” change vector labelled “restore block before image (8032)”, while the 5th and 8th records in both dumps hold “OP:10.8” change vectors labelled “(kdxlne) … new block has NNN rows”. In the case of the 90-10 split the values for NNN are 1 and 145, in the case of the 50-50 split the values for NNN are 68 and 78 – in that (higher values leaf block first) order.

The 90-10 split and the 50-50 split act in exactly the same way – save the old block, allocate a new block, populate two blocks. It really looks as if code re-use has missed an easy opportunity for some optimisation – why save and rewrite a block when the data content is not going to change ?

Before assuming there’s a bug (or defect, or sub-optimal implementation) though it’s wise to consider whether there might be something else going on – Oracle developers (especially at the deeper levels) tend to have good reasons for what they do so maybe the rewrite is deliberate and serves a useful purpose.

If you do anything with my current test you won’t spot the extra little feature because my test is running a very special edge case – but I had a thought that would justify the cost (and timing) of the rewrite, and I’ll be publishing the idea, the test, and the results tomorrow.

Footnote

It is possible that a leaf node split means Oracle has to insert a pointer into a level 1 branch node that is already full – in which case Oracle will have to allocate a new branch node, share the branch data (including the new leaf pointer) between the two nodes, and insert a new branch pointer into the relevant level 2 branch block … and that may split etc. all the way up to the root. When the root node splits Oracle allocates two new blocks, increasing the branch level by one and keeping the original root block in place (immediately after all the space management blocks) but now pointing to just 2 “branch N-1” level blocks. Oracle will update the statistics “branch node splits” and “root node splits”.

In certain situations (typically relating to very large deletes followed by highly concurrent small inserts) Oracle may run into problems identifying a suitable “free” block while trying to do a split, and this can result in very slow splits that generate a lot of undo and redo which pinning index leaf blocks exclusively (leading to a couple of the more rare “enq – TX:” waits. In this case you may see statistics “failed probes on index block reclamation” and “recursive aborts on index block reclamation” starting to climb.  In theory I think you shouldn’t see more than a handful of “failed probes” per “recursive abort” – but I’ve never been able to model the problem to check that.

 

 

OpenWorld is done!

The OpenWorld conference is over for another year. It’s always a thrill, albeit a hard-working thrill, to visit San Francisco and be a part of such a huge event.

I’ll have a full wrap up post and video soon, but I wanted to get quick blog post out there so people could get a link to see the slides from my sessions.

See my OpenWorld 2018 content here

To all those people that attended my sessions, thanks for giving up your time. I hope you got lots of benefit out of the content. And as always, if you have any feedback on what you liked, or how I can make the content better for you, please drop me a comment below.

If you’d like me to speak at your local event, please reach out to me on Twitter. I can’t make any promises – these things all depend on scheduling and budgeting, but I’ll always try my best.

See you at another event soon!

What’s new in Oracle Database 19c and other #OOW18 feedback

Back from Oracle Open World #OOW18 and here are quick notes I’ve taken about what’s new. I focus on what was publicly announced, but with the ‘Safe Harbor’ statement: all this can be true, wrong or different (three-valued logic;) when 19c will be in General Availability. Rushing to publish those notes before entering the beta program as it will be hard to remember what was public or not.

Long Term Support and Beta Program

19c is confirmed as Long Term Support version, with primary support at least to 2022 and extended support to 2025. Note that in Oracle Cloud, extended support is without additional fees.

If you are still in 11.2.0.4 because you don’t want frequent upgrades, 19c will be the release to go (but keep it up to date with Release Updates). If you want to be sure to do this upgrade as soon as 19c is available on premises, then better to test it as soon as possible and report any bug you encounter. This means:

  • enter the beta program. Having a beta program for a release focused more on stability than new features is an awesome decision from Oracle Development. Sign-up for beta: https://pdpm.oracle.com
  • test 19c in Oracle Cloud as soon as available. You may use more options than what you have on-premises for those tests, such as the Performance Analyzer.

Oracle Code One, Ground Breakers and XE

The big new thing, in my opinion, is that the Java One conference has been merged to Oracle Open World conference, and renamed to Oracle Code One. Interleaving Dev and Ops sessions is a really good idea. The Oracle Technology Network (OTN) which became Oracle Developer Community (ODC) is now becoming the Oracle Groundbreakers. What I like here is that there is no mention of technology or development. We are all in the same project: improve the user experience with all the application components, from system to code.

This was also the occasion to give a talk more focused at developers than at DBA. At the time of Microservices, I wanted to share my experience in Dev and Ops teams, on monolithic, client/server, 3-tiers,… architectures. Because segregation of responsibility is great at logical level, but the way to performance and scalability must often go through colocation in same server, process, engine…

Presenting at a developer event was new to me, and it seems that the goal was reached. This is the kind of feedback I got which I really appreciate:

Oracle XE 18c is out (18.4)

In this talk, I mentioned Oracle XE which can be embedded for free. The 18c version of XE was released Saturday during the ACED Briefing. A lot larger than the 11g but with more options. Compression is an important one because the storage is limited to 12 GB.

19c new Optimizer features

The best you can do for the optimizer is providing accurate statistics. Daily gathering on stale tables is not sufficient, and dynamic statistics overload parse activity too much. We will have two new features:

  • High Frequency Automatic Statistics Task: some statistics gathering will be done during the day, when the system is idle, controlled by the Resource Manager.
  • Real-Time Statistics: some statistics (like high/low value) are maintained during DML. This can be great to avoid Out-of-Range problems with linear decay from obsolete statistics.

We should not rely on hints, but sometimes, we need. And we want to be sure that they are used (because no error is raised when not).

  • Hint Usage Reporting: dbms_xplan reports the hints per operation and alias name, with an indication for those that were not used (and reason)

Talking about DBMS_XPLAN, the comparison of plans have improved:

  • Compare SQL Execution Plans: shows directly where the plan diverges

At execution time, when a query never ends the worst that can happen is that the impatient run it again. That’s a good idea, but we all do that on Web applications: no answer -> hit F5 several times. The query will not run faster and this only increases the contention. Then, better to quarantine those statements for a while:

  • Automatic Diagnostic and Repair and Quarantine for Runaway Statements

Of course, the big feature for the 19c Autonomous Transaction Processing cloud service is Automatic Indexing. This deserves a longer description. The short idea is that a background job will capture and analyze the workload on short-term (last 15 minutes) to add better indexes, removes them, or blacklist them (SQL Patch) for some queries.

And the feature I like the most:

  • Automatic SQL Plan Management: the capture will be automatic

Enabling the capture for all statements executed two times is not easy, even with the new filters introduced in the latest release. Here, the capture relies on AWR. When a query is long, the execution plan will be compared with past AWR snapshots. And if a better plan was used in the past, then this one will be the accepted SQL Plan baseline. This is what we do manually, isn’t it?

Other Database new features

  • Fast ingest support for IoT: in conjunction with middle tier to stream data continuously to the database. The inserts are buffered in memory to benefit from asynchronous bulk inserts.
  • Partitioned Hybrid Tables: the idea is to accept different organization (Heap/External) for partitions. Maybe have some old partitions in a read-only tablespace on Object Store?
  • The multitenant main focus is bug fixes as the goal is to abandon non-CDB for the next main release (20c). 19c should allow RAT and ADDM at PDB level.
  • Data Guard: the re-creation of broker configuration is a long-awaited feature. FSFO will be able to allow data loss, and dynamically change the target

One problem when offloading read-only application to Active Data Gard standby is that those applications may not run in a read-only instance. An example is updating a table when the user logs-in. Some DML will be possible:

  • Data Guard: DML redirect on ADG for read-mostly applications

RAC is a key component for Oracle Cloud Autonomous Database, because of online patching:

  • RAC: Zero downtime Infrastructure Patching (including JVM) and Read-Only instances

And talking about Cloud, after the public cloud in the private cloud, now here is the private cloud in the public cloud:

  • dedicated ATP (Autonomous Transaction Processing): a quarter rack Exadata with limited access to CDB dedicated to you and where you can create multiple PDBs.

Upgrades – again

I’ve got a data set which I’ve recreated in 11.2.0.4 and 12.2.0.1.

I’ve generated stats on the data set, and the stats are identical.

I don’t have any indexes or extended stats, or SQL Plan directives or SQL Plan Profiles, or SQL Plan Baselines, or SQL Patches to worry about.

I’m joining two tables, and the join column on one table has a frequency histogram while the join column on the other table has a height-balanced histogram.  The histograms were created with estimate_percent => 100%. (which explains why I’ve got a height-balanced histogram in 12c rather than a hybrid histogram.)

Here are the two execution plans, 11.2.0.4 first, pulled from memory by dbms_xplan.display_cursor():


SQL_ID  f8wj7karu0hhs, child number 0
-------------------------------------
select         count(*) from         t1, t2 where         t1.j1 = t2.j2

Plan hash value: 906334482

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |      12 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |      12 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |   1855 |   1327 |00:00:00.01 |      12 |  2440K|  2440K| 1357K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |       6 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    800 |    800 |00:00:00.01 |       6 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."J1"="T2"."J2")



SQL_ID	f8wj7karu0hhs, child number 0
-------------------------------------
select	       count(*) from	     t1, t2 where	  t1.j1 = t2.j2

Plan hash value: 906334482

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |	OMem |	1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:00.01 |	  41 |	     |	     |		|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:00.01 |	  41 |	     |	     |		|
|*  2 |   HASH JOIN	    |	   |	  1 |	1893 |	 1327 |00:00:00.01 |	  41 |	2545K|	2545K| 1367K (0)|
|   3 |    TABLE ACCESS FULL| T1   |	  1 |	 100 |	  100 |00:00:00.01 |	   7 |	     |	     |		|
|   4 |    TABLE ACCESS FULL| T2   |	  1 |	 800 |	  800 |00:00:00.01 |	   7 |	     |	     |		|
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."J1"="T2"."J2")

The key point is the the difference between the two cardinality estimates. Why has that appeared, and what might the optimizer do in a more complex plan when a cardinality estimates changes?

The difference is only 2% but that was on a couple of data sets I just happened to run up to check something completely different, I wasn’t trying to break something, so who know how big the variation can get. Of course if you’re switching from 11g to 12c then Oracle (Corp.) expects you to be using auto_sample_size anyway so you shouldn’t be producing height-balanced histograms.

So does this difference really matter? Maybe not, but if you (like many sites I’ve seen) are still using fixed percentage sample sizes and are generating histograms it’s another reason (on top of the usual instability effects of height-balanced and hybrid histograms) why you might see plans change as you upgrade from 11g to 12c.

Footnote

It looks as if the difference comes mostly from a coding error in 11g that has been fixed in 12c – I couldn’t find an official bug or fix_control that matched, though. More on that later in the week.

Update

Chinar Aliyev has pointed out that there are three fix-controls that may be associated with this (and other ) changes. From v$system_fix_control these are:

14033181 1 QKSFM_CARDINALITY_14033181   correct ndv for non-popular values in join cardinality comp.         (12.1.0.1)
19230097 1 QKSFM_CARDINALITY_19230097   correct join card when popular value compared to non popular         (12.2.0.1)
22159570 1 QKSFM_CARDINALITY_22159570   correct non-popular region cardinality for hybrid histogram          (12.2.0.1)

I haven’t tested them yet, but with the code easily available in the article it won’t take long to see what the effects are when I have a few minutes. The first fix may also be why I had a final small discrepancy between 11g and 12c on the join on two columns with frequency histograms.

Oak Table World 2018

Oak Table World 2018 (OTW) just completed at the Children’s Creativity Museum in San Francisco.  The event website is “https://otw18.org“.

This year, it was my turn to organize this remarkable little un-conference on behalf of the Oak Table Network, which is often described as a drinking society with an Oracle problem.

What is an “un-conference”?  The dictionary definition is…

un·con·fer·ence   /ˈənkänf(ə)rəns/

noun: unconference; plural noun: unconferences

a loosely structured conference emphasizing the
informal exchange of information and ideas between
participants, rather than following a conventionally
structured program of events. Generally this manifests
as a "participant-driven" event.

OTW started out in 2010 with Mogens Norgaard holding a set of tables in the back room of the Chevy’s Tex-Mex Bar & Grill in San Francisco (since closed), just across 3rd St from the Moscone Center where Oracle OpenWorld was being held.

Rather than drinking from the flood pouring from the corporate marketing machines across the street, friends joined Mogens for beer and Tex-Mex to discuss and argue technical topics and drink beer.

Thus was born “Oracle ClosedWorld” as a true unconference.  The following year in 2011, Oracle ClosedWorld was held in the upstairs room at The Thirsty Bear, a few doors down from Chevy’s on Howard St, this time with an agenda and a more structured format.

However, the ever-vigilant legal department at Oracle Corporation was not in the least amused by the gentle gibe of the name “Oracle ClosedWorld” from a bunch of drunk geeks, and so after the event they quickly regurgitated a letter offering detailed demands and dire consequences, and so of course we sweetly complied, and the unconference was renamed to “Oak Table World” (OTW).  To date, Oracle’s legal team to not impose their will on our continued use of the word “World”, perhaps at least until Oracle achieves actual world domination.  So we wait with bated breath for their next move.

The following year, in 2012, Oak Table World found a new home at the Children’s Creativity Museum, which is located at the corner of 4th Street and Howard Street, smack dab in the middle of the Moscone complex.  The unconference was then organized by Kyle Hailey, who continued to do so in 2013, 2014, and 2015.  In 2016, it was organized by Kellyn Pot’vin-Gorman, in 2017 by Jeremiah Wilton, and in 2018 by Tim Gorman.

The event seems to be slipping away from its original un-conference format, becoming more professionalized, becoming more of a conference-within-a-bigger-conference.  I’m not sure if that is a good thing or a bad thing, but once you start laying out money for a venue and catering, things get more conventional pretty quickly.

Join Cardinality – 4

In previous installments of this series I’ve been describing how Oracle estimates the join cardinality for single column joins with equality where the columns have histograms defined. So far I’ve  covered two options for the types of histogram involved: frequency to frequency, and frequency to top-frequency. Today it’s time to examine frequency to hybrid.

My first thought about this combination was that it was likely to be very similar to frequency to top-frequency because a hybrid histogram has a list of values with “repeat counts” (which is rather like a simple frequency histogram), and a set of buckets with variable sizes that could allow us to work out an “average selectivity” of the rest of the data.

I was nearly right but the arithmetic didn’t quite work out the way I expected.  Fortunately Chinar Aliyev’s document highlighted my error – the optimizer doesn’t use all the repeat counts, it uses only those repeat counts that identify popular values, and a popular value is one where the endpoint_repeat_count is not less than the average number of rows in a bucket. Let’s work through an example – first the data (which repeats an earlier article, but is included here for ease of reference):

rem
rem     Script:         freq_hist_join_06.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem

set linesize 156
set pagesize 60
set trimspool on

execute dbms_random.seed(0)

create table t1 (
        id              number(6),
        n04             number(6),
        n05             number(6),
        n20             number(6),
        j1              number(6)
)
;

create table t2(
        id              number(8,0),
        n20             number(6,0),
        n30             number(6,0),
        n50             number(6,0),
        j2              number(6,0)
)
;

insert into t1
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        mod(rownum,   4) + 1                    n04,
        mod(rownum,   5) + 1                    n05,
        mod(rownum,  20) + 1                    n20,
        trunc(2.5 * trunc(sqrt(v1.id*v2.id)))   j1
from
        generator       v1,
        generator       v2
where
        v1.id <= 10 -- > comment to avoid WordPress format issue
and     v2.id <= 10 -- > comment to avoid WordPress format issue
;

insert into t2
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        mod(rownum,   20) + 1                   n20,
        mod(rownum,   30) + 1                   n30,
        mod(rownum,   50) + 1                   n50,
        28 - round(abs(7*dbms_random.normal))        j2
from
        generator       v1
where
        rownum <= 800 -- > comment to avoid WordPress format issue
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname          => null,
                tabname          => 'T1',
                method_opt       => 'for all columns size 1 for columns j1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname          => null,
                tabname          => 'T2',
                method_opt       => 'for all columns size 1 for columns j2 size 13'
        );
end;
/

As before I’ve got a table with 100 rows using the sqrt() function to generate column j1, and a table with 800 rows using the dbms_random.normal function to generate column j2. So the two columns have skewed patterns of data distribution, with a small number of low values and larger numbers of higher values – but the two patterns are different.

I’ve generated a histogram with 254 buckets (which dropped to 10) for the t1.j1 column, and generated a histogram with 13 buckets for the t2.j2 column as I knew (after a little trial and error) that this would give me a hybrid histogram.

Here’s a simple query, with its result set, to report the two histograms – using a full outer join to line up matching values and show the gaps where (endpoint) values in one histogram do not appear in the other:


define m_popular = 62

break on report skip 1

compute sum of product on report
compute sum of product_rp on report

compute sum of t1_count on report
compute sum of t2_count on report
compute sum of t2_repeats on report
compute sum of t2_pop_count on report

with f1 as (
select
        table_name,
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
        endpoint_number,
        endpoint_repeat_count,
        to_number(null)
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'J1'
order by
        endpoint_value
),
f2 as (
select
        table_name,
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
        endpoint_number,
        endpoint_repeat_count,
        case when endpoint_repeat_count >= &m_popular
                        then endpoint_repeat_count
                        else null
        end     pop_count
from
        user_tab_histograms
where
        table_name  = 'T2'
and     column_name = 'J2'
order by
        endpoint_value
)
select
        f1.value t1_value,
        f2.value t2_value,
        f1.row_or_bucket_count t1_count,
        f2.row_or_bucket_count t2_count,
        f1.endpoint_repeat_count t1_repeats,
        f2.endpoint_repeat_count t2_repeats,
        f2.pop_count t2_pop_count
from
        f1
full outer join
        f2
on
        f2.value = f1.value
order by
        coalesce(f1.value, f2.value)
;


  T1_VALUE   T2_VALUE   T1_COUNT   T2_COUNT T1_REPEATS T2_REPEATS T2_POP_COUNT
---------- ---------- ---------- ---------- ---------- ---------- ------------
                    1                     1                     1
         2                     5                     0
         5                    15                     0
         7                    15                     0
        10                    17                     0
        12                    13                     0
        15         15         13         55          0         11
        17         17         11         56          0         34
                   19                    67                    36
        20         20          7         57          0         57
                   21                    44                    44
        22         22          3         45          0         45
                   23                    72                    72           72
                   24                    70                    70           70
        25         25          1         87          0         87           87
                   26                   109                   109          109
                   27                    96                    96           96
                   28                    41                    41
---------- ---------- ---------- ----------            ---------- ------------
                             100        800                   703          434

You’ll notice that there’s a substitution variable (m_popular) in this script that I use to identify the “popular values” in the hybrid histogram so that I can report them separately. I’ve set this value to 62 for this example because a quick check of user_tables and user_tab_cols tells me I have 800 rows in the table (user_tables.num_rows) and 13 buckets (user_tab_cols.num_buckets) in the histogram: 800/13 = 61.52. A value is popular only if its repeat count is 62 or more.

This is where you may hit a problem – I certainly did when I switched from testing 18c to testing 12c (which I just knew was going to work – but I tested anyway). Although my data has been engineered so that I get the same “random” data in both versions of Oracle, I got different hybrid histograms (hence my complaint in a recent post.) The rest of this covers 18c in detail, but if you’re running 12c there are a couple of defined values that you can change to get the right results in 12c.

At this point I need to “top and tail” the output because the arithmetic only applies where the histograms overlap, so I need to pick the range from 2 to 25. Then I need to inject a “representative” or “average” count/frequency in all the gaps, then cross-multiply. The average frequency for the frequency histogram is “half the frequency of the least frequently occurring value” (which seems to be identical to new_density * num_rows), and the representative frequency for the hybrid histogram is (“number of non-popular rows” / “number of non-popular values”). There are 800 rows in the table with 22 distinct values in the column, and the output above shows us that we have 5 popular values totally 434 rows, so the average frequency is (800 – 434) / (22 – 5) = 21.5294. (Alternatively we could say that the average selectivities (which is what I’ve used in the next query) are 0.5/100 and 21.5294/800.)

[Note for 12c, you’ll get 4 popular values covering 338 rows, so your figurese will be: (800 – 338) / (22 – 4) = 25.6666… and 0.0302833]

So here’s a query that restricts the output to the rows we want from the histograms, discards a couple of columns, and does the arithmetic:


define m_t2_sel = 0.0302833
define m_t2_sel = 0.0269118
define m_t1_sel = 0.005

break on table_name skip 1 on report skip 1

with f1 as (
select
        table_name,
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
        endpoint_number,
        endpoint_repeat_count,
        to_number(null) pop_count
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'J1'
order by
        endpoint_value
),
f2 as (
select
        table_name,
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
        endpoint_number,
        endpoint_repeat_count,
        case when endpoint_repeat_count >= &m_popular
                        then endpoint_repeat_count
                        else null
        end     pop_count
from
        user_tab_histograms
where
        table_name  = 'T2'
and     column_name = 'J2'
order by
        endpoint_value
)
select
        f1.value f1_value,
        f2.value f2_value,
        nvl(f1.row_or_bucket_count,100 * &m_t1_sel) t1_count,
        nvl(f2.pop_count,          800 * &m_t2_sel) t2_count,
        case when (   f1.row_or_bucket_count is not null
                   or f2.pop_count is not null
        )    then
                nvl(f1.row_or_bucket_count,100 * &m_t1_sel) *
                nvl(f2.pop_count,          800 * &m_t2_sel)
        end      product_rp
from
        f1
full outer join
        f2
on
        f2.value = f1.value
where coalesce(f1.value, f2.value) between 2 and 25
order by
        coalesce(f1.value, f2.value)
;


 F1_VALUE   F2_VALUE   T1_COUNT   T2_COUNT PRODUCT_RP
---------- ---------- ---------- ---------- ----------
         2                     5   21.52944   107.6472
         5                    15   21.52944   322.9416
         7                    15   21.52944   322.9416
        10                    17   21.52944  366.00048
        12                    13   21.52944  279.88272
        15         15         13   21.52944  279.88272
        17         17         11   21.52944  236.82384
                   19         .5   21.52944
        20         20          7   21.52944  150.70608
                   21         .5   21.52944
        22         22          3   21.52944   64.58832
                   23         .5         72         36
                   24         .5         70         35
        25         25          1         87         87
                      ---------- ---------- ----------
sum                          102  465.82384 2289.41456

There’s an important detail that I haven’t mentioned so far. In the output above you can see that some rows show “product_rp” as blank. While we cross multiply the frequencies from t1.j1 and t2.j2, filling in average frequencies where necessary, we exclude from the final result any rows where average frequencies have been used for both histograms.

[Note for 12c, you’ll get the result 2698.99736 for the query, and 2699 for the execution plan]

Of course we now have to check that the predicted cardinality for a simple join between these two tables really is 2,289. So let’s run a suitable query and see what the optimizer predicts:


set serveroutput off

alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

select
        count(*)
from
        t1, t2
where
        t1.j1 = t2.j2
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

alter session set statistics_level = typical;
alter session set events '10053 trace name context off';

SQL_ID  cf4r52yj2hyd2, child number 0
-------------------------------------
select  count(*) from  t1, t2 where  t1.j1 = t2.j2

Plan hash value: 906334482

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |     108 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |     108 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |   2289 |   1327 |00:00:00.01 |     108 |  2546K|  2546K| 1194K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |      18 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    800 |    800 |00:00:00.01 |      34 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."J1"="T2"."J2")

As you can see, the E-Rows for the join is 2,289, as required.

I can’t claim that the model I’ve produced is definitely what Oracle does, but it looks fairly promising. No doubt, though, there are some variations on the theme that I haven’t considered – even when sticking to a simple (non-partitioned) join on equality on a single column.

“Hidden” Efficiencies of Non-Partitioned Indexes on Partitioned Tables Part III” (Ricochet)

In Part I and Part II of this series, we looked at how Global Indexes can effectively perform “Partition Pruning” when the partition keys are specified in SQL predicates, by only using those index entries that have a Data Object of interest stored within the index Rowids. In this piece, I’ll cover the key performance […]

OTW18 Sessions: Data Engineering and Latch Contentions

Thank you all those who came to my sessions at the OakTable World. Here are the presentations materials

Think Data; not Database
Data Engineering for Data Science
Latch Contentions Presentation Scripts