Search

OakieTags

Who's online

There are currently 0 users and 21 guests online.

Recent comments

Affiliations

Infrastructure

GTTs

That’s Global Temporary Table, of course.

A request appeared on the OTN database forum recently asking if it was possible to allocate a global temporary table to a particular temporary tablespace because the programmer didn’t want typical users to be at risk of running out of temporary space when a particular global temporary table suddenly got loaded with millions of rows.

If you’re running 11.1 or above, go straight to the footnote, otherwise read on.

At first sight the answer seems to be no – but an important point in the question was the comment: “I have a procedure which …”. This may make it possible to isolate this table in the required way.

When someone inserts data into a global temporary table the tablespace used to instantiate their version of the table is the temporary tablespace of the schema that is inserting the data. But remember, if you call a procedure to insert the data it’s (usually) the schema that owns the procedure that is doing the insert. (I say “usually” because you could, of course, have a procedure defined as “authid current_user”)

Consider the following (running on 11.1.0.6):

select
	username, temporary_tablespace
from
	dba_users
where
	username in ('U1','TEST_USER')
;

USERNAME        TEMPORARY_TABLESPACE
--------------- --------------------
U1              TEMP
TEST_USER       TEMP2

I have two users with different temporary tablespaces. In schema test_user we do the following:

create global temporary table t
as
select	*
from	all_objects
where	rownum = 0;

grant all on t to u1;

create or replace procedure pt
as
begin
	insert into t select * from all_objects where rownum <= 1000;
end;
/

grant execute on pt to u1;

Now we connect as user u1 and run the following:


SQL> insert into test_user.t select * from all_objects where rownum < = 10;

10 rows created.

SQL> commit;

Commit complete.

SQL> execute test_user.pt

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

The first commit it very important, by the way. In another session logged in through a suitably privileged account, execute the following before each commit:

select username, tablespace from v$sort_usage;

The results after the direct call to insert into the GTT will be:

USERNAME TABLESPACE
-------- ----------------
U1       TEMP

The results from the call to the procedure will be:

USERNAME TABLESPACE
-------- ----------------
U1       TEMP2

The direct insert has used the temporary tablespace of schema U1, but the insert from the procedure call has used the temporary tablespace of schema test_user. So there are ways to allocate a global temporary table to a particularly temporary tablespace if you always use a procedure to populate it. I did point out, though, that the first commit in the demonstration was important. If you don’t do this commit the global temporary table will have been instantiated in the “wrong” temporary tablespace at the point you call the procedure, so the data from the procedure call would be added to the existing data – i.e. where you didn’t want it to go.

Footnote: Everything changes – you’ll see in a response to the thread from Uwe Hesse that from 11g onwards you can specify a tablespace for a global temporary tablespace, and that’s where the data will go. (But I’d drafted this note before I got back to the thread – and it’s still valid for earlier versions of Oracle.)

Footnote 2: I had to create a spare temporary tablespace for this demo. It’s funny little detail that although you: “create TEMPORARY tablespace … TEMPfile…”, when it comes to dropping it’s: “drop tablespace … including … datafiles”.

Quiz Night

Here’s an execution plan pulled from v$sql_plan in 10.2.0.3 with the “gather_plan_statistics” option enabled:

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------
|   1 |  NESTED LOOPS                |        |      1 |      0 |00:00:00.01 |     608 |
|   2 |   TABLE ACCESS BY INDEX ROWID| PARENT |      1 |    200 |00:00:00.01 |       6 |
|   3 |    INDEX RANGE SCAN          | PAR_I1 |      1 |    200 |00:00:00.01 |       2 |
|   4 |   TABLE ACCESS BY INDEX ROWID| CHILD  |    200 |      0 |00:00:00.01 |     602 |
|   5 |    INDEX UNIQUE SCAN         | CHI_I1 |    200 |    200 |00:00:00.01 |     402 |
----------------------------------------------------------------------------------------

The code defining the two tables is the same:

create table &m_table_name
as
select
	rownum					id,
	trunc(dbms_random.value(1,&m_rowcount))	n1,
	lpad(rownum,10,'0')			small_vc,
	rpad('x',100)				padding
from
	big_row_source
where
	rownum <= &m_rowcount
;

The indexes par_i1 and chi_i1 are on the id column and, as you can see for chi_i1, have been created as unique indexes. Unlike the table, though, the code defining them isn’t necessarily identical.

Please state, with justification, what you think the blevel is for index chi_i1 ?

(Note – to avoid any risk of confusion, remember that height = blevel + 1; and just in case anyone thinks it makes any difference, the tables and indexes were all created in the same tablespace which is my usual setup of 1MB uniform extents, locally managed, 8K block size and freelist management.)

The quiz isn’t intended to be cunning, devious and tricksy, by the way. It’s all about observing and understanding a fairly ordinary situation. On the other hand I’m sure there are interesting variations with strange side effects that could be introduced if you really want to get clever.

Answer 23rd Jan:
Rather than writing my answer into a new post I’ve added it below – but to give new visitors a chance to work out the answer before seeing mine I’ve left a big gap before the answer.

Reading the comments I think all the interesting points have been covered. We have comments covering special treatment of root blocks in nested loop joins, the “buffer is pinned count” statistic, possible changes in 11g, the hypothesis that I had set a large pctfree for the child index,and the link between “A-rows” in one line of the plan and “Starts” in another line.

For reference, my query looked like this:

select
	par.small_vc,
	chi.small_vc
from
	parent	par,
	child	chi
where
	par.id between 301 and 500
and	chi.id = par.n1
and	chi.n1 = 0
;

The mechanics of the plan are:
Line 3 – index range scan of par_i1 (which finds ultimately 200 matching rowids).
Line 2 – for each rowid from line 3 check the matching table row (which ultimately returns 200 rows)

Then two hundred calls to:
Line 5 – index unique scan of chi_i1, which always finds it single row – hence returning 200 rowids in total
Line 4- check the matching row in the table – which always fails – hence returning zero rows in total

The critical counts that allow you to answer my question are the 402 and 602 buffer gets in lines 5 and 4. Ignoring, temporarily, the odd 2 these numbers are clearly 200 times something – which is rather nice given that we know that we are doing something 200 times. The simplest solution, of course is that the (602 – 402 =) 200 gets due to line 4 represents the visits to the 200 table blocks, leaving us with 400 (plus an odd bit) to account for.

Now consider the possibilities:

    Blevel = 0 – we would visit the only index block (which would be the root, but look like a leaf) then visit the table.
    Blevel = 1 – we would visit the root, the leaf which is the next layer down, then the table
    Blevel = 2 – we would visit the root, the branch level, the leaf, then the table.

So, at first sight we might decide the only way to get 400 buffer visits from the index is to have Blevel = 1, visit the root and the leaf. But that’s not the way it works (apart from a couple of versions which had a bug).

When running a nested loop join, Oracle “pins” the root block of the inner (second table) index, which means it attaches a structure to the “users’ list” in the buffer header that links the session’s state object directly to the buffer header. So, in my nested loop join, Oracle gets the root block once and keeps it pinnned, then gets branch and leaf blocks 200 times. The blevel on the child index is 2. (The blevel of the parent index is only 1 – I had set the pctfree on the child index to 90 to make it much larger than it needed to be.)

This “root block pin” isn’t the only pinning optimisation in the query, though. In detail, the steps are as follows – and this is something you would be table to see in the trace file if you enabled event 10200 (one of the ‘consistent reads’ traces).

    1) Get the root block of parent index and pin it
    2) Get the first relevant leaf block of the parent index and pin it
    3) Get the first relevant table block of the parent table and pin it.
    4) Get the root block of the child index and pin it
    5) Get the relevant branch block of the child index (no pin)
    6) Get the relevant leaf block of the child index (no pin)
    7) Get the table block of the child table (no pin) — row does not match final predicate and is rejected
    8) Revisit the parent index leaf block through the pin (add one to “buffer is pinned count”)
    9) Revisit the parent table block through the pin (add one to “buffer is pinned count”)
    10) Revisit the child index root block through the pin (add one to “buffer is pinned count”)
    11) Get a new child branch
    12) Get a new child leaf
    13) Get a new table block
    14) Repeat for a total of 200 cycles from step 8

There a couple of deviations from this cycle, of course. The rows I needed from parent were spread across 4 consecutive blocks in the table so, roughly every 50 rows from parent, line (9) above would become “release current pinned parent block, get new parent block and pin it”. The other little oddity that I can’t explain is that Oracle does “get” the child root block on the second visit to the index as well as the first visit – and then pins it from that moment onwards. So the counts are:

    Gets on the parent index 2 (root and leaf)
    Gets on the parent table 4 (one for each table block visited)
    Gets on the child index 402 (two on the root block, 200 on branch blocks, 200 on leaf blocks)
    Gets on the child table (200 for each row/block accessed)

Inevitably, things change – Oracle keeps getting smarter about things like ‘consistent gets – examination’, ‘buffer is pinned count’ and, in 11g, “fastpath” access to buffered blocks. This is a clue to the difference in gets that Charles Hooper recorded in 11.2.0.2 – and explains why I chose to use 10.2.0.3 with a unique index for my example. If you want to investigate other variations all it takes is snapshots of v$mystat, calls to “alter system flush buffer_cache”, and event 10200; the treedump can also be very helpful for identifying block addresses.

[Some notes which say a little more about gets and pins]

ASSM Again

While checking my backlog of drafts (currently 75 articles in note form) I came across this one from August 2009 and was a little upset that I hadn’t finished it sooner – it’s a nice example of geek stuff that has the benefit of being useful.

From the Oracle newsgroup comp.databases.oracle.server, here’s an example of how to recreate a performance problem due to maintenance on ASSM bitmaps in 10.2.0.4.

Create a table in a tablespace with an 8KB block size, locally managed tablespace with uniform 1MB extents, and automatic segment space management (ASSM). Check the newsgroup thread if you want complete details on reproducing the test:

Session 1: Insert 100,000 rows of about 490 bytes into a table using a pl/sql loop and commit at end.
Session 1: Insert 1,000 rows into the table with single SQL inserts and no commits
Session 1: delete all data from the table with a single statement – but do not commit

Session 2: Insert 1,000 rows into the table with single SQL inserts and no commits – it’s very slow.

As one person on the thread pointed out – it looks as if Oracle is doing a full tablescan of the table, one block at a time showing “db file sequential read” waits all the way through the table. (If your db_cache_size is large enough you might not see this symptom).

I simplified the test – inserting just 100,000 of the rows (with the commit), then deleting them all (without the commit), then inserting one row from another session. Taking a snapshot of x$kcbsw and x$kcbwh, I got the following figures for the activity that took place inserting that one extra row (this was on Oracle 10.2.0.3):

---------------------------------
Buffer Cache - 23-Dec 11:50:36
Interval:-  0 seconds
---------------------------------
          Why0          Why1          Why2    Other Wait
          ----          ----          ----    ----------
         1,457             0             0             0 ktspfwh10: ktspscan_bmb
             8             0             0             0 ktspswh12: ktspffc
             1             0             0             0 ktsphwh39: ktspisc
         7,061             0             0             0 ktspbwh1: ktspfsrch
             1             0             0             0 ktuwh01: ktugus
         7,060             0             0             0 ktuwh05: ktugct
             1             0             0             0 ktuwh09: ktugfb
             2             0             0             0 kdswh02: kdsgrp
             2             0             0             0 kdiwh06: kdifbk
             2             0             0             0 kdiwh07: kdifbk
          ----          ----          ----    ----------
        15,595             0             0             0 Total: 10 rows

The figures tell us how much work Oracle has to do to find a table block that could accept a new row. The idea is simple – Oracle checks the first “level 3″ bitmap block (which is actually part of the segment header block) to find a pointer to the current “level 2″ bitmap block; it checks this level 2 bitmap block to find a pointer to the current “level 1″ bitmap block; and finally it checks the “level 1″ bitmap block to find a pointer to a data block that shows some free space.

Unfortunately every block in our table is apparently empty – but that’s only going to be true once session 1 commits. In this version of Oracle the blocks are all visible as “x% free” in the level 1 bitmaps – but when Oracle visits each block (“ktspbwh1: ktspfsrch”) it checks the ITL entry, which points it to the transaction table slot in the related undo segment header block to Get the Commit Time for the transaction (“ktuwh05: ktugct”) and finds that the transaction is not committed so the space is not really free. So Oracle has to visit the next block shown as free in the bitmap.

In our “bulk delete / no commit” case, we end up visitng every (or nearly every) block in the entire table before we find a block we can actually use – and, given the nature of the ASSM bitmap implementation, the order of the block visits is the “natural” table order, so we see something that looks like a full tablescan operating through single blocks reads (apart, perhaps, from a few blocks that are still cached).

I can’t explain why we do 1,457 visits to bitmap blocks (“ktspfwh10: ktspscan_bmb”) in this version of Oracle, but perhaps it’s simply an indication that Oracle picks (say) five “free block” entries from the bitmap block each time it visits it and therefore has to visit each bitmap block about 12 times if it doesn’t find a data block with space that really is free in it search.

Note – these results will be hugely dependent on version of Oracle – in an earlier version of Oracle the bitmap blocks were not updated by the delete until some time after the commit – and this variation of delayed block cleanout produced other unpleasant anomalies; and, just to make like difficult in later versions of Oracle, the x$kcbsw / x$kcbwh objects are not populated properly in 11g.

Footnote: In case you hadn’t realised, ASSM is a mechanism aimed at OLTP systems with a reasonable degree of concurrency – so it’s not too surprising that you can find problems and weak spots if you hit it with processing which is biased towards the DW and batch processing end of the spectrum.

clonedb

I’ve lost count of the number of times I’ve said something like: “Ideally you really need to do your testing on the production system”. Kevin Closson has recently written some interesting notes (with more to come) about clonedb – an 11.2 feature that may make it possible to get pretty close to this ideal.

Footnote: for purposes of removing any confusion, please note that actually testing on your production system is not a good idea

Redo

A couple of days ago I published a link to some comments I had made on OTN about differences in redo generation between 10g and earlier versions of Oracle. This raised a few questions that suggested a follow-up (or perhaps “prequel”) note might be a little helpful. So I’ve created a simple SQL script to help demonstrate the differences and show how some of the pieces hang together. The whole posting is a little lengthy, but a large fraction of it is just the continuous output from a couple of trace files from which I’ll be highlighting just a few lines.

drop table t1;

create table t1 (
	n1	number,
	n2	number,
	n3	number,
	n4	number
);

create index i1 on t1(n1);
create index i2 on t1(n2);

insert into t1 values (1,1,1,1);
commit;

alter system switch logfile;

insert into t1 values (2,2,2,2);
insert into t1 values (3,3,3,3);

commit;

execute dump_log

All I’ve done is create a simple table with two indexes. I insert a row and commit because the first insert of a row into a table results in all sorts of funny bits of recursive activity (mostly related space management) and I want to get that out of the way so that it doesn’t obscure the more typical activity. Then I “switch logfile”, do a little work, and use a little procedure I’ve written to dump the current log file. The log file switch is just a convenient way to keep the dump small – but there are strategies for being selective about the parts of the redo log file you dump.

The work I’ve done is just two inserts into the table – this means I expect to find something like the following bits of activity listed in the redo log file:

    start transaction
    modify block in table
    modify block in index 1
    modify block in index 2
    modify block in table
    modify block in index 1
    modify block in index 2
    end transaction

What I’m going to show you next is the redo log file dump from Oracle 9.2.0.8 and then the corresponding redo log file dump from 10.2.0.3.

In the 9.2.0.8 dump we’ll see the “traditional” sequence of redo generation, which follows very closely to the steps listed above and also shows that Oracle pairs a “redo change vector” for a table or index block with the matching “redo change vector” for an undo record that describes how to reverse the table (or index) change – an undo record, of course, is just a piece of information that gets stored in an undo block. In general, each pair of redo change vectors will be gathered into a single “redo record”.

In the 10.2.0.3 dump we’ll see that the same redo change vectors still exist but they’ve all been lumped into a single redo record, and their ordering in that record is completely different from the 9.2.0.8 ordering, demonstrating (or at least supporting) the point that the “private redo/in-memory undo” mechanism uses two separate buffers, one for redo change vectors related to table/index blocks (the private redo bit) and the other for redo change vectors related to undo records (the in-memory undo bit).

The key expressions to watch out for in the dump are: “REDO RECORD” and “CHANGE #” (I’ve inserted a few blank lines  into the trace file to give a little white space between records and vectors):

Redo log file dump from 9i

REDO RECORD - Thread:1 RBA: 0x000360.00000002.0010 LEN: 0x0170 VLD: 0x01
SCN: 0x0000.03ec99f5 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:23 AFN:2 DBA:0x00800039 SCN:0x0000.03ec99bf SEQ:  1 OP:5.2
ktudh redo: slt: 0x0016 sqn: 0x00001a69 flg: 0x0012 siz: 80 fbi: 0
            uba: 0x00800040.075d.26    pxid:  0x0000.000.00000000

CHANGE #2 TYP:0 CLS:24 AFN:2 DBA:0x00800040 SCN:0x0000.03ec99be SEQ:  1 OP:5.1
ktudb redo: siz: 80 spc: 2708 flg: 0x0012 seq: 0x075d rec: 0x26
            xid:  0x0004.016.00001a69
ktubl redo: slt: 22 rci: 0 opc: 11.1 objn: 45648 objd: 45648 tsn: 12
Undo type:  Regular undo        Begin trans    Last buffer split:  No
Temp Object:  No
Tablespace Undo:  No
             0x00000000  prev ctl uba: 0x00800040.075d.25
prev ctl max cmt scn:  0x0000.03ebf084  prev tx cmt scn:  0x0000.03ebf106
KDO undo record:
KTB Redo
op: 0x03  ver: 0x01
op: Z
KDO Op code: DRP row dependencies Disabled
  xtype: XA  bdba: 0x02c0000a  hdba: 0x02c00009
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 1(0x1)

CHANGE #3 TYP:2 CLS: 1 AFN:11 DBA:0x02c0000a SCN:0x0000.03ec99f2 SEQ:  1 OP:11.2
KTB Redo
op: 0x01  ver: 0x01
op: F  xid:  0x0004.016.00001a69    uba: 0x00800040.075d.26
KDO Op code: IRP row dependencies Disabled
  xtype: XA  bdba: 0x02c0000a  hdba: 0x02c00009
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 1(0x1) size/delt: 15
fb: --H-FL-- lb: 0x2  cc: 4
null: ----
col  0: [ 2]  c1 03
col  1: [ 2]  c1 03
col  2: [ 2]  c1 03
col  3: [ 2]  c1 03

CHANGE #4 MEDIA RECOVERY MARKER SCN:0x0000.00000000 SEQ:  0 OP:5.20
session number   = 9
serial  number   = 6
transaction name =

--------------

REDO RECORD - Thread:1 RBA: 0x000360.00000002.0180 LEN: 0x0108 VLD: 0x01
SCN: 0x0000.03ec99f5 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:24 AFN:2 DBA:0x00800040 SCN:0x0000.03ec99f5 SEQ:  1 OP:5.1
ktudb redo: siz: 96 spc: 2626 flg: 0x0022 seq: 0x075d rec: 0x27
            xid:  0x0004.016.00001a69
ktubu redo: slt: 22 rci: 38 opc: 10.22 objn: 45649 objd: 45649 tsn: 12
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x04  ver: 0x01
op: L  itl: xid:  0x000a.012.00002552 uba: 0x0080009c.09cb.0c
                      flg: C---    lkc:  0     scn: 0x0000.03ec99f2
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=96880916 indexid=0x2c00409 block=0x02c0040a
purge leaf row
key: (10):  02 c1 03 06 02 c0 00 0a 00 01

CHANGE #2 TYP:2 CLS: 1 AFN:11 DBA:0x02c0040a SCN:0x0000.03ec99f2 SEQ:  1 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x11  ver: 0x01
op: F  xid:  0x0004.016.00001a69    uba: 0x00800040.075d.27
Block cleanout record, scn:  0x0000.03ec99f5 ver: 0x01 opt: 0x02, entries follow...
  itli: 2  flg: 2  scn: 0x0000.03ec99f2
REDO: SINGLE / -- / --
itl: 2, sno: 1, row size 14
insert key: (10):  02 c1 03 06 02 c0 00 0a 00 01

--------------

REDO RECORD - Thread:1 RBA: 0x000360.00000003.0098 LEN: 0x0108 VLD: 0x01
SCN: 0x0000.03ec99f5 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:24 AFN:2 DBA:0x00800040 SCN:0x0000.03ec99f5 SEQ:  2 OP:5.1
ktudb redo: siz: 96 spc: 2528 flg: 0x0022 seq: 0x075d rec: 0x28
            xid:  0x0004.016.00001a69
ktubu redo: slt: 22 rci: 39 opc: 10.22 objn: 45650 objd: 45650 tsn: 12
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x04  ver: 0x01
op: L  itl: xid:  0x000a.012.00002552 uba: 0x0080009c.09cb.0d
                      flg: C---    lkc:  0     scn: 0x0000.03ec99f2
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=96880916 indexid=0x2c00489 block=0x02c0048a
purge leaf row
key: (10):  02 c1 03 06 02 c0 00 0a 00 01

CHANGE #2 TYP:2 CLS: 1 AFN:11 DBA:0x02c0048a SCN:0x0000.03ec99f2 SEQ:  1 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x11  ver: 0x01
op: F  xid:  0x0004.016.00001a69    uba: 0x00800040.075d.28
Block cleanout record, scn:  0x0000.03ec99f5 ver: 0x01 opt: 0x02, entries follow...
  itli: 2  flg: 2  scn: 0x0000.03ec99f2
REDO: SINGLE / -- / --
itl: 2, sno: 1, row size 14
insert key: (10):  02 c1 03 06 02 c0 00 0a 00 01

--------------

REDO RECORD - Thread:1 RBA: 0x000360.00000003.01a0 LEN: 0x00f4 VLD: 0x01
SCN: 0x0000.03ec99f5 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:24 AFN:2 DBA:0x00800040 SCN:0x0000.03ec99f5 SEQ:  3 OP:5.1
ktudb redo: siz: 64 spc: 2430 flg: 0x0022 seq: 0x075d rec: 0x29
            xid:  0x0004.016.00001a69
ktubu redo: slt: 22 rci: 40 opc: 11.1 objn: 45648 objd: 45648 tsn: 12
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
KDO undo record:
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x00800040.075d.26
KDO Op code: DRP row dependencies Disabled
  xtype: XA  bdba: 0x02c0000a  hdba: 0x02c00009
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 2(0x2)

CHANGE #2 TYP:0 CLS: 1 AFN:11 DBA:0x02c0000a SCN:0x0000.03ec99f5 SEQ:  1 OP:11.2
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x00800040.075d.29
KDO Op code: IRP row dependencies Disabled
  xtype: XA  bdba: 0x02c0000a  hdba: 0x02c00009
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 2(0x2) size/delt: 15
fb: --H-FL-- lb: 0x2  cc: 4
null: ----
col  0: [ 2]  c1 04
col  1: [ 2]  c1 04
col  2: [ 2]  c1 04
col  3: [ 2]  c1 04

--------------

REDO RECORD - Thread:1 RBA: 0x000360.00000004.00a4 LEN: 0x00c8 VLD: 0x01
SCN: 0x0000.03ec99f5 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:24 AFN:2 DBA:0x00800040 SCN:0x0000.03ec99f5 SEQ:  4 OP:5.1
ktudb redo: siz: 80 spc: 2364 flg: 0x0022 seq: 0x075d rec: 0x2a
            xid:  0x0004.016.00001a69
ktubu redo: slt: 22 rci: 41 opc: 10.22 objn: 45649 objd: 45649 tsn: 12
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x00800040.075d.27
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=96880916 indexid=0x2c00409 block=0x02c0040a
purge leaf row
key: (10):  02 c1 04 06 02 c0 00 0a 00 02

CHANGE #2 TYP:0 CLS: 1 AFN:11 DBA:0x02c0040a SCN:0x0000.03ec99f5 SEQ:  1 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x00800040.075d.2a
REDO: SINGLE / -- / --
itl: 2, sno: 2, row size 14
insert key: (10):  02 c1 04 06 02 c0 00 0a 00 02

--------------

REDO RECORD - Thread:1 RBA: 0x000360.00000004.016c LEN: 0x00c8 VLD: 0x01
SCN: 0x0000.03ec99f5 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:24 AFN:2 DBA:0x00800040 SCN:0x0000.03ec99f5 SEQ:  5 OP:5.1
ktudb redo: siz: 80 spc: 2282 flg: 0x0022 seq: 0x075d rec: 0x2b
            xid:  0x0004.016.00001a69
ktubu redo: slt: 22 rci: 42 opc: 10.22 objn: 45650 objd: 45650 tsn: 12
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x00800040.075d.28
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=96880916 indexid=0x2c00489 block=0x02c0048a
purge leaf row
key: (10):  02 c1 04 06 02 c0 00 0a 00 02

CHANGE #2 TYP:0 CLS: 1 AFN:11 DBA:0x02c0048a SCN:0x0000.03ec99f5 SEQ:  1 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x00800040.075d.2b
REDO: SINGLE / -- / --
itl: 2, sno: 2, row size 14
insert key: (10):  02 c1 04 06 02 c0 00 0a 00 02

--------------

REDO RECORD - Thread:1 RBA: 0x000360.00000005.0044 LEN: 0x0070 VLD: 0x01
SCN: 0x0000.03ec99f6 SUBSCN:  1 12/30/2010 14:36:06

CHANGE #1 TYP:0 CLS:23 AFN:2 DBA:0x00800039 SCN:0x0000.03ec99f5 SEQ:  1 OP:5.4
ktucm redo: slt: 0x0016 sqn: 0x00001a69 srt: 0 sta: 9 flg: 0x12
ktucf redo: uba: 0x00800040.075d.2b ext: 0 spc: 2200 fbi: 0

The dump from 9.2.0.8 consists of seven redo records – and most of those records consist of two redo change vectors. We’ll start by looking at the fourth redo record (starting at line 112) which is a typical “old-style” redo record. It consists of two redo change vectors.

In the chosen record, change #1 (line 115) describes how to change an undo block at address (DBA) 0×00800040 which is operating at sequence number (seq: ) 0x075d. The change is to insert a new undo record, with record number (rec: ) 0×29. If you look at lines 126 – 129 you see that this undo record is an instruction to Delete a Row Piece (DRP) from data block (bdba) 0x02c0000a; the row piece is identified as the pointed to by slot 2 of table 0 in the block’s row directory.

Change # 2 (line 131) describes the change needed to create that row piece (note the IRP – Insert Row Piece at line 135). Again we see references to the block, table number and slot number.

The next two redo records are very similar: they describe how to insert and delete a key entry for each of the two indexes that have to be maintained as the row is inserted into the table – again the redo change vector for the “undo” appears before the redo change vector for the “apply”.

The first three redo records form a similar group – the first one is about inserting/removing a table row, the second and third are about the insert/remove pairs for the two indexes – but the first record of the three is more complex, consisting of four redo change vectors rather than just two.

The first redo record happens to be the first record of the transaction – so it contains (for 9i and below) two extra redo change vectors. The standard pair for the table row appear as change #2 and change #3, but in change #1 we see the redo change vector describing a change to the undo segment header block (line 5: ktudh) as we acquire a “transaction table slot” (see my glossary, and “cleanout”) specifically slot 0×0016, which we stamp with sequence number 0x00001a69. The final change vector in this record (change #4 at line 41) is marked as a “Media Recovery Marker” – it’s actually a type of audit label – and the example here is the short form; there is a much longer version used for the first transaction of a session’s lifetime, which is also used after each log file switch that takes place in a session’s lifetime (and in 8i there was a bug that meant the full vector appeared for every transaction start – with could result in a nasty volume of excess redo).

The last redo record is the so-called “commit record”. It consists of just one change vector (line 209). This change vector describes the update to the undo segment header as we commit (ktucm) the transaction; the status (sta: ) of transaction table slot 0×0016 is set to 9 to show that the slot is now free, and the “free pool” list (ktucf) is updated to show that the last undo block we were using has some space available in it that could be used by the next transaction that acquires a slot in this undo segment.

So that’s the redo generated by 9i. The basic pattern is that a typical redo record consists of a pair of redo change vectors, the first describing an undo record and the second describing how to make the required data change.

An important scalability detail (or threat, rather) that goes with this implementation is that the session has to allocate memory in the redo log buffer and copy each redo record in turn so, for our table with two indexes, every row we insert requires us to acquire the redo allocation latch and redo copy latch three times.

Now we move on to see what shows up in the log file when we run our code on 10.2.0.3

Redo log file dump from 10g

REDO RECORD - Thread:1 RBA: 0x00034b.00000002.0010 LEN: 0x0638 VLD: 0x0d
SCN: 0x0000.00d60a85 SUBSCN:  1 12/30/2010 14:24:01

CHANGE #1 TYP:2 CLS: 1 AFN:6 DBA:0x01804d8a OBJ:70577 SCN:0x0000.00d60a81 SEQ:  5 OP:11.2
KTB Redo
op: 0x01  ver: 0x01
op: F  xid:  0x0008.025.00003632    uba: 0x008031d3.07f1.27
KDO Op code: IRP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x01804d8a  hdba: 0x01804d89
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 1(0x1) size/delt: 15
fb: --H-FL-- lb: 0x2  cc: 4
null: ----
col  0: [ 2]  c1 03
col  1: [ 2]  c1 03
col  2: [ 2]  c1 03
col  3: [ 2]  c1 03

CHANGE #2 TYP:0 CLS:31 AFN:2 DBA:0x00800079 OBJ:4294967295 SCN:0x0000.00d60a5f SEQ:  1 OP:5.2
ktudh redo: slt: 0x0025 sqn: 0x00003632 flg: 0x0012 siz: 104 fbi: 0
            uba: 0x008031d3.07f1.27    pxid:  0x0000.000.00000000

CHANGE #3 TYP:2 CLS: 1 AFN:6 DBA:0x01804e0a OBJ:70578 SCN:0x0000.00d60a81 SEQ:  2 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x11  ver: 0x01
op: F  xid:  0x0008.025.00003632    uba: 0x008031d3.07f1.28
Block cleanout record, scn:  0x0000.00d60a84 ver: 0x01 opt: 0x02, entries follow...
  itli: 2  flg: 2  scn: 0x0000.00d60a81
REDO: SINGLE / -- / --
itl: 2, sno: 1, row size 14
insert key: (10):  02 c1 03 06 01 80 4d 8a 00 01

CHANGE #4 TYP:2 CLS: 1 AFN:6 DBA:0x01804e8a OBJ:70579 SCN:0x0000.00d60a81 SEQ:  2 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x11  ver: 0x01
op: F  xid:  0x0008.025.00003632    uba: 0x008031d3.07f1.29
Block cleanout record, scn:  0x0000.00d60a84 ver: 0x01 opt: 0x02, entries follow...
  itli: 2  flg: 2  scn: 0x0000.00d60a81
REDO: SINGLE / -- / --
itl: 2, sno: 1, row size 14
insert key: (10):  02 c1 03 06 01 80 4d 8a 00 01

CHANGE #5 TYP:0 CLS: 1 AFN:6 DBA:0x01804d8a OBJ:70577 SCN:0x0000.00d60a85 SEQ:  1 OP:11.2
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x008031d3.07f1.2a
KDO Op code: IRP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x01804d8a  hdba: 0x01804d89
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 2(0x2) size/delt: 15
fb: --H-FL-- lb: 0x2  cc: 4
null: ----
col  0: [ 2]  c1 04
col  1: [ 2]  c1 04
col  2: [ 2]  c1 04
col  3: [ 2]  c1 04

CHANGE #6 TYP:0 CLS: 1 AFN:6 DBA:0x01804e0a OBJ:70578 SCN:0x0000.00d60a85 SEQ:  1 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x008031d3.07f1.2b
REDO: SINGLE / -- / --
itl: 2, sno: 2, row size 14
insert key: (10):  02 c1 04 06 01 80 4d 8a 00 02

CHANGE #7 TYP:0 CLS: 1 AFN:6 DBA:0x01804e8a OBJ:70579 SCN:0x0000.00d60a85 SEQ:  1 OP:10.2
index redo (kdxlin):  insert leaf row
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x008031d3.07f1.2c
REDO: SINGLE / -- / --
itl: 2, sno: 2, row size 14
insert key: (10):  02 c1 04 06 01 80 4d 8a 00 02

CHANGE #8 TYP:0 CLS:31 AFN:2 DBA:0x00800079 OBJ:4294967295 SCN:0x0000.00d60a85 SEQ:  1 OP:5.4
ktucm redo: slt: 0x0025 sqn: 0x00003632 srt: 0 sta: 9 flg: 0x2
ktucf redo: uba: 0x008031d3.07f1.2c ext: 38 spc: 1718 fbi: 0

CHANGE #9 TYP:0 CLS:32 AFN:2 DBA:0x008031d3 OBJ:4294967295 SCN:0x0000.00d60a5d SEQ:  1 OP:5.1
ktudb redo: siz: 104 spc: 2250 flg: 0x0012 seq: 0x07f1 rec: 0x27
            xid:  0x0008.025.00003632
ktubl redo: slt: 37 rci: 0 opc: 11.1 objn: 70577 objd: 70577 tsn: 5
Undo type:  Regular undo        Begin trans    Last buffer split:  No
Temp Object:  No
Tablespace Undo:  No
             0x00000000  prev ctl uba: 0x008031d3.07f1.26
prev ctl max cmt scn:  0x0000.00d5fa91  prev tx cmt scn:  0x0000.00d5fb64
txn start scn:  0x0000.00d60a81  logon user: 52  prev brb: 8401354  prev bcl: 0 KDO undo record:
KTB Redo
op: 0x03  ver: 0x01
op: Z
KDO Op code: DRP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x01804d8a  hdba: 0x01804d89
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 1(0x1)

CHANGE #10 TYP:0 CLS:32 AFN:2 DBA:0x008031d3 OBJ:4294967295 SCN:0x0000.00d60a85 SEQ:  1 OP:5.1
ktudb redo: siz: 96 spc: 2144 flg: 0x0022 seq: 0x07f1 rec: 0x28
            xid:  0x0008.025.00003632
ktubu redo: slt: 37 rci: 39 opc: 10.22 objn: 70578 objd: 70578 tsn: 5
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x04  ver: 0x01
op: L  itl: xid:  0x0005.01e.000035e3 uba: 0x008034a5.083a.1d
                      flg: C---    lkc:  0     scn: 0x0000.00d60a81
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=0 indexid=0x1804e09 block=0x01804e0a
(kdxlpu): purge leaf row
key: (10):  02 c1 03 06 01 80 4d 8a 00 01

CHANGE #11 TYP:0 CLS:32 AFN:2 DBA:0x008031d3 OBJ:4294967295 SCN:0x0000.00d60a85 SEQ:  2 OP:5.1
ktudb redo: siz: 96 spc: 2046 flg: 0x0022 seq: 0x07f1 rec: 0x29
            xid:  0x0008.025.00003632
ktubu redo: slt: 37 rci: 40 opc: 10.22 objn: 70579 objd: 70579 tsn: 5
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x04  ver: 0x01
op: L  itl: xid:  0x0005.01e.000035e3 uba: 0x008034a5.083a.1e
                      flg: C---    lkc:  0     scn: 0x0000.00d60a81
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=0 indexid=0x1804e89 block=0x01804e8a
(kdxlpu): purge leaf row
key: (10):  02 c1 03 06 01 80 4d 8a 00 01

CHANGE #12 TYP:0 CLS:32 AFN:2 DBA:0x008031d3 OBJ:4294967295 SCN:0x0000.00d60a85 SEQ:  3 OP:5.1
ktudb redo: siz: 64 spc: 1948 flg: 0x0022 seq: 0x07f1 rec: 0x2a
            xid:  0x0008.025.00003632
ktubu redo: slt: 37 rci: 41 opc: 11.1 objn: 70577 objd: 70577 tsn: 5
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
KDO undo record:
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x008031d3.07f1.27
KDO Op code: DRP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x01804d8a  hdba: 0x01804d89
itli: 2  ispac: 0  maxfr: 4863
tabn: 0 slot: 2(0x2)

CHANGE #13 TYP:0 CLS:32 AFN:2 DBA:0x008031d3 OBJ:4294967295 SCN:0x0000.00d60a85 SEQ:  4 OP:5.1
ktudb redo: siz: 80 spc: 1882 flg: 0x0022 seq: 0x07f1 rec: 0x2b
            xid:  0x0008.025.00003632
ktubu redo: slt: 37 rci: 42 opc: 10.22 objn: 70578 objd: 70578 tsn: 5
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x008031d3.07f1.28
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=0 indexid=0x1804e09 block=0x01804e0a
(kdxlpu): purge leaf row
key: (10):  02 c1 04 06 01 80 4d 8a 00 02

CHANGE #14 TYP:0 CLS:32 AFN:2 DBA:0x008031d3 OBJ:4294967295 SCN:0x0000.00d60a85 SEQ:  5 OP:5.1
ktudb redo: siz: 80 spc: 1800 flg: 0x0022 seq: 0x07f1 rec: 0x2c
            xid:  0x0008.025.00003632
ktubu redo: slt: 37 rci: 43 opc: 10.22 objn: 70579 objd: 70579 tsn: 5
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
index undo for leaf key operations
KTB Redo
op: 0x02  ver: 0x01
op: C  uba: 0x008031d3.07f1.29
Dump kdilk : itl=2, kdxlkflg=0x1 sdc=0 indexid=0x1804e89 block=0x01804e8a
(kdxlpu): purge leaf row
key: (10):  02 c1 04 06 01 80 4d 8a 00 02

The entire transaction is wrapped up into a single redo record with fourteen redo change vectors. (The 9i dump showed a total of fifteen redo change vectors – we’ve lost the “audit” vector in our 10g setup.) In order, the vectors describe how to…

    insert first row into table
    acquire transaction table slot (start transaction – slightly odd order)
    insert first row into first index
    insert first row into second index
    insert second row into table
    insert second row into first index
    insert second row into second index
    release transaction table slot (commit transaction)
    create undo record showing how to remove first row from table
    create undo record showing how to remove first row from first index
    create undo record showing how to remove first row from second index
    create undo record showing how to remove second row from table
    create unto record showing how to remove second row from first index
    create undo record showing how to remove second row from second index

It is reasonable to infer from this that Oracle is generating a list of “forward” changes (i.e. the private redo) in one buffer and a list of “reverse” changes (i.e. the in-memory undo) in another buffer – and simply concatenates the two lists on the commit.

On the plus side, if we’ve only got one redo record we only have to allocate and copy into the (public) redo buffer once – which reduces contention on a single important hot spot in the Oracle database; on the other hand we have to keep a number of data blocks pinned in memory until the commit, and we have to modify all those data blocks at the same time as we copy the two private buffers into the (public) redo log buffer, and this requirement introduces a different concurrency threat.

Because of this alternative threat there is a fairly tight limit on how much change a single transaction is allowed to make using this new mechanism . For 32-bit Oracle the buffers are about 64Kb each, and for 64-bit Oracle they are about 128KB – and if a session fills either of the two buffers it has acquired then Oracle flushes the contents to the public log buffer and the session then carries on using the public log buffer in the “traditional” way (while still keeping “ownership” of the pair of private buffers that it had allocated.)

A final thought – if you issue a rollback instead of a commit, the “commit record” (change #8 in the above) disappears from the redo record, and the redo stream gets a further six redo records showing how each of the previous undo records have been applied at the user’s request; these records are followed by one last redo record which describes the release of the transaction table slot i.e. a “commit record”.

Quiz Night

Many people in the UK like to greet the new year with some sort of party, and in this household the type of party we like includes old-fashioned party games (can you draw “Brief Encounters” in 10 seconds ?); so, for all those members of the Oracle community who haven’t prepared their party games this year, here’s a little quiz:

Which processes could report waits for “log file sequential read”, and why ?

    1.

    10.

Warning: the number of answers I have may, or may not, be the 10 suggested by the list.

Update Jan 5th: There were a couple of answers I wasn’t expecting (and would need to check), but some things are fairly easy to check. Here, for example, is a simple script that will identify sessions that regularly wait for ‘log file sequential read’.

select
	ses.program
from
	v$session_event sev,
	V$session ses
where
	sev.event = 'log file sequential read'
and	ses.sid = sev.sid
;

Answers 7th Jan:

The obvious one, of course, is ARCH (or ARCn as they tend to be) the archiver process(es). Archiving requires something to read the online redo and copy them to some other location. Here’s a brief extract from tracing an archive process – showing reads of 1MB (2048 x 512 bytes per block) -just after a log file switch:

WAIT #0: nam='log file sequential read' ela= 38410 log#=0 block#=1 blocks=2048 obj#=-1 tim=101515708285
WAIT #0: nam='Log archive I/O' ela= 183580 count=1 intr=256 timeout=-1 obj#=-1 tim=101515892589
WAIT #0: nam='log file sequential read' ela= 84409 log#=0 block#=2049 blocks=2048 obj#=-1 tim=101515977375
WAIT #0: nam='Log archive I/O' ela= 116641 count=1 intr=256 timeout=-1 obj#=-1 tim=101516094684
WAIT #0: nam='log file sequential read' ela= 31139 log#=0 block#=4097 blocks=2048 obj#=-1 tim=101516126157
WAIT #0: nam='Log archive I/O' ela= 122492 count=1 intr=256 timeout=-1 obj#=-1 tim=101516250399
WAIT #0: nam='log file sequential read' ela= 32197 log#=0 block#=6145 blocks=2048 obj#=-1 tim=101516282965
WAIT #0: nam='Log archive I/O' ela= 122779 count=1 intr=256 timeout=-1 obj#=-1 tim=101516552698
WAIT #0: nam='Log archive I/O' ela= 4 count=0 intr=32 timeout=2147483647 obj#=-1 tim=101516552996
WAIT #0: nam='Log archive I/O' ela= 25 count=1 intr=32 timeout=2147483647 obj#=-1 tim=101516553297

Slightly less obvious is LGWR, the log writer itself. Apart from reading the log files at (normal) startup to check that they’re there, LGWR reads the start of the current and next log files on every log switch. Here’s the trace file of LGWR on a log file switch where there are two files in each group:

WAIT #0: nam='log file sequential read' ela= 32788 log#=0 block#=1 blocks=1 obj#=-1 tim=100016453547
WAIT #0: nam='log file sequential read' ela= 18309 log#=1 block#=1 blocks=1 obj#=-1 tim=100016471921
WAIT #0: nam='log file single write' ela= 518 log#=0 block#=1 blocks=1 obj#=-1 tim=100016472654
WAIT #0: nam='log file single write' ela= 516 log#=1 block#=1 blocks=1 obj#=-1 tim=100016473225
Media recovery not enabled or manual archival only 0x10000
WAIT #0: nam='log file sequential read' ela= 90 log#=0 block#=1 blocks=1 obj#=-1 tim=100016473376
WAIT #0: nam='log file sequential read' ela= 73 log#=1 block#=1 blocks=1 obj#=-1 tim=100016473474
WAIT #0: nam='log file single write' ela= 477 log#=0 block#=1 blocks=1 obj#=-1 tim=100016473981
WAIT #0: nam='log file single write' ela= 856 log#=1 block#=1 blocks=1 obj#=-1 tim=100016474885

By playing around with files per group we can work out that the order of activity is:

    read the current
    write the next
    read the current
    write the current

Next up: any process which is involved in (asynchronous) Change Data Capture (CDC) or Streams will be reading the log files, as will any process running a Log Miner session.

A couple of people suggested the SMON and PX slaves could read the log file on database startup – but as another poster pointed out, it’s the startup process that reads the log files – and if parallel recovery is invoked it passes change vectors to the PX slaves, the slaves don’t read the log files directly. Apart from that, I don’t really know which processes read the log file – I don’t want to say “process X doesn’t” read the log file, because it may do at times that I haven’t considered. (And there are lots of processes to look into in 11g.)

However, there is one (generic) process that can read the log files – and that’s ANY process that has to handle data blocks. And that’s really why I wrote this note in the first place. If a process finds at any time that the buffer it’s looking at in memory is corrupt it may do real-time block recovery if it thinks the corruption has occurred between the moment the block was read from disk and the present. There are various flags and details in the buffer header that tell it that the block is dirty (differs from what’s on disk) and the redo block address (RBA) where the earliest known change to the in-memory block can be found in the log files.

So any process may, in conditions that shouldn’t happen often (we hope), identify an in-memory corrupt, read a block from disc, then go backwards in time in the redo logs and start walking the redo logs to find the redo needed to bring the block up to date. So if you see excessive amounts of “log file sequential read” coming from your foreground session, check what’s going on – it might be an indication of a hardware fault.

Thanks to anyone who participated – and if anyone runs the little script and finds some new examples of log file sequential reads, please add them to the list.

Update Jan 2011:
Thanks to Pavol Babel in comment 8 and his reply to comment 7 for adding MRP0 and RFS to the list of processes that are expected to read the log file.

System Statistics

I wrote an article about system statistics / CPU Costing for Oracle magazine a few years ago – and last week I realised that I’ve never supplied a link to it in the notes and comments I’ve made about system statistics. So I’ve just run a search through the Oracle website trying to find it – and discovered that it’s no longer available. Apparently the editors have decided that any technical articles over a certain age should be withdrawn in case they are out of date and misleading. (Clearly they’ve read my blog on trust – I wish the people maintaining Metalink would do the same as the magazine editors – but they probably have a much larger volume to worry about).

However, I have discovered translations of the article in Russian, Korean and Chinese – so if you can read any of these languages, you might want to take a look at them before they disappear too.

If you want an original English version – dated April 2004, which is when I sent it in to Oracle Magazine, and before it underwent some editing – I’ve posted it as a pdf file.

[More on System Statistics]

Local Indexes

In a recent article about list partitioned tables I raised some questions about the cases where the optimizer can’t (yet) do partitioning pruning even when the opportunity is clearly visible to the human eye. The most important example was the case where each partition was defined to hold rows for just one partition key value – but the optimizer could not prune out the redundant partition for a query like: “partition_key != {constant}”.

I recently came across a situation where this really made a big difference. The system had a huge table that had been list partitioned as follows (with some camouflage):


partition by list (status) (
	partition p_state01 values  ('STATE01'),
	partition p_state02 values  ('STATE02'),
	    ....
	partition p_state25 values  ('STATE25'),
	partition p_handled values  ('Completed')
)

The table was defined to allow row movement, and every day there would be tens of thousands of rows moving through various states until they reached the “Completed” state.

There are various pros and cons to this setup. The most significant con is that when you update the status of a row Oracle actually has to update the row “in situ”, then delete it from what is now the wrong partition and insert it into the right partition. The most dramatic pro is that if the rows you’re interested in are (almost always) the ones that haven’t got to the “Completed” you’ve put all the boring old garbage out of the way where it doesn’t cause any problems. (In fact, if you’re running 11.2 you might choose to declare some of the “Completed” partitions of any local indexes as unusable and save yourself a lot of space – and by the time I’ve finished this article you might think this is a truly wonderful idea.) In the case of the client, there were about 200 million rows in the completed partition, and barely 2 million spread over the other partitions.

There was a bit of a problem, though. Some of the indexes on this table had been created as local indexes (arguably they should all have been local)and this resulted in some odd optimisation side effects. Here’s a little bit of code to build a table that demonstrates an interesting issue:

create table t1 (
	id,
	n1,
	small_vc,
	padding
)
partition by list (n1) (
	partition p0 values(0),
	partition p1 values(1),
	partition p2 values(2),
	partition p3 values(3),
	partition p4 values(4),
	partition p5 values(5)
)
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	rownum			id,
	trunc(log(10,rownum))	n1,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 999999
;

create index t1_n1 on t1(n1, small_vc) local nologging;
create index t1_id on t1(id, small_vc) local nologging;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 1'
	);
end;
/

break on index_name skip 1

select
	index_name, partition_name, blevel, num_rows, leaf_blocks
from
	user_ind_partitions    -- but see comment #1 below from Tony Sleight
order by
	index_name, partition_name
;

Thanks to the log definition of column n1, you will see a very skewed distribution of data across the partitions, and the output from the query against the index partitions shows this quite dramatically. Since the code sample uses a 100% sample on the stats, you should get the following figures for the indexes (with a little variation in leaf blocks, perhaps, depending on your version and tablespace definitions. I was using 11.1.0.6 with 8KB blocks, locally managed tablespaces, freelists, and 1MB uniform extents.)

INDEX_NAME           PARTITION_NAME           BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------------- -------------------- ---------- ----------- ----------
T1_ID                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          33       9000
                     P4                            1         339      90000
                     P5                            2        3384     900000

T1_N1                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          32       9000
                     P4                            1         314      90000
                     P5                            2        3136     900000

So here’s important question number 1: What do you think the blevel will be at the global level for the two indexes ?

Important question number 2: If you have query with a where clause like “id between 100 and 400 and n1 != 5″ – which is designed very precisely to exclude the last partition – what value of blevel is Oracle going to use when considering the cost of using the index t1_id to access the data ?

My answers are in this follow-up post.

Not NULL

Here’s a little detail that I’ve known for years – but keep forgetting until I embarrass myself by rediscovering it (usually in front of a client). I’ll start with a demonstration of a useful little feature of mandatory columns:


drop table t1 purge;

create table t1
as
select
	*
from
	all_objects
where
	rownum <= 10000
;

execute dbms_stats.gather_table_stats(user,'t1')

create index t1_i1 on t1(object_name);

set autotrace traceonly explain

select count(*) from t1;

/*

--------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Cost (%CPU)| Time
--------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    13   (0)| 00:00:0
|   1 |  SORT AGGREGATE       |       |     1 |            |
|   2 |   INDEX FAST FULL SCAN| T1_I1 | 10000 |    13   (0)| 00:00:0
--------------------------------------------------------------------

*/

Oracle can use the index on column object_name to count the number of rows in the table because the column has been declared NOT NULL, so every row in the table also has to appear in the index. Let’s just demonstrate that by changing the column definition:


alter table t1 modify object_name null;
select count(*) from t1;

/*

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    40   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 10000 |    40   (0)| 00:00:01 |
-------------------------------------------------------------------

*/

Now let’s make the column mandatory again – by adding a constraint:

Last Modified

Have you ever wondered if there’s a way to see when the data in a table was last modified ? Depending on what you’re really trying to achieve, how accurate you want the answer to be, what resources you’re prepared to use, and how you’ve configured your database, you may be able to get an answer that’s good enough very easily.

If all you want is a rough indication that the table hasn’t changed over the last few days, or weeks, or even months, you may be able to run a simple, but potentially brutal, query against the table to find out. Here’s an example that get’s there in three steps – using the table sys.link$ as a target. Steps 1 and 2 are for explanatory purposes only.


select
	ora_rowscn
from
	link$
;

select
	first_change#, first_time
from
	v$log_history
order by
	first_change#;

select
	min(first_time)
from
	v$log_history
where
	first_change# >= (
		select max(ora_rowscn) from link$
	)
;

The pseudo-column ora_rowscn is an upper bound for the SCN at which the last change to a row committed. If you have enabled rowdependencies this will generally be the actual commit SCN for the row, otherwise Oracle has various algorithms for working out the largest SCN avaiable that is not less than the SCN at which the transaction committed. If all you’re interested in is an indication of when the table last changed even this is upper bound SCN may be good enough.

Step 2 is there to remind you that v$log_history records the SCN and timestamp for the first change in the file. This gives us a very crude conversion between SCNs and timestamps.

The final step gives us the answer we want. We find the approximate SCN of the most recent change to the table, and then report the timestamp of the first log file that started at a higher SCN.