Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Quick lists when generating test data using plsql

One of the outstanding issues that I have had on the testdata_ninja library, was the ability to do quick lists.
A short list of possible values, like a status column or open/close kind of values.

18c new Lost Write Protection

There are many layers between the Oracle Database pwrite() calls and the physical sector written on disk: filesystem, logical volume, SAN or NAS, with a lot of smart software running for Virtualisation, Compression, Snapshotting, Synchronisation… Are you sure that the changes you made to your data is actually persisted on disk, completely and without any corruption? In case of bug or crash in the storage layer, it may happen that only part of the changes was written. In the case of crash, Oracle ensures that the datafile headers are written at the end, so that recovery can kick-in after the crash. Then, a partially written block can be detected and restored. With different checksum settings, you can also check block integrity while writing or reading. But that protects only for fractured blocks. What if a block write just did not occur? An old version of the block remains and then is perfectly correct for checksum, RMAN, and DBV.

You may be 100% sure that you have never experienced lost writes. But then I’ll ask you: how do you know it? You don’t. Except if you enable Lost Write Protection.

In 11g Oracle introduced this feature for Data Guard configurations. Data Guard is the best protection as the synchronization is done at the highest level: the change vector, generated before any I/O and block modification occurred. Do not use SAN synchronization for your database. Data Guard is less expensive (no option needed), more efficient (only the persistent change information is shipped), and protects over all layers. It protects from lost writes because blocks are written on both sites by a different server, instance, storage. And Data Guard can detect lost writes by shipping the block SCN for each read to compare it with the standby.

However, this has an overhead: redo generation for reads. Oracle 18c comes with a new solution with no need for a standby database: a new LOST WRITE PROTECTION tablespace is created to store the SCN of each block modified.

Do you have lost writes?

First I’ll show what happens without this feature. I create a table filled with number, timestamp and SCN:

SQL> create table DEMO.DEMO pctfree 99 as select rownum id,1 n, current_timestamp ts , (select current_scn from v$database) scn from xmltable('1 to 10');
Table DEMO.DEMO created.
 
SQL> select owner,segment_name,segment_type,block_id,blocks,sum(blocks)over(partition by owner,segment_name,segment_type) from dba_extents where owner='DEMO' and segment_name='DEMO' order by 1,2,3,4;
 
OWNER SEGMENT_NAME SEGMENT_TYPE BLOCK_ID BLOCKS SUM(BLOCKS)OVER(PARTITIONBYOWNER,SEGMENT_NAME,SEGMENT_TYPE)
----- ------------ ------------ -------- ------ -----------------------------------------------------------
DEMO DEMO TABLE 3128 8 8
 
SQL> column block_id new_value last_block_id
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
 
BLOCK_ID ID N TS SCN
-------- -- - -- ---
3131 1 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3131 2 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3132 3 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3132 4 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3133 5 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3133 6 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3134 7 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3134 8 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3135 9 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3135 10 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
 
SQL> column block_id clear

I’ve recorded the las BLOCK_ID in &last_block_id and ensures that all those modifications are written on dosk:

SQL> alter system checkpoint;
System CHECKPOINT altered.
 
SQL> alter system flush buffer_cache;
System FLUSH altered.

I save the block 3133 to keep an old version of it. This will be my way to simulate a lost write.

SQL> host dd if=/u01/oradata/CDB1/PDB1/users01.dbf of=/var/tmp/lwp.blk skip=$(( &last_block_id - 2 )) bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000263416 s, 31.1 MB/s

Now, I update all rows, set new timestamp, SCN and increase the number 1 to 2.

SQL> update DEMO.DEMO set n=2+1, ts=current_timestamp, scn=(select current_scn from v$database);
10 rows updated.
SQL> commit;
Commit complete.
 
SQL> alter system checkpoint;
System CHECKPOINT altered.
SQL> alter system flush buffer_cache;
System FLUSH altered.

Here is how I reproduce lost writes. All blocks changed were written to disk, but I restore the old version of block 3133 as if this one was not written:

SQL> host dd if=/var/tmp/lwp.blk of=/u01/oradata/CDB1/PDB1/users01.dbf seek=$(( &last_block_id - 2 )) bs=8k count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000111582 s, 73.4 MB/s

This is what you can see if one of your storage layers missed a write and nevertheless acknowledged the I/O call.
s
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
BLOCK_ID ID N TS SCN
-------- -- - --------------------------------------------- -------
3131 1 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3131 2 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3132 3 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3132 4 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3133 5 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3133 6 1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH 4619734
3134 7 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3134 8 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3135 9 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806
3135 10 3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH 4619806

No errors. No corruption. Just old values for the rows in this block. And there’s no way to detect it. Physical reads, RMAN, DBV, will all see a perfectly correct block. Only if those changes are inconsistent with other tables (with constraints) or indexes, you may detect a logical corruption.

The probability that this problem occurs is very low (a storage bug on exactly one 8k block or multiple of it). But it is critical because it cannot be detected. To detect it, you need to compare the full block or a checksum, or simply the SCN with another copy, such as in the standby database. Or, with this new feature, store the SCN of each data block in a new structure: the 18c LOST WRITE DETECTION.

Enabling Lost Write Protection

You need to create a tablespace to store those SCN. There’s no choice you must use a bigfile tablespace, but you can create multiple small ones if you want.

SQL> create bigfile tablespace SHADOW
2 datafile '/u01/oradata/CDB1/PDB1/shadow.dbf'
3 size 5M
4 lost write protection;
 
Tablespace SHADOW created.
 
SQL> select tablespace_name,status,bigfile,contents,logging,allocation_type,encrypted,lost_write_protect,chunk_tablespace from dba_tablespaces;
TABLESPACE_NAME STATUS BIGFILE CONTENTS LOGGING ALLOCATION_TYPE ENCRYPTED LOST_WRITE_PROTECT CHUNK_TABLESPACE
--------------- ------ ------- -------- ------- --------------- --------- ------------------ ----------------
SYSTEM ONLINE NO PERMANENT LOGGING SYSTEM NO OFF N
SYSAUX ONLINE NO PERMANENT LOGGING SYSTEM NO OFF N
UNDOTBS1 ONLINE NO UNDO LOGGING SYSTEM NO OFF N
TEMP ONLINE NO TEMPORARY NOLOGGING UNIFORM NO OFF N
SHADOW ONLINE YES LOST WRITE PROTECTION LOGGING SYSTEM NO OFF N
USERS ONLINE NO PERMANENT LOGGING SYSTEM NO OFF N
 
SQL> select * from dba_data_files;
 
FILE_NAME FILE_ID TABLESPACE_NAME BYTES BLOCKS STATUS RELATIVE_FNO AUTOEXTENSIBLE MAXBYTES MAXBLOCKS INCREMENT_BY USER_BYTES USER_BLOCKS ONLINE_STATUS LOST_WRITE_PROTECT
--------- ------- --------------- ----- ------ ------ ------------ -------------- -------- --------- ------------ ---------- ----------- ------------- ------------------
/u01/oradata/CDB1/PDB1/undotbs01.dbf 164 UNDOTBS1 104857600 12800 AVAILABLE 9 YES 34359721984 4194302 640 103809024 12672 ONLINE OFF
/u01/oradata/CDB1/PDB1/sysaux01.dbf 163 SYSAUX 408944640 49920 AVAILABLE 4 YES 34359721984 4194302 1280 407896064 49792 ONLINE OFF
/u01/oradata/CDB1/PDB1/system01.dbf 162 SYSTEM 272629760 33280 AVAILABLE 1 YES 34359721984 4194302 1280 271581184 33152 SYSTEM OFF
/u01/oradata/CDB1/PDB1/users01.dbf 169 USERS 104857600 12800 AVAILABLE 169 NO 0 0 0 103809024 12672 ONLINE OFF
/u01/oradata/CDB1/PDB1/shadow.dbf 58 SHADOW 5242880 640 AVAILABLE 1024 NO 0 0 0 4194304 512 ONLINE OFF

Then you enable this feature for the database (or pluggable database) and for the tablespaces you want to protect:

SQL> alter pluggable database enable lost write protection;
Pluggable database ENABLE altered.
 
SQL> alter tablespace USERS enable lost write protection;
Tablespace USERS altered.

Here are the new columns in DBA_TABLESPACES and DBA_DATA_FILES:

SQL> select tablespace_name,status,bigfile,contents,logging,allocation_type,encrypted,lost_write_protect,chunk_tablespace from dba_tablespaces;
 
TABLESPACE_NAME STATUS BIGFILE CONTENTS LOGGING ALLOCATION_TYPE ENCRYPTED LOST_WRITE_PROTECT CHUNK_TABLESPACE
--------------- ------ ------- -------- ------- --------------- --------- ------------------ ----------------
SYSTEM ONLINE NO PERMANENT LOGGING SYSTEM NO OFF N
SYSAUX ONLINE NO PERMANENT LOGGING SYSTEM NO OFF N
UNDOTBS1 ONLINE NO UNDO LOGGING SYSTEM NO OFF N
TEMP ONLINE NO TEMPORARY NOLOGGING UNIFORM NO OFF N
SHADOW ONLINE YES LOST WRITE PROTECTION LOGGING SYSTEM NO OFF N
USERS ONLINE NO PERMANENT LOGGING SYSTEM NO ENABLED N
 
SQL> select * from dba_data_files;
 
FILE_NAME FILE_ID TABLESPACE_NAME BYTES BLOCKS STATUS RELATIVE_FNO AUTOEXTENSIBLE MAXBYTES MAXBLOCKS INCREMENT_BY USER_BYTES USER_BLOCKS ONLINE_STATUS LOST_WRITE_PROTECT
--------- ------- --------------- ----- ------ ------ ------------ -------------- -------- --------- ------------ ---------- ----------- ------------- ------------------
/u01/oradata/CDB1/PDB1/undotbs01.dbf 164 UNDOTBS1 104857600 12800 AVAILABLE 9 YES 34359721984 4194302 640 103809024 12672 ONLINE OFF
/u01/oradata/CDB1/PDB1/sysaux01.dbf 163 SYSAUX 408944640 49920 AVAILABLE 4 YES 34359721984 4194302 1280 407896064 49792 ONLINE OFF
/u01/oradata/CDB1/PDB1/system01.dbf 162 SYSTEM 272629760 33280 AVAILABLE 1 YES 34359721984 4194302 1280 271581184 33152 SYSTEM OFF
/u01/oradata/CDB1/PDB1/users01.dbf 169 USERS 104857600 12800 AVAILABLE 169 NO 0 0 0 103809024 12672 ONLINE ENABLED
/u01/oradata/CDB1/PDB1/shadow.dbf 58 SHADOW 5242880 640 AVAILABLE 1024 NO 0 0 0 4194304 512 ONLINE OFF

Note that we are on the Oracle Cloud here and all tablespaces must be encrypted. This is also the case with the LOST WRITE PROTECTION one. Here you can see ENCRYPTION at none only because I decrypted them to look at what is inside the files.

Here are some internal tables giving some information about the storage. Note that tablespace number TSID=7 here is USERS, the one protected, and the TSID=6 one is the LOST WRITE PROTECTION one.

SQL> select * from new_lost_write_datafiles$;
 
DATAFILE_TSID_TRACKED DATAFILE_RFN_TRACKED SHADOW_DATAFILE_TSID SHADOW_DATAFILE_RFN SHADOW_DATAFILE_OFFSET NUMBER_OF_BLOCKS_ALLOCATED DATAFILE_CURRENT_STATUS
--------------------- -------------------- -------------------- ------------------- ---------------------- -------------------------- -----------------------
7 169 6 1024 128 184 enabled
 
SQL> select * from new_lost_write_extents$;
 
EXTENT_DATAFILE_RFN EXTENT_DATAFILE_TSID EXTENT_START EXTENT_LENGTH_BLOCKS_2K EXTENT_NEXT_BLOCK
------------------- -------------------- ------------ ----------------------- -----------------
1024 6 312 1312 641
 
SQL> select * from new_lost_write_shadows$;
 
SHADOW_DATAFILE_RFN SHADOW_DATAFILE_TSID SHADOW_NUMBER_BLOCKS_ALLOC SHADOW_FIRST_FREE_BLOCK SHADOW_BLOCK_SIZE_BYTES SHADOW_RECS_PER_BLOCK
------------------- -------------------- -------------------------- ----------------------- ----------------------- ---------------------
1024 6 640 128 8192 136

Because there was already a ‘lost write’ feature, this one is called ‘new lost write’.

NEW_LOST_WRITE_DATAFILE$ lists all protected (aka tracked) datafiles with the LOST WRITE PROTECTION (aka shadow) tablespace protecting it. The status can be ‘enabled’ or ‘suspended’. The row is deleted if the protection is removed.

NEW_LOST_WRITE_SHADOWS$ lists the LOST WRITE PROTECTION tablespaces. It contains a 1MB bitmap in the first 128 blocks, and extents starts after this.

NEW_LOST_WRITE_EXTENTS$ maps the free extents in the shadow tablespace. Here, the lost write protection for my USERS tablespace (100MB) takes 312 – 128 = 184 blocks (1.4 MB). The extent is 4MB. The 1312 EXTENT_LENGTH_BLOCKS_2K are the remaining free space in the extent, in 2KB blocks. 4MB-1.4MB=2.6MB which is 1312 x 2k blocks.

Simulate Lost Write

As I did before, I create a DEMO table

SQL> create table DEMO.DEMO pctfree 99 as select rownum id,1 n, current_timestamp ts , (select current_scn from v$database) scn from xmltable('1 to 10');
Table DEMO.DEMO created.
 
SQL> column block_id new_value last_block_id
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
BLOCK_ID ID N TS SCN
-------- -- - --------------------------------------------- -------
3123 1 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3123 2 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3124 3 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3124 4 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3125 5 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3125 6 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3126 7 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3126 8 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3127 9 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079
3127 10 1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH 4578079

I save one block:

SQL> column block_id clear
SQL> alter system checkpoint;
System CHECKPOINT altered.
SQL> alter system flush buffer_cache;
System FLUSH altered.
SQL> host dd if=/u01/oradata/CDB1/PDB1/users01.dbf of=/var/tmp/lwp.blk skip=$(( &last_block_id - 2 )) bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000325309 s, 25.2 MB/s

Update the table:

SQL> update DEMO.DEMO set n=2+1, ts=current_timestamp, scn=(select current_scn from v$database);
10 rows updated.
SQL> commit;
Commit complete.
SQL> alter system checkpoint;
System CHECKPOINT altered.
SQL> alter system flush buffer_cache;
System FLUSH altered.

Write back the block I saved, to simulate a lost write:

SQL> host dd if=/var/tmp/lwp.blk of=/u01/oradata/CDB1/PDB1/users01.dbf seek=$(( &last_block_id - 2 )) bs=8k count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000104103 s, 78.7 MB/s

Lost Write detection

Now, when I query the table, the lost write is detected and an error is raised:

SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
 
Error starting at line : 93 File @ /media/sf_share/18c/lost_write_protection.sql
In command -
select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO
Error report -
ORA-65478: shadow lost write protection - found lost write

In the alert.log I have the mention of the block that failed:

2018-03-03 16:21:06.842000 +01:00
ERROR - I/O type:buffered I/O found lost write in block with file#:169 rdba:0x2a400c35, Expected SCN:0x000000000045db54 SCN in block:0x000000000045db23, approx current SCN:0x000000000045dbbb, RAC instance:1 pdb:5
*****************************************************************
An internal routine has requested a dump of selected redo.
This usually happens following a specific internal error, when
analysis of the redo logs will help Oracle Support with the
diagnosis.
It is recommended that you retain all the redo logs generated (by
all the instances) during the past 12 hours, in case additional
redo dumps are required to help with the diagnosis.
*****************************************************************

The block defined by the RDBA in alert.log is the one I have manually corrupted:

SQL> select dbms_utility.data_block_address_file(to_number('2a400c35','XXXXXXXX'))file#,dbms_utility.data_block_address_block( to_number('2a400c35','XXXXXXXX'))block# from dual;
 
FILE# BLOCK#
---------- ----------
169 3125

As mentioned in the alert.log the session has dumped the redo, as found in the session trace file:

ALTER SYSTEM DUMP REDO DBA MIN 169 3125 DBA MAX 169 3125 SCN MIN 4578132 SCN MAX 4578235 CON_ID 5

This SCN 4578132 is the commit SCN for my update. And the 4578235 is the current one. I can see the change that was lost here:

CHANGE #10 CON_ID:5 TYP:0 CLS:1 AFN:169 DBA:0x2a400c35 OBJ:73527 SCN:0x000000000045db23 SEQ:2 OP:11.4 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x01 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F xid: 0x0008.019.000002d6 uba: 0x02401257.01a2.24
KDO Op code: LKR row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x2a400c35 hdba: 0x2a400c32
itli: 2 ispac: 0 maxfr: 4858
tabn: 0 slot: 0 to: 2
CHANGE #11 CON_ID:5 TYP:0 CLS:1 AFN:169 DBA:0x2a400c35 OBJ:73527 SCN:0x000000000045db54 SEQ:1 OP:11.5 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x02 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C uba: 0x02401257.01a2.25
KDO Op code: URP row dependencies Disabled
xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x2a400c35 hdba: 0x2a400c32
itli: 2 ispac: 0 maxfr: 4858
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 2 ckix: 0
ncol: 4 nnew: 3 size: 0
Vector content:
col 1: [ 2] c1 04
col 2: [13] 78 76 03 03 10 16 02 0d 73 cd 48 86 58
col 3: [ 5] c4 05 3a 52 20

Then we need to recover…

However, unfortunately, this block is not marked as corrupt:

SQL> select * from V$DATABASE_BLOCK_CORRUPTION;
no rows selected

This means that I cannot use RMAN block recovery:

SQL> host rman target / <<
Starting recover at 03-MAR-18
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=152 device type=DISK
allocated channel: ORA_DISK_2
 
starting media recovery
media recovery complete, elapsed time: 00:00:00
 
Finished recover at 03-MAR-18

And the RMAN recovery advisor is not aware of the problem:

RMAN> list failure;
 
using target database control file instead of recovery catalog
Database Role: PRIMARY
 
no failures found that match specification

So the solution is to mark it as corrupt or restore the whole datafile (or section). And maybe ensure that the write was lost on the data, and not on the lost write tracking block itself. The redo dump may help for that.

strace: additional I/Os

I traced the server process to see what files are read during my query:

SQL> column spid new_value pid
SQL> select spid from v$process join v$session on v$session.paddr=v$process.addr where sid=sys_context('userenv','sid');
SPID
9360
SQL> column spid clear
SQL> define bg=&:
SQL> host strace -p &pid -o strace.txt &bg

This small awk filters I/O calls on users01.dbf and shadow.dbf and displays the system calls on the file handle

awk '/open[(]["].*(shadow.dbf|users01.dbf).*["],/{l=$0;gsub(/[(,)]/," ");h[$NF]=$2" "$NF;print $l;$0=$l}/^[a-zA-Z0-9]+[(][0-9]+[,)]/{l=$0;gsub(/[(,)]/," "); if ( $1 == "close" ) h[$2]="" ; if ( h[$2]!="" ) printf "%-130s \t%80s\n",l,h[$2]}/F_DUPFD/{if ( h[$2]!="" ) h[$NF]=h[$2]" "$NF;h[$2]=""}' strace.txt | grep --color=auto -E "^|^.*users01.*"

Here are the open() and pread() calls:

open "/u01/oradata/CDB1/PDB1/shadow.dbf" O_RDWR|O_DSYNC = 8
fcntl(8, F_SETFD, FD_CLOEXEC) = 0 "/u01/oradata/CDB1/PDB1/shadow.dbf" 8
fcntl(8, F_DUPFD, 256) = 256 "/u01/oradata/CDB1/PDB1/shadow.dbf" 8
fcntl(256, F_SETFD, FD_CLOEXEC) = 0 "/u01/oradata/CDB1/PDB1/shadow.dbf" 8 256
pread64(256, "X\242\226V\333E\1\f\372\366"..., 8192, 1056768) = 8192 "/u01/oradata/CDB1/PDB1/shadow.dbf" 8 256
open "/u01/oradata/CDB1/PDB1/users01.dbf" O_RDWR|O_DSYNC = 8
fcntl(8, F_SETFD, FD_CLOEXEC) = 0 "/u01/oradata/CDB1/PDB1/users01.dbf" 8
fcntl(8, F_DUPFD, 256) = 257 "/u01/oradata/CDB1/PDB1/users01.dbf" 8
fcntl(257, F_SETFD, FD_CLOEXEC) = 0 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "#\242002\f@*%\333E\1\4\370\264"..., 8192, 25575424) = 8192 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "\6\242003\f@*T\333E\5\6\317H\1007\37\1!\333E"..., 40960, 25583616) = 40960 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "\6\242005\f@*#\333E\2\4zo\1007\37\1!\333E"..., 8192, 25600000) = 8192 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "\6\242003\f@*T\333E\5\6\317H\1007\37\1!\333E"..., 40960, 25583616) = 40960 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "\6\242005\f@*#\333E\2\4zo\1007\37\1!\333E"..., 8192, 25600000) = 8192 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "\6\242003\f@*T\333E\5\6\317H\1007\37\1!\333E"..., 40960, 25583616) = 40960 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "\6\242005\f@*#\333E\2\4zo\1007\37\1!\333E"..., 8192, 25600000) = 8192 "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257

We can see the lost write protection file read first (1 block at offset 1056768 which is block 129, the first one after the 1MB header) and the SCNs for my 5 blocks table are all there. Then the table blocks are read. Note that all those blocks (lost protection and data) goes into the buffer cache, and then do not have to be re-read each time. Here, I’ve run my failing select 3 times and only the first one had to read the shadow datafile.

X$BH: additional buffer gets

As those blocks are read through the buffer cache during the consistent reads, I checked the buffer cache headers for the 3 times I’ve run the queries. I’ve identified them from the function that reads them: kcbr_lost_get_lost_write_scns

SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
ORA-65478: shadow lost write protection - found lost write
 
SQL> select obj,state,tch,fp_whr from x$bh where fp_whr like 'kr_gcur_4: kcbr_lost_get_lost_w%';
 
OBJ STATE TCH FP_WHR
--- ----- --- ------
4294967295 1 1 kr_gcur_4: kcbr_lost_get_lost_w
 
 
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
ORA-65478: shadow lost write protection - found lost write
 
SQL> select obj,state,tch,fp_whr from x$bh where fp_whr like 'kr_gcur_4: kcbr_lost_get_lost_w%';
 
OBJ STATE TCH FP_WHR
--- ----- --- ------
4294967295 1 2 kr_gcur_4: kcbr_lost_get_lost_w
 
 
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
ORA-65478: shadow lost write protection - found lost write
 
SQL> select obj,state,tch,fp_whr from x$bh where fp_whr like 'kr_gcur_4: kcbr_lost_get_lost_w%';
 
OBJ STATE TCH FP_WHR
--- ----- --- ------
4294967295 1 3 kr_gcur_4: kcbr_lost_get_lost_w

Here we can see the touch count increasing. It seems that for each query the kcbr_lost_get_lost_write_scns is called, even when there was no modification and no new read from disk.

While we’re there, let’s breakpoint on this fonction to see when it is called:

(gdb) break kcbr_lost_get_lost_write_scns
Breakpoint 1 at 0x85a9140
(gdb) c
Continuing.
 
Breakpoint 1, 0x00000000085a9140 in kcbr_lost_get_lost_write_scns ()
(gdb) bt
#0 0x00000000085a9140 in kcbr_lost_get_lost_write_scns ()
#1 0x0000000001cf9c01 in kcbzibmlt ()
#2 0x0000000001ce2f29 in kcbzib ()
#3 0x0000000011e7f6e9 in kcbgtcr ()
#4 0x0000000011e366bd in ktrget2 ()
#5 0x00000000121d5ca7 in kdst_fetch0 ()
#6 0x00000000121e4f5a in kdstf000110100000000km ()
#7 0x00000000121d398e in kdsttgr ()
#8 0x000000001224d28f in qertbFetch ()
#9 0x00000000120340ef in opifch2 ()
#10 0x0000000002d8d033 in kpoal8 ()
#11 0x000000001203af9c in opiodr ()
#12 0x000000001230acf7 in ttcpip ()
#13 0x00000000026a5667 in opitsk ()
#14 0x00000000026aa27d in opiino ()
#15 0x000000001203af9c in opiodr ()
#16 0x00000000026a10a3 in opidrv ()
#17 0x00000000032a58af in sou2o ()
#18 0x0000000000d68047 in opimai_real ()
#19 0x00000000032b2667 in ssthrdmain ()
#20 0x0000000000d67e53 in main ()

Look at Frits Hoogland annotations for the signification and you will see that this is called during consistent reads -> input buffer.

So what?

This feature is interesting. Of course, we need to measure the overhead of this detection, but this additional storage of the SCN being implemented as any data block, benefits from all its efficiency: buffer cache, background writes by dbwr, protection by redo, backups,… These times, I see more and more databases installed on storage with fancy features, and admins playing with snapshot without really knowing whether it is consistent or not. This is the opposite of the ‘reliable’ and ‘keep it simple’ properties that we want for our data. For these environments, when I cannot convince the storage admins to forget about those features and rely on Data Guard on top of the simplest storage, then at least we have a way to protect us from failures in those layers.

 

Cet article 18c new Lost Write Protection est apparu en premier sur Blog dbi services.

Friday Philosophy – Criticism is Critical for Believable Endorsement

If you had a friend who always told you that your were the best, that you had no faults, and that everything you did was great – would you trust them? I know I would not. I am fully aware that I am not perfect(*). I used to see this sometimes in relationships too, especially when I was younger. One of them would be so desperate for their boyfriend/girlfriend to like them that they would never criticise the light of their life. The relationship never lasted as it was, well, creepy and false.

For your opinion of someone to be seen as honest, there has to be space for criticism. I love my wife very much, but she’s crap at loading the dishwasher. Joking aside, I believe my wife when she says she likes some aspect of my character as she will tell me about the ones she does not like. Thankfully, not at length.

In exactly the same way, for your opinion on a technology or application to be accepted as honest & worthwhile, there has to be space for criticism. I’m not saying that there has to be some criticism within any given endorsement of a product, I’m saying you need to be confident that the person would mention any faults or drawback they are aware of for you to believe that endorsement. I’m really hoping you are all agreeing with me on this!

So why do Marketing departments so often not get this? What is so fundamentally broken – OK, let’s be nice and say different – about their view of the world that any criticism is not acceptable? I just don’t understand either their belief that their product is perfect or that people will be fooled by an uncritical opinion of that product.

I can see how this would work in social media reviews like TripAdviser though. I recently did reviews of several places I had visited and a couple of companies then contacted me to ask me to remove the bits where I had said anything negative. They fundamentally wanted me to lie for them, or at least implicitly (and complicitly) give a better review by omission. I don’t take it well when I am asked to lie. In this case of social media I can see how “cleaning up” the reviews might help as most people look at the sum of all reviews and not at the reviewers.

But when you are actually a known person giving a review or endorsement, your reputation is critical to how the review is perceived.

What triggered this post was I recently discovered a friend of mine had been asked by a marketing person to remove some negative things they had said. They refused and made the point I have just made – if they were to be seen as believable when they said something else that the company produced was great, they had to be seen to be honest in criticising anything that was less-than-perfect. And let’s all be grown up about this, I’d say no software or application is perfect! However, the marketing person found this concept alien to them.

I wonder if people who work in marketing departments have difficulty maintaining long-term relationships? Based on my experience, quite a few of them are willing to ask you to lie for them and they don’t understand honesty is core to trust. Or am I just being very, very naive?

For me to trust the opinion of someone commenting on software & applications, in fact on anything really, I want to see some proof of their integrity by them being critical as well as complementary. If it is all positive, I will assume that their opinion is bought in some way or they are not particularly discerning. So Marketing People asking for negative comments to be removed? You are doing your employer a disservice. Please keep that in mind.

(*)I’m not perfect – but sometimes I’m so close it hurts. My wife got me a T-shirt with this on, so it must be true… Or she was being sarcastic?

Conditional SQL – 5

Here’s a note that has been sitting around for more than 3 years (the draft date is Jan 2015), waiting for me to finish it off; and in that time we’ve got a new version of Oracle that changes the solution to the problem it presented. (I also managed to write “Conditional SQL –  6” in the intervening period !)

This posting started with a question on the OTN (now ODC) database forum about an execution plan used by 11.2.0.3.  Here’s a model to represent the data and the query:

rem
rem     Script:         null_plan_4.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2015
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

create table catentry as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment here to avoid format issue
)
select
        rownum  catentry_id,
        case
                when mod(rownum-1,100) > 0 then mod(rownum-1,100)
        end     member_id,
        case
                when trunc((rownum-1)/100) > 0 then trunc((rownum-1)/100)
        end     partnumber,
        rpad('x',100)   padding
from
        generator,
        generator
where
        rownum <= 100 * 100 -- > comment here to avoid format issue
;

execute dbms_stats.gather_table_stats(user,'catentry');

create unique index cat_i0 on catentry(member_id, partnumber) compress 1;
--  create        index cat_i1 on catentry(member_id, partnumber, 0) compress 1;
--  create        index cat_i2 on catentry(partnumber, member_id, 0) compress 1;

variable b1 number
variable b2 number
variable b3 number
variable b4 number

begin
        :b1 := 22;
        :b2 := 1;
        :b3 := 44;
        :b4 := 1;
end;
/

select
        catentry_id
from
        catentry
where
        (   partnumber= :b1
         or (0 = :b2 and partnumber is null)
        )
and     (    member_id= :b3
         or (0 = :b4 and member_id is null)
        )
;

select * from table(dbms_xplan.display_cursor);

==============================================================================

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |       |       |    23 (100)|          |
|*  1 |  TABLE ACCESS FULL| CATENTRY |     1 |    10 |    23   (5)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((("PARTNUMBER"=:B1 OR ("PARTNUMBER" IS NULL AND 0=:B2))
              AND ("MEMBER_ID"=:B3 OR ("MEMBER_ID" IS NULL AND 0=:B4))))

The question this example raised was, effectively: “Why didn’t Oracle use bind peeking to work out that the best plan for this query – with these input values – was an index range scan?”

The basic answer to this question is this: “No matter how clever Oracle can be with bind peeking and executions plans it has to produce an execution plan that will give the right answer whatever the supplied values might be.”

The OP was hoping that the optimizer would see :b2 and :b4 were arriving with the value 1, infer that “0 = 1” is always false, and reduce the query predicate to “partnumber =22 and member_id = 44” to produce the following plan:


----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |       |       |     2 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | CAT_I0   |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("PARTNUMBER"=22 AND "MEMBER_ID"=44)

But this plan could produce the wrong results if the next execution of the query supplied zeros for :b2 or :b4, so Oracle has to do something more generic. (Bear in mind that adaptive cursor sharing keeps reusing the same execution plan until it detects that the performance for some input values is bad; it doesn’t pre-emptively create new plans based on the incoming values – though in principle it might be possible for the Oracle developers to introduce code that can recognise special cases for predicates of the form “constant1 = constant2”).

If you review the SQL statement you can see that it’s clearly trying to allow the user to find data about member_ids and partnumbers where both, neither, or either value is allowed to be null: a couple of “if – then – else” conditions that should have been handled in the client code have been embedded in the code. As we have seen several times before if you can’t change the client code then you have to hope that Oracle will use some clever transformation to handle the query in sections.

We can infer from various details of the posting that the member_id and partnumber columns were both allowed to be null, so if we want to make sure that Oracle always uses an indexed access path to acquire data for this query we need to have an index which starts with those two columns and then has at least one column which is guaranteed to be non-null so, for example, we could simply drop the current index and replace it with one that has a fixed zero on the end:

create index cat_i1 on catentry(member_id, partnumber, 0) compress 1;

With my particular data set, query, and version of Oracle this didn’t make any difference to the plan. But then I thought about the data definition and realised (and checked) that the index had a terrible clustering_factor, so I dropped it and created it with the first two columns in the opposite order:

create index cat_i2 on catentry(partnumber, member_id, 0) compress 1;

Side note:
You’ll notice that I’ve replaced the original unique index with a non-unique index. This was necessary because there were many rows where both partnumber and member_id were null, so if I want to maintain the logic of the previous unique index I’ll need to add a unique constraint on (member_id, partnumber). It’s possible, of course, that in similar circumstances I might want both indexes – one for the uniqueness and to access the data using only one of the columns, the other to access the data using only the other column.

With this index in place, and unhinted, the plan I got from 11.2.0.4 changed to use concatenation with an impressive four-way split:


------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |       |       |    12 (100)|          |
|   1 |  CONCATENATION                |          |       |       |            |          |
|*  2 |   FILTER                      |          |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
|*  5 |   FILTER                      |          |       |       |            |          |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
|*  8 |   FILTER                      |          |       |       |            |          |
|*  9 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN          | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
|* 11 |   TABLE ACCESS BY INDEX ROWID | CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 12 |    INDEX RANGE SCAN           | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter((0=:B2 AND 0=:B4))
   4 - access("PARTNUMBER" IS NULL AND "MEMBER_ID" IS NULL)
       filter("MEMBER_ID" IS NULL)
   5 - filter(0=:B2)
   6 - filter((LNNVL("MEMBER_ID" IS NULL) OR LNNVL(0=:B4)))
   7 - access("PARTNUMBER" IS NULL AND "MEMBER_ID"=:B3)
       filter("MEMBER_ID"=:B3)
   8 - filter(0=:B4)
   9 - filter((LNNVL("PARTNUMBER" IS NULL) OR LNNVL(0=:B2)))
  10 - access("PARTNUMBER"=:B1 AND "MEMBER_ID" IS NULL)
  11 - filter(((LNNVL("MEMBER_ID" IS NULL) OR LNNVL(0=:B4)) AND
              (LNNVL("PARTNUMBER" IS NULL) OR LNNVL(0=:B2))))
  12 - access("PARTNUMBER"=:B1 AND "MEMBER_ID"=:B3)


To execute this plan the run-time engine works as follows:

  • Operation 2: If :b2 and :b4 are both zero we use the index to find the rows where member_id and partnumber are null (the filter “member_id is null” seems to be redundant)
  • Operation 5: if :b2 is zero we use the index to find rows where the partnumber is null and the member_id is the supplied value (and if that’s null the access will immediately return zero rows because of the equality predicate), and discard any rows that have already been returned by operation 2
  • Operation 8: if :b4 is zero we will use the index to find rows where the partnumber is the supplied value (and if the partnumber is null, that access will immediately return zero rows because of the equality predicate) and the member_id is null, and discard any rows that have already been returned by operation 2.
  • Operations 11 and 12 will always run – using the index to find rows that match with equality on both the incoming member_id and partnumber, discarding any rows already returned by the previous operations, and obviously not matching any rows where either column “IS” null.

The critical feature of this plan, of course, is that we got it because we have given Oracle an efficient option to find the rows where both member_id and partnumber are null – and that allows the rest of the concatenation options to take place.

Hints and Upgrades

Interestingly, after the clue that 11g would happily use concatenation with a “good enough” index I went back to the example where I’d just added a zero to the existing index and checked to see what would happen if I added a /*+ use_concat */ hint (without any of the qualifying parameters that the hint can now use) and got the same concatenated plan. The fact that the path appeared wasn’t the interesting bit – see if you can spot what is the interesting bit:

------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |       |       |    12 (100)|          |
|   1 |  CONCATENATION                |          |       |       |            |          |
|*  2 |   FILTER                      |          |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
|*  5 |   FILTER                      |          |       |       |            |          |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
|*  8 |   FILTER                      |          |       |       |            |          |
|*  9 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN          | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
|* 11 |   TABLE ACCESS BY INDEX ROWID | CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 12 |    INDEX RANGE SCAN           | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Check the cost, and compare it with the cost of the full tablescan. The hinted path has a lower cost than the default path. I think this may be another case of an “unknowable” range scan being ignored in favour of a known alternative.

Finally, we get to today – when I tested the code against 12.1.0.2 and 12.2.0.1. Nothing exciting happened in 12.1.0.2 – the plans were just like the 11g plans, but here’s the plan I got in 12.2 with the “bad” indexing (original column order with added zero column – index cat_i1) without any hints in the SQL:


----------------------------------------------------------------------------------------------------------
| Id  | Operation			       | Name		 | Rows  | Bytes | Cost (%CPU)| Time	 |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		       |		 |	 |	 |     9 (100)| 	 |
|   1 |  VIEW				       | VW_ORE_37A7142B |     4 |    52 |     9   (0)| 00:00:01 |
|   2 |   UNION-ALL			       |		 |	 |	 |	      | 	 |
|   3 |    TABLE ACCESS BY INDEX ROWID	       | CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  4 |     INDEX UNIQUE SCAN		       | CAT_I1 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  5 |    FILTER			       |		 |	 |	 |	      | 	 |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN		       | CAT_I1 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  8 |    FILTER			       |		 |	 |	 |	      | 	 |
|   9 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|* 10 |      INDEX RANGE SCAN		       | CAT_I1 	 |     1 |	 |     1   (0)| 00:00:01 |
|* 11 |    FILTER			       |		 |	 |	 |	      | 	 |
|  12 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     3   (0)| 00:00:01 |
|* 13 |      INDEX RANGE SCAN		       | CAT_I1 	 |     1 |	 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      ...
      OR_EXPAND(@"SEL$1" (1) (2) (3) (4))
      ...
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("MEMBER_ID"=:B3 AND "PARTNUMBER"=:B1)
   5 - filter(0=:B4)
   7 - access("MEMBER_ID" IS NULL AND "PARTNUMBER"=:B1)
       filter(("PARTNUMBER"=:B1 AND LNNVL("MEMBER_ID"=:B3)))
   8 - filter(0=:B2)
  10 - access("MEMBER_ID"=:B3 AND "PARTNUMBER" IS NULL)
       filter(LNNVL("PARTNUMBER"=:B1))
  11 - filter((0=:B4 AND 0=:B2))
  13 - access("MEMBER_ID" IS NULL AND "PARTNUMBER" IS NULL)
       filter(("PARTNUMBER" IS NULL AND LNNVL("PARTNUMBER"=:B1) AND LNNVL("MEMBER_ID"=:B3)))

The full tablescan didn’t appear – but it wasn’t eliminated by concatenation but by the “new” 12.2  variant: “OR EXPANSION”. In this case the net effect is remarkably similar – we still have filter operations comparing :b2 and :b4 with zero, and we still have a scattering of lnnvl() function calls being used to discard rows we’ve already accessed, but the pattern is slightly different and we have a union all rather than concatenation.

This change prompted me to go back to testing with just the original index (member_id, partnumber – index cat_i0) … which took me back to the full tablescan until I added the hint /*+ or_expand */ to the query to get the following plan:


----------------------------------------------------------------------------------------------------------
| Id  | Operation			       | Name		 | Rows  | Bytes | Cost (%CPU)| Time	 |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		       |		 |	 |	 |    29 (100)| 	 |
|   1 |  VIEW				       | VW_ORE_37A7142B |     4 |    52 |    29   (4)| 00:00:01 |
|   2 |   UNION-ALL			       |		 |	 |	 |	      | 	 |
|   3 |    TABLE ACCESS BY INDEX ROWID	       | CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  4 |     INDEX UNIQUE SCAN		       | CAT_I0 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  5 |    FILTER			       |		 |	 |	 |	      | 	 |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN		       | CAT_I0 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  8 |    FILTER			       |		 |	 |	 |	      | 	 |
|   9 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|* 10 |      INDEX RANGE SCAN		       | CAT_I0 	 |     1 |	 |     1   (0)| 00:00:01 |
|* 11 |    FILTER			       |		 |	 |	 |	      | 	 |
|* 12 |     TABLE ACCESS FULL		       | CATENTRY	 |     1 |    10 |    23   (5)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("MEMBER_ID"=:B3 AND "PARTNUMBER"=:B1)
   5 - filter(0=:B4)
   7 - access("MEMBER_ID" IS NULL AND "PARTNUMBER"=:B1)
       filter(("PARTNUMBER"=:B1 AND LNNVL("MEMBER_ID"=:B3)))
   8 - filter(0=:B2)
  10 - access("MEMBER_ID"=:B3 AND "PARTNUMBER" IS NULL)
       filter(LNNVL("PARTNUMBER"=:B1))
  11 - filter((0=:B4 AND 0=:B2))
  12 - filter(("PARTNUMBER" IS NULL AND "MEMBER_ID" IS NULL AND LNNVL("PARTNUMBER"=:B1) AND
	      LNNVL("MEMBER_ID"=:B3)))

The plan shows “or expansion”, and highlights the most significant difference between concatenation and expansion – concatenation requires indexed access paths in all branches, or-expansion doesn’t.

At first sight this plan with its full tablescan at operation 12 might seem like a bad idea; but check operation 11, the guarding filter, the tablescan occurs only if both :b2 and :b4 are null. Perhaps that special condition is never supposed to appear, perhaps it’s going to do a lot of work whether or not you can use an index. The fact that you can now handle the original problem without adding or altering existing indexes – provided you can inject this or_expand hint – may be of significant benefit. (Of course, being able to recreate the original index with the columns in the reverse order would even avoid the necessity of worrying about the hint.)

tl;dr

Applications that push “if-then-else” decisions into the SQL and down to the optimizer are a pain in the backside; the performance problems they produce can sometimes be bypassed by the addition of extra indexes that might give you plans (possibly hinted) that report the use of the concatentation operation. In 12cR2 the optimizer has an improved strategy for damage limitation “Cost-based Or Expansion” that can produce very similar effects without the addition of extra indexes. These plans will report union all operations, referencing views with names like: VW_ORE_xxxxxxxxx.

When you next upgrade you may find a few cases where you can get rid of indexes that were only created to work around defective application coding strategies. You may also want to hunt down any code where you’ve added use_concat hints and see if they can be removed, or if they should be replaced by or_expand. since the former hint will disable the latter feature.

 

Setting Oracle Session Parameters for Specific Process Scheduler Processes

This note describes a mechanism for setting initialisation parameters for specific processes run on the process scheduler. I will demonstrate it relation to nVision, but it has general application in PeopleSoft.
A table is used to hold metadata that described what setting is applied to which processes. A trigger on the process scheduler request table PSPRCSRQST reads that data and alters the session setting. This approach is easier to adjust and understand that static PL/SQL code in a trigger.
NB: The PSAESRV server process is used to run application engine programs in the process scheduler by default.  It should only be used for very short-lived programs and should usually be deconfigured. It maintains a persistent connection to the database. Therefore, session parameters set for one application engine program will carry forward into subsequent programs run on the same server process with the potential for unintended and unpredictable results. Other processes all establish their own database session that ends when the process terminates. This trigger mechanism can be still used to set parameters for some processes that run on PSAESRV, but the default parameter value should then be set for all other application engine processes. 

Metadata 

The table that holds the metadata should be defined in Application Designer.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">CREATE TABLE PS_PRCS_SESS_PARM (PRCSTYPE VARCHAR2(30) NOT NULL,
PRCSNAME VARCHAR2(12) NOT NULL,
OPRID VARCHAR2(30) NOT NULL,
RUNCNTLID VARCHAR2(30) NOT NULL,
KEYWORD VARCHAR2(8) NOT NULL,
PARAM_NAME VARCHAR2(50) NOT NULL,
PARMVALUE VARCHAR2(128) NOT NULL) TABLESPACE PTTBL STORAGE (INITIAL
40000 NEXT 100000 MAXEXTENTS UNLIMITED PCTINCREASE 0) PCTFREE 10
PCTUSED 80
/
CREATE UNIQUE iNDEX PS_PRCS_SESS_PARM ON PS_PRCS_SESS_PARM (PRCSTYPE,
PRCSNAME,
OPRID,
RUNCNTLID,
PARAM_NAME) TABLESPACE PSINDEX STORAGE (INITIAL 40000 NEXT 100000
MAXEXTENTS UNLIMITED PCTINCREASE 0) PCTFREE 10 PARALLEL NOLOGGING
/
ALTER INDEX PS_PRCS_SESS_PARM NOPARALLEL LOGGING
/

I will demonstrate this mechanism for nVision. The metadata is simply inserted into the table by script.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">INSERT INTO sysadm.PS_PRCS_SESS_PARM (prcstype, prcsname, oprid, runcntlid, keyword, param_name, parmvalue)
VALUES ('nVision-ReportBook','RPTBOOK',' ',' ', 'SET', 'parallel_degree_policy','auto');
INSERT INTO sysadm.PS_PRCS_SESS_PARM (prcstype, prcsname, oprid, runcntlid, keyword, param_name, parmvalue)
VALUES ('nVision-ReportBook','RPTBOOK',' ',' ', 'SET', 'parallel_degree_limit','4');
INSERT INTO sysadm.PS_PRCS_SESS_PARM (prcstype, prcsname, oprid, runcntlid, keyword, param_name, parmvalue)
VALUES ('nVision-ReportBook','RPTBOOK',' ',' ', 'SET', 'parallel_degree_level','150');
INSERT INTO sysadm.PS_PRCS_SESS_PARM (prcstype, prcsname, oprid, runcntlid, keyword, param_name, parmvalue)
VALUES ('nVision-ReportBook','RPTBOOK',' ',' ', 'SET', 'parallel_min_time_threshold','1');
INSERT INTO sysadm.PS_PRCS_SESS_PARM (prcstype, prcsname, oprid, runcntlid, keyword, param_name, parmvalue)
VALUES ('nVision-ReportBook','RPTBOOK',' ',' ', 'SET', '_optimizer_skip_scan_enabled','FALSE');

Here we have 5 session parameters that will apply to all nVision reportbooks, but which I don't want to apply to the rest of the system.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 87%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">PRCSTYPE             PRCSNAME     OPRID      RUNCNTLID                KEYWORD  PARAM_NAME                  PARMVALUE
-------------------- ------------ ---------- ------------------------ -------- ------------------------------ --------------------
nVision-ReportBook RPTBOOK SET parallel_degree_policy auto
nVision-ReportBook RPTBOOK SET parallel_degree_limit 4
nVision-ReportBook RPTBOOK SET _optimizer_skip_scan_enabled FALSE
nVision-ReportBook RPTBOOK SET parallel_min_time_threshold 1
nVision-ReportBook RPTBOOK SET parallel_degree_level 150

The objective is to use limited parallelism only in the nVision reporting, and without decorating underlying ledger tables

  • parallel_degree_policy=auto enables the new 12c automatic degree of parallel calculation, statement queuing. 
  • parallel_min_time_threshold is set to 1 second. The default is 10. Statements whose runtime is estimated to be greater than or equal to this value will be considered for automatic degree of parallelism. 
  • parallel_degree_limit=4 restricts the automatic degree of parallelism to 4 to prevent any one statement using excessive parallelism.
  • parallel_degree_level=150 scales the automatic degree of parallelism calculation but within the parallel_degree_limit. This parameter is not officially documented, but See Kerry Osborne's blog 12c – parallel_degree_level (control for auto DOP).   
  • _optimiser_skip_scan_enabled=FALSE disables index skip scan to promote the use of smart full scan and Bloom filtering. It is recommended for engineered systems in Advice for the PeopleSoft DBA. Skip scan can prevent the optimizer choosing a smart full scan, so it makes sense to limits the setting to just nVision. I can also specify a parameter that will only be set when the reportbook is run by a particular operator with a particular run control.

The specific setting for one particular operation ID and run control takes precedence of the generic setting for all reportbooks.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">INSERT INTO sysadm.PS_PRCS_SESS_PARM (prcstype, prcsname, oprid, runcntlid, keyword, param_name, parmvalue) 
VALUES ('nVision-ReportBook','RPTBOOK','NVOPR','NVSRBK_2', 'SET', 'parallel_degree_level','200');

In this case, I will scale the degree of parallelism further for a particular reportbook by setting parallel degree level to 20.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">PRCSTYPE             PRCSNAME     OPRID      RUNCNTLID                KEYWORD  PARAM_NAME                  PARMVALUE
-------------------- ------------ ---------- ------------------------ -------- ------------------------------ --------------------

nVision-ReportBook RPTBOOK SET parallel_degree_level 150
nVision-ReportBook RPTBOOK NVOPR NVSRBK_2 SET parallel_degree_level 200

Trigger 

When a process starts the first thing it does is update its own status to 7 to indicate that it is processing. This is another example of a trigger created on that transition that injects behaviour at the start of a PeopleSoft process. This trigger reads the metadata and applies the settings with an ALTER SESSION command. The process type, name, operation and run control attributes must exactly match the process request, but a blank space is treated as a wildcard. Underscore parameters must be delimited in double quotes.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">CREATE OR REPLACE TRIGGER sysadm.set_prcs_sess_parm
BEFORE UPDATE OF runstatus ON sysadm.psprcsrqst
FOR EACH ROW
WHEN (new.runstatus = 7 AND old.runstatus != 7 AND new.prcstype != 'PSJob')
DECLARE
l_cmd VARCHAR2(100 CHAR);
l_delim VARCHAR2(1 CHAR) := '';
l_op VARCHAR2(1 CHAR) := '=';
BEGIN
dbms_output.put_line('Row:'||:new.prcstype||'.'||:new.prcsname||':'||:new.oprid||'.'||:new.runcntlid);

FOR i IN (
WITH x as (
SELECT p.*
, row_number() over (partition by param_name
order by NULLIF(prcstype, ' ') nulls last,
NULLIF(prcsname, ' ') nulls last,
NULLIF(oprid, ' ') nulls last,
NULLIF(runcntlid,' ') nulls last
) priority
FROM sysadm.PS_PRCS_SESS_PARM p
WHERE (p.prcstype = :new.prcstype OR p.prcstype = ' ')
AND (p.prcsname = :new.prcsname OR p.prcsname = ' ')
AND (p.oprid = :new.oprid OR p.oprid = ' ')
AND (p.runcntlid = :new.runcntlid OR p.runcntlid = ' ')
)
SELECT * FROM x WHERE priority = 1
) LOOP

IF UPPER(i.keyword) = 'SET' THEN
l_op := '=';
IF SUBSTR(i.param_name,1,1) = '_' THEN
l_delim := '"';
ELSE
l_delim := '';
END IF;
ELSE
l_op := ' ';
l_delim := '';
END IF;

IF NULLIF(i.parmvalue,' ') IS NOT NULL THEN
dbms_output.put_line('Rule:'||NVL(NULLIF(i.prcstype,' '),'*')
||'.'||NVL(NULLIF(i.prcsname,' '),'*')
||':'||NVL(NULLIF(i.oprid,' '),'*')
||'.'||NVL(NULLIF(i.runcntlid,' '),'*')
||':'||i.keyword||':'||i.param_name||l_op||i.parmvalue);

l_cmd := 'ALTER SESSION '||i.keyword||' '||l_delim||i.param_name||l_delim||l_op||i.parmvalue;
dbms_output.put_line('PI='||:new.prcsinstance||':'||:new.prcstype||'.'||:new.prcsname||':'
||:new.oprid||'.'||:new.runcntlid||':'||l_cmd);
EXECUTE IMMEDIATE l_cmd;
END IF;
END LOOP;
EXCEPTION
WHEN OTHERS THEN
dbms_output.put_line(sqlerrm);
END;
/

The trigger script set_prcs_sess_parm.sql can be download from my miscellaneous PeopleSoft scripts repository on Github.

Testing 

The trigger can be tested by updating the process scheduler request table in SQL*Plus, but be careful to roll back the update afterwards rather than committing. The trigger writes debug information to the server output that can be seen in SQL*Plus showing the rule being used and the ALTER SESSION command generated. However, this output is not captured in any PeopleSoft log when the trigger is fired by a PeopleSoft process.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">set serveroutput on 
update psprcsrqst set runstatus = 7 where runstatus != 7
and prcsname = 'RPTBOOK' and runcntlid = 'NVSRBK_2' and oprid = 'NVOPR' and rownum = 1;
rollback;

Rule:nVision-ReportBook.RPTBOOK:*.*:_optimizer_skip_scan_enabled=FALSE
PI=3543986:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_2:ALTER SESSION SET "_optimizer_skip_scan_enabled"=FALSE
Rule:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_2:parallel_degree_level=200
PI=3543986:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_2:ALTER SESSION SET parallel_degree_level=200
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_degree_limit=4
PI=3543986:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_2:ALTER SESSION SET parallel_degree_limit=4
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_degree_policy=auto
PI=3543986:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_2:ALTER SESSION SET parallel_degree_policy=auto
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_min_time_threshold=1
PI=3543986:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_2:ALTER SESSION SET parallel_min_time_threshold=1

In the above example, the specific rule for NVSRBK_2 was applied setting PARALLEL_DEGREE_LEVEL to 200, whereas in the next example the generic setting of 150 is applied to NVSRBK_1.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">update psprcsrqst set runstatus = 7 where runstatus != 7
and prcsname = 'RPTBOOK' and runcntlid = 'NVSRBK_1' and oprid = 'NVOPR' and rownum = 1;
rollback;

Rule:nVision-ReportBook.RPTBOOK:*.*:_optimizer_skip_scan_enabled=FALSE
PI=3543964:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_1:ALTER SESSION SET "_optimizer_skip_scan_enabled"=FALSE
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_degree_level=150
PI=3543964:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_1:ALTER SESSION SET parallel_degree_level=150
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_degree_limit=4
PI=3543964:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_1:ALTER SESSION SET parallel_degree_limit=4
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_degree_policy=auto
PI=3543964:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_1:ALTER SESSION SET parallel_degree_policy=auto
Rule:nVision-ReportBook.RPTBOOK:*.*:parallel_min_time_threshold=1
PI=3543964:nVision-ReportBook.RPTBOOK:NVOPR.NVSRBK_1:ALTER SESSION SET parallel_min_time_threshold=1

How to Identify Active Lines in an Adaptive Execution Plan in DBA_HIST_SQL_PLAN

When investigating performance problems I spend a lot of my time profiling ASH and AWR data. I sometimes want to join back to captured plans in DBA_HIST_SQL_PLAN to see how an object was access, or which index was used, or how many rows the optimizer thought it would get.
From Oracle 12c, we can get adaptive execution plans where the optimizer considers different plans at runtime. Mostly it is a choice between either a nested loop or a hash join of tables.
For example, in the below execution plan, DBMS_XPLAN includes a note to say that inactive plans with a dash.  Thus we can see that the nested loop lookup of PS_LEDGER using the PSCLEDGER index was rejected in favour of a Bloom filter on a full scan of the ledger table.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SQL_ID bk26thygs9c8n

Plan hash value: 38031479
-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 18275 (100)| | | |
| 1 | SORT GROUP BY | | 1 | 135 | 18275 (1)| 00:00:01 | | |
|- 2 | HASH JOIN | | 1 | 135 | 18274 (1)| 00:00:01 | | |
| 3 | NESTED LOOPS | | 1 | 135 | 18274 (1)| 00:00:01 | | |
|- 4 | STATISTICS COLLECTOR | | | | | | | |
| 5 | TABLE ACCESS STORAGE FULL | PS_ELIM_CF_SEL2002 | 1 | 59 | 25 (0)| 00:00:01 | | |
| 6 | PARTITION RANGE OR | | 1 | 76 | 18249 (1)| 00:00:01 |KEY(OR)|KEY(OR)|
| 7 | PARTITION LIST ITERATOR | | 1 | 76 | 18249 (1)| 00:00:01 | KEY | KEY |
| 8 | TABLE ACCESS BY LOCAL INDEX ROWID| PS_LEDGER | 1 | 76 | 18249 (1)| 00:00:01 | KEY | KEY |
| 9 | INDEX RANGE SCAN | PSCLEDGER | 1 | | 18228 (1)| 00:00:01 | KEY | KEY |
|- 10 | PARTITION RANGE OR | | 1 | 76 | 18249 (1)| 00:00:01 |KEY(OR)|KEY(OR)|
|- 11 | PARTITION LIST JOIN-FILTER | | 1 | 76 | 18249 (1)| 00:00:01 |:BF0000|:BF0000|
|- 12 | TABLE ACCESS STORAGE FULL | PS_LEDGER | 1 | 76 | 18249 (1)| 00:00:01 | KEY | KEY |
-----------------------------------------------------------------------------------------------------------------------------

Note
-----

- this is an adaptive plan (rows marked '-' are inactive)

When I come to look at this plan in DBA_HIST_SQL_PLAN, I will find all 14 lines but there is no column in the view that indicates whether the line is active in the adaptive plan.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">select id, operation, options, object_name, cardinality, cost from dba_hist_sql_plan 
where sql_id = 'bk26thygs9c8n' and plan_hash_value = 38031479

ID OPERATION OPTIONS OBJECT_NAME CARDINALITY COST
---------- ------------------------------ ------------------------------ ------------------ ----------- ----------
0 SELECT STATEMENT 1529564
1 SORT GROUP BY 1 1529564
2 HASH JOIN 359 1529563
3 PART JOIN FILTER CREATE :BF0000 359 1529563
4 NESTED LOOPS 359 1529563
5 STATISTICS COLLECTOR
6 TABLE ACCESS STORAGE FULL PS_ELIM_CF_SEL2002 360 172
7 PARTITION RANGE OR 1 1529289
8 PARTITION LIST ITERATOR 1 1529289
9 TABLE ACCESS BY LOCAL INDEX ROWID PS_LEDGER 1 1529289
10 INDEX RANGE SCAN PSCLEDGER
11 PARTITION RANGE OR 16845834 1529289
12 PARTITION LIST JOIN-FILTER 16845834 1529289
13 TABLE ACCESS STORAGE FULL PS_LEDGER 16845834 1529289

The information about whether a line in the execution plan is active and whether it should be displayed by DBMS_XPLAN is encoded in the display map inside an XML structure stored the OTHER_XML column on line 1 of the plan.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">
000000000001

yes








skp="0"/>
skp="0"/>
skp="0"/>


skp="0"/>




skp="0"/>
skp="0"/>
skp="0"/>


OP is the plan line ID. SKP=0 indicates an active line. This data can be extracted as a set of rows and columns with the XMLTABLE function.
My thanks to Tim Hall. I was only able to work this out with the aid of his excellent article XMLTABLE: Convert XML Data into Rows and Columns using SQL and associated video.
(Update 5.3.2018): Since publishing this I have also found Martin Bach's Blog Adaptive plans and v$sql_plan and related views.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 100%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">column plan_hash_value format 9999999999 heading 'PHV'
column op format 999
column dis format 999
column par format 999
column dep format 999
column skp format 999
column object_name format a20
column operation format a20
column options format a20
column partition_start format a10 heading 'PStart'

with x as (
select sql_id, plan_hash_value
, xmltype(other_xml) xml_data
from dba_hist_sql_plan p
where sql_id IN ('bk26thygs9c8n')
and id = 1
and other_xml IS NOT NULL
), y as (
select sql_id, plan_hash_value
, xt.*
from x
, xmltable('/other_xml/display_map/row' PASSING x.xml_data
COLUMNS
"OP" NUMBER path '@op',
"DIS" NUMBER path '@dis',
"PAR" NUMBER path '@par',
"DEP" NUMBER path '@dep',
"SKP" NUMBER path '@skp'
) xt
where xt.skp = 0
)
select y.*
, p.object_name, p.operation, p.options, p.partition_start
from y
, dba_hist_sql_plan p
where p.sql_id = y.sql_id
and p.plan_hash_value = y.plan_hash_value
and y.op = p.id
/
  • OTHER_XML is a CLOB, so it needs to be converted to an XMLTYPE before being passed to XML table.
  • We effectively make a Cartesian product between the row source of SQL plans in sub-query x, and XMLTABLE().
  • The XML structure is attributed rather than tagged, so you need the '@' in the path in the columns clause.
  • Inactive plan lines are filtered out with a criterion on SKP.

Now I have a SQL query that returns only the active rows in the execution plan. I can use this technique in other places to profile statements by aspects of the execution plan.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SQL_ID                PHV   OP  DIS  PAR  DEP  SKP OBJECT_NAME          OPERATION            OPTIONS              PStart
------------- ----------- ---- ---- ---- ---- ---- -------------------- -------------------- -------------------- ----------
bk26thygs9c8n 38031479 1 1 0 1 0 SORT GROUP BY
bk26thygs9c8n 38031479 2 2 1 2 0 HASH JOIN
bk26thygs9c8n 38031479 3 3 2 3 0 :BF0000 PART JOIN FILTER CREATE
bk26thygs9c8n 38031479 6 4 3 4 0 PS_ELIM_CF_SEL2002 TABLE ACCESS STORAGE FULL
bk26thygs9c8n 38031479 11 5 2 3 0 PARTITION RANGE OR KEY(OR)
bk26thygs9c8n 38031479 12 6 5 4 0 PARTITION LIST JOIN-FILTER :BF0000
bk26thygs9c8n 38031479 13 7 6 5 0 PS_LEDGER TABLE ACCESS STORAGE FULL KEY

I demonstrate a practical use of this technique in my next blog.

Predictable random test data using plsql

One of the problems in using random generated test data, is that it is random. Because sometimes what you need
is test data that is the same every time you generate it. This entry will look at how you can achieve this using
the seed functionality of the dbms_random package.

Predictable random test data using plsql

One of the problems in using random generated test data, is that it is random. Because sometimes what you need
is test data that is the same every time you generate it. This entry will look at how you can achieve this using
the seed functionality of the dbms_random package.

Hotsos Symposium 2018

Just a quick note to say I’m doing a couple of talks at this year’s Hotsos Symposium, which starts next Monday at 8:30AM. It’s not too late to register.

As an enticement – here is one of my favorite pictures from last year’s event: (Come join the fun!)

IMG_0379http://kerryosborne.oracle-guy.com/files/2018/03/IMG_0379-e1519959814127... 225w" sizes="(max-width: 480px) 100vw, 480px" />

255 Columns

It’s the gift that keeps on giving – no matter how many problems you find there are always more waiting to be found. It’s been some time since I last wrote about tables with more than 255 columns, and I said then that there was more to come. In the last article I described how adding a few columns to a table, or updating a trailing column in a way that made the table’s used column count exceed 255, could result in some strange row-splitting behaviour – in this article I’m going to look at a critical side effect of that behaviour.

We’ll start with a simple model and a question – I’ll create a table with a thousand rows of data, then I’ll add a few columns to that table and update the last column in every row and check the undo and redo generated by the update.  Eventually I’m going to need a table with more than 255 columns so the script to create the table is rather long and I’ve posted it at the end of the article in case you want to experiment – in the following text I’ve omitted a few hundred lines of column declarations and values.


rem
rem     Script:         wide_table_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(
        col000,
        col001,
        col002,
        col003,
...
        col199,
/*
        col200,
        col201,
        col202,
        col203,
        col204,
        col205,
        col206,
        col207,
        col208,
        col209,
*/
        col210,
        col220,
...
        col247,
        col248,
        col249
)
as
with generator as (
        select 
                rownum id
        from dual
        connect by
                level <= 1e3 -- > comment to avoid WordPress format issue
)
select
        lpad(000,10,'0'),
        lpad(001,10,'0'),
        lpad(002,10,'0'),
        lpad(003,10,'0'),
...
        lpad(247,10,'0'),
        lpad(248,10,'0'),
        lpad(249,10,'0')
from
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add(
        col250 varchar2(10),
        col251 varchar2(10),
        col252 varchar2(10),
        col253 varchar2(10),
        col254 varchar2(10),
        col255 varchar2(10),
        col256 varchar2(10),
        col257 varchar2(10),
        col258 varchar2(10),
        col259 varchar2(10)
)
;

update t1 set col259 = lpad('259',10,'0');
commit;


The table I’ve created has columns named from col000 to col249 populated with a text matching the column’s numeric id – a total of 250 columns, except that for the first part of the test I’ve commented out the creation and population of 10 of those columns, giving me a total of 240 columns. Then I’ve added 10 more columns and updated the last of those 10. So, for the first part of the test, I’ve grown my table from 240 used columns to 250 used columns. Here are a few critical stats for that update from the session statistics view (v$sesstat joined to v$statname):

 

Name                                                 Value
----                                                 -----
db block gets                                        1,023
consistent gets                                        555
db block changes                                     2,012
redo entries                                           907
redo size                                          262,692
undo change vector size                             76,052
table scan rows gotten                               1,002
table scan blocks gotten                               501
HSC Heap Segment Block Changes                       1,000

That’s pretty much the sort of thing we might expect. For a small update to a row it’s reasonable to see 250 – 300 bytes of redo of which about half is the redo for the undo. We can see that we’ve scanned 1,000 rows and made 2,000 block changes (one to the table block and one to an undo block for each row in the table). The table was 500 blocks of data (the avg_row_len is about 2640 bytes from 240 columns at 10 bytes + a length byte, which makes two rows per block with lots of spare space at the default 10 pctfree). You might been expecting the number of redo entries to be a close match to the number of rows but it’s a little short because the redo records for the first few updates would have been stored in private redo and turned into a single large redo entry.

So what do the stats look like when we start with 250 columns and grow to 260 columns, breaking through the dreaded 255 barrier ?


Name                                                 Value
----                                                 -----
db block gets                                        9,503
consistent gets                                      1,894
db block changes                                     9,384
redo size                                        8,110,584
redo entries                                         5,778
undo change vector size                          3,780,260
table scan rows gotten                               1,002
table scan blocks gotten                               501
HSC Heap Segment Block Changes                       3,000

Such a simple change – with such a large impact.

The average undo is now nearly 3.5KB per row (and the rows were only about 2,860 bytes each anyway), the number of redo entries is up to 6 times the original, we’re averaging 3 “HSC Heap Segment Block Changes” per row instead of 1 and in total we’ve managed to do an extra 7,000 db block changes overall.

To get an idea of what was going on behind the scenes I dumped the redo log file. For the first part of the test most of the redo entries consisted of a pair of redo change vectors with OP codes 5.1 (modify undo block) and 11.5 (update row piece). The 5.1 corresponded to the undo needed to reverse out the effect of its paired 11.5 and both change vectors were pretty small.

For the second part of the test I found a frequently repeated sequence of three consecutive redo records of paired redo vectors: (5.1, 11.4), (5.1, 11.2) and (5.1, 11.6). Again each 5.1 corresponds to the undo needed to reverse out the effect of its associated 11.x, and in this case the three “table” (level 11) OP codes are, in order: “lock row piece”, “insert row piece”, “overwrite row piece”. These 3 pairs occured 1,000 times each, accounting for roughly 3,000 of the redo entries reported.

On top of this there were 2,500 redo records holding redo change vectors with op code 13.22 (“State change on level 1 bitmap block”), and a few hundred , with op code 13.21 (“Format page table block”) with just a few others bringing the total up to the 5,800 redo entries reported. (You might note that the last OP code is a clue that we added quite a lot of new blocks to the table as we executed the update – in fact the number of used table blocks grew by about 50%.)

We also had 500 redo change vectors of type 5.2 (“get undo header”). This number was significantly more than in the first part of the test because we had a lot more undo block changes (with bigger undo change vectors) which means we used a lot more undo blocks, and each time we move to a new undo block we update our transaction slot in the undo header with the current undo block address. I mentioned the pairing of 5.1 and 11.6 above – in fact 50% of those records recorded just two changes (5.1, 11.6) the other 50% recorded three changes (5.2, 5.1, 11.6) – in effect every other row update resulted in Oracle demanding a new undo block.

I’ll explain in a moment why we have a very large number of “State change on level 1 bitmap block”; first let’s examine the key steps of how Oracle is processing a single row update – the sequence of 11.4, 11.2, 11.6:

  • 11.4: lock the head row piece – this seems to happen when the row is in (or going to turn into) multiple pieces; presumably because the piece that is going to change might not be the head piece. This is a tiny operation that generates undo and redo in the order of tens of bytes.
  • 11.2: our code extends the row to 260 columns, which means Oracle has to split it into two pieces of 5 columns and 255 columns respectively – so one of those row-pieces has to cause an insert to take place. Inserting a row may require a lot of redo, of course, but the undo for a (table-only) insert is, again, tiny.
  • 11.6: When Oracle has to split a wide row (>255 columns) it counts columns from the end, so the first row piece of our row is 5 columns and the second row piece (which is the one inserted by the 11.2 operation) is 255 columns. This means we have to overwrite the original row piece with a much shorter row piece. So we’re replacing 2,750 bytes (250 columns) with 55 bytes (5 columns), which means we have to write the contents of the “deleted” 250 columns to the undo tablespace – and that’s where most of the huge volume of undo comes from.

There are two extra points to note about the way Oracle handles the insert/overwrite steps. The length of our row exaggerates the effect, of course, but when we insert the ending 255 columns the block they go to is probably going to change from nearly empty to half full, or half full to full – which means its bitmap entry has to be updated; similarly when the initial 250 columns is overwritten with just 5 columns a huge amount of free space will appear in the block which, again, means that the block’s bitmap entry has to be updated. This gives us a good idea of why we see so many 13.22 (“L1 bitmap state change”)redo change vectors.

The second point is that the numbers still don’t add up. Allowing a couple of hundred bytes of undo per row for the lock row and insert row, then 2,750 plus a few more for the overwrite, I’ve accounted for about 3,000 bytes per row updated – which leaves me short by about 800 bytes per row.  If I dump undo blocks I can see that the undo change vector for the overwrite is actually 3,628 bytes long rather than being “slightly over” the 2,750 for the before image of the overwritten row. Strangely I can see a couple of hundred bytes of what looks like damage in the row image, and there’s also a couple of hundred bytes of what looks like garbage (but probably isn’t) after the row image, but I’ve got no idea why there’s so much excess data in the record.

One day I (or someone else) may get around to looking very closely at why that particular undo record is a lot bigger than an observer might expect – but at least we can see that the space is there, and even if some of that volume could be made to disappear the problem of the massive undo relating to the overwrite and Oracle’s odd strategy of counting columns from the end is still going to be present, and there are probably some occasions when you need to know about this.

tl;dr

Once the number of columns in a table goes over 255 then a simple update to a “trailing” null column (i.e. one past the current end of the row) is likely to generate a much larger volume of undo and redo than you might expect. In particular the size of the undo record is likely to be equivalent to the volume of the last 255 columns of the table – and then a large fraction more.

The reason why this particular anomaly came to my attention is because a client had a problem upgrading an application that required them to add a few columns to a table and then update a couple of them. The size of the undo tablespace was 300 GB, the update (the only thing running at the time) should have added about 30 bytes to the length of each row, the update should have affected 250 million rows. The process crashed after a few hours “ORA-30036: unable to extend segment by %s in undo tablespace ‘%s'”. Even allowing for a “reasonable” overhead it seemed rather surprising that Oracle needed more than 1,200 bytes of undo space per row – but then I found the table was defined with 350 columns.

Solutions for this user:  it’s an upgrade that’s allowed to take quite a long time, and the nature of the update is such that it would be possible to update in batches, committing after each batch.  It would also be nice to review how the table was used to see if it could be rebuilt with a different column order to move all the unused columns to the end of the row – with a little luck the result table might find almost all the rows fitting into a single row piece, even after the upgrade.

Footnote

If you want to experiment, here’s the whole script to create the table, insert some rows, then add a few more columns and update one of them. It’s very long, and not in the least bit exciting, but it may save you a little typing time if you want to use it.


create table t1(
        col000,
        col001,
        col002,
        col003,
        col004,
        col005,
        col006,
        col007,
        col008,
        col009,
        col010,
        col011,
        col012,
        col013,
        col014,
        col015,
        col016,
        col017,
        col018,
        col019,
        col020,
        col021,
        col022,
        col023,
        col024,
        col025,
        col026,
        col027,
        col028,
        col029,
        col030,
        col031,
        col032,
        col033,
        col034,
        col035,
        col036,
        col037,
        col038,
        col039,
        col040,
        col041,
        col042,
        col043,
        col044,
        col045,
        col046,
        col047,
        col048,
        col049,
        col050,
        col051,
        col052,
        col053,
        col054,
        col055,
        col056,
        col057,
        col058,
        col059,
        col060,
        col061,
        col062,
        col063,
        col064,
        col065,
        col066,
        col067,
        col068,
        col069,
        col070,
        col071,
        col072,
        col073,
        col074,
        col075,
        col076,
        col077,
        col078,
        col079,
        col080,
        col081,
        col082,
        col083,
        col084,
        col085,
        col086,
        col087,
        col088,
        col089,
        col090,
        col091,
        col092,
        col093,
        col094,
        col095,
        col096,
        col097,
        col098,
        col099,
        col100,
        col101,
        col102,
        col103,
        col104,
        col105,
        col106,
        col107,
        col108,
        col109,
        col110,
        col111,
        col112,
        col113,
        col114,
        col115,
        col116,
        col117,
        col118,
        col119,
        col120,
        col121,
        col122,
        col123,
        col124,
        col125,
        col126,
        col127,
        col128,
        col129,
        col130,
        col131,
        col132,
        col133,
        col134,
        col135,
        col136,
        col137,
        col138,
        col139,
        col140,
        col141,
        col142,
        col143,
        col144,
        col145,
        col146,
        col147,
        col148,
        col149,
        col150,
        col151,
        col152,
        col153,
        col154,
        col155,
        col156,
        col157,
        col158,
        col159,
        col160,
        col161,
        col162,
        col163,
        col164,
        col165,
        col166,
        col167,
        col168,
        col169,
        col170,
        col171,
        col172,
        col173,
        col174,
        col175,
        col176,
        col177,
        col178,
        col179,
        col180,
        col181,
        col182,
        col183,
        col184,
        col185,
        col186,
        col187,
        col188,
        col189,
        col190,
        col191,
        col192,
        col193,
        col194,
        col195,
        col196,
        col197,
        col198,
        col199,
        col200,
        col201,
        col202,
        col203,
        col204,
        col205,
        col206,
        col207,
        col208,
        col209,
        col210,
        col211,
        col212,
        col213,
        col214,
        col215,
        col216,
        col217,
        col218,
        col219,
        col220,
        col221,
        col222,
        col223,
        col224,
        col225,
        col226,
        col227,
        col228,
        col229,
        col230,
        col231,
        col232,
        col233,
        col234,
        col235,
        col236,
        col237,
        col238,
        col239,
        col240,
        col241,
        col242,
        col243,
        col244,
        col245,
        col246,
        col247,
        col248,
        col249
)
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e3 -- > comment to avoid WordPress format issue
)
select
        lpad(000,10,'0'),
        lpad(001,10,'0'),
        lpad(002,10,'0'),
        lpad(003,10,'0'),
        lpad(004,10,'0'),
        lpad(005,10,'0'),
        lpad(006,10,'0'),
        lpad(007,10,'0'),
        lpad(008,10,'0'),
        lpad(009,10,'0'),
        lpad(010,10,'0'),
        lpad(011,10,'0'),
        lpad(012,10,'0'),
        lpad(013,10,'0'),
        lpad(014,10,'0'),
        lpad(015,10,'0'),
        lpad(016,10,'0'),
        lpad(017,10,'0'),
        lpad(018,10,'0'),
        lpad(019,10,'0'),
        lpad(020,10,'0'),
        lpad(021,10,'0'),
        lpad(022,10,'0'),
        lpad(023,10,'0'),
        lpad(024,10,'0'),
        lpad(025,10,'0'),
        lpad(026,10,'0'),
        lpad(027,10,'0'),
        lpad(028,10,'0'),
        lpad(029,10,'0'),
        lpad(030,10,'0'),
        lpad(031,10,'0'),
        lpad(032,10,'0'),
        lpad(033,10,'0'),
        lpad(034,10,'0'),
        lpad(035,10,'0'),
        lpad(036,10,'0'),
        lpad(037,10,'0'),
        lpad(038,10,'0'),
        lpad(039,10,'0'),
        lpad(040,10,'0'),
        lpad(041,10,'0'),
        lpad(042,10,'0'),
        lpad(043,10,'0'),
        lpad(044,10,'0'),
        lpad(045,10,'0'),
        lpad(046,10,'0'),
        lpad(047,10,'0'),
        lpad(048,10,'0'),
        lpad(049,10,'0'),
        lpad(050,10,'0'),
        lpad(051,10,'0'),
        lpad(052,10,'0'),
        lpad(053,10,'0'),
        lpad(054,10,'0'),
        lpad(055,10,'0'),
        lpad(056,10,'0'),
        lpad(057,10,'0'),
        lpad(058,10,'0'),
        lpad(059,10,'0'),
        lpad(060,10,'0'),
        lpad(061,10,'0'),
        lpad(062,10,'0'),
        lpad(063,10,'0'),
        lpad(064,10,'0'),
        lpad(065,10,'0'),
        lpad(066,10,'0'),
        lpad(067,10,'0'),
        lpad(068,10,'0'),
        lpad(069,10,'0'),
        lpad(070,10,'0'),
        lpad(071,10,'0'),
        lpad(072,10,'0'),
        lpad(073,10,'0'),
        lpad(074,10,'0'),
        lpad(075,10,'0'),
        lpad(076,10,'0'),
        lpad(077,10,'0'),
        lpad(078,10,'0'),
        lpad(079,10,'0'),
        lpad(080,10,'0'),
        lpad(081,10,'0'),
        lpad(082,10,'0'),
        lpad(083,10,'0'),
        lpad(084,10,'0'),
        lpad(085,10,'0'),
        lpad(086,10,'0'),
        lpad(087,10,'0'),
        lpad(088,10,'0'),
        lpad(089,10,'0'),
        lpad(090,10,'0'),
        lpad(091,10,'0'),
        lpad(092,10,'0'),
        lpad(093,10,'0'),
        lpad(094,10,'0'),
        lpad(095,10,'0'),
        lpad(096,10,'0'),
        lpad(097,10,'0'),
        lpad(098,10,'0'),
        lpad(099,10,'0'),
        lpad(100,10,'0'),
        lpad(101,10,'0'),
        lpad(102,10,'0'),
        lpad(103,10,'0'),
        lpad(104,10,'0'),
        lpad(105,10,'0'),
        lpad(106,10,'0'),
        lpad(107,10,'0'),
        lpad(108,10,'0'),
        lpad(109,10,'0'),
        lpad(110,10,'0'),
        lpad(111,10,'0'),
        lpad(112,10,'0'),
        lpad(113,10,'0'),
        lpad(114,10,'0'),
        lpad(115,10,'0'),
        lpad(116,10,'0'),
        lpad(117,10,'0'),
        lpad(118,10,'0'),
        lpad(119,10,'0'),
        lpad(120,10,'0'),
        lpad(121,10,'0'),
        lpad(122,10,'0'),
        lpad(123,10,'0'),
        lpad(124,10,'0'),
        lpad(125,10,'0'),
        lpad(126,10,'0'),
        lpad(127,10,'0'),
        lpad(128,10,'0'),
        lpad(129,10,'0'),
        lpad(130,10,'0'),
        lpad(131,10,'0'),
        lpad(132,10,'0'),
        lpad(133,10,'0'),
        lpad(134,10,'0'),
        lpad(135,10,'0'),
        lpad(136,10,'0'),
        lpad(137,10,'0'),
        lpad(138,10,'0'),
        lpad(139,10,'0'),
        lpad(140,10,'0'),
        lpad(141,10,'0'),
        lpad(142,10,'0'),
        lpad(143,10,'0'),
        lpad(144,10,'0'),
        lpad(145,10,'0'),
        lpad(146,10,'0'),
        lpad(147,10,'0'),
        lpad(148,10,'0'),
        lpad(149,10,'0'),
        lpad(150,10,'0'),
        lpad(151,10,'0'),
        lpad(152,10,'0'),
        lpad(153,10,'0'),
        lpad(154,10,'0'),
        lpad(155,10,'0'),
        lpad(156,10,'0'),
        lpad(157,10,'0'),
        lpad(158,10,'0'),
        lpad(159,10,'0'),
        lpad(160,10,'0'),
        lpad(161,10,'0'),
        lpad(162,10,'0'),
        lpad(163,10,'0'),
        lpad(164,10,'0'),
        lpad(165,10,'0'),
        lpad(166,10,'0'),
        lpad(167,10,'0'),
        lpad(168,10,'0'),
        lpad(169,10,'0'),
        lpad(170,10,'0'),
        lpad(171,10,'0'),
        lpad(172,10,'0'),
        lpad(173,10,'0'),
        lpad(174,10,'0'),
        lpad(175,10,'0'),
        lpad(176,10,'0'),
        lpad(177,10,'0'),
        lpad(178,10,'0'),
        lpad(179,10,'0'),
        lpad(180,10,'0'),
        lpad(181,10,'0'),
        lpad(182,10,'0'),
        lpad(183,10,'0'),
        lpad(184,10,'0'),
        lpad(185,10,'0'),
        lpad(186,10,'0'),
        lpad(187,10,'0'),
        lpad(188,10,'0'),
        lpad(189,10,'0'),
        lpad(190,10,'0'),
        lpad(191,10,'0'),
        lpad(192,10,'0'),
        lpad(193,10,'0'),
        lpad(194,10,'0'),
        lpad(195,10,'0'),
        lpad(196,10,'0'),
        lpad(197,10,'0'),
        lpad(198,10,'0'),
        lpad(199,10,'0'),
        lpad(200,10,'0'),
        lpad(201,10,'0'),
        lpad(202,10,'0'),
        lpad(203,10,'0'),
        lpad(204,10,'0'),
        lpad(205,10,'0'),
        lpad(206,10,'0'),
        lpad(207,10,'0'),
        lpad(208,10,'0'),
        lpad(209,10,'0'),
        lpad(210,10,'0'),
        lpad(211,10,'0'),
        lpad(212,10,'0'),
        lpad(213,10,'0'),
        lpad(214,10,'0'),
        lpad(215,10,'0'),
        lpad(216,10,'0'),
        lpad(217,10,'0'),
        lpad(218,10,'0'),
        lpad(219,10,'0'),
        lpad(220,10,'0'),
        lpad(221,10,'0'),
        lpad(222,10,'0'),
        lpad(223,10,'0'),
        lpad(224,10,'0'),
        lpad(225,10,'0'),
        lpad(226,10,'0'),
        lpad(227,10,'0'),
        lpad(228,10,'0'),
        lpad(229,10,'0'),
        lpad(230,10,'0'),
        lpad(231,10,'0'),
        lpad(232,10,'0'),
        lpad(233,10,'0'),
        lpad(234,10,'0'),
        lpad(235,10,'0'),
        lpad(236,10,'0'),
        lpad(237,10,'0'),
        lpad(238,10,'0'),
        lpad(239,10,'0'),
        lpad(240,10,'0'),
        lpad(241,10,'0'),
        lpad(242,10,'0'),
        lpad(243,10,'0'),
        lpad(244,10,'0'),
        lpad(245,10,'0'),
        lpad(246,10,'0'),
        lpad(247,10,'0'),
        lpad(248,10,'0'),
        lpad(249,10,'0')
from
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select
        avg_row_len, num_rows,  blocks,
        num_rows / trunc(8000/avg_row_len) estimated_blocks
from
        user_tables
where
        table_name = 'T1'
;

prompt  =================
prompt  Add a few columns
prompt  =================

alter table t1 add(
        col250 varchar2(10),
        col251 varchar2(10),
        col252 varchar2(10),
        col253 varchar2(10),
        col254 varchar2(10),
        col255 varchar2(10),
        col256 varchar2(10),
        col257 varchar2(10),
        col258 varchar2(10),
        col259 varchar2(10)
)
;

-- alter system switch logfile;

update t1 set col259 = lpad('259',10,'0');
commit;

-- execute dump_log

P.S. if you do investigate and solve the question of the excess space in the undo record, and the odd content in the row “before image” then do let me know. (Hint: part of the excess may be a “null columns” map – but that still leaves plenty to account for.)