Here’s a quick demo to make a point about consistent reads (prompted by a question on the Oracle-L mailing list):
SQL> drop table t1; Table dropped. SQL> create table t1 (n1 number); Table created. SQL> insert into t1 values(0); 1 row created. SQL> begin 2 for i in 1..1000 loop 3 update t1 set n1 = i; 4 end loop; 5 end; 6 / PL/SQL procedure successfully completed.
Note that I haven’t issued a commit in this session, and all I’ve got is a single row in the table (and because it’s my usual demo setup of locally managed tablespaces with uniform extents of 1MB using freelist management I know that that one row is in the first available block of the table).
How much work is a second session going to do to scan that table ?
SQL> alter system flush buffer_cache; SQL> execute snap_my_stats.start_snap SQL> select * from t1; SQL> set serveroutput on size 1000000 format wrapped SQL> execute snap_my_stats.end_snap --------------------------------- Session stats - 18-Apr 11:33:01 Interval:- 2 seconds --------------------------------- Name Value ---- ----- session logical reads 967 consistent gets 967 consistent gets from cache 967 consistent gets - examination 964 consistent changes 1,001 CR blocks created 1 data blocks consistent reads - undo records applied 1,001 IMU CR rollbacks 41
The snap_my_stats package is similar in concept to Tom Kyte’s “runstats” or Tanel Poder’s “snapper” program to capture changes in values in the dynamic performance views over short time periods. In this case I’ve deleted all but a few of the larger changes, and a couple of small changes.
The figure that stands out (probably) is the “session logical reads” – we’ve done 967 logical I/Os to scan a tables of just one block. The reason for this is that we’ve created a read-consistent copy of that one block (“CR blocks created” = 1), and it has taken a lot of work to create that copy. We’ve had to apply 1,001 undo records (“data blocks consistent reads – undo records applied” = 1001).
Most of those undo records come from individual accesses (which are of the cheaper “consistent gets – examination” type that only need a single get on the “cache buffers chains” latch) to undo blocks, following the “UBA (undo block address)” pointer in the relevant ITL entry of the table block, but since this is a 10g database the last few undo records come out of the “In-memory Undo” of the other session. Basically the cloning operation is something like this:
It is an interesting point that as the first session created undo records it would pin and fill undo blocks – so would only do a few current gets (one for each block) on the undo blocks it was using. As another process reverses out the changes in a CR clone it has to get and release each undo block every time it wants a single undo record … applying undo records introduces far more latch and buffer activity that the original generation of the undo.
It’s worth knowing that there are three statistics relating to applying undo records:
transaction tables consistent reads - undo records applied Estimating "old" commit SCNs during delayed block cleanout data blocks consistent reads - undo records applied Creating CR clones rollback changes - undo records applied The result of a real "rollback;"
The second step in the list of actions is: “Notice uncommitted transaction”. It’s probably worth pointing out that another part of the ITL entry holds the transaction id (“xid”) which implicitly identifies the undo segment and transaction table slot in that segment that has been used to hold the transaction state. The current contents of that slot allow Oracle to determine whether or not (and when, if necessary) the transaction was committed.
I have some broken links in my old blog entries right now, so if you’re looking for something, then download the whole zip file from here:
I have uploaded a .zip file (for Windows) and a .tar.gz file (for Unix/Mac). The scripts are all the same with differences in the CR/LF bytes in the files and the init.sql and i.sql which have some OS specific commands in them.
I also uploaded the latest PerfSheet there where I fixed an annoying bug which complained about some missing reference files when opening the file.
I plan to fix the broken links some time between now and my retirement.
Here’s an example of a slightly less common data deadlock graph (dumped from 11gR2, in this case):
[Transaction Deadlock] The following deadlock is not an ORACLE error. It is a deadlock due to user error in the design of an application or from issuing incorrect ad-hoc SQL. The following information may aid in determining the deadlock: Deadlock graph: ---------Blocker(s)-------- ---------Waiter(s)--------- Resource Name process session holds waits process session holds waits TX-00010006-00002ade 16 34 X 12 50 S TX-00050029-000037ab 12 50 X 16 34 S session 34: DID 0001-0010-00000021 session 50: DID 0001-000C-00000024 session 50: DID 0001-000C-00000024 session 34: DID 0001-0010-00000021 Rows waited on: Session 50: no row Session 34: no row Information on the OTHER waiting sessions: Session 50: pid=12 serial=71 audsid=1560855 user: 52/TEST_USER O/S info: user: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX program: sqlplus.exe application name: SQL*Plus, hash value=3669949024 Current SQL Statement: update t1 set n3 = 99 where id = 100 End of information on OTHER waiting sessions. Current SQL statement for this session: update t1 set n3 = 99 where id = 200
The anomaly is that the waiters are both waiting on S (share) mode locks for a TX enqueue.
It’s fairly well known that Share (and Share sub exclusive, SSX) lock waits for TM locks are almost a guarantee of a missing “foreign key index”; and it’s also fairly well known that Share lock waits for TX locks can be due to bitmap collisions, issues with ITL (interested transaction list) waits, various RI (referential integrity) collisions including simultaneous inserts of the same primary key.
A cause for TX/4 waits that is less well known or overlooked (because a featrure is less-well used) is simple data collisions on IOTs (index organized tables). In the example above t1 is an IOT with a primary key of id. Session 34 and 50 have both tried to update the rows with ids 100 and 200 – in the opposite order. If this were a normal heap table the deadlock graph would be showing waits for eXclusive TX locks, because it’s an IOT (and therefore similar in some respects to a primary key wait) we see waits for Share TX locks.
First, a reminder – my Advanced Oracle Troubleshooting v2.0 online seminar starts next week already. Last chance to sign up, I can accept registrations until Sunday :-)
I won’t do another AOT seminar before Oct (or Nov) this year. More details and sign-up here:
I have rescheduled my Advanced SQL Tuning and Partitioning & Parallel Execution for Performance seminars too. I will do them in September/October. Unfortunately I’m too busy right now to do them before the summer.
And that’s all the travel I will do this year…
I’ll soon announce the 2nd EsSN virtual conference too ;-)
Free online stuff:
Perhaps in a month or so I will do another hacking session (I’ll plan 2 hours this time, 1 hour isn’t nearly enough for going deep). The topic will probably be about low-level details of SQL plan execution internals… stay tuned!
A recent question on the OTN database forum:
Can any one please point to me a document or a way to calculate the average number of rows per block in oralce 10.2.0.3
One answer would be to collect stats and then approximate as block / avg_row_len – although you have to worry about things like row overheads, the row directory, and block overheads before you can be sure you’ve got it right. On top of this, the average might not be too helpful anyway. So here’s another (not necessarily fast) option that gives you more information about the blocks that have any rows in them (I picked the source$ table from a 10g system because source$ is often good for some extreme behaviour).
break on report compute sum of tot_blocks on report compute sum of tot_rows on report column avg_rows format 999.99 select twentieth, min(rows_per_block) min_rows, max(rows_per_block) max_rows, sum(block_ct) tot_blocks, sum(row_total) tot_rows, round(sum(row_total)/sum(block_ct),2) avg_rows from ( select ntile(20) over (order by rows_per_block) twentieth, rows_per_block, count(*) block_ct, rows_per_block * count(*) row_total from ( select fno, bno, count(*) rows_per_block from ( select dbms_rowid.rowid_relative_fno(rowid) as fno, dbms_rOwId.rowid_block_number(rowid) as bno from source$ ) group by fno, bno ) group by rows_per_block order by rows_per_block ) group by twentieth order by twentieth ;
I’ve used the ntile() function to split the results into 20 lines, obviously you might want to change this according to the expected variation in rowcounts for your target table. In my case the results looked like this:
TWENTIETH MIN_ROWS MAX_ROWS TOT_BLOCKS TOT_ROWS AVG_ROWS ---------- ---------- ---------- ---------- ---------- -------- 1 1 11 2706 3470 1.28 2 12 22 31 492 15.87 3 23 34 30 868 28.93 4 35 45 20 813 40.65 5 46 57 13 664 51.08 6 59 70 18 1144 63.56 7 71 81 23 1751 76.13 8 82 91 47 4095 87.13 9 92 101 79 7737 97.94 10 102 111 140 14976 106.97 11 112 121 281 32799 116.72 12 122 131 326 41184 126.33 13 132 141 384 52370 136.38 14 142 151 325 47479 146.09 15 152 161 225 35125 156.11 16 162 171 110 18260 166.00 17 172 181 58 10207 175.98 18 182 191 18 3352 186.22 19 193 205 22 4377 198.95 20 206 222 16 3375 210.94 ---------- ---------- sum 4872 284538
Of course, the moment you see a result like this it prompts you to ask more questions.
Is the “bell curve” effect that you can see centred around the 13th ntile indicative of a normal distribution of row lengths – if so why is the first ntile such an extreme outlier – is that indicative of a number of peculiarly long rows, did time of arrival have a special effect, is it the result of a particular pattern of delete activity … and so on.
Averages are generally very bad indicators if you’re worried about the behaviour of an Oracle system.
Steve Bamber has written up a case study of library cache latch contention troubleshooting of an Apex application with LatchProf. I’m happy that others also see the value and have had success with my new LatchProf based latch contention troubleshooting approach which takes into account both sides of the contention story (latch waiters and latch holders/blockers) as opposed to the guesswork used previously (hey if it’s shared pool latch contention – is must be about bad SQL not using bind variables …. NOT always…)
Anyway, I’m happy. If you have success stories with LatchProf, please let me know!
As a second topic of interest, Laimutis Nedzinskas has written some good notes about the effect and overhead of Flashback Database option when you are using and modifying (nocache) LOBs. We’ve exchanged some mails on this topic and yeah, my clients have sure seen some problems with this combination as well. You basically want to keep your LOBs cached when using FB database…
Ok, it’s official – the first and only Oracle Troubleshooting TV show is live now!
The first show is almost 2 hours about the ORA-4031 errors and shared pool hacking. It’s a recording of the US/EMEA timezone online hacking session I did some days ago.
There are a couple of things to note:
View the embedded video below or go to my official Oracle Troubleshooting TV show channel:
Yesterday I introduced a little framework I use to avoid the traps inherent in writing PL/SQL loops when modelling a session that does lots of simple calls to the database. I decided to publish the framework because I had recently come across an example where a series of SQL statements gives a very different result from a single PL/SQL block.
The model starts with a simple data set – which in this case is created in a tablespace using ASSM (automatic segment space management), an 8KB block size and 1MB uniform extents (in a locally management tablespace).
create table t1 tablespace test_8k_assm as select trunc((rownum-1)/100) n1, lpad('x',40) v1, rpad('x',100) padding from dual connect by rownum <= 20000 ; create index t1_i1 on t1(n1, v1) tablespace test_8k_assm ; validate index t1_i1; execute print_table('select * from index_stats');
You can see that the n1 column is defined to have 200 rows for each of 100 different values, and that each set of two hundreds rows is stored (at least initially) in a very small cluster of blocks.
With the data set in place I am now going to pick a set of two hundred rows at random, delete it, re-insert it, and commit; and I’m going to repeat that process 1,000 times.
declare rand number(3); begin for i in 1..1000 loop rand := trunc(dbms_random.value(0,200)); delete from t1 where n1 = rand ; insert into t1 select rand, lpad('x',40), rpad('x',100) from dual connect by rownum <= 100 ; commit; end loop; end; / validate index t1_i1; execute print_table('select * from index_stats');
You might think that this piece of code is a little strange – but it is a model of some processing that I’ve recently seen on a client site, and it has crossed my mind that it might appear in a number of systems hidden underneath the covers of dbms_job. So what does it do to the index ?
Given the delay that usually appears between the time an index entry is marked as deleted and the time that the space can be reused, and given the way I’ve engineered my date so that the space needed for the 200 rows for each key value is little more than a block (an important feature of this case), I wouldn’t be too surprised if the index had stabilised at nearly twice its original size. But that’s not what happened to my example running under ASSM. Here are the “before” and “after” results from my test:
Before After LF_ROWS 20,000 70,327 LF_BLKS 156 811 LF_ROWS_LEN 1,109,800 3,877,785 BR_ROWS 155 810 BR_BLKS 3 10 BR_ROWS_LEN 8,903 45,732 DEL_LF_ROWS 0 50,327 DEL_LF_ROWS_LEN 0 2,767,985 DISTINCT_KEYS 200 190 MOST_REPEATED_KEY 100 1,685 BTREE_SPACE 1,272,096 6,568,320 USED_SPACE 1,118,703 3,923,517 PCT_USED 88 60 ROWS_PER_KEY 100 370 BLKS_GETS_PER_ACCESS 54 189
It’s a small disaster – our index has grown in size by a factor of about five, and we have more deleted rows than “real” rows. (Note, by the way, that the awfulness of the index is NOT really indicated by the PCT_USED figure – one which is often suggested as an indicator of the state of an index).
Unfortunately this is the type of problem that doesn’t surprise me when using ASSM; it’s supposed to help with highly concurrent OLTP activity (typified by a large number of very small transactions) but runs into problems updating free space bitmaps whenever you get into “batch-like” activity.
However, there is a special consideration in play here – I’ve run the entire operation as a single pl/sql loop. Would the same problem appear if I ran each delete/insert cycle as a completely independent SQL script using the “start_1000.sql” script from my previous note ?
To test the effect of running 1,000 separate tasks, rather than executing a single pl/sql loop, I wrote the following code into the start_1.sql script that I described in the article before running start_1000.sql:
declare rand number(3); begin rand := trunc(dbms_random.value(0,200)); delete from t1 where n1 = rand ; insert into t1 select rand, lpad('x',40), rpad('x',100) from dual connect by rownum <= 100 ; commit; end; /
The impact was dramatically different. (Still very wasteful, but quite a lot closer to the scale of the results that you might expect from freelist management).
Before After --------- --------- LF_ROWS 20,000 39,571 LF_BLKS 156 479 LF_ROWS_LEN 1,109,800 2,196,047 BR_ROWS 155 478 BR_BLKS 3 6 BR_ROWS_LEN 8,903 26,654 DEL_LF_ROWS 0 19,571 DEL_LF_ROWS_LEN 0 1,086,247 DISTINCT_KEYS 200 199 MOST_REPEATED_KEY 100 422 BTREE_SPACE 1,272,096 3,880,192 USED_SPACE 1,118,703 2,222,701 PCT_USED 88 58 ROWS_PER_KEY 100 199 BLKS_GETS_PER_ACCESS 54 102
I haven’t yet investigated why the pl/sql loop should have produced such a damaging effect – although I suspect that it might be a side effect of the pinning of bitmap blocks (amongst others, of course) that takes place within a single database call. It’s possible that the repeated database calls from SQL*Plus keep “rediscovering” bitmap blocks that show free space while the pinning effects stop the pl/sql from “going back” to bitmap blocks that have recently acquired free space.
Interestingly the impact of using ASSM was dramatically reduced if one object used freelists and the other used ASSM – and with my specific example the combination of a freelist table with an ASSM index even did better than the expected 50% usage from the “traditional” option of using freelists for both the table and index.
Note – the purpose of this note is NOT to suggest that you should avoid using ASSM in general; but if you can identify code in your system that is doing something similar to the model then it’s worth checking the related indexes (see my index efficiency note) to see if any of them are displaying the same problem as this test case. If they are you may want to do one of two things: think about a schedule for coalescing or even rebuilding problem indexes on a regular basis, or see if you can move the table, index, or both, into a tablespace using freelist management.
People talk about the Oracle SQL Developer 3 being out, which is cool, but I have something even cooler for you today ;-)
I finally figured out how to convert my screen-recordings to uploadable videos, so that the text wouldn’t get unreadable and blurry.
So, here’s the first video, about a tool called MOATS, which we have built together with fellow OakTable Network member and a PL/SQL wizard Adrian Billington (of oracle-developer.net).
Here’s the video, it’s under 3 minutes long. Play the video in full screen for best results (and if it’s too slow loading, change it to lower resolution from HD mode):
Check it out and if you like MOATS, you can download it from Adrian’s website site (current version 1.05) and make sure you read the README.txt file in the zip!
Also thanks to Randolf Geist for finding and fixing some bugs in our alpha code… Note that MOATS is still kind of beta right now…
P.S. I will post my ORA-4031 and shared pool hacking video real soon now, too! :-)
P.P.S. Have you already figured out how it works?! ;-)
Update: Now you can suggest new features and improvement requests here:
Here’s a note I’ve been meaning to research and write up for more than 18 months – every since Dion Cho pinged a note I’d written about the effects of partitioning because of a comment it made about the “2% small table threshold”.
It has long been an item of common knowledge that Oracle has a “small table threshold” that allows for special treatment of data segments that are smaller than two percent of the size of the buffer cache, viz:
If a table is shorter than the threshold then it is read to the midpoint of the cache (just like any other block read) but – whether by accident or design – the touch count (x$bh.tch) is not set and the table will fall off the LRU end of the buffer cache fairly promptly as other objects are read into the buffer. Such a tablescan would be recorded under the statistic “table scans (short tables)”.
Then, in July 2009, Dion Cho decided to check this description before repeating it, and set about testing it on Oracle 10gR2 – producing some surprising results and adding another item to my to-do list. Since then I have wanted to check his conclusions, check whether the original description had ever been true and when (or if) it had changed.
As a simple starting point, of course, it was easy to check the description of the relevant (hidden) parameter to see when it changed:
220.127.116.11 _small_table_threshold threshold level of table size for forget-bit enabled during scan 18.104.22.168 _small_table_threshold threshold level of table size for direct reads 22.214.171.124 _small_table_threshold lower threshold level of table size for direct reads
This suggests that the behaviour might have changed some time in 9i (126.96.36.199 happened to be the earliest 9i listing of x$ksppi I had on file) – so I clearly had at least three major versions to check.
The behaviour of the cache isn’t an easy thing to test, though, because there are a number of special cases to consider – in particular the results could be affected by the positioning of the “mid-point” marker (x$kcbwds.cold_hd) that separates the “cold” buffers from the “hot” buffers. By default the hot portion of the default buffer is 50% of the total cache (set by hidden parameter _db_percent_hot_default) but on instance startup or after a “flush buffer cache” there are no used buffers so the behaviour can show some anomalies.
So here’s the basic strategy:
Here’s some sample code:
create table t_15400 pctfree 99 pctused 1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 15400 ; create index t_15400_id on t_15400(id); begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T_15400', estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / select object_name, object_id, data_object_id from user_objects where object_name in ( 'T_300', 'T_770', 'T_1540', 'T_3750', 'T_7700', 'T_15400', 'T_15400_ID' ) order by object_id ; select /*+ index(t) */ max(small_vc) from t_15400 t where id > 0 ;
The extract shows the creation of just the last and largest table I created and collected statistics for – and it was the only one with an index. I chose the number of blocks (I’ve rigged one row per block) because I had set up a db_cache_size of 128MB on my 10.2.0.3 Oracle instance and this had given me 15,460 buffers.
As you can see from the query against user_objects my test case included tables with 7,700 rows (50%), 3,750 rows (25%), 1,540 rows (10%), 770 rows (5%) and 300 rows (2%). (The number in brackets are the approximate sizes of the tables – all slightly undersized – relative to the number of buffers in the default cache).
Here’s the query that I then ran against x$bh (connected as sys from another session) to see what was in the cache (the range of values needs to be adjusted to cover the range of object_id reported from user_objects):
select obj, tch, count(*) from x$bh where obj between 77710 and 77720 group by obj, tch order by count(*) ;
After executing the first index range scan of t_15400 to fill the cache three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75855 0 1 75854 0 1 75853 0 1 75851 0 1 75850 0 1 75849 0 1 75852 0 1 75855 2 9 -- Index blocks, touch count incremented 75855 1 18 -- Index blocks, touch count incremented 75854 1 11521 -- Table blocks, touch count incremented
Then after three tablescans, at 4 second intervals, of the 7,700 block table:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 -- segment header of 7700 table, touch count incremented each time 75855 0 1 75854 0 1 75852 0 1 75849 0 1 75850 0 1 75851 0 1 75855 2 9 75855 1 10 75853 0 3991 -- lots of blocks from 7700 table, no touch count increment 75854 1 7538
Then repeating the tablescan of the 3,750 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 0 1 75852 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75855 2 9 75855 1 10 75853 0 240 75852 0 3750 -- table completely cached - touch count not incremented 75854 1 7538
Then repeating the tablescan of the 1,540 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75852 3 1 75855 2 9 75855 1 10 75853 0 149 75851 2 1540 -- Table fully cached, touch count incremented but only to 2 75852 0 2430 75854 1 7538
Then executing the tablescan of the 770 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75850 3 1 -- segment header block, touch count incremented each time 75849 0 1 75851 3 1 75852 3 1 75854 0 1 75855 2 9 75855 1 10 75851 0 69 75853 0 149 75850 2 770 -- Table fully cached, touch count incremented but only to 2 75851 2 1471 75852 0 1642 75854 1 7538
Finally executing the tablescan of the 300 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75850 3 1 75852 3 1 75851 3 1 75855 2 9 75855 1 10 75851 0 69 75850 0 131 75853 0 149 75849 3 301 -- Table, and segment header, cached and touch count incremented 3 times 75850 2 639 75852 0 1342 75851 2 1471 75854 1 7538
This set of results on its own isn’t conclusive, of course, but the indications for 10.2.0.3 are:
I can’t state with any certainty where the used and recycled buffers might be, but since blocks from the 3750 tablescan removed the blocks from the 7700 tablescan, it’s possible that “large” tablescans do somehow go “to the bottom quarter” of the LRU.
There also some benefit in checking the statistics “table scans (short)” and “table scans (long)” as the tests run. For the 2% (300 block) table I recorded 3 short tablescans; for the tables in the 2% to 10% range (770 and 1540) I recorded one long and two short (which is consistent with the touch count increment of 2 – the first scan was expected to be long, but the 2nd and 3rd were deemed to be short based on some internal algorithm about the tables being fully cached); finally for the tables above 10% we always got 3 long tablescans.
But as it says in the original note on small partitions – there are plenty of questions still to answer:
I’ve quoted the 2% as the fraction of the db_cache_size – but we have automatic SGA management in 10g, automatic memory management in 11g, and up to eight different cache sizing parameters in every version from 9i onwards. What figure is used as the basis for the 2%, and is that 2% of the blocks or 2% of the bytes, and if you have multiple block sizes does each cache perhaps allow 2% of its own size.
And then, in 11g we have to worry about automatic direct path serial tablescans – and it would be easy to think that the “_small_table_threshold” may have been describing that feature since (at least) 188.8.131.52 if its description hadn’t changed slightly for 11.2 !
So much to do, so little time — but at least you know that there’s something that needs careful investigation if you’re planning to do lots of tablescans.
Footnote: Having written some tests, it’s easy to change versions. Running on 184.108.40.206 and 220.127.116.11, with similar sized caches, I could see that the “traditional” description of the “small_table_threshold” was true – a short tablescan was anything less 2% of the buffer cache, long tablescans were (in effect) done using just a window of “db_file_multiblock_read_count” buffers, and in both cases the touch count was never set (except for the segment header block).