April 14, 2011 I find interesting details from time to time about Oracle Database. I am in the process of organizing material for a presentation, and I encountered something interesting when testing Oracle Database 22.214.171.124 with patch 3 applied. I traced the execution of an application and then processed the 660MB trace file using a trace file [...]
This one is long overdue. There is a partition example of binary xml on this website based on Range, Hash and List partitioning, but this is, seen from a XML view, a incorrect way to do it due to the fact that the partition column is a regular one and not a virtual column as described in the Oracle XMLDB Developers Guide for 11.2. The examples given in that ppost will partition the table on the ID column but does not partition the XML based on a value/element IN the XML document. The following will.
So here, a small example, of how it can be done based on a virtual column. Be aware that you should support these virtual columns with at least an index or XMLIndex structure for performance reasons.
#993333; font-weight: bold;">CREATE #993333; font-weight: bold;">TABLE binary_part_xml #993333; font-weight: bold;">OF XMLType XMLTYPE STORE #993333; font-weight: bold;">AS SECUREFILE #993333; font-weight: bold;">BINARY XML VIRTUAL #993333; font-weight: bold;">COLUMNS #66cc66;">( LISTING_TYPE #993333; font-weight: bold;">AS #66cc66;">(XMLCast#66cc66;">(XMLQuery#66cc66;">(#ff0000;">'/LISTING/@TYPE' PASSING OBJECT_VALUE RETURNING CONTENT#66cc66;">) #993333; font-weight: bold;">AS VARCHAR2#66cc66;">(#cc66cc;">100#66cc66;">)#66cc66;">)#66cc66;">) #66cc66;">) PARTITION #993333; font-weight: bold;">BY LIST #66cc66;">(LISTING_TYPE#66cc66;">) #66cc66;">( PARTITION health #993333; font-weight: bold;">VALUES #66cc66;">(#ff0000;">'Health'#66cc66;">)#66cc66;">, PARTITION law_firms #993333; font-weight: bold;">VALUES #66cc66;">(#ff0000;">'Law Firm'#66cc66;">) #66cc66;">);
First, a reminder – my Advanced Oracle Troubleshooting v2.0 online seminar starts next week already. Last chance to sign up, I can accept registrations until Sunday :-)
I won’t do another AOT seminar before Oct (or Nov) this year. More details and sign-up here:
I have rescheduled my Advanced SQL Tuning and Partitioning & Parallel Execution for Performance seminars too. I will do them in September/October. Unfortunately I’m too busy right now to do them before the summer.
And that’s all the travel I will do this year…
I’ll soon announce the 2nd EsSN virtual conference too ;-)
Free online stuff:
Perhaps in a month or so I will do another hacking session (I’ll plan 2 hours this time, 1 hour isn’t nearly enough for going deep). The topic will probably be about low-level details of SQL plan execution internals… stay tuned!
Steve Bamber has written up a case study of library cache latch contention troubleshooting of an Apex application with LatchProf. I’m happy that others also see the value and have had success with my new LatchProf based latch contention troubleshooting approach which takes into account both sides of the contention story (latch waiters and latch holders/blockers) as opposed to the guesswork used previously (hey if it’s shared pool latch contention – is must be about bad SQL not using bind variables …. NOT always…)
Anyway, I’m happy. If you have success stories with LatchProf, please let me know!
As a second topic of interest, Laimutis Nedzinskas has written some good notes about the effect and overhead of Flashback Database option when you are using and modifying (nocache) LOBs. We’ve exchanged some mails on this topic and yeah, my clients have sure seen some problems with this combination as well. You basically want to keep your LOBs cached when using FB database…
Ok, it’s official – the first and only Oracle Troubleshooting TV show is live now!
The first show is almost 2 hours about the ORA-4031 errors and shared pool hacking. It’s a recording of the US/EMEA timezone online hacking session I did some days ago.
There are a couple of things to note:
View the embedded video below or go to my official Oracle Troubleshooting TV show channel:
I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.
As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:
“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”
“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.
That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.
The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.
However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.
That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.
Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.
People talk about the Oracle SQL Developer 3 being out, which is cool, but I have something even cooler for you today ;-)
I finally figured out how to convert my screen-recordings to uploadable videos, so that the text wouldn’t get unreadable and blurry.
So, here’s the first video, about a tool called MOATS, which we have built together with fellow OakTable Network member and a PL/SQL wizard Adrian Billington (of oracle-developer.net).
Here’s the video, it’s under 3 minutes long. Play the video in full screen for best results (and if it’s too slow loading, change it to lower resolution from HD mode):
Check it out and if you like MOATS, you can download it from Adrian’s website site (current version 1.05) and make sure you read the README.txt file in the zip!
Also thanks to Randolf Geist for finding and fixing some bugs in our alpha code… Note that MOATS is still kind of beta right now…
P.S. I will post my ORA-4031 and shared pool hacking video real soon now, too! :-)
P.P.S. Have you already figured out how it works?! ;-)
Update: Now you can suggest new features and improvement requests here:
Here’s a note I’ve been meaning to research and write up for more than 18 months – every since Dion Cho pinged a note I’d written about the effects of partitioning because of a comment it made about the “2% small table threshold”.
It has long been an item of common knowledge that Oracle has a “small table threshold” that allows for special treatment of data segments that are smaller than two percent of the size of the buffer cache, viz:
If a table is shorter than the threshold then it is read to the midpoint of the cache (just like any other block read) but – whether by accident or design – the touch count (x$bh.tch) is not set and the table will fall off the LRU end of the buffer cache fairly promptly as other objects are read into the buffer. Such a tablescan would be recorded under the statistic “table scans (short tables)”.
Then, in July 2009, Dion Cho decided to check this description before repeating it, and set about testing it on Oracle 10gR2 – producing some surprising results and adding another item to my to-do list. Since then I have wanted to check his conclusions, check whether the original description had ever been true and when (or if) it had changed.
As a simple starting point, of course, it was easy to check the description of the relevant (hidden) parameter to see when it changed:
126.96.36.199 _small_table_threshold threshold level of table size for forget-bit enabled during scan 188.8.131.52 _small_table_threshold threshold level of table size for direct reads 184.108.40.206 _small_table_threshold lower threshold level of table size for direct reads
This suggests that the behaviour might have changed some time in 9i (220.127.116.11 happened to be the earliest 9i listing of x$ksppi I had on file) – so I clearly had at least three major versions to check.
The behaviour of the cache isn’t an easy thing to test, though, because there are a number of special cases to consider – in particular the results could be affected by the positioning of the “mid-point” marker (x$kcbwds.cold_hd) that separates the “cold” buffers from the “hot” buffers. By default the hot portion of the default buffer is 50% of the total cache (set by hidden parameter _db_percent_hot_default) but on instance startup or after a “flush buffer cache” there are no used buffers so the behaviour can show some anomalies.
So here’s the basic strategy:
Here’s some sample code:
create table t_15400 pctfree 99 pctused 1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 15400 ; create index t_15400_id on t_15400(id); begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T_15400', estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / select object_name, object_id, data_object_id from user_objects where object_name in ( 'T_300', 'T_770', 'T_1540', 'T_3750', 'T_7700', 'T_15400', 'T_15400_ID' ) order by object_id ; select /*+ index(t) */ max(small_vc) from t_15400 t where id > 0 ;
The extract shows the creation of just the last and largest table I created and collected statistics for – and it was the only one with an index. I chose the number of blocks (I’ve rigged one row per block) because I had set up a db_cache_size of 128MB on my 10.2.0.3 Oracle instance and this had given me 15,460 buffers.
As you can see from the query against user_objects my test case included tables with 7,700 rows (50%), 3,750 rows (25%), 1,540 rows (10%), 770 rows (5%) and 300 rows (2%). (The number in brackets are the approximate sizes of the tables – all slightly undersized – relative to the number of buffers in the default cache).
Here’s the query that I then ran against x$bh (connected as sys from another session) to see what was in the cache (the range of values needs to be adjusted to cover the range of object_id reported from user_objects):
select obj, tch, count(*) from x$bh where obj between 77710 and 77720 group by obj, tch order by count(*) ;
After executing the first index range scan of t_15400 to fill the cache three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75855 0 1 75854 0 1 75853 0 1 75851 0 1 75850 0 1 75849 0 1 75852 0 1 75855 2 9 -- Index blocks, touch count incremented 75855 1 18 -- Index blocks, touch count incremented 75854 1 11521 -- Table blocks, touch count incremented
Then after three tablescans, at 4 second intervals, of the 7,700 block table:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 -- segment header of 7700 table, touch count incremented each time 75855 0 1 75854 0 1 75852 0 1 75849 0 1 75850 0 1 75851 0 1 75855 2 9 75855 1 10 75853 0 3991 -- lots of blocks from 7700 table, no touch count increment 75854 1 7538
Then repeating the tablescan of the 3,750 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 0 1 75852 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75855 2 9 75855 1 10 75853 0 240 75852 0 3750 -- table completely cached - touch count not incremented 75854 1 7538
Then repeating the tablescan of the 1,540 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75852 3 1 75855 2 9 75855 1 10 75853 0 149 75851 2 1540 -- Table fully cached, touch count incremented but only to 2 75852 0 2430 75854 1 7538
Then executing the tablescan of the 770 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75850 3 1 -- segment header block, touch count incremented each time 75849 0 1 75851 3 1 75852 3 1 75854 0 1 75855 2 9 75855 1 10 75851 0 69 75853 0 149 75850 2 770 -- Table fully cached, touch count incremented but only to 2 75851 2 1471 75852 0 1642 75854 1 7538
Finally executing the tablescan of the 300 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75850 3 1 75852 3 1 75851 3 1 75855 2 9 75855 1 10 75851 0 69 75850 0 131 75853 0 149 75849 3 301 -- Table, and segment header, cached and touch count incremented 3 times 75850 2 639 75852 0 1342 75851 2 1471 75854 1 7538
This set of results on its own isn’t conclusive, of course, but the indications for 10.2.0.3 are:
I can’t state with any certainty where the used and recycled buffers might be, but since blocks from the 3750 tablescan removed the blocks from the 7700 tablescan, it’s possible that “large” tablescans do somehow go “to the bottom quarter” of the LRU.
There also some benefit in checking the statistics “table scans (short)” and “table scans (long)” as the tests run. For the 2% (300 block) table I recorded 3 short tablescans; for the tables in the 2% to 10% range (770 and 1540) I recorded one long and two short (which is consistent with the touch count increment of 2 – the first scan was expected to be long, but the 2nd and 3rd were deemed to be short based on some internal algorithm about the tables being fully cached); finally for the tables above 10% we always got 3 long tablescans.
But as it says in the original note on small partitions – there are plenty of questions still to answer:
I’ve quoted the 2% as the fraction of the db_cache_size – but we have automatic SGA management in 10g, automatic memory management in 11g, and up to eight different cache sizing parameters in every version from 9i onwards. What figure is used as the basis for the 2%, and is that 2% of the blocks or 2% of the bytes, and if you have multiple block sizes does each cache perhaps allow 2% of its own size.
And then, in 11g we have to worry about automatic direct path serial tablescans – and it would be easy to think that the “_small_table_threshold” may have been describing that feature since (at least) 18.104.22.168 if its description hadn’t changed slightly for 11.2 !
So much to do, so little time — but at least you know that there’s something that needs careful investigation if you’re planning to do lots of tablescans.
Footnote: Having written some tests, it’s easy to change versions. Running on 22.214.171.124 and 126.96.36.199, with similar sized caches, I could see that the “traditional” description of the “small_table_threshold” was true – a short tablescan was anything less 2% of the buffer cache, long tablescans were (in effect) done using just a window of “db_file_multiblock_read_count” buffers, and in both cases the touch count was never set (except for the segment header block).
A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:
*** 2011-03-17 14:34:37.242 WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725 WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368 WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865 [...snipped...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205 WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221 WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560 WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579 WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685 WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706 WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720 #ff0000;">LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735
In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.
Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).
So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.
Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):
SQL> @oerr 10051 ORA-10051: trace OPI calls SQL> alter session set events '10051 trace name context forever, level 1'; Session altered.
Now run some LOB access code and check the tracefile:
*** 2011-03-17 14:37:07.178 WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602 OPI CALL: type=105 argc= 2 cursor= 0 name=Cursor close all CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731 OPI CALL: type=94 argc=28 cursor= 0 name=V8 Bundled Exec ===================== PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82' select * from t_lob END OF STMT PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805 EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884 WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939 WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785 OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945 FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963 WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629 #ff0000;">OPI CALL: type=96 argc=21 cursor= 0 name=#ff0000;">LOB/FILE operations WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788 [...snip...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054 LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064
Check the bold and especially the red string above. Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…
By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?
Well, in 188.8.131.52 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.
Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.
It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:
If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:
Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:
Greg Rahn points out that the improved NDV estimation arrives in 11.1 not 11.2 (which my article distinctly odd implies) see http://structureddata.org/2007/09/17/oracle-11g-enhancements-to-dbms_stats/ for more on this.