A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:
*** 2011-03-17 14:34:37.242 WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725 WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368 WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865 [...snipped...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205 WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221 WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560 WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579 WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685 WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706 WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720 #ff0000;">LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735
In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.
Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).
So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.
Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):
SQL> @oerr 10051 ORA-10051: trace OPI calls SQL> alter session set events '10051 trace name context forever, level 1'; Session altered.
Now run some LOB access code and check the tracefile:
*** 2011-03-17 14:37:07.178 WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602 OPI CALL: type=105 argc= 2 cursor= 0 name=Cursor close all CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731 OPI CALL: type=94 argc=28 cursor= 0 name=V8 Bundled Exec ===================== PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82' select * from t_lob END OF STMT PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805 EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884 WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939 WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785 OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945 FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963 WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629 #ff0000;">OPI CALL: type=96 argc=21 cursor= 0 name=#ff0000;">LOB/FILE operations WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788 [...snip...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054 LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064
Check the bold and especially the red string above. Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…
By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?
Well, in 22.214.171.124 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.
Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.
It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:
If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:
Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:
This is part 2 of an article on the KEEP cache. If you haven’t got here from part one you should read that first for an explanation of the STATE and CUR columns of the output.
Here’s a little code to demonstrate some of the problems with setting a KEEP cache – I’ve set up a 16MB cache, which gives me 1,996 buffers of 8KB in 10.2.0.3, and then created a table that doesn’t quite fill that cache. The table is 1,900 data blocks plus one block for the segment header (I’ve used freelist management to make the test as predictable as possible, and fixed the pctfree to get one row per block).
create table t1 pctfree 90 pctused 10 storage (buffer_pool keep) as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',1000) padding from generator v1, generator v2 where rownum <= 1900 ; alter system flush buffer_cache; -- scan the table to load it into memory select /*+ full(t1) */ count(small_vc) from t1 where id > 0 ; -- check the content of x$bh from another connection -- update every fifth row (380 in total) update t1 set small_vc = upper(small_vc) where mod(id,5) = 0 ; -- check the content of x$bh from another connection
The query I ran under the SYS account was this:
select bpd.bp_blksz, bpd.bp_name, wds.set_id, bh.state, cur, ct from ( select set_ds, state, bitand(flag, power(2,13)) cur, count(*) ct from x$bh group by set_ds, state, bitand(flag, power(2,13)) ) bh, x$kcbwds wds, x$kcbwbpd bpd where wds.addr = bh.set_ds and bpd.bp_lo_sid <= wds.set_id and bpd.bp_hi_sid >= wds.set_id and bpd.bp_size != 0 order by bpd.bp_blksz, bpd.bp_name, wds.set_id, bh.state, bh.cur ;
In my test case this produced two sets of figures, one for the DEFAULT cache, and one for the KEEP cache but I’ve only copied out the results from the KEEP cache, first after the initial tablescan, then after the update that affected 380 blocks:
BP_BLKSZ BP_NAME SET_ID STATE CUR CT ---------- ------- ---------- ----- ----- ---------- 8192 KEEP 1 0 0 95 1 0 1901 ******* ********** ---------- sum 1996 BP_BLKSZ BP_NAME SET_ID STATE CUR CT ---------- ------- ---------- ----- ----- ---------- 8192 KEEP 1 1 0 1462 1 8192 380 3 0 323 ******* ********** ---------- sum 1996
In the first output you see the 1901 buffers holding blocks from the table (1,900 data plus one segment header), with the remaining 95 buffers still “free” (state 0). The table blocks are all shown as XCUR (state 1, exclusive current)
In the second output you see 380 buffers holding blocks with state ‘XCUR’ with bit 13 of the flag column set, i.e. “gotten in current mode”. These are the 380 blocks that have been updated – but there are also 323 blocks shown as CR (state 3, “only valid for consistent read”). A detailed check of the file# and dbablk for these buffers shows that they are clones of (most of) the 380 blocks in the XCUR buffers.
Do a bit of arithmetic – we have 1462 blocks left from the original tablescan, plus 380 blocks in CUR mode (of which there are 323 clones) for a total of 1,842 blocks – which means that 59 blocks from the table are no longer in the cache. As we clone blocks we can lose some of the blocks we want to KEEP.
Unfortunately for us, Oracle has not given any preferential treatment to buffers which hold blocks in the XCUR state – any buffer which reaches the end of LRU chain and hasn’t been accessed since it was first loaded will be dumped so that the buffer can be used to create a clone (but see footnote). This means that a constant stream of inserts, updates, deletes, and queries could result in lots of clones being created in your KEEP cache, pushing out the data you want to keep.
If you want to size your KEEP cache to minimise this effect, you probably need to start by making it somewhat larger than the objects it is supposed to KEEP, and then checking to see how many clones you have in the cache – because that will give you an idea of how many extra buffers you need to stop the clones from pushing out the important data.
When I wrote and ran the test cases in this note the client was running Oracle 10.2 – while writing up my notes I happened to run the test on 126.96.36.199 (still using freelists rather than ASSM) and got the following output from my scan of the KEEP cache:
BP_BLKSZ BP_NAME SET_ID STATE CUR CT ---------- ------- ---------- ----- ----- ---------- 8192 KEEP 1 1 0 1901 3 0 91 ******* ********** ---------- sum 1992
Apart from the fact that you get slightly fewer buffers per granule in 11g (the x$bh structure has become slightly larger – and x$bh is a segmented array where each segment shares the granule with the buffers it points to) you can see that we only have 91 clones in the KEEP cache, and apparently we’ve managed to update our 380 blocks without changing their flag to “gotten in current mode”. Doing an update is, of course, just one way of making clones appear – but perhaps 11g will generally have more success in keeping current versions of blocks in memory than earlier versions.
There is , unfortunately, a very special feature to this test case – it’s using a single tablescan to update the table. So having said in part 1 that I was going to write a two-part article, I’ve got to this point, done a few more tests, and decided I need to write part three as well. Stay tuned.
Footnote: Technically there are a couple of circumstances where Oracle will bypass the buffer and walk along the LRU chain looking for another block – but I’ve excluded them from this demonstration.
Free stuff! Free stuff! Free stuff! :-)
The awesome dudes at E2SN have done it again! (and yes, Tom, this time the “we at E2SN Ltd” doesn’t mean only me alone ;-)
On Tuesday 22nd March I’ll hold two (yes two) Secret Oracle Hacking Sessions – about ORA-04031: unable to allocate x bytes of shared memory errors, cursor management issues and other shared pool related problems (like fragmentation). This event is free for all! You’ll just need to be fast enough to register, both events have 100 attendee limit (due to my GotoWebinar accont limitations).
I am going to run this online event twice, so total 200 people can attend (don’t register for both events, please). One event is in the morning (my time) to cater for APAC/EMEA region and the other session is for EMEA/US/Americas audience.
The content will be the same in both sessions. There will be no slides (you cant fix your shared pool problems with slides!) but there will be demos, scripts, live examples and fun (for the geeks among us anyway – others go and read some slides instead ;-)!
As my frequent readers know, I have promised to not travel anymore as it’s just too much hassle compared to the benefit of being “there”. This is why I’m going to fly to London on Monday, 18th April to speak at the UKOUG Exadata Special Event. This event is just too sexy to be missed, so I made an exception (the last one, I promise!)… and it’s probably going to be warmer there as well compared to where I am now :-)
I will be talking about what’s been my focus area for last year or so – Oracle Exadata Performance.
Dan Norris and Alex Gorbachev will be speaking there too, so it should end up being a pretty awesome event!
More details here:
My abstract is following:
In order to systematically troubleshoot and optimize Exadata performance, one must understand the meaning of its performance metrics.
This session provides a deep technical walkthrough of how Exadata IO and smart scans work and how to use relevant metrics for troubleshooting related performance issues. We will review both Exadata database and cell-level metrics, cell wait events and tools useful for troubleshooting. We will also look into metrics related to Exadata Hybrid Columnar Compression and the cell Flash Cache usage.
P.S. The reason why I called this post “Exadata Training” is that you’ll learn some real world practical stuff there… as opposed to the marketing material (and marketing material copy material) overdose out there… ;-)
Here’s a bit of geek stuff that I’ve been meaning to write up for nearly a year – to the day, more or less – and I’ve finally been prompted to finish the job off by the re-appearance on the OTN database forum of the standard “keep cache” question:
This is a two-part note – and in the first part I’m just going to run a query and talk about the results. The query is one that has to be run by SYS because it references a couple of x$ structures, and this particular version of the query was engineered specifically for a particular client.
select obj, state, bitand(bh.flag,power(2,13)) cur, count(*) ct from x$bh bh, x$kcbwds wds where wds.addr = bh.set_ds and wds.set_id between 1 and 24 group by obj, state, bitand(bh.flag,power(2,13)) order by obj, state, bitand(bh.flag,power(2,13)) ;
You’ll notice I’m joining x$bh (the “buffer header” array) to x$kcbwds (the “working data set” array) where I’ve picked sets 1 to 24. On this particular system these were the sets for the KEEP cache. (If you want a generic query to isolate a particular cache then there’s an example here that identifies the RECYCLE cache by reference – but I wanted the query in this note to run as efficiently as possible against this production system, so I did a preliminary lookup against x$kcbwbpd and then used the literal set ids).
Here are a few lines from the resulting output:
OBJ STATE CUR CT ---------- ---------- ---------- ---------- 40158 1 8192 1 40189 1 0 87233 8192 272789 3 0 69804 8192 393868 40192 1 0 87 8192 12197 3 0 30763 8192 1994 ... 117291 1 0 498 8192 4419 3 0 3001 8192 15 117294 1 0 243 8192 3544 3 0 1245 8192 23 4294967294 3 0 2 ********** ---------- sum 1216072
Since we’re looking at x$ structures – which rarely have any official documentation – the rest of this note isn’t guaranteed to be correct – and things do change with version so I need to stress that this specific example comes from 188.8.131.52. This is what I think the results show:
The state column is instance-related and is essentially something that’s useful in a RAC enviroment. State 1 translates to ‘XCUR’ (exclusive current) which means that this instance has exclusive rights to the most recent version of the block; state 3 translates to ‘CR’ (only valid for consistent read).
Bit 13 of the flag column is set if the buffer has been “gotten in current mode”. (If you’re interested in the other bits there’s a page on my old website that might keep you entertained for a while – I haven’t yet updated it to 11g, though.)
The problem for the client was this – the total size of all the data segments in the KEEP cache was about 6GB and the total size of the KEEP cache was about 10GB, yet the database was still reporting a constant trickle of physical reads to the objects and, when the code to “re-load” the cache was executed at a quiet period at the start of the day some 60,000 physical blocks had to be read. With a 10GB cache for 6GB of data would you really expect to see this I/O ?
Take a look at the figures for object 40189:
There are 272,789 buffers for blocks that were “gotten in current mode” (bit 13 is set) and are also “exclusive current” (state 1) to the instance, but there are also 393,868 buffers that were originally “gotten in current mode” but are now “only valid for consistent read”.
Similarly there are 87,233 buffers for blocks that weren’t “gotten in current mode” but are “exclusive current” to the instance – in other words they are the most up to date version of the block but weren’t fetched with a “db block get”, and again there are 69,804 buffers holding blocks that were not “gotten in current mode” but which are now “only valid for consistent read”.
Buffers that are “only valid for consistent read” are buffers holding blocks that have been generated through one of Oracle’s mechanisms for creating CR (consistent read) clones. As you can see, then, a block that is a CR clone may still be flagged as “gotten in current mode”. In fact, in line with Oracle’s generally “lazy” approach to work you can even find (in some versions of Oracle, at least) CR clones that still have the “dirty” bit set in the flag, even though CR clones can never really be dirty and are NEVER written to disc.
Take another look at the buffer counts – this KEEP cache is sized at 1.2M buffers (10GB), but object 40189 alone has taken out 460,000 of those buffers (3.6GB) in block clones, and for this object there are more clones than originals (at 360,000, which happens to be just a few thousand blocks less than the size of the table). So, when you’re thinking about creating a KEEP cache, remember that you have to allow for block cloning – simply setting the db_keep_cache_size to something “a bit bigger” than the object you want to keep cached may not even be close to adequate.
Part 2 to follow soon.
For all that loves Oracle performance.. Hotsos is truly the best conference, all the speakers are performance geeks, all the attendees talks about performance. Everything is about performance! There are lot more stuff that I like about my first Hotsos experience, the following are some of them..
I like that presenters and attendees are curious about what each performance geek has to say..
I like that whenever their brains are already fried. They consume as much coffee/soda as they can.. and just lay their butts on this couch.. and still.. talk about performance.
A couple of weeks ago I listed a number of scripts from 184.108.40.206 relating to AWR reports – it seems only sensible to publish a corresponding list for Statspack. In fact, there are two such lists – one for “traditional” Statspack, and one for “standby statspack” – a version of statspack you can pre-install so that you can run statspack reports against a standby database.
Statspack – the basic source of information is $ORACLE_HOME/rdbms/admin/spdoc.txt
spreport.sql -> Generates a Statspack Instance report sprepins.sql -> Generates a Statspack Instance report for the database and instance specified sprepsql.sql -> Generates a Statspack SQL report for the SQL Hash Value specified sprsqins.sql -> Generates a Statspack SQL report for the SQL Hash Value specified, for the database and instance specified sppurge.sql -> Purges a limited range of Snapshot Id's for a given database instance sptrunc.sql -> Truncates all Performance data in Statspack tables spuexp.par -> An export parameter file supplied for exporting the whole PERFSTAT user
Standby Statspack – the basic source of information is $ORACLE_HOME/rdbms/admin/sbdoc.txt
sbreport.sql - Create a report sbaddins.sql - Add a standby database instance to the configuration sblisins.sql - List instances in the standby configuration sbdelins.sql - Delete an instance from the standby configuration sbpurge.sql - Purge a set of snapshots
A warning note about standby statspack – there is a significant structural change from 220.127.116.11 to 18.104.22.168 allowing support of multiple standby databases. I haven’t looked closely at it yet, but it did cross my mind that it might be possible to install the 22.214.171.124 version in an 126.96.36.199 database to avoid late upgrade issues – I may get around to testing the idea one day.
If you’re running a version of Oracle older than 10g (where v$active_session_history appeared), or if you’re not using Enterprise Edition, or if you just don’t want to pay for the Diagnostic Pack licence, here are some links relating a free Java program that emulates the “Top Sessions” output of the Enterprise Manager screen:
If you want to query v$active_session_history (or any of the AWR objects) then you need to purchase the licence for the Diagnostic Pack). If you aren’t licensed you should only use the program in emulation mode.
Browsing through the archive for the Oracle-L listserver a couple of days ago I came across this item dated Feb 2011 where the author was puzzled by Oracle’s choice of index for a query.
He was using 10.2.0.3, and running with the optimizer_mode set to first_rows – which you shouldn’t really be doing with that version of Oracle since Oracle Corp. told us about 10 years ago that “first_rows is avaiable only for backwards compatibility”.
I’ve created a model of their problem to demonstrate the effect. As usual, to make it easier to get a reproducible result, I’ve used locally managed tablespaces with 1MB uniform extents, freelist management, and CPU costing disabled:
create table t1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select to_char( to_date('01-Jan-2011','dd-mon-yyyy') + trunc((rownum-1)/317), 'yyyymmdd' ) a, mod(rownum,317) + 1 b, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 100000 ; alter table t1 add constraint t1_pk primary key(a,b); create index t1_ba on t1(b,a);
The SQL creates 317 rows for a list of dates which have been stored as eight character strings in the form YYYYMMDD. The 317 rows are numbered from 1 to 317, and the data is stored in order of date and number. I’ve created a primary key on (date, number), and I’ve also created an index on (number, date) – the PK has a very good clustering_factor and the other index has a very bad one because of the way I generated the data.
With this data in hand, and after collecing statistics (compute, no histograms), I run the following SQL (and like the OP I am using 10.2.0.3):
alter session set optimizer_mode = first_rows; set autotrace traceonly explain select small_vc from t1 where a = '20110401' and b > 10 order by a, b ; select /*+ index(t1(a,b)) */ small_vc from t1 where a = '20110401' and b > 10 order by a, b ;
I’m after 307 consecutive rows of one date – and I want the data sorted by the date and number. With first_rows optimization the default plan is a little surprising. Here are two execution plans for the query – first the plan that the optimizer chose by default, the second when I hinted the SQL to use the primary key – note that neither plan shows a sort operation:
Default execution plan --------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | --------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 307 | 7368 | 617 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 307 | 7368 | 617 | |* 2 | INDEX SKIP SCAN | T1_BA | 307 | | 309 | --------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("B">10 AND "A"='20110401') filter("A"='20110401' AND "B">10) Hinted execution plan --------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | --------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 307 | 7368 | 10 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 307 | 7368 | 10 | |* 2 | INDEX RANGE SCAN | T1_PK | 307 | | 2 | --------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("A"='20110401' AND "B">10)
How strange – it is clearly better to use the primary key index for this query, yet the optimizer doesn’t do it under first_rows optimisation. (It does if you use the slightly more appropriate first_rows(1) – the “new” improved option from 9i).
The first thought you might have when looking at this example is the first_rows has a heuristic (i.e. rule) that says “use an index to avoid sorting at all costs if possible (unless the hidden parametere _sort_elimination_cost_ratio is non-zero)”. But that shouldn’t apply here because both indexes will allow Orace to avoid sorting.
And here’s an even stranger detail: notice that the “order by” clause includes column “a”, which is obviously constant because of the “where” clause. Since it’s constant removing it won’t make any difference to the final output - but look what happens:
select small_vc from t1 where a = '20110401' and b > 10 order by b ; --------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | --------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 307 | 7368 | 10 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 307 | 7368 | 10 | |* 2 | INDEX RANGE SCAN | T1_PK | 307 | | 2 | --------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("A"='20110401' AND "B">10) filter("A"='20110401' AND "B">10)
For no obvious reason the optimizer now picks the right index. What’s going on ? Unfortunately I have to say that I don’t know.
When I checked the 10053 trace file there were a few differences for the two “order by” clauses but I couldn’t see anything that gave me any reasonable ideas. The most significant difference was the choice of indexes examined when the optimizer was doing the “recost for order by” bit. When we ordered by a,b the optimizer considered only the t1_ba index (note – the final costs are slightly higher here because in this run I enabled CPU costing to see if that was having an effect, so there’s a little extra for the CPU):
Access Path: index (skip-scan) SS sel: 0.0030744 ANDV (#skips): 308 SS io: 308.00 vs. index scan io: 321.00 Skip Scan chosen Access Path: index (SkipScan) Index: T1_BA resc_io: 617.00 resc_cpu: 23882848 ix_sel: 0.0030744 ix_sel_with_filters: 0.0030744 Cost: 618.60 Resp: 618.60 Degree: 1 Best:: AccessPath: IndexRange Index: T1_BA Cost: 618.60 Degree: 1 Resp: 618.60 Card: 307.44 Bytes: 24
when we ordered by b alone the optimizer considered only the t1_pk index:
Access Path: index (RangeScan) Index: T1_PK resc_io: 10.00 resc_cpu: 191334 ix_sel: 0.0030744 ix_sel_with_filters: 0.0030744 Cost: 10.01 Resp: 10.01 Degree: 1 Best:: AccessPath: IndexRange Index: T1_PK Cost: 10.01 Degree: 1 Resp: 10.01 Card: 307.44 Bytes: 24
There really seems to be a flaw in the logic behind the choice of index – and there’s an important point to think about here: if it’s a bug it’s probably not going to be fixed. The first_rows option only exists for “backwards compatibility” and things stop being compatible if you change them.
Footnote: Because the cost of the skip scan path in the original run was 617 and the cost of the primary key range scan path was 10 I could make Oracle choose the primary key by setting the parameter _sort_elimination_cost_ratio to a value just less than 617/10 (say 60); but I mention that only as an idle curiosity. You shouldn’t be using first_rows , and if you do use it you shouldn’t be hacking with undocumented parameters to work around the problems it produces.
Apress has made the draft versions of our Expert Oracle Exadata book available for purchase.
How this works is:
This is an awesome deal if you can’t wait until the final launch and want to get ahead of the curve with your Exadata skills ;-)
Buy the alpha version of our Expert Oracle Exadata book from Apress here!
If you haven’t heard about this book earlier – I’m one of the 3 authors, writing it together with Kerry Osborne and Randy Johnson from Enkitec and our official tech reviewer is no other than THE Kevin Closson and we are also getting some (unofficial) feedback from Oracle database junkie Arup Nanda.
So this book will absolutely rock and if you want a piece of it now, order the alpha book above!
P.S. This hopefully also explains why I’ve been so quiet with my blogging lately – can’t write a book and do many other things at the same time… (at least if you want to do it well…)