I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.
As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:
“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”
“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.
That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.
The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.
However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.
That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.
Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.
People talk about the Oracle SQL Developer 3 being out, which is cool, but I have something even cooler for you today ;-)
I finally figured out how to convert my screen-recordings to uploadable videos, so that the text wouldn’t get unreadable and blurry.
So, here’s the first video, about a tool called MOATS, which we have built together with fellow OakTable Network member and a PL/SQL wizard Adrian Billington (of oracle-developer.net).
Here’s the video, it’s under 3 minutes long. Play the video in full screen for best results (and if it’s too slow loading, change it to lower resolution from HD mode):
Check it out and if you like MOATS, you can download it from Adrian’s website site (current version 1.05) and make sure you read the README.txt file in the zip!
Also thanks to Randolf Geist for finding and fixing some bugs in our alpha code… Note that MOATS is still kind of beta right now…
P.S. I will post my ORA-4031 and shared pool hacking video real soon now, too! :-)
P.P.S. Have you already figured out how it works?! ;-)
Update: Now you can suggest new features and improvement requests here:
Here’s a note I’ve been meaning to research and write up for more than 18 months – every since Dion Cho pinged a note I’d written about the effects of partitioning because of a comment it made about the “2% small table threshold”.
It has long been an item of common knowledge that Oracle has a “small table threshold” that allows for special treatment of data segments that are smaller than two percent of the size of the buffer cache, viz:
If a table is shorter than the threshold then it is read to the midpoint of the cache (just like any other block read) but – whether by accident or design – the touch count (x$bh.tch) is not set and the table will fall off the LRU end of the buffer cache fairly promptly as other objects are read into the buffer. Such a tablescan would be recorded under the statistic “table scans (short tables)”.
Then, in July 2009, Dion Cho decided to check this description before repeating it, and set about testing it on Oracle 10gR2 – producing some surprising results and adding another item to my to-do list. Since then I have wanted to check his conclusions, check whether the original description had ever been true and when (or if) it had changed.
As a simple starting point, of course, it was easy to check the description of the relevant (hidden) parameter to see when it changed:
184.108.40.206 _small_table_threshold threshold level of table size for forget-bit enabled during scan 220.127.116.11 _small_table_threshold threshold level of table size for direct reads 18.104.22.168 _small_table_threshold lower threshold level of table size for direct reads
This suggests that the behaviour might have changed some time in 9i (22.214.171.124 happened to be the earliest 9i listing of x$ksppi I had on file) – so I clearly had at least three major versions to check.
The behaviour of the cache isn’t an easy thing to test, though, because there are a number of special cases to consider – in particular the results could be affected by the positioning of the “mid-point” marker (x$kcbwds.cold_hd) that separates the “cold” buffers from the “hot” buffers. By default the hot portion of the default buffer is 50% of the total cache (set by hidden parameter _db_percent_hot_default) but on instance startup or after a “flush buffer cache” there are no used buffers so the behaviour can show some anomalies.
So here’s the basic strategy:
Here’s some sample code:
create table t_15400 pctfree 99 pctused 1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 15400 ; create index t_15400_id on t_15400(id); begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T_15400', estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / select object_name, object_id, data_object_id from user_objects where object_name in ( 'T_300', 'T_770', 'T_1540', 'T_3750', 'T_7700', 'T_15400', 'T_15400_ID' ) order by object_id ; select /*+ index(t) */ max(small_vc) from t_15400 t where id > 0 ;
The extract shows the creation of just the last and largest table I created and collected statistics for – and it was the only one with an index. I chose the number of blocks (I’ve rigged one row per block) because I had set up a db_cache_size of 128MB on my 10.2.0.3 Oracle instance and this had given me 15,460 buffers.
As you can see from the query against user_objects my test case included tables with 7,700 rows (50%), 3,750 rows (25%), 1,540 rows (10%), 770 rows (5%) and 300 rows (2%). (The number in brackets are the approximate sizes of the tables – all slightly undersized – relative to the number of buffers in the default cache).
Here’s the query that I then ran against x$bh (connected as sys from another session) to see what was in the cache (the range of values needs to be adjusted to cover the range of object_id reported from user_objects):
select obj, tch, count(*) from x$bh where obj between 77710 and 77720 group by obj, tch order by count(*) ;
After executing the first index range scan of t_15400 to fill the cache three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75855 0 1 75854 0 1 75853 0 1 75851 0 1 75850 0 1 75849 0 1 75852 0 1 75855 2 9 -- Index blocks, touch count incremented 75855 1 18 -- Index blocks, touch count incremented 75854 1 11521 -- Table blocks, touch count incremented
Then after three tablescans, at 4 second intervals, of the 7,700 block table:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 -- segment header of 7700 table, touch count incremented each time 75855 0 1 75854 0 1 75852 0 1 75849 0 1 75850 0 1 75851 0 1 75855 2 9 75855 1 10 75853 0 3991 -- lots of blocks from 7700 table, no touch count increment 75854 1 7538
Then repeating the tablescan of the 3,750 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 0 1 75852 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75855 2 9 75855 1 10 75853 0 240 75852 0 3750 -- table completely cached - touch count not incremented 75854 1 7538
Then repeating the tablescan of the 1,540 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75852 3 1 75855 2 9 75855 1 10 75853 0 149 75851 2 1540 -- Table fully cached, touch count incremented but only to 2 75852 0 2430 75854 1 7538
Then executing the tablescan of the 770 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75850 3 1 -- segment header block, touch count incremented each time 75849 0 1 75851 3 1 75852 3 1 75854 0 1 75855 2 9 75855 1 10 75851 0 69 75853 0 149 75850 2 770 -- Table fully cached, touch count incremented but only to 2 75851 2 1471 75852 0 1642 75854 1 7538
Finally executing the tablescan of the 300 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75850 3 1 75852 3 1 75851 3 1 75855 2 9 75855 1 10 75851 0 69 75850 0 131 75853 0 149 75849 3 301 -- Table, and segment header, cached and touch count incremented 3 times 75850 2 639 75852 0 1342 75851 2 1471 75854 1 7538
This set of results on its own isn’t conclusive, of course, but the indications for 10.2.0.3 are:
I can’t state with any certainty where the used and recycled buffers might be, but since blocks from the 3750 tablescan removed the blocks from the 7700 tablescan, it’s possible that “large” tablescans do somehow go “to the bottom quarter” of the LRU.
There also some benefit in checking the statistics “table scans (short)” and “table scans (long)” as the tests run. For the 2% (300 block) table I recorded 3 short tablescans; for the tables in the 2% to 10% range (770 and 1540) I recorded one long and two short (which is consistent with the touch count increment of 2 – the first scan was expected to be long, but the 2nd and 3rd were deemed to be short based on some internal algorithm about the tables being fully cached); finally for the tables above 10% we always got 3 long tablescans.
But as it says in the original note on small partitions – there are plenty of questions still to answer:
I’ve quoted the 2% as the fraction of the db_cache_size – but we have automatic SGA management in 10g, automatic memory management in 11g, and up to eight different cache sizing parameters in every version from 9i onwards. What figure is used as the basis for the 2%, and is that 2% of the blocks or 2% of the bytes, and if you have multiple block sizes does each cache perhaps allow 2% of its own size.
And then, in 11g we have to worry about automatic direct path serial tablescans – and it would be easy to think that the “_small_table_threshold” may have been describing that feature since (at least) 126.96.36.199 if its description hadn’t changed slightly for 11.2 !
So much to do, so little time — but at least you know that there’s something that needs careful investigation if you’re planning to do lots of tablescans.
Footnote: Having written some tests, it’s easy to change versions. Running on 188.8.131.52 and 184.108.40.206, with similar sized caches, I could see that the “traditional” description of the “small_table_threshold” was true – a short tablescan was anything less 2% of the buffer cache, long tablescans were (in effect) done using just a window of “db_file_multiblock_read_count” buffers, and in both cases the touch count was never set (except for the segment header block).
A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:
*** 2011-03-17 14:34:37.242 WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725 WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368 WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865 [...snipped...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205 WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221 WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560 WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579 WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685 WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706 WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720 #ff0000;">LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735
In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.
Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).
So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.
Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):
SQL> @oerr 10051 ORA-10051: trace OPI calls SQL> alter session set events '10051 trace name context forever, level 1'; Session altered.
Now run some LOB access code and check the tracefile:
*** 2011-03-17 14:37:07.178 WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602 OPI CALL: type=105 argc= 2 cursor= 0 name=Cursor close all CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731 OPI CALL: type=94 argc=28 cursor= 0 name=V8 Bundled Exec ===================== PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82' select * from t_lob END OF STMT PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805 EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884 WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939 WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785 OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945 FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963 WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629 #ff0000;">OPI CALL: type=96 argc=21 cursor= 0 name=#ff0000;">LOB/FILE operations WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788 [...snip...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054 LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064
Check the bold and especially the red string above. Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…
By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?
Well, in 220.127.116.11 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.
Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.
It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:
If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:
Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:
Greg Rahn points out that the improved NDV estimation arrives in 11.1 not 11.2 (which my article distinctly odd implies) see http://structureddata.org/2007/09/17/oracle-11g-enhancements-to-dbms_stats/ for more on this.
Free stuff! Free stuff! Free stuff! :-)
The awesome dudes at E2SN have done it again! (and yes, Tom, this time the “we at E2SN Ltd” doesn’t mean only me alone ;-)
On Tuesday 22nd March I’ll hold two (yes two) Secret Oracle Hacking Sessions – about ORA-04031: unable to allocate x bytes of shared memory errors, cursor management issues and other shared pool related problems (like fragmentation). This event is free for all! You’ll just need to be fast enough to register, both events have 100 attendee limit (due to my GotoWebinar accont limitations).
I am going to run this online event twice, so total 200 people can attend (don’t register for both events, please). One event is in the morning (my time) to cater for APAC/EMEA region and the other session is for EMEA/US/Americas audience.
The content will be the same in both sessions. There will be no slides (you cant fix your shared pool problems with slides!) but there will be demos, scripts, live examples and fun (for the geeks among us anyway – others go and read some slides instead ;-)!
March 16, 2011 (Back to the Previous Post in the Series) In the previous articles in this series we looked at ways to analyze the Oracle time model data at the system-wide level with drill-down into the session level detail, with cross references to a handful of statistics found in V$OSSTAT and V$SYSSTAT, and the [...]
As I’m finishing up a performance chapter for the Exadata book (a lot of work!), I thought to take a quick break and write a blog entry.
This is not really worth putting into my Oracle Exadata Performance series (which so far has only 1 article in it anyway) .. so this is a little stand-alone article …
Everybody knows that the Exadata smart scan can be used when scanning tables (and table partitions). You should also know that smart scan can be used with fast full scan on Oracle B-tree indexes (a fast full scan on an index segment is just like a full table scan, only on the index segment (and ignoring branch blocks)).
For some reason there’s a (little) myth circulating that smart scans aren’t used for scanning bitmap indexes.
So, here’s evidence, that smart scan can be used when scanning bitmap indexes:
SQL> select /*+ tanel3 */ count(*) from t1 where owner like '%XYZXYZ%'; ... Plan hash value: 39555139 ----------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows | Cost (%CPU)| ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 505 (100)| | 1 | SORT AGGREGATE | | 1 | | | 2 | BITMAP CONVERSION COUNT | | 400K| 505 (0)| |* 3 | BITMAP INDEX STORAGE FAST FULL SCAN| BI_T1_OWNER | | | ----------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - storage(("OWNER" LIKE '%XYZXYZ%' AND "OWNER" IS NOT NULL)) filter(("OWNER" LIKE '%XYZXYZ%' AND "OWNER" IS NOT NULL))
So, as you see the execution plan sure shows a FAST FULL SCAN on a BITMAP INDEX segment, which happens to be on Exadata STORAGE.
Also, you see a storage() predicate applied on the line 3 of the execution plan, which means that Oracle will attempt to use a smart scan predicate offload – but this can’t always be done!
So, you can’t really determine whether a smart scan happened during execution just by looking into the execution plan, you should really check some V$SESSION statistics too. That’s where my Snapper script becomes handy.
I started Snapper on my session just before running the above query. The “smart table scan” and “smart index scan” performance counters are updated right after Oracle has opened the segment header and determines, from the number of blocks in the segment, whether to call the smart scan codepath or not. In other words, the smart scan counters are inremented in the beginning of the segment scan.
The output is following (some irrelevant counters are stripped for brevity):
@snapper all 5 1 "301" Sampling SID 301 with interval 5 seconds, taking 1 snapshots... setting stats to all due to option = all -- Session Snapper v3.52 by Tanel Poder @ E2SN ( http://tech.e2sn.com ) ------------------------------------------------------------------------------------------------------------------------------------- SID, USERNAME , TYPE, STATISTIC , HDELTA, HDELTA/SEC, %TIME, GRAPH ------------------------------------------------------------------------------------------------------------------------------------- 301, TANEL , STAT, physical read total IO requests , 13, 2.6, 301, TANEL , STAT, physical read total multi block requests , 4, .8, 301, TANEL , STAT, physical read requests optimized , 1, .2, 301, TANEL , STAT, physical read total bytes optimized , 8.19k, 1.64k, 301, TANEL , STAT, physical read total bytes , 4.63M, 925.7k, 301, TANEL , STAT, cell physical IO interconnect bytes , 10.02k, 2k, 301, TANEL , STAT, physical reads , 565, 113, 301, TANEL , STAT, physical reads cache , 1, .2, 301, TANEL , STAT, physical reads direct , 564, 112.8, 301, TANEL , STAT, physical read IO requests , 13, 2.6, 301, TANEL , STAT, physical read bytes , 4.63M, 925.7k, 301, TANEL , STAT, db block changes , 1, .2, 301, TANEL , STAT, cell physical IO bytes eligible for predicate offload , 4.62M, 924.06k, 301, TANEL , STAT, cell physical IO interconnect bytes returned by smart scan, 1.82k, 364.8, 301, TANEL , STAT, cell blocks processed by cache layer , 564, 112.8, 301, TANEL , STAT, cell blocks processed by txn layer , 564, 112.8, 301, TANEL , STAT, cell blocks processed by index layer , 564, 112.8, 301, TANEL , STAT, cell blocks helped by minscn optimization , 564, 112.8, 301, TANEL , STAT, cell index scans , 1, .2, 301, TANEL , STAT, index fast full scans (full) , 1, .2, 301, TANEL , STAT, index fast full scans (direct read) , 1, .2, 301, TANEL , STAT, bytes sent via SQL*Net to client , 334, 66.8, 301, TANEL , STAT, bytes received via SQL*Net from client , 298, 59.6, 301, TANEL , STAT, SQL*Net roundtrips to/from client , 2, .4, 301, TANEL , STAT, cell flash cache read hits , 1, .2, 301, TANEL , TIME, hard parse elapsed time , 1.17ms, 233.8us, .0%, | | 301, TANEL , TIME, parse time elapsed , 1.5ms, 300.2us, .0%, | | 301, TANEL , TIME, DB CPU , 11ms, 2.2ms, .2%, | | 301, TANEL , TIME, sql execute elapsed time , 82.2ms, 16.44ms, 1.6%, |@ | 301, TANEL , TIME, DB time , 84.36ms, 16.87ms, 1.7%, |@ | 301, TANEL , WAIT, enq: KO - fast object checkpoint , 16.18ms, 3.24ms, .3%, | | 301, TANEL , WAIT, gc cr grant 2-way , 223us, 44.6us, .0%, | | 301, TANEL , WAIT, gc current grant 2-way , 136us, 27.2us, .0%, | | 301, TANEL , WAIT, cell smart index scan , 56.04ms, 11.21ms, 1.1%, |@ | 301, TANEL , WAIT, SQL*Net message to client , 7us, 1.4us, .0%, | | 301, TANEL , WAIT, SQL*Net message from client , 4.42s, 884.47ms, 88.4%, |@@@@@@@@@ | 301, TANEL , WAIT, cell single block physical read , 541us, 108.2us, .0%, | | 301, TANEL , WAIT, events in waitclass Other , 2.22ms, 443.2us, .0%, | | -- End of Stats snap 1, end=2011-03-13 19:36:31, seconds=5
As you see from the above “cell index scans” statistic – indeed one index segment was scanned using the cell smart scan method.
So, I would rather call this feature “smart segment scan” to reflect that smart scan can scan more than just tables…
I guess one of the reasons why few people have seen smart bitmap index scans in action is that (single-column) bitmap indexes tend to be small. Smaller than corresponding table segments and B-tree index segments. On partitioned tables they’re much more likely going to be under the “_small_table_threshold” calculation which is used for determining whether to do a direct path full segment scan or not (yes, the _small_table_threshold applies to fast full index scan and fast full bitmap index scan too, not just table scans). So, it’s likely that Oracle chooses to do a regular, buffered full bitmap segment scan and thus won’t even consider using smart scan (as smart scans require direct path reads).
By the way – the direct path read (or not) decision is done per segment – not per object (like a table or index). So if you have 10 partitions in a table (or index), half of them are large, half are smaller, then Oracle may end up using direct path reads (and smart scan) on 5 of them and buffered (dumb) scan on the other 5. If you run something like Snapper on the session, then you’d see the smart scan counters go up by 5 only. As written above, Oracle decides whether to do direct path reads (and smart scan) right after opening the header block of a segment (partition) and reading out how many blocks this partition’s segment has below HWM.
The above applied to serial direct path reads – the Parallel Execution slaves should always read using direct path mode, right? …. Wrong :)
Well, partially wrong… In 18.104.22.168, if the parallel_degree_policy = manual, then yes, PX slaves behave like usual and always force a direct path read (and try to use a smart scan). However, with parallel_degree_policy = AUTO, which is the future of PX auto-management, Oracle can decide to do a buffered parallel scan instead, again disabling the use of smart scan…
One more note – I didn’t say anything about whether you should or should not use (bitmap) indexes on Exadata, it’s an entirely different discussion. I just brought out that the smart scan is used for scanning table segments, B-tree index segments and bitmap index segments if conditions are right.
And in the end I have to say…. that even with this evidence you can’t be fully sure that a smart scan was used throughout the entire segment, but more about this in the book and perhaps in a later blog article. We have interesting times ahead ;-)
As my frequent readers know, I have promised to not travel anymore as it’s just too much hassle compared to the benefit of being “there”. This is why I’m going to fly to London on Monday, 18th April to speak at the UKOUG Exadata Special Event. This event is just too sexy to be missed, so I made an exception (the last one, I promise!)… and it’s probably going to be warmer there as well compared to where I am now :-)
I will be talking about what’s been my focus area for last year or so – Oracle Exadata Performance.
Dan Norris and Alex Gorbachev will be speaking there too, so it should end up being a pretty awesome event!
More details here:
My abstract is following:
In order to systematically troubleshoot and optimize Exadata performance, one must understand the meaning of its performance metrics.
This session provides a deep technical walkthrough of how Exadata IO and smart scans work and how to use relevant metrics for troubleshooting related performance issues. We will review both Exadata database and cell-level metrics, cell wait events and tools useful for troubleshooting. We will also look into metrics related to Exadata Hybrid Columnar Compression and the cell Flash Cache usage.
P.S. The reason why I called this post “Exadata Training” is that you’ll learn some real world practical stuff there… as opposed to the marketing material (and marketing material copy material) overdose out there… ;-)
There was a question in LinkedIn forum about whether Linux Hugepages should be used in Oracle Exadata Database layer, as they aren’t enabled by default during ACS install. I’m putting my answer into this blog entry – apparently LinkedIn forums have a limit of 4000 characters per reply… (interestingly familiar number, by the way…:)
So, I thought that it’s time to start writing my Oracle Exadata Performance series articles what I’ve planned for a while… with some war stories from the field, some stuff what I’ve overcome when researching for writing the Expert Oracle Exadata book etc.
Here’s the first article (initially planned as a short response in LinkedIn, but it turned out much longer though):
As far as I’ve heard, the initial decision to not enable hugepages by default was that the hugepages aren’t flexible & dynamic enough – you’ll have to always configure the hugepages at OS level to match your desired SGA size (to avoid wastage). So, different shops may want radically different SGA sizes (larger SGA for single-block read oriented databases like transactional/OLTP or OLAP cubes), but smaller SGA for smart scan/parallel scan oriented DWs. If you configure 40GB of hugepages on a node, but only use 1GB of SGA, then 39GB memory is just reserved, not used, wasted – as hugepages are pre-allocated. AMM, using regular pages, will only use the pages what it touches, so there’s no memory wastage due to any pre-allocation issues…
So, Oracle chose to use an approach which is more universal and doesn’t require extra OS level configuration (which isn’t hard at all though if you pay attention, but not all people do). So, less people will end up in trouble with their first deployments although they might not be getting the most out of their hardware.
However, before enabling hugepages “because it makes things faster” you should ask yourself what exact benefit would they bring you?
There are 3 main reasons why hugepages may be useful in Linux:
1) Smaller kernel memory usage thanks to less PTEs thanks to larger pagesizes
This means less pagetable entries (PTEs) and less kernel memory usage. The bigger your SGA and the more processes you have logged on, the bigger the memory usage.
You can measure this in your case – just “grep Page /proc/meminfo” and see how big portion of your RAM has been used by “PageTables”. Many people have blogged about this, but Kevin Closson’s blog is probably the best source to read about this:
2) Lower CPU usage due to less TLB misses in CPU and soft page-fault processing when accessing SGA.
It’s harder to measure this on Linux with standard tools, although it is sure possible (on Solaris you can just run prstat -m to get microstate accounting and look into TFL,DFL,TRP stats).
Anyway, the catch here is that if you are running parallel scans and smart scans, then you don’t access that much of buffer cache in SGA at all, all IOs or smart scan result-sets are read directly to PGAs of server processes – which don’t use large pages at all, regardless of whether hugepages for SGA have been configured or not. There are some special cases, like when a block clone has to be rolled back for read consistency, you’ll have to access some undo blocks via buffer cache… but again this should be a small part of total workload.
So, in a DW, which using mostly smarts scans or direct path reads, there won’t be much CPU efficiency win from large pages as you bypass buffer cache anyway and use small pages of private process memory. All the sorting, hashing etc all happens using small pages anyway. Again I have to mention that on (my favorite OS) Solaris it is possible to configure even PGAs to use large pages (via _realfree_heap_pagesize_hint parameter) … so it’ll be interesting to see how this would help DW workloads on the Exadata X2-8 monsters which can run Solaris 11.
3) Lock SGA pages into RAM so they won’t be paged out when memory shortage happens (for whatever reason).
Hugepages are pre-allocated and never paged out. So, when you have extreme memory shortage, your SGAs won’t be paged out “by accident”. Of course it’s better to ensure that such memory shortages won’t happen – configure the SGA/PGA_AGGREGATE_TARGET sizes properly and don’t allow third party programs consume crazy amounts of memory etc. Of course there’s the lock_sga parameter in Oracle which should allow to do this on Linux with small pages too, but first I have never used it on Linux so I don’t know whether it works ok at all and also in 11g AMM perhaps the mlock() calls aren’t supported on the /dev/shm files at all (haven’t checked and don’t care – it’s better to stay away from extreme memory shortages). Read more about how the AMM MEMORY_TARGET (/dev/shm) works from my article written back in 2007 when 11g came out ( Oracle 11g internals – Automatic Memory Management ).
So, the only realistic win (for DW workload) would be the reduction of kernel pagetables structure size – and you can measure this using PageTables statistic in /proc/meminfo. Kevin demonistrated in his article that 500 connections to an instance with ~8 GB SGA consisting of small pages resulted in 7 GB of kernel pagetables usage, while the usage with large pages (still 500 connections, 8 GB SGA) was about 265 MB. So you could win over 6 GB of RAM, which you can then give to PGA_AGGREGATE_TARGET or to further inrease SGA. The more processes you have connected to Oracle, the more pagetable space is used… Similarly, the bigger the SGA is, the more pagetable space is used…
This is great, but the tradeoff here is manageability and some extra effort you have to put in to always check whether the large pages actually got used or not. After starting up your instance, you should really check whether the HugePages_Free in /proc/meminfo shrank and HugePages_Rsvd increased (when instance has just started up and Oracle hasn’t touched all the SGA pages yet, some pages will show up as Rsvd – reserved).
With a single instance per node this is trivial – you know how much SGA you want and pre-allocate the amount of hugepages for that. If you want to increase the SGA, you’ll have to shut down the instance and increase the Linux hugepages setting too. This can be done dynamically by issuing a command like echo N > /proc/sys/vm/nr_hugepages (where N is the number of huge pages), BUT in real life this may not work out well as if Linux kernel can’t free enough small pages from right physical RAM locations to consolidate 2 or 4 MB contiguous pages, the above command may fail to create the requested amount of new hugepages.
And this means you should restart the whole node to do the change. Note that if you increase your SGA larger to the number of hugepages (or you forget to increase the memlock setting in /etc/security/limits.conf accordingly) then your instance will silently just use the small pages, while all the memory pre-allocated for hugepages stays reserved for hugepages and is not usable for anything else!).
So, this may become more of a problem when you have multiple database instances per cluster node or you expect to start up and shut down instances on different nodes based on demand (or when some cluster nodes fail).
Long story short – I do configure hugepages in “static” production environments, to save kernel memory (and some CPU time for OLTP type environments using buffer cache heavily), also on Exadata. However for various test and development environments with lots of instances per server and constant action, I don’t bother myself (and the client) with hugepages and make everyone’s life easier… Small instances with small number of connections won’t use that many PTEs anyway…
For production environments with multiple database instances per node (and where failovers are expected) I would take the extra effort to ensure that whatever hugepages I have preallocated, won’t get silently wasted because an instance wants more SGA than the available hugepages can accommodate. You can do this by monitoring /proc/meminfo’s HugePage entries as explained above. And remember, the ASM instance (which is started before DB instances) will also grab itself some hugepages when it starts!