Search

OakieTags

Who's online

There are currently 1 user and 50 guests online.

Online users

Recent comments

Affiliations

Troubleshooting

ASSM wreck

Yesterday I introduced a little framework I use to avoid the traps inherent in writing PL/SQL loops when modelling a session that does lots of simple calls to the database. I decided to publish the framework because I had recently come across an example where a series of SQL statements gives a very different result from a single PL/SQL block.

The model starts with a simple data set – which in this case is created in a tablespace using ASSM (automatic segment space management), an 8KB block size and 1MB uniform extents (in a locally management tablespace).


create table t1
tablespace test_8k_assm
as
select
	trunc((rownum-1)/100)	n1,
	lpad('x',40)		v1,
	rpad('x',100)		padding
from
	dual
connect by
	rownum <= 20000
;

create index t1_i1 on t1(n1, v1)
tablespace test_8k_assm
;

validate index t1_i1;
execute print_table('select * from index_stats');

You can see that the n1 column is defined to have 200 rows for each of 100 different values, and that each set of two hundreds rows is stored (at least initially) in a very small cluster of blocks.

With the data set in place I am now going to pick a set of two hundred rows at random, delete it, re-insert it, and commit; and I’m going to repeat that process 1,000 times.

declare
	rand	number(3);
begin
	for i in 1..1000 loop

		rand := trunc(dbms_random.value(0,200));

		delete from t1
		where n1 = rand
		;

		insert into t1
		select
			rand,
			lpad('x',40),
			rpad('x',100)
		from
			dual
		connect by
			rownum <= 100
		;

		commit;

	end loop;
end;
/

validate index t1_i1;
execute print_table('select * from index_stats');

You might think that this piece of code is a little strange – but it is a model of some processing that I’ve recently seen on a client site, and it has crossed my mind that it might appear in a number of systems hidden underneath the covers of dbms_job. So what does it do to the index ?

Given the delay that usually appears between the time an index entry is marked as deleted and the time that the space can be reused, and given the way I’ve engineered my date so that the space needed for the 200 rows for each key value is little more than a block (an important feature of this case), I wouldn’t be too surprised if the index had stabilised at nearly twice its original size. But that’s not what happened to my example running under ASSM. Here are the “before” and “after” results from my test:


                       Before         After
LF_ROWS                20,000        70,327
LF_BLKS                   156           811
LF_ROWS_LEN         1,109,800     3,877,785
BR_ROWS                   155           810
BR_BLKS                     3            10
BR_ROWS_LEN             8,903        45,732
DEL_LF_ROWS                 0        50,327
DEL_LF_ROWS_LEN             0     2,767,985
DISTINCT_KEYS             200           190
MOST_REPEATED_KEY         100         1,685
BTREE_SPACE         1,272,096     6,568,320
USED_SPACE          1,118,703     3,923,517
PCT_USED                   88            60
ROWS_PER_KEY              100           370
BLKS_GETS_PER_ACCESS       54           189

It’s a small disaster – our index has grown in size by a factor of about five, and we have more deleted rows than “real” rows. (Note, by the way, that the awfulness of the index is NOT really indicated by the PCT_USED figure – one which is often suggested as an indicator of the state of an index).

Unfortunately this is the type of problem that doesn’t surprise me when using ASSM; it’s supposed to help with highly concurrent OLTP activity (typified by a large number of very small transactions) but runs into problems updating free space bitmaps whenever you get into “batch-like” activity.

However, there is a special consideration in play here – I’ve run the entire operation as a single pl/sql loop. Would the same problem appear if I ran each delete/insert cycle as a completely independent SQL script using the “start_1000.sql” script from my previous note ?

To test the effect of running 1,000 separate tasks, rather than executing a single pl/sql loop, I wrote the following code into the start_1.sql script that I described in the article before running start_1000.sql:


declare
	rand	number(3);
begin

	rand := trunc(dbms_random.value(0,200));

	delete from t1
	where n1 = rand
	;

	insert into t1
	select
		rand,
		lpad('x',40),
		rpad('x',100)
	from
		dual
	connect by
		rownum <= 100
	;

	commit;

end;
/

The impact was dramatically different. (Still very wasteful, but quite a lot closer to the scale of the results that you might expect from freelist management).


                       Before         After
                    ---------     ---------
LF_ROWS                20,000        39,571
LF_BLKS                   156           479
LF_ROWS_LEN         1,109,800     2,196,047
BR_ROWS                   155           478
BR_BLKS                     3             6
BR_ROWS_LEN             8,903        26,654
DEL_LF_ROWS                 0        19,571
DEL_LF_ROWS_LEN             0     1,086,247
DISTINCT_KEYS             200           199
MOST_REPEATED_KEY         100           422
BTREE_SPACE         1,272,096     3,880,192
USED_SPACE          1,118,703     2,222,701
PCT_USED                   88            58
ROWS_PER_KEY              100           199
BLKS_GETS_PER_ACCESS       54           102

I haven’t yet investigated why the pl/sql loop should have produced such a damaging effect – although I suspect that it might be a side effect of the pinning of bitmap blocks (amongst others, of course) that takes place within a single database call. It’s possible that the repeated database calls from SQL*Plus keep “rediscovering” bitmap blocks that show free space while the pinning effects stop the pl/sql from “going back” to bitmap blocks that have recently acquired free space.

Interestingly the impact of using ASSM was dramatically reduced if one object used freelists and the other used ASSM – and with my specific example the combination of a freelist table with an ASSM index even did better than the expected 50% usage from the “traditional” option of using freelists for both the table and index.

Note – the purpose of this note is NOT to suggest that you should avoid using ASSM in general; but if you can identify code in your system that is doing something similar to the model then it’s worth checking the related indexes (see my index efficiency note) to see if any of them are displaying the same problem as this test case. If they are you may want to do one of two things: think about a schedule for coalescing or even rebuilding problem indexes on a regular basis, or see if you can move the table, index, or both, into a tablespace using freelist management.

MOATS: The Mother of All Tuning Scripts!

People talk about the Oracle SQL Developer 3 being out, which is cool, but I have something even cooler for you today ;-)

I finally figured out how to convert my screen-recordings to uploadable videos, so that the text wouldn’t get unreadable and blurry.

So, here’s the first video, about a tool called MOATS, which we have built together with fellow OakTable Network member and a PL/SQL wizard Adrian Billington (of oracle-developer.net).

Here’s the video, it’s under 3 minutes long. Play the video in full screen for best results (and if it’s too slow loading, change it to lower resolution from HD mode):

Check it out and if you like MOATS, you can download it from Adrian’s website site (current version 1.05) and make sure you read the README.txt file in the zip!

Also thanks to Randolf Geist for finding and fixing some bugs in our alpha code… Note that MOATS is still kind of beta right now…

P.S. I will post my ORA-4031 and shared pool hacking video real soon now, too! :-)

P.P.S. Have you already figured out how it works?! ;-)

Update: Now you can suggest new features and improvement requests here:

Small Tables

Here’s a note I’ve been meaning to research and write up for more than 18 months – every since Dion Cho pinged a note I’d written about the effects of partitioning because of a comment it made about the “2% small table threshold”.

It has long been an item of common knowledge that Oracle has a “small table threshold” that allows for special treatment of data segments that are smaller than two percent of the size of the buffer cache, viz:

    If a table is longer than the threshold then a full tablescan of the table will only use db_file_multiblock_read_count buffers at the end of the LRU (least recently used) chain to read the table and (allowing a little inaccuracy for multi-user systems, pinning and so on) keeps recycling the same few buffers to read the table thus protecting the bulk of the buffer cache from being wiped out by a single large tablescan. Such a tablescan would be recorded under the statistic “table scans (long tables)”.

    If a table is shorter than the threshold then it is read to the midpoint of the cache (just like any other block read) but – whether by accident or design – the touch count (x$bh.tch) is not set and the table will fall off the LRU end of the buffer cache fairly promptly as other objects are read into the buffer. Such a tablescan would be recorded under the statistic “table scans (short tables)”.

Then, in July 2009, Dion Cho decided to check this description before repeating it, and set about testing it on Oracle 10gR2 – producing some surprising results and adding another item to my to-do list. Since then I have wanted to check his conclusions, check whether the original description had ever been true and when (or if) it had changed.

As a simple starting point, of course, it was easy to check the description of the relevant (hidden) parameter to see when it changed:

8.1.7.4     _small_table_threshold        threshold level of table size for forget-bit enabled during scan
9.2.0.4     _small_table_threshold        threshold level of table size for direct reads
11.2.0.1    _small_table_threshold        lower threshold level of table size for direct reads

This suggests that the behaviour might have changed some time in 9i (9.2.0.4 happened to be the earliest 9i listing of x$ksppi I had on file) – so I clearly had at least three major versions to check.

The behaviour of the cache isn’t an easy thing to test, though, because there are a number of special cases to consider – in particular the results could be affected by the positioning of the “mid-point” marker (x$kcbwds.cold_hd) that separates the “cold” buffers from the “hot” buffers. By default the hot portion of the default buffer is 50% of the total cache (set by hidden parameter _db_percent_hot_default) but on instance startup or after a “flush buffer cache” there are no used buffers so the behaviour can show some anomalies.

So here’s the basic strategy:

    Start the instance
    Create a number of relatively small tables with no indexes
    Create a table large enough to come close to filling the cache, with an index to allow indexed access
    Collect stats on the table – largest last.
    Query the large table through an index range scan to fill the cache
    Repeat a couple of times with at least 3 second pauses to allow for incrementing the touch count
    Check x$bh for buffer usage
    Run repeated tablescans for the smaller tables to see how many blocks end up in the cache at different sizes.

Here’s some sample code:

create table t_15400
pctfree 99
pctused 1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	rownum			id,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 15400
;

create index t_15400_id on t_15400(id);

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T_15400',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 1'
	);
end;
/

select
	object_name, object_id, data_object_id
from
	user_objects
where
	object_name in  (
		'T_300',
		'T_770',
		'T_1540',
		'T_3750',
		'T_7700',
		'T_15400',
		'T_15400_ID'
	)
order by
	object_id
;

select
	/*+ index(t) */
	max(small_vc)
from
	t_15400 t
where
	id > 0
;

The extract shows the creation of just the last and largest table I created and collected statistics for – and it was the only one with an index. I chose the number of blocks (I’ve rigged one row per block) because I had set up a db_cache_size of 128MB on my 10.2.0.3 Oracle instance and this had given me 15,460 buffers.

As you can see from the query against user_objects my test case included tables with 7,700 rows (50%), 3,750 rows (25%), 1,540 rows (10%), 770 rows (5%) and 300 rows (2%). (The number in brackets are the approximate sizes of the tables – all slightly undersized – relative to the number of buffers in the default cache).

Here’s the query that I then ran against x$bh (connected as sys from another session) to see what was in the cache (the range of values needs to be adjusted to cover the range of object_id reported from user_objects):

select
	obj, tch, count(*)
from	x$bh
where
	obj between 77710 and 77720
group by
	obj, tch
order by
	count(*)
;

Typical results from 10.2.0.3

After executing the first index range scan of t_15400 to fill the cache three times:

       OBJ        TCH   COUNT(*)
---------- ---------- ----------
     75855          0          1
     75854          0          1
     75853          0          1
     75851          0          1
     75850          0          1
     75849          0          1
     75852          0          1
     75855          2          9    -- Index blocks, touch count incremented
     75855          1         18    -- Index blocks, touch count incremented
     75854          1      11521    -- Table blocks, touch count incremented

Then after three tablescans, at 4 second intervals, of the 7,700 block table:

       OBJ        TCH   COUNT(*)
---------- ---------- ----------
     75853          3          1    -- segment header of 7700 table, touch count incremented each time
     75855          0          1
     75854          0          1
     75852          0          1
     75849          0          1
     75850          0          1
     75851          0          1
     75855          2          9
     75855          1         10
     75853          0       3991    -- lots of blocks from 7700 table, no touch count increment
     75854          1       7538

Then repeating the tablescan of the 3,750 block table three times:


       OBJ        TCH   COUNT(*)
---------- ---------- ----------
     75853          3          1
     75855          0          1
     75854          0          1
     75851          0          1
     75852          3          1    -- segment header block, touch count incremented each time
     75849          0          1
     75850          0          1
     75855          2          9
     75855          1         10
     75853          0        240
     75852          0       3750    -- table completely cached - touch count not incremented
     75854          1       7538

Then repeating the tablescan of the 1,540 block table three times:

       OBJ        TCH   COUNT(*)
---------- ---------- ----------
     75853          3          1
     75855          0          1
     75854          0          1
     75851          3          1    -- segment header block, touch count incremented each time
     75849          0          1
     75850          0          1
     75852          3          1
     75855          2          9
     75855          1         10
     75853          0        149
     75851          2       1540    -- Table fully cached, touch count incremented but only to 2
     75852          0       2430
     75854          1       7538

Then executing the tablescan of the 770 block table three times:

       OBJ        TCH   COUNT(*)
---------- ---------- ----------
     75853          3          1
     75855          0          1
     75850          3          1    -- segment header block, touch count incremented each time
     75849          0          1
     75851          3          1
     75852          3          1
     75854          0          1
     75855          2          9
     75855          1         10
     75851          0         69
     75853          0        149
     75850          2        770    -- Table fully cached, touch count incremented but only to 2
     75851          2       1471
     75852          0       1642
     75854          1       7538

Finally executing the tablescan of the 300 block table three times:

       OBJ        TCH   COUNT(*)
---------- ---------- ----------
     75853          3          1
     75855          0          1
     75854          0          1
     75850          3          1
     75852          3          1
     75851          3          1
     75855          2          9
     75855          1         10
     75851          0         69
     75850          0        131
     75853          0        149
     75849          3        301	-- Table, and segment header, cached and touch count incremented 3 times
     75850          2        639
     75852          0       1342
     75851          2       1471
     75854          1       7538

This set of results on its own isn’t conclusive, of course, but the indications for 10.2.0.3 are:

    a) “Large” tablescans don’t increment the touch count – so avoiding promotion to the hot portion of the buffer
    b) There is a 25% boundary (ca. 3750 in this case) above which a tablescan will start to recycle the buffer it has used
    c) There is a 10% boundary (ca. 1540 in this case) below which repeating a scan WILL increment the touch count
    d) There is a 2% boundary (ca. 300 in this case) below which tablescans will always increment the touch count.

I can’t state with any certainty where the used and recycled buffers might be, but since blocks from the 3750 tablescan removed the blocks from the 7700 tablescan, it’s possible that “large” tablescans do somehow go “to the bottom quarter” of the LRU.

There also some benefit in checking the statistics “table scans (short)” and “table scans (long)” as the tests run. For the 2% (300 block) table I recorded 3 short tablescans; for the tables in the 2% to 10% range (770 and 1540) I recorded one long and two short (which is consistent with the touch count increment of 2 – the first scan was expected to be long, but the 2nd and 3rd were deemed to be short based on some internal algorithm about the tables being fully cached); finally for the tables above 10% we always got 3 long tablescans.

But as it says in the original note on small partitions – there are plenty of questions still to answer:

    I’ve cited 2%, 10%, and 25% and only one of these is set by a parameter (_small_table_threshold is derived as 2% of the db_cache_size – in simple cases) are the other figures derived, settable, or hard-coded.

    I’ve quoted the 2% as the fraction of the db_cache_size – but we have automatic SGA management in 10g, automatic memory management in 11g, and up to eight different cache sizing parameters in every version from 9i onwards. What figure is used as the basis for the 2%, and is that 2% of the blocks or 2% of the bytes, and if you have multiple block sizes does each cache perhaps allow 2% of its own size.

And then, in 11g we have to worry about automatic direct path serial tablescans – and it would be easy to think that the “_small_table_threshold” may have been describing that feature since (at least) 9.2.0.4 if its description hadn’t changed slightly for 11.2 !

So much to do, so little time — but at least you know that there’s something that needs careful investigation if you’re planning to do lots of tablescans.

Footnote: Having written some tests, it’s easy to change versions. Running on 8.1.7.4 and 9.2.0.8, with similar sized caches, I could see that the “traditional” description of the “small_table_threshold” was true – a short tablescan was anything less 2% of the buffer cache, long tablescans were (in effect) done using just a window of “db_file_multiblock_read_count” buffers, and in both cases the touch count was never set (except for the segment header block).

An index of my TPT scripts

A lot of people have asked me whether there’s some sort of index or “table of contents” of my TPT scripts (there’s over 500 scripts in the tpt_public.zip file – http://tech.e2sn.com/oracle-scripts-and-tools )

I have planned to create such index for years, but never got to it. I probably never will :) So a good way to extract the descriptions of some scripts is this (run the command in the directory where you extracted my scripts to):

$ grep -i Purpose: *.sql | awk -F: ‘{ printf(“%20s %-50s\n”, $1, $3) }’
            bhla.sql      Report which blocks are in buffer cache, protected by a cache
         bufprof.sql      Display buffer gets done by a session and their reason
            calc.sql      Basic calculator and dec/hex converter       
        channels.sql      Report KSR channel message counts by channel endpoints
        curheaps.sql      Show main cursor data block heap sizes and their contents
             dba.sql      Convert Data Block Address (a 6 byte hex number) to file#, block#
             ddl.sql      Extracts DDL statements for specified objects
              df.sql  Show Oracle tablespace free space in Unix df style
        diag_sid.sql      Display current Session Wait info            
        diag_sid.sql      An easy to use Oracle session-level performance snapshot utility
           disco.sql      Generates commands for disconnecting selected sessions
     getplusparm.sql      get sqlplus parameter value (such linesize, pagesize, sqlcode,
            grpn.sql      Quick group by query for aggregating Numeric columns
            hash.sql      Show the hash value, SQL_ID and child number of previously
             i2h.sql      Advanced Oracle Troubleshooting Seminar demo script
              im.sql      Display In-Memory Undo (IMU) buffer usage    
            init.sql  Initializes sqlplus variables for 156 character terminal
       kglbroken.sql                                                   
       kglbroken.sql      Report broken kgl locks for an object this can be used for 
            kill.sql      Generates commands for killing selected sessions
              la.sql      Show which latch occupies a given memory address and its stats
     lastchanged.sql      Detect when a datablock in table was last changed
       latchprof.sql      Perform high-frequency sampling on V$LATCHHOLDER
   latchprof_old.sql      Perform high-frequency sampling on V$LATCHHOLDER
      latchprofx.sql      Perform high-frequency sampling on V$LATCHHOLDER
              lh.sql      Show latch holding SIDs and latch details from V$LATCHHOLDER
             lhp.sql      Perform high-frequency sampling on V$LATCHHOLDER
            lhpx.sql      Perform high-frequency sampling on V$LATCHHOLDER
     lotshparses.sql      Generate Lots of hard parses and shared pool activity 
    lotshparses2.sql      Generate Lots of hard parses and shared pool activity 
        lotslios.sql      Generate Lots of Logical IOs for testing purposes
        lotspios.sql      Generate Lots of Physical IOs for testing purposes
     lotssparses.sql      Generate Lots of soft parses and library cache/mutex activity 
    lotssparses2.sql      Generate Lots of soft parses and library cache/mutex activity 
       mutexprof.sql      Display KGX mutex sleep history from v$mutex_sleep_history
       nonshared.sql      Print reasons for non-shared child cursors from v$sql_shared_cursor
      nonshared2.sql      Show the reasons why more child cursors were created instead of
      ostackprof.sql      Take target process stack samples and show an execution profile
            pmem.sql      Show process memory usage breakdown – lookup by process SPID
     pmem_detail.sql      Show process memory usage breakdown details – lookup by process SPID
        prefetch.sql      Show KCB layer prefetch                      
          pvalid.sql      Show valid parameter values from V$PARAMETER_VALID_VALUES
        rowcache.sql      Show parent rowcache entries mathcing an object name
              rs.sql      Display available Redo Strands               
               s.sql      Display current Session Wait and SQL_ID info (10g+)
          sample.sql      Sample any V$ view or X$ table and display aggregated results
      sampleaddr.sql      High-frequency sampling of contents of a SGA memory address
             ses.sql      Display Session statistics for given sessions, filter by
            ses2.sql      Display Session statistics for given sessions, filter by
        sgastatx.sql      Show shared pool stats by sub-pool from X$KSMSS
            smem.sql      Show process memory usage breakdown – lookup by session ID
     smem_detail.sql      Show process memory usage breakdown details – lookup by session ID
         snapper.sql      An easy to use Oracle session-level performance snapshot utility
     snapper3.15.sql      An easy to use Oracle session-level performance snapshot utility
      snapper_v1.sql      An easy to use Oracle session-level performance snapshot utility
      snapper_v2.sql      An easy to use Oracle session-level performance snapshot utility
            stat.sql      Execute SQL statement in script argument and report basic
              sw.sql      Display current Session Wait info            
             sw2.sql      Display current Session Wait info            
             swg.sql      Display given Session Wait info grouped by state and event
             swo.sql      Display current Session Wait info            
          topsql.sql      Show TOP SQL ordered by user-provided criteria
            usql.sql      Show another session’s SQL directly from library cache
           usqlx.sql      Show another session’s SQL directly from library cache
        waitprof.sql      Sample V$SESSION_WAIT at high frequency and show resulting 
              xb.sql      Explain a SQL statements execution plan with execution 
            xde2.sql      Describe X$ tables, column offsets and report indexed fixed table
              xm.sql      Explain a SQL statements execution plan directly from library cache
             xma.sql      Explain a SQL statements execution plan directly from library cache
            xmai.sql      Explain a SQL statements execution plan with execution 
             xms.sql      Explain your last SQL statements execution plan with execution 
            xmsh.sql      Explain a SQL statements execution plan with execution 
            xmsi.sql      Explain a SQL statements execution plan with execution

Share

Update, things of interest and a couple of blogs to check out

tech.E2SN secret hacking session on Tuesday 22nd March:

Just in case you missed it – there’s still chance to sign up to my tomorrow’s ORA-4031 and shared pool hacking session. I initially planned to limit the attendees to 100 per event (as the limited GotoWebinar package is cheaper that way) but over 100 people had signed up for the US event on the day of announcement, even before it was 8am in California, so I figured I should invest a bit more and allow more people attend. So far over 500 people have signed up (total for both events). If you haven’t done so, you can sign up here:

Advanced Oracle Troubleshooting online seminar Deep Dives 1-5  on 11-15 April:

The next AOT deep dives (1-5) will start in 3 weeks, on 11-15 April. (and 6-10 will be on 9-13 May).

Check the details here:

Blogs to check out:

Andrey Nikolaev has done some serious low-level research on Oracle latches and KGX mutexes and he also presented his work this year at Hotsos Symposium (I missed his session as I was stuck in JFK instead of attending the conference on that day):

Porus Havewala is quite a Grid Control and OEM enthusiast. If you are into OEM & GC, check out his blog:

Future events:

  • I will be speaking at the UKOUG Exadata Special Event on 18th April
  • I will announce some more Virtual Conferences pretty soon!!! Very interesting topics and good speakers – including (but not limited to) some serious Exadata technical contents!

Share

Upgrade Whoa

(The title’s a pun, by the way – an English form of humour that is not considered good unless it’s really bad.)

Very few people try to email me or call me with private problems – which is the way it should be, and I am grateful to my audience for realizing that this blog isn’t trying to compete with AskTom – but I do get the occasional communication and sometimes it’s an interesting oddity that’s worth a little time.

Today’s blog item is one such oddity – it was a surprise, it looked like a nasty change in behaviour, and it came complete with a description of environment, and a neatly formatted, complete, demonstration. For a discussion of the problem in Spanish you can visit the blog of John Ospino Rivas, who sent me the original email and has written his own blog post on the problem.

We start with a simple table, and then query it with a ‘select for update from two different sessions:


drop table tab1 purge;

create table tab1(
	id	number,
	info	varchar2(10),
	constraint tab1_pk primary key (id)
		using index (create index idx_tab1_pk on tab1(id))
);

insert into tab1 values(1,'a');
insert into tab1 values(2,'a');
insert into tab1 values(3,'a');
commit;

execute dbms_stats.gather_table_stats(user,'tab1',cascade=>true)

column id new_value m_id

set autotrace on explain

select  id
from    tab1
where   id = (
            select  min(id)
            from    tab1
        )
for update
;

set autotrace off

prompt	=============================================================
prompt  Now repeat the query in another session and watch it lock
prompt	And use a third session to check v$lock
prompt  Then delete here, commit and see what the second session does
prompt	=============================================================

accept X prompt 'Press return to delete and commit'

set verify on
delete from tab1 where id = &m_id;
commit;

The fact that the primary key index is created as a non-unique index isn’t a factor that affects this demonstration.

Given the query and the data in the table, you won’t be surprised by the result of the query from the first session (for convenience I’ve captured the selected value using the ‘column new_value’ option). Here’s the result of the query and its execution plan:


        ID
----------
         1

--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |     1 |     3 |     1   (0)| 00:00:01 |
|   1 |  FOR UPDATE                  |             |       |       |            |          |
|*  2 |   INDEX RANGE SCAN           | IDX_TAB1_PK |     1 |     3 |     0   (0)| 00:00:01 |
|   3 |    SORT AGGREGATE            |             |     1 |     3 |            |          |
|   4 |     INDEX FULL SCAN (MIN/MAX)| IDX_TAB1_PK |     3 |     9 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID"= (SELECT MIN("ID") FROM "TAB1" "TAB1"))

At this point the program issues instructions to repeat the query from a second session, then waits for you to press Return. When you run the same query from another session it’s going to see the data in read-consistent mode and try to select and lock the row where ID = 1, so the second session is going to hang waiting for the first session to commit or rollback.

Here’s the key question: what’s the second session going to return when you allow the first session to continue, delete the row it has selected, and commit ? Here’s the answer if you’re running 10.2.0.3 or 11.1.0.6 (which is what I happen to have easily available):

SQL> select  id
  2  from    tab1
  3  where   id = (
  4              select  min(id)
  5              from    tab1
  6          )
  7  for update
  8  ;

        ID
----------
         2

1 row selected.

Now, this seems perfectly reasonable to me – especially since I’ve read Tom Kyte’s notes on “write consistency” and seen the “rollback and restart” mechanism that kicks in when updates have to deal with data that’s changed since the start of the update. Session 2 had a (select for) update, and when it finally got to a point where it could lock the data it found that the read-consistent version of the data didn’t match the current version of the data so it restarted the statement at a new SCN. At the new SCN the current highest value was 2.

Now here’s what happened when I ran the test under 11.2.0.2:

SQL> select  id
  2  from    tab1
  3  where   id = (
  4              select  min(id)
  5              from    tab1
  6          )
  7  for update
  8  ;

no rows selected

The upgrade produces a different answer !

At first sight (or guess) it looks as if the query has run in two parts – the first part producing the min(id) of 1 using a read-consistent query block, with the second part then using the resulting “known value” to execute the outer select (shades of “precompute_subquery”) and restarting only the second part when it discovers that the row it has been waiting for has gone away.

It doesn’t really matter whether you think the old behaviour or the new behaviour is correct – the problem is that the behaviour has changed in a way that could silently produce unexpected results. Be careful if any of your code uses select for update with subqueries.

As a defensive measure you might want to change the code to use the serializable isolation level – that way the upgraded code will crash with Oracle error ORA-08177 instead of silently giving different answers:

SQL> alter session set isolation_level = serializable;

Session altered.

SQL> get afiedt.buf
  1  select  /*+ gather_plan_statistics */
  2          id
  3  from    tab1
  4  where   id = (
  5              select  min(id)
  6              from    tab1
  7          )
  8* for update
  9  /
from    tab1
        *
ERROR at line 3:
ORA-08177: can't serialize access for this transaction

It might be a way of avoiding this specific problem, of course, but it’s not a frequently used feature (the first pages of hits on Google are mostly about SQL Server) so who knows what other anomalies this change in isolation level might introduce.

LOBREAD SQL Trace entry in Oracle 11.2 (and tracing OPI calls with event 10051)

A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:

*** 2011-03-17 14:34:37.242
WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725
WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368
WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865
[...snipped...]
WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205
WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221
WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560
WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579
WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685
WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706
WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720
#ff0000;">LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735

In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.

Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).

So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.

Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):

SQL> @oerr 10051

ORA-10051: trace OPI calls

SQL> alter session set events '10051 trace name context forever, level 1';

Session altered.

Now run some LOB access code and check the tracefile:

*** 2011-03-17 14:37:07.178
WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602
OPI CALL: type=105 argc= 2 cursor=  0 name=Cursor close all
CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731
OPI CALL: type=94 argc=28 cursor=  0 name=V8 Bundled Exec
=====================
PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82'
select * from t_lob
END OF STMT
PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805
EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884
WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939
WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785
OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH
WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945
FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963
WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629
#ff0000;">OPI CALL: type=96 argc=21 cursor=  0 name=#ff0000;">LOB/FILE operations
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788
[...snip...]
WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054
LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064

Check the bold and especially the red string above.  Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…

By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?

Well, in 11.2.0.2 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.

Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.

It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:

If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:

Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:

Share

buffer flush

This is part 2 of an article on the KEEP cache. If you haven’t got here from part one you should read that first for an explanation of the STATE and CUR columns of the output.

Here’s a little code to demonstrate some of the problems with setting a KEEP cache – I’ve set up a 16MB cache, which gives me 1,996 buffers of 8KB in 10.2.0.3, and then created a table that doesn’t quite fill that cache. The table is 1,900 data blocks plus one block for the segment header (I’ve used freelist management to make the test as predictable as possible, and fixed the pctfree to get one row per block).

create table t1
pctfree 90
pctused 10
storage (buffer_pool keep)
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	rownum			id,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',1000)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1900
;

alter system flush buffer_cache;

-- scan the table to load it into memory

select
	/*+ full(t1) */
	count(small_vc)
from	t1
where
	id > 0
;

-- check the content of x$bh from another connection

-- update every fifth row (380 in total)

update t1
	set small_vc = upper(small_vc)
where
	mod(id,5) = 0
;

-- check the content of x$bh from another connection

The query I ran under the SYS account was this:

select
	bpd.bp_blksz,
	bpd.bp_name,
	wds.set_id,
	bh.state,
	cur,
	ct
from
	(
	select
		set_ds,
		state,
		bitand(flag, power(2,13))	cur,
		count(*)  ct
	from
		x$bh
	group by
		set_ds,
		state,
		bitand(flag, power(2,13))
	)		bh,
	x$kcbwds	wds,
	x$kcbwbpd	bpd
where
	wds.addr = bh.set_ds
and	bpd.bp_lo_sid <= wds.set_id
and	bpd.bp_hi_sid >= wds.set_id
and	bpd.bp_size != 0
order by
	bpd.bp_blksz,
	bpd.bp_name,
	wds.set_id,
	bh.state,
	bh.cur
;

In my test case this produced two sets of figures, one for the DEFAULT cache, and one for the KEEP cache but I’ve only copied out the results from the KEEP cache, first after the initial tablescan, then after the update that affected 380 blocks:


  BP_BLKSZ BP_NAME     SET_ID STATE   CUR         CT
---------- ------- ---------- ----- ----- ----------
      8192 KEEP             1     0     0         95
                                  1     0       1901
           ******* **********             ----------
           sum                                  1996

  BP_BLKSZ BP_NAME     SET_ID STATE   CUR         CT
---------- ------- ---------- ----- ----- ----------
      8192 KEEP             1     1     0       1462
                                  1  8192        380
                                  3     0        323
           ******* **********             ----------
           sum                                  1996

In the first output you see the 1901 buffers holding blocks from the table (1,900 data plus one segment header), with the remaining 95 buffers still “free” (state 0). The table blocks are all shown as XCUR (state 1, exclusive current)

In the second output you see 380 buffers holding blocks with state ‘XCUR’ with bit 13 of the flag column set, i.e. “gotten in current mode”. These are the 380 blocks that have been updated – but there are also 323 blocks shown as CR (state 3, “only valid for consistent read”). A detailed check of the file# and dbablk for these buffers shows that they are clones of (most of) the 380 blocks in the XCUR buffers.

Do a bit of arithmetic – we have 1462 blocks left from the original tablescan, plus 380 blocks in CUR mode (of which there are 323 clones) for a total of 1,842 blocks – which means that 59 blocks from the table are no longer in the cache. As we clone blocks we can lose some of the blocks we want to KEEP.

Unfortunately for us, Oracle has not given any preferential treatment to buffers which hold blocks in the XCUR state – any buffer which reaches the end of LRU chain and hasn’t been accessed since it was first loaded will be dumped so that the buffer can be used to create a clone (but see footnote). This means that a constant stream of inserts, updates, deletes, and queries could result in lots of clones being created in your KEEP cache, pushing out the data you want to keep.

If you want to size your KEEP cache to minimise this effect, you probably need to start by making it somewhat larger than the objects it is supposed to KEEP, and then checking to see how many clones you have in the cache – because that will give you an idea of how many extra buffers you need to stop the clones from pushing out the important data.

Upgrades:

When I wrote and ran the test cases in this note the client was running Oracle 10.2 – while writing up my notes I happened to run the test on 11.1.0.6 (still using freelists rather than ASSM) and got the following output from my scan of the KEEP cache:

  BP_BLKSZ BP_NAME     SET_ID STATE   CUR         CT
---------- ------- ---------- ----- ----- ----------
      8192 KEEP             1     1     0       1901
                                  3     0         91
           ******* **********             ----------
           sum                                  1992

Apart from the fact that you get slightly fewer buffers per granule in 11g (the x$bh structure has become slightly larger – and x$bh is a segmented array where each segment shares the granule with the buffers it points to) you can see that we only have 91 clones in the KEEP cache, and apparently we’ve managed to update our 380 blocks without changing their flag to “gotten in current mode”. Doing an update is, of course, just one way of making clones appear – but perhaps 11g will generally have more success in keeping current versions of blocks in memory than earlier versions.

There is , unfortunately, a very special feature to this test case – it’s using a single tablescan to update the table. So having said in part 1 that I was going to write a two-part article, I’ve got to this point, done a few more tests, and decided I need to write part three as well.  Stay tuned.

Footnote: Technically there are a couple of circumstances where Oracle will bypass the buffer and walk along the LRU chain looking for another block – but I’ve excluded them from this demonstration.

ORA-4031 errors, contention, cursor management issues and shared pool fragmentation – free secret seminar!

Free stuff! Free stuff! Free stuff! :-)

The awesome dudes at E2SN have done it again! (and yes, Tom, this time the “we at E2SN Ltd” doesn’t mean only me alone ;-)

On Tuesday 22nd March I’ll hold two (yes two) Secret Oracle Hacking Sessions – about ORA-04031: unable to allocate x bytes of shared memory errors, cursor management issues and other shared pool related problems (like fragmentation). This event is free for all! You’ll just need to be fast enough to register, both events have 100 attendee limit (due to my GotoWebinar accont limitations).

I am going to run this online event twice, so total 200 people can attend (don’t register for both events, please). One event is in the morning (my time) to cater for APAC/EMEA region and the other session is for EMEA/US/Americas audience.

The content will be the same in both sessions. There will be no slides (you cant fix your shared pool problems with slides!) but there will be demos, scripts, live examples and fun (for the geeks among us anyway – others go and read some slides instead ;-)!

Share

Exadata Training – I’ll be speaking at the 1-day UKOUG Exadata Special Event on 18th April

Hi all,

As my frequent readers know, I have promised to not travel anymore as it’s just too much hassle compared to the benefit of being “there”. This is why I’m going to fly to London on Monday, 18th April to speak at the UKOUG Exadata Special Event. This event is just too sexy to be missed, so I made an exception (the last one, I promise!)… and it’s probably going to be warmer there as well compared to where I am now :-)

I will be talking about what’s been my focus area for last year or so – Oracle Exadata Performance.

Dan Norris and Alex Gorbachev will be speaking there too, so it should end up being a pretty awesome event!

More details here:

My abstract is following:

#000000; font-family: Times; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; font-size: medium;">#333333; font-family: arial,sans-serif; text-align: left; font-size: small;">Understanding Exadata Performance: Metrics and Wait Events
#000000; font-family: Times; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; font-size: medium;">#333333; font-family: arial,sans-serif; text-align: left; font-size: small;">

In order to systematically troubleshoot and optimize Exadata performance, one must understand the meaning of its performance metrics.

This session provides a deep technical walkthrough of how Exadata IO and smart scans work and how to use relevant metrics for troubleshooting related performance issues. We will review both Exadata database and cell-level metrics, cell wait events and tools useful for troubleshooting. We will also look into metrics related to Exadata Hybrid Columnar Compression and the cell Flash Cache usage.

P.S. The reason why I called this post “Exadata Training” is that you’ll learn some real world practical stuff there… as opposed to the marketing material (and marketing material copy material) overdose out there… ;-)

Share