Search

OakieTags

Who's online

There are currently 1 user and 26 guests online.

Online users

Recent comments

Affiliations

16K CR gets for UQ index scan? Easy. Kind of.

I planned to write on this for quite some time, but failed to do so. Sorry about that. Today I finally got time and desire to describe a situation from the title. It was observed on an Oracle 9.2.0.8 running Solaris SPARC; manifested itself as a severe CPU burning at 100% utilization with ‘latch free’ on the first place of the Top 5 wait events in Statspack report.

Here is part of a 1 hour report:

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:            -72,280.51            -18,601.83
              Logical reads:            411,881.06            106,000.09
              Block changes:              2,515.53                647.39
             Physical reads:                665.17                171.18
            Physical writes:                218.18                 56.15
                 User calls:                394.87                101.62
                     Parses:                 79.46                 20.45
                Hard parses:                  0.94                  0.24
                      Sorts:                885.81                227.97
                     Logons:                  0.05                  0.01
                   Executes:              1,160.92                298.77
               Transactions:                  3.89

  % Blocks changed per Read:    0.61    Recursive Call %:     75.88
 Rollback per transaction %:    5.17       Rows per Sort:      5.95

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   99.98       Redo NoWait %:    100.00
            Buffer  Hit   %:   99.84    In-memory Sort %:    100.00
            Library Hit   %:   99.93        Soft Parse %:     98.82
         Execute to Parse %:   93.16         Latch Hit %:     89.32
Parse CPU to Parse Elapsd %:   84.20     % Non-Parse CPU:     99.57

 Shared Pool Statistics        Begin   End
                               ------  ------
             Memory Usage %:   90.41   91.42
    % SQL with executions>1:   31.20   35.28
  % Memory for SQL w/exec>1:   30.16   30.05

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free                                      2,529,903      49,710    48.22
CPU time                                                       31,694    30.75
db file sequential read                         1,228,775      13,164    12.77
buffer busy waits                                 269,229       3,076     2.98
enqueue                                             2,276       2,567     2.49 

And this is Latch sleep breakdown:

                                      Get                            Spin &
Latch Name                       Requests      Misses      Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
cache buffers chains        4,190,999,464 ###########   2,524,798 467719813/20
                                                                  75503/172064
                                                                  /31324/0
library cache                  27,510,901      44,028         931 43116/895/15
                                                                  /2/0
row cache enqueue latch        10,692,790      27,738          16 27722/16/0/0
                                                                  /0
cache buffer handles            6,478,263       7,240          66 7175/64/1/0/
                                                                  0
shared pool                     2,657,115       5,706         412 5300/400/6/0
                                                                  /0
...

As you can see, ‘latch free’ waits are almost entirely accounted to the infamous ‘cache buffers chains’ latch gets. You can easily find information on what it is and why it could happen, and how to diagnose such contention. Here I’ll just show you what I’ve found by accident (just staring at the report). I noticed that two queries – #2 and #3 in the SQL Ordered by Gets section – which are nothing but a unique index access to a table, had ~16K gets per execution:

                                                     CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    333,759,439       21,538       15,496.3   22.4  7028.63  24546.46 4091048759
Module: JDBC Thin Client
SELECT 1 FROM T WHERE ID = :B1 FOR UPDATE

    259,164,764       16,137       16,060.3   17.4  5391.01  18934.74  307775861
Module: JDBC Thin Client
select ID from T where ID=:1

How that could possibly happen? At first I thought the index used by queries was missing, but that was not the case. Second thought – the index has exploded due to modifications (I found one query that probably had inserted lots of rows to the table, but rows_processed for the statement was zero) and its height was increased, which caused other sessions to do much more work when accessing both index and table blocks. That was just an idea which I didn’t know how to confirm or dismiss so I asked wise guys and they were kind enough to point out to me that index height increase is not a necessity to have a spike in CR gets for concurrent SQL access. All that is needed is just an uncommitted transaction that inserts rows to the table somewhere nearby the normally accessed data. Yep, it sounded reasonable to me and I was OK with this explanation. I did confirm that the long running INSERT statement was indeed adding millions of rows to the table due to an application bug (missing condition), but failed to do so because of transaction timeout on the WebLogic set to half an hour. So the statement actually was rolled back and that’s why rows_processed for the query was zero.
But when I tried to reproduce this situation to prepare test case for this blog entry, I was not able to do so. A simple access by index to the table with an in-flight transaction inserting huge amount of data is definitely not enough to reproduce such situation. So it must have been one of those scenarios when a (table or index) block has been updated many times, and subsequent transactions had to apply undo to reverse changes to a block made by other transactions. It’s very easy to setup so I won’t do that here. Also I think it’s impossible to identify which block caused the issue having just a standard Statspack report in hands. Anyway, here is statistics comparison that shows undo access was the cause for increased LIO and ‘cache buffers chain’ latch contention:

Statname Normal R1 R2 Comments
branch node splits

29

329

421

consistent changes

44,192,320

779,500,412

1,269,485,409

number of times undo was applied to construct a consistent block version
consistent gets

209,640,205

1,035,154,852

1,466,065,189

consistent gets – examination

142,031,460

889,574,780

1,366,583,799

number of times a block (such as undo or index root) was accessed under protection of ‘cache buffers chain’ latch
data blocks consistent reads – undo records applied

43,973,030

779,361,353

1,269,187,599

number of undo records applied while constructing CR block copy
leaf node splits

6,200

90,367

113,873

transaction tables consistent read – undo records applied

1,824

80,503

4,863

user calls

6,526,934

2,277,792

1,429,810

Filed under: Oracle, Performance Tagged: indexes, read consistency