Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Index Skip Scan: Potential Use Case or Maybe Not ? (Shine On You Crazy Diamond)

While answering a recent question on a LinkedIn forum, it got me thinking whether there’s a potential use case for using an INDEX SKIP SCAN I hadn’t previously considered. I’ve discussed Index Skip Scans previously (as I did here), a feature introduced around Oracle9i that allows an index to be considered by the CBO even […]

Histogram Threat

Have you ever seen a result like this:


SQL> select sql_id, count(*) from V$sql group by sql_id having count(*) > 1000;

SQL_ID		COUNT(*)
------------- ----------
1dbzmt8gpg8x7	   30516

A client of mine who had recently upgraded to 12.2.0.1 RAC, using DRCP (database resident connection pooling) for an application using PHP was seeing exactly this type of behaviour for a small number of very simple SQL statements and wanted to find out what was going on because they were also seeing an undesirable level of contention in the library cache when the system load increased.

In this note I just want to highlight a particular detail of their problem – with an example – showing how easily histograms can introduce problems if you don’t keep an eye out for the dangers.

One of their queries really was as simple as this:

select count(*), sum(skew2) from t1 where skew = :b1;

And I’m going to use this query to model the problem. All I have to do is arrange for a data set that results in a hybrid (or height-balanced) histogram being created on the skew column, and then run the query lots of times with different input bind values. In the case of the client there were around 18,000 possible values for the column, and the number of rows per value varied from 1 to about 20,000 – but whatever the number of rows selected the optimum execution plan was always going to be an indexed access.


rem
rem     Script:         acs_child_cursors.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem

create table t1 (
        id, skew, skew2, padding
)
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 3e3
)
select
        rownum  id,
        g1.id   id1,
        g2.id   id2,
        rpad('x',100)
from
        generator       g1,
        generator       g2
where
        g2.id <= g1.id     -- > comment to avoid WordPress format issue
order by
        g2.id, g1.id
;

alter table t1 modify skew not null;
alter table t1 modify skew2 not null;

create index t1_skew on t1(skew);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1 for columns skew size 254'
--              method_opt       => 'for all columns size 1'
        );
end;
/

variable b1 number
exec :b1 := 0;

set termout off
@start_1000
@start_1000
@start_1000
set termout on

set linesize 120
set trimspool on

column sql_text format a55

select
        child_number, plan_hash_value, executions,
        is_bind_sensitive,
        is_bind_aware,
        is_shareable,
        sql_text
from
        v$sql
where   sql_id = 'b82my582cnvut'
;

The data set contains 3,000 distinct values for skew and the way I’ve generated the rows means that the value N will appear N times – so there’s one row with the value 1 and 3,000 rows with the value 3,000 and so on for a total of 4,501,500 rows. If you want to run the tes the code is likely to take a couple of minutes to complete, requiring roughly 700 MB of disk space.

The mechanism of the script start_1000 is something I published a few years ago, and essentially it executes a script called start_1.sql 1,000 times which, for this test, contains the following two lines:


exec :b1 := :b1 + 1

select count(*), sum(skew2) from t1 where skew = :b1;

The net effect of the 3 calls to start_1000.sql is that my simple SQL statement is called once in turn for each value of skew from 1 to 3,000. The SQL_ID of the statement is ‘b82my582cnvut’ which I’ve used to query v$sql when the run is complete, with the following result:


CHILD_NUMBER PLAN_HASH_VALUE EXECUTIONS I I I SQL_TEXT
------------ --------------- ---------- - - - -------------------------------------------------------
	   0	  1041516234	    498 Y N N select count(*), sum(skew2) from t1 where skew = :b1
	   1	  1041516234	     25 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   2	  1041516234	    104 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   3	  1041516234	    308 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   4	  1041516234	    429 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   5	  1041516234	    640 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   6	  1041516234	     31 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   7	  1041516234	    305 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   8	  1041516234	    660 Y Y Y select count(*), sum(skew2) from t1 where skew = :b1

9 rows selected.

I’ve got 9 child cursors, all with the same execution plan, all except the last labelled as not shareable (you may find that you don’t get exactly the same result, on repeated tests I got between 5 and 9 cursors). Given enough time all these cursors except the last (shareable) one would be aged out of the library cache. In the case of the client, who had a shared pool that was probably quite a bit larger than needed, the number of non-shareable cursors could get very large and they were hanging around for ages. Typically most of the cursors would report an execution count less than 30, with many showing just one or two executions and a handful showing execution counts in the region of several hundred (and that hanful were the ones that were still marked as shareable).

After eliminating the critical histogram (using dbms_stats.delete_column_stats()) and eliminating the redundant child cursors (using dbms_pool.purge()) the massive proliferation stopped happening and the performance threat disappeared. The only issue then was to change the table preferences for stats collection on this table to add the clause “for columns size 1 skew” so that the histogram would not be recreated on the next gather.

Further Observations.

I suspect that part of the client’s probem – something that exaggerated the count rather than causing it – could be attributed to using DRCP (database resident connection pool) which probably still has some quirky bits of behaviour. It was also true that the client’s connection pool was probably much bigger than it needed to be so if there were any funny little bits of emergent behaviour at scale the client would probably have hit them.

The problem of escalating child cursors is one that Oracle has been working on for quite a long time, and there’s a (hidden) parameter that was introduced late in 11gR2 (though I think that the 11g mechanism first appeared through a fix control) to allow Oracle to mark a parent cursor obsolete if it acquired too many child cursors.  There’s a note on MoS that the client had read on this topic: Doc ID: 2298504.1: Cursor Mutex X Wait Events: After Upgrading To 12.2″ which looked as if it was describing their symptoms so they had set this parameter (_cursor_obsolete_threshold) from 8192 (the 12.2 default) down to 1024 (the default for 12.1 and earlier versions). This had been of some help with the library cache problem.  When the sql_id at the top of this article reported 30,516 child cursors that would be 29 “obsolete” parent cursors with 1,024 childs cursor and one “live” parent cursor with 820 child cursors.

You can appreciate that if Oracle has to find a parent cursor and pin it while walking a chain of 30,516 child cursors that’s likely to take a lot more time than walking a chain of 30 parent cursors (while holding a library cache mutex, perhaps) to find the one non-obsolete parent, then acquiring the parent mutex to walk a chain of 820 child cursor.

I don’t know the exact details of the mechanisms involved with manipulating the various chains – but there are likely to be times when one process needs exclusive mutexes/latches to change a chain while other processes are holding shared mutexes/latches to search the chain. When you’ve got 30,000 child cursors in total the important questions become: “where’s the trade-off between making the child chains shorter and making the parent search longer ?” (I’d expect – or hope – that the Oracle developers had actually found a way to code the obsolence so that the new parent was first in the search, and the rest were never visited, of course.)

One of the suggestions I made to try to alleviate the problem – which I had assumed was due in part to the degree of concurrent execution of the statement – was to mark the cursor as “hot” This resulted in 36 differet sql_ids for the same statement (the client machine had 72 CPUs). This had some effect but ultimately meant that there were 36 chains of parents that would eventually end up with lots of child cursors – the only difference was the rate at which the total cursor count was growing (a lot slower), balanced against the threat that they might all manage to get to 30,000+ child cursors! Instead, as a quick and dirty workaround, I supplied the client with a script that could be run at quieter moments during the day to call dbms_shared_pool.purge() for the sql_id to flush all its cursors from the library cache.

One final oddity – which isn’t from the client site. When I changed my test above to avoid generating the histogram (using the commented out method_opt “for all columns size 1”) I found that I still got two child cursors; the first ended up marked as bind-aware but insensitive and non-shareable, the second appeared (time and time again) when my bind value got to 499 and was marked bind-sensitive, bind-aware and shareable.  I still have no idea why this happened.

Footnote:

When repeating the test I started with a “flush shared_pool” – but I’ve removed this line from the code above in case anyone ran it without considering the possible side effects. You do need to clear previous copies of the key statement from the library cache, though, if you want to avoid confusing the issue on repeated runs of the test.

 

Result Cache: when *not* to use it

I encountered recently a case where result cache was incorrectly used, leading to high contention when the application encountered a peak of load. It was not a surprise when I’ve seen that the function was called with an ‘ID’ as argument, which may have thousands of values in this system. I mentioned to the software vendor that the result cache must be used only for frequently calling the function with same arguments, not for random values, even if each value have 2 or 3 identical calls. And, to detail this, I looked at the Oracle Documentation to link the part which explains when the result cache can be used and when it should be avoided.

But I’ve found nothing relevant. This is another(*) case where the Oracle Documentation is completely useless. Without explaining how a feature works, you completely fail to get this feature used. Most people will not take the risk to use it, and a few will use it in the wrong place, before definitely blacklisting this feature.

(*) By another case, I’m thinking about Kamil Stawiarski presentation about Pragma UDF and the lack of useful documentation about it.

Oracle documentation

So this is what I’ve find in the Database Performance Tuning Guide about the Benefits of Using the Server Result Cache

  1. The benefits of using the server result cache depend on the application
  2. OLAP applications can benefit significantly from its use.
  3. Good candidates for caching are queries that access a high number of rows but return a small number, such as those in a data warehouse.

So, this is vague (‘depends’, ‘can benefit’, ‘good candidates’). And doesn’t help to decide when it can be used.
The ‘access a high number of rows but return a small number’ is an indication why cache hits can benefit. However, there is no mention of the most important things, which are :

  • The cache result is invalidated for any DML on the tables the result relies on.
  • The cache miss, when the result is invalidated is expensive
  • The cache miss, when the result is not in the result cache is expensive
  • The ‘expensive’ here is a scalability issue: not detected in unit tests, but big contention when load increases

Real things to know

The first thing to know is that the Result Cache memory is protected by a latch:

SQL> select addr,name,gets,misses,sleeps,spin_gets,wait_time from v$latch where name like 'Result Cache%';
 
ADDR NAME GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
---------------- ------------------------- ---------- ---------- ---------- ---------- ----------
00000000600477D0 Result Cache: RC Latch 2 0 0 0 0
0000000060047870 Result Cache: SO Latch 0 0 0 0 0
0000000060047910 Result Cache: MB Latch 0 0 0 0 0

This latch has no children:

SQL> select * from v$latch_children where name like '%Result Cache%';
 
no rows selected

Only one latch to protect the whole result cache: concurrent sessions – even for different functions – have to serialize their access on the same latch.

This latch is acquired in exclusive mode when the session has to write to the result cache (cache miss, invalidation,…) or in shared mode – since 11gR2 when reading only. This has been explained by Alex Fatkulin http://afatkulin.blogspot.ch/2012/05/result-cache-latch-in-11gr2-shared-mode.html.

This means that, whatever the Oracle Documentation says, the benefit of result cache comes only at cache hit: when the result of the function is already there, and has not been invalidated. If you call the same function with always the same parameter, frequently, and with no changes in the related tables, then we are in the good case.

But if there was a modification of one of the tables, even some rows that have nothing to do with the result, then you will have an overhead: exclusive latch get. And if you call the function with new values for the arguments, that’s also a cache miss which has to get this exclusive latch. And if you have multiple sessions experiencing a cache miss, then they will spin on CPU to get the exclusive latch. This can be disastrous with a large number of sessions. I have seen this kind of contention for hours with connection pools set to 100 sessions when the call to the function is frequent with different values.

To show it, I create a demo table (just to have a dependency) and a result_cache function:

SQL> create table DEMO as select rownum n from xmltable('1 to 1000');
Table created.
 
SQL> create or replace function F(n number) return number result_cache as begin for i in (select * from DEMO where DEMO.n=F.n) loop return i.n; end loop; end;
2 /
Function created.

I have just restarted the instance and my latch statistics are reset:

SQL> select addr,name,gets,misses,sleeps,spin_gets,wait_time from v$latch where name like 'Result Cache%';
 
ADDR NAME GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
---------------- ------------------------- ---------- ---------- ---------- ---------- ----------
00000000600477D0 Result Cache: RC Latch 2 0 0 0 0
0000000060047870 Result Cache: SO Latch 0 0 0 0 0
0000000060047910 Result Cache: MB Latch 0 0 0 0 0

Result Cache Hit

This will call the function always with the same argument, and no change in the table it relies on:
SQL> declare n number; begin for i in 1..1e3 loop n:=n+f(1); end loop; end;
2 /
PL/SQL procedure successfully completed.

So, the first call is a cache miss and the 999 next calls are cache hits. This is the perfect case for Result Cache.

SQL> select addr,name,gets,misses,sleeps,spin_gets,wait_time from v$latch where name like 'Result Cache%';
 
ADDR NAME GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
---------------- ------------------------- ---------- ---------- ---------- ---------- ----------
00000000600477D0 Result Cache: RC Latch 1009 0 0 0 0
0000000060047870 Result Cache: SO Latch 1 0 0 0 0
0000000060047910 Result Cache: MB Latch 0 0 0 0 0

So, that’s about 1000 latch gets. With cache hits you get the latch once per execution, and this is a shared latch, so no contention here.
You want to see check that it is a shared latch? Just set a breakpoint with gdb on the ksl_get_shared_latch function (up to 12.1 because 12.2 uses ksl_get_shared_latch_int) and print the arguments (as explained by Stefan Koehler and Frits Hoogland):

As my RC latch is at address 00000000600477D0 I set a beakpoint on ksl_get_shared_latch where the first argument is 0x600477d0 and display the other arguments:

break ksl_get_shared_latch
condition 1 $rdi == 0x600477d0
commands
silent
printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
c
end

Then one call with cache hit displays:

ksl_get_shared_latch laddr:600477d0, willing:1, where:1, why:5358, mode:8

Mode 8 is shared: many concurrent sessions can do the same without waiting. Shared is scalable: cache hits are scalable.

Cache miss – result not in cache

Here each call will have a different value for the argument, so that they are all cache misses (except the first one):

SQL> declare n number; begin for i in 1..1e3 loop n:=n+f(i); end loop; end;
2 /
PL/SQL procedure successfully completed.

Now the ‘RC latch’ statistics have increased further:

SQL> select addr,name,gets,misses,sleeps,spin_gets,wait_time from v$latch where name like 'Result Cache%';
 
ADDR NAME GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
---------------- ------------------------- ---------- ---------- ---------- ---------- ----------
00000000600477D0 Result Cache: RC Latch 6005 0 0 0 0
0000000060047870 Result Cache: SO Latch 1 0 0 0 0
0000000060047910 Result Cache: MB Latch 0 0 0 0 0

This is about 5000 additional latch gets, which means 5 per execution. And, because it writes, you can expect them to be exclusive.

Here is my gdb script output when I call the function with a value that is not already in cache:

ksl_get_shared_latch laddr:600477d0, willing:1, where:1, why:5358, mode:8
ksl_get_shared_latch laddr:600477d0, willing:1, where:1, why:5347, mode:16
ksl_get_shared_latch laddr:600477d0, willing:1, where:1, why:5358, mode:16
ksl_get_shared_latch laddr:600477d0, willing:1, where:1, why:5374, mode:16

Mode 16 is exclusive. And we have 3 of them in addition to the shared one. You can imagine what happens when several sessions are running this: spin and wait, all sessions on the same resource.

Cache miss – result in cache but invalid

I run the same again, where all values are in cache now:

SQL> declare n number; begin for i in 1..1e3 loop n:=n+f(i); end loop; end;
2 /
PL/SQL procedure successfully completed.

So this is only 1000 additional gets:

SQL> select addr,name,gets,misses,sleeps,spin_gets,wait_time from v$latch where name like 'Result Cache%';
 
ADDR NAME GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
---------------- ------------------------- ---------- ---------- ---------- ---------- ----------
00000000600477D0 Result Cache: RC Latch 7005 0 0 0 0
0000000060047870 Result Cache: SO Latch 1 0 0 0 0
0000000060047910 Result Cache: MB Latch 0 0 0 0 0

The function depends on DEMO table, and I do some modifications on it:

SQL> insert into DEMO values (0)
1 row created.
SQL> commit;
Commit complete.

This has invalidated all previous results. A new run will have all cache miss:

SQL> declare n number; begin for i in 1..1e3 loop n:=n+f(i); end loop; end;
2 /
PL/SQL procedure successfully completed.

And this is 5000 additional gets:

SQL> select addr,name,gets,misses,sleeps,spin_gets,wait_time from v$latch where name like 'Result Cache%';
 
ADDR NAME GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
---------------- ------------------------- ---------- ---------- ---------- ---------- ----------
00000000600477D0 Result Cache: RC Latch 12007 0 0 0 0
0000000060047870 Result Cache: SO Latch 1 0 0 0 0
0000000060047910 Result Cache: MB Latch 0 0 0 0 0

So what?

The important thing to know is that each cache miss requires an exclusive access to the Result Cache, multiple times. Those must be avoided. The Result Cache is good for a static set of result. It is not a short-term cache to workaround an application design where the function is called two or three times with the same values. This is, unfortunately, not explained in the Oracle Documentation. But it becomes obvious when we look at the implementation, or when we load test it with multiple sessions. The consequence can be this kind of high contention during minutes or hours:

Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
----------------------------------------- ------------ ----------- ------ ------
latch free 858,094 1,598,387 1863 78.8
enq: RC - Result Cache: Contention 192,855 259,563 1346 12.8

Without either the knowledge of the implementation, or relevant load tests, the risk is that a developer stays on his good results in unit testing, and implement Result Cache in each function. The consequence will be seen too late, in production, at a time of load peak. If this happens to you, you can disable the result cache (DBMS_RESULT_CACHE.BYPASS(TRUE);) but the risk is to have performance degradation in the ‘good cases’. Or recompile the procedures with removed RESULT_CACHE, but you may bring a new contention on library cache then.

 

Cet article Result Cache: when *not* to use it est apparu en premier sur Blog dbi services.

A look into Oracle redo, part 1: redo allocation latches

This will be a series of posts about Oracle database redo handling. The database in use is Oracle version 12.2.0.1, with PSU 170814 applied. The operating system version is Oracle Linux Server release 7.4. In order to look into the internals of the Oracle database, I use multiple tools; very simple ones like the X$ views and oradebug, but also advanced ones, quite specifically the intel PIN tools (https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool). One of these tools is ‘debugtrace’, which contains pretty usable output on itself (a indented list of function calls and returns), for which I essentially filter out some data, another one is ‘pinatrace’, which does not produce directly usable output, because it provides instruction pointer and memory addresses. However, the information it provides is extremely useful (and detailed!) once you translate the addresses to Oracle function names and memory region names. I have written a tool to do that: https://gitlab.com/FritsHoogland/pinatrace_annotate.git

The blog posts are meant for investigating Oracle database internals. This is a microscopic view into how the database works. I do believe by understanding the mechanics, you can very easy and quickly assess and resolve issues because of a fundamental understanding of how the database works.

First of all to immediately show the value of looking into the internals, a clear documentation bug: this entry from the online documentation of 12.2: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cncpt/memory-architecture.html#GUID-C2AD1BF6-A5AE-42E9-9677-0AA08126864B (not sure how long this link will be valid, Oracle is notorious for changing it) states that the log buffer is singular buffer called the log buffer. In fact, the log buffer is not a singular buffer, but a buffer that contains at least to public redo strands, which changed since at least Oracle 10g. These strands are externalised in x$kcrfstrand:

SQL> select indx,strand_size_kcrfa, first_buf_kcrfa, last_buf_kcrfa from x$kcrfstrand where last_buf_kcrfa != '00';

      INDX STRAND_SIZE_KCRFA FIRST_BUF_KCRFA  LAST_BUF_KCRFA
---------- ----------------- ---------------- ----------------
         0          67108864 00000000D8000000 00000000DBFFFE00
         1          67108864 00000000DC000000 00000000DFFFFE00

The number of public redo buffers is dependent on the undocumented parameter “_log_parallelism_max”. A quick question out on twitter revealed that systems with a high count of processors could have a higher public redo strand count.

The public strands are allocated in the ‘Redo Buffers’ memory allocation:

SQL> select "AREA NAME", "START ADDR", to_char((to_number(rawtohex("START ADDR"),'xxxxxxxxxxxxxxxx')+"SEGMENT SIZE"),'XXXXXXXXXXXXXXXX'),pagesize from x$ksmssinfo; 

AREA NAME                        START ADDR       TO_CHAR((TO_NUMBE   PAGESIZE
-------------------------------- ---------------- ----------------- ----------
Variable Size                    0000000061000000          D8000000    2097152
Redo Buffers                     00000000D8000000          E07C6000    2097152
Fixed Size                       0000000060000000          60A00000    2097152
skgm overhead                    00000000E1000000          E1003000       4096

Each public strand is protected by a redo allocation latch:

SQL> select addr, child#, name, gets, misses, immediate_gets, immediate_misses from v$latch_children where name = 'redo allocation';

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE17F0         20 redo allocation               5          0              0                0
0000000076FE1750         19 redo allocation               5          0              0                0
0000000076FE16B0         18 redo allocation               5          0              0                0
0000000076FE1610         17 redo allocation               5          0              0                0
0000000076FE1570         16 redo allocation               5          0              0                0
0000000076FE14D0         15 redo allocation               5          0              0                0
0000000076FE1430         14 redo allocation               5          0              0                0
0000000076FE1390         13 redo allocation               5          0              0                0
0000000076FE12F0         12 redo allocation               5          0              0                0
0000000076FE1250         11 redo allocation               5          0              0                0
0000000076FE11B0         10 redo allocation               5          0              0                0

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE1110          9 redo allocation               5          0              0                0
0000000076FE1070          8 redo allocation               5          0              0                0
0000000076FE0FD0          7 redo allocation               5          0              0                0
0000000076FE0F30          6 redo allocation               5          0              0                0
0000000076FE0E90          5 redo allocation               5          0              0                0
0000000076FE0DF0          4 redo allocation               5          0              0                0
0000000076FE0D50          3 redo allocation               5          0              0                0
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

Redo allocation latch child# 1 is the latch protecting allocations in public redo buffer 0, child# 2 is the latch protecting allocations in public redo buffer 1. Two questions that you would probably ask yourself:
1: why are there 20 redo allocation latches whilst there are only two public redo buffers?
2: how can you tell child# 1 is bound to public redo buffer 0, and child# 2 is bound to public redo buffer 1?

Answers to these questions:
1: the 18 other redo allocation latches are because of a mechanism called ‘private redo strands’, see https://blog.tanelpoder.com/files/Oracle_Performance_Scalability_10g_11g.pdf, https://fritshoogland.wordpress.com/2016/11/15/redo-a-blogpost/
2: in order to be sure about what that redo allocation latch protects, you need to look at the actual function calls.

For this, I am using a tool from intel, called ‘pin tools’, which is a suite of tools, of which one is called ‘debugtrace’: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

First I am going to look into how a latch is chosen.

Attach a debugtrace to a normal foregrond session that executes an insert and a commit, and filter the output to make it more readable using a custom filter: https://gitlab.com/snippets/1686145

Now search for the latch address (in lowercase) 0x76fe0c10 (the address of the redo allocation latch, see above query from v$latch_children):

 | | | | | | | | | | | | | | | > kcrfw_redo_gen_ext(0x3, 0x7ffd52f513e0, ...)
 | | | | | | | | | | | | | | | | > kcobrh(0x7ffd52f513e0, 0x3, ...)
 | | | | | | | | | | | | | | | | < kcobrh+0x00000000014b returns: 0x1b4
 | | | | | | | | | | | | | | | | > kcb_get_redo_version(0x3, 0x7ffd52f513a0, ...)
 | | | | | | | | | | | | | | | | < kcb_get_redo_version+0x0000000000c4 returns: 0x1
 | | | | | | | | | | | | | | | | > kcmglscn8(0x7ffd52f508d0, 0, ...)
 | | | | | | | | | | | | | | | | | > kcscu8(0x600113b8, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | < kcmglscn8+0x000000000054 returns: 0x756239
 | | | | | | | | | | | | | | | | > kscn_to_ub8_impl(0x7ffd52f50838, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | < kscn_to_ub8_impl+0x00000000003e returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)                             <<<<<<<<
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > ksvgcls(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < ksvgcls+0x000000000021 returns: 0
 | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7ffd52f508d8, ...)
 | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab838, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0x7ffd52f50848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kscnbur3_int_impl(0x756239, 0xd8eab824, ...)
 | | | | | | | | | | | | | | | | < kscnbur3_int_impl+0x000000000030 returns: 0x756239
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756231, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kslfre(0x76fe0c10, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0

If we look at the work of andrey nikolaev: https://andreynikolaev.files.wordpress.com/2016/03/latches_and_mutexes_in_oracle_12c.pdf
We know that the first argument of kslgetl is the latch address, and the second argument is 0 for no wait or 1 for willing to wait mode. What becomes clear, is that a foreground process gets the redo allocation latch in nowait mode. Of course this is widely known and documented. the return code of 0x1 means the latch is successfully taken. The latch that is fetched before the redo allocation latch of course is the redo copy latch:

SQL> @flba 76fdebe0
ADDR             NAME
---------------- ------------------------
0000000076FDEBE0 redo copy

If we look again at the figures of the two redo allocation latches protecting the public redo strands:

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

It becomes apparent that child# 1 is used way more than child# 2, which is visible in the column ‘immediate_gets’. How does that work? Let’s take the first redo allocation child latch, and issue an insert again with debugtrace on it.

session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1

session 2; debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1

First child# 1 is tried, but when the latch is already taken, indicated by returncode 0, it turns to the second latch.

What will happen when both latches are taken?

session 1:

SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1
SQL> oradebug call kslgetl 0x76FE0CB0 0 0 2442
Function returned 1

This is what it looks like on in debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | | | | | | | | | | | | | | | > kslges(0x76fe0cb0, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | > ksuSdiInProgress(0x2, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | | | | | | | | | | | | | > kskthbwt(0, 0, ...)
 | | | | | | | | | | | | | | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | | | | | | | | | | | | | | > kslwlmod(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | > kstmgetsectick(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | < kstmgetsectick+0x00000000003a returns: 0x5a4bd4b9
 | | | | | | | | | | | | | | | | | | < kslwlmod+0x0000000000cb returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | > kslwtbctx(0x7ffc45d6a988, 0x7a63c588, ...)

What this shows, is the foreground session first trying to get the latch of the first redo allocation child in immediate mode, which failed, then the latch of the second redo allocation child in immediate mode, which failed too, after which it had no other option than to get a redo allocation latch in willing to wait mode. For that it performed a redo allocation latch get for the first child latch again with the second argument set to 1, indicating the willing to wait mode. Because the latch was still taken, kslgetl called kslges in order to spin (which means a series of retries of getting the latch) for it; on Linux the spinning happens inside kslges, so it is not not visible in a debugtrace). After it spun for 20000 times, the process registers a wait (kslwtbctx), and then goes to sleep on a semaphore in the wait:

 | | | | | | | | | | | | | | | | | | < kslwtbctx+0x00000000050b returns: 0x77c000a4
 | | | | | | | | | | | | | | | | | | > skgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | > kslwait_conv_wait_time(0xffffffff, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | | > sskgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | | > semop@plt(0x98000, 0x7ffc45d6a888, ...)

This is the modern latch implementation where a session goes to sleep on a semaphore after spinning for a latch in willing to wait mode. In order to learn more about that, look in the work of Andrey Nikolaev. The semaphore implementation means that as soon as the latch becomes available, the session is posted and can do it’s allocation in the public redo strand.

However, this does not tie the child latch to a public strand, it only shows how a session bunny-hops public redo allocation latches in nowait mode until it hits a blind wall (if all have been taken) after which it will start waiting in willing to wait mode, waiting to be posted.

In order to see the connection between the redo allocation latches and the actual public strand memory areas, we need to look one level deeper, which means looking at (main) memory access of a process. This can be done using the pin tools pinatrace tool, and adding annotations to the pinatrace in order to understand what is going on. for this, I created a tool: https://gitlab.com/FritsHoogland/pinatrace_annotate

First, create a pinatrace of a normal insert on a test database. A word of warning: if you think a function call trace generates a lot of detailed data, the pinatrace alias memory access trace generates *WAY* more. Even modest things like ‘select count(*) from dual’ generates heaps of data.

Back to the pinatrace of the normal insert; generate the trace, and then annotate it using pinatrace_annotate tool. This should show the insert takes redo allocation latch child #1. To validate, search for the latch address again:

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007fff5d88efe8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007fff5d88efc8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007fff5d88efe8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

What this shows is the kcrfw_redo_gen_ext function reading address 0x60016570, which is in the fixed SGA, which belongs to a ‘fixed sga variable’ called kcrfsg_ at offset 840, reading the value 0x76fe0c10, which we already know is the address of redo allocation latch child #1. The latch address is written into address 0x7fff5d88efe8, which is in the process’ PGA, but we got no annotation of it, because there is nothing between the round brackets. Later, the latch address in PGA is read again.

We know the kcrfsg_ variable/struct from my earlier investigation into how the logwriter works. This is a struct in fixed sga that holds data important to kcrf (kernel cache redo file). My current guess is sg means ‘SGA’.

Only a little further in the trace, we see that the kslgetl function is invoked, and indeed it is trying to get this latch:

kslgetl+69:0x0000000076fe0c1f(Variable Size(pgsz:2048k)|+368970783 shared pool|(child)latch#1:redo allocation+15 shared pool|permanent memor,duration 1,cls perm+16120863 ):R:1:0x37/55()

The way a successful latch get looks like, which is visible in the trace, is this:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0/0()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x1a/26()

The work of Andrey nikolaev describes that when a latch is taken, the ora pid is written at offset 0. Here we see at offset 0, the kslgetl function reads a 0, indicating the latch is free, after which the value 26, the ora pid of the process, is written into it, which means this process now has taken the latch.

The next step is to see if there is any indication of it pointing to the public redo strand. It should be close to the latch get, because a latch should be hold for as little time as possible. Essentially, what I did then is search for ‘Redo Buffers’. I found this a little further still in the kcrfw_redo_gen_ext function:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

There’s a lot to say about this. The first thing to point out, is that these are all memory locations of a memory segment in the shared pool in permanent memory somewhere around 0x76fdf300 / 0x76fdf400. The first public strand is explicitly mentioned (by my annotation of the memory structures):

kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )

As well as the second public strand:

kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )

I am confident that the second public strand just happens to collide with a sort of ‘not used’ code, if you look at what the hexadecimal value of this is: 0xdeadbeef

The second thing to notice is there is a lot of information written really close to each other (0x76fdf300-0x76fdf400). The only location I know that holds this kind of redo related information is the previously mentioned kcrfsg_ struct, but this seems to be another location that holds critical information about redo, of which all we know is it sits in the shared pool (not in fixed sga), and it is allocated as permanent memory.

I tried various things to get more information about this memory location, like dumping the sga heap, inspecting this memory location in a heap dump to see if there is a sub heap that I can dump for more information (there isn’t), asking on twitter (which sometimes helps), but I gotten no information. I spend a considerable time searching in the bug database, and still gotten no answers….

I decided to prove usage of the second public strand by manually obtain the latch for the first strand, so an insert will take the second latch, and if that indeed means the second public redo strand is used. This means fetching the first latch using oradebug in a session, and then starting another database session, attach a pinatrace to it, and execute an insert.

Once the pinatrace is annotated, search for 0x76fe0c10 (the first redo allocation latch) again. The start is exactly the same, the latch address is fetched from kcrfsg_ at offset 840.

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007ffcca22e6e8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007ffcca22e6c8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007ffcca22e6e8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

Now look at the same offsets in the function kslgetl (+138) again:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0x27/39()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x27/39()

Now we see the latch already has a value at offset zero, which is 39, indicating the latch is taken. Now that the session executing the insert found the first redo allocation latch busy, first this happens:

kcrfw_redo_gen_ext+14267:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):R:4:0x1/1()
kcrfw_redo_gen_ext+14273:0x00007ffcca22e6d8():R:4:0x2/2()
kcrfw_redo_gen_ext+14283:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):W:4:0x2/2()

Because the latch for the first public redo strand was taken, an additional strand is enabled by setting the fixed sga variable kcrf_actv_strands_ to 2, indicating the second strand is now active too.

Then kcrfsg_ at offset 840 is read again, but a higher value is saved in PGA memory. In fact, the latch size is known (160), so in order to get then next latch, the session picks up the latch address of child #1, and adds 160 in order to get to child #2.

kcrfw_redo_gen_ext+14335:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+14342:0x00007ffcca22e6e8():W:8:0x76fe0cb0/1996360880(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 )

This latch is free, and is taken by this session:

kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):R:8:0/0()
kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):W:8:0x1a/26()

Now let’s look at the same function locations with the first redo allocation latch taken, and the second redo allocation latch acquired:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

Here we see reads and writes in the same fashion, including the 0xdeadbeef write, but now we see PUB_REDO_1, alias the second public redo strand. This proves the second redo allocation child latch is really protecting the second public redo strand!

Case Study – 1

It has been some time since I wrote an article walking through the analysis of information on an AWR report, but a nice example appeared a few weeks ago on Twitter that broke a big AWR picture into a sequence of bite-sized chunks that made a little story so here it is, replayed in sync with my ongoing thoughts. The problem started with the (highly paraphrased) question – “How could I get these headline figures when all the ‘SQL ordered by’ sections of the report show captured SQL account for 0.0% of Total?”. The report was a 30 minute snapshot from 11.2.0.3, and here’s the Load Profile:https://jonathanlewis.files.wordpress.com/2018/01/load_profile.jpg?w=150... 150w, https://jonathanlewis.files.wordpress.com/2018/01/load_profile.jpg?w=300... 300w" sizes="(max-width: 549px) 100vw, 549px" />As you can see, the database was doing quite a lot of work as measured by the physical reads and writes, the number of block changes and size of redo, and the fact that the average CPU usage by the foreground sessions in this instance accounts for 9.5 CPUs. Generally speaking the 49 seconds per second of DB time is also a fairly good hint,when combined with the other numbers, that the instance is busy but, in the absence of any other information, that could be one session holding a blocking lock with 48 other sessions waiting for the lock to be released.

There are a couple of unusual (though not impossible) features to this profile. Given the amount of work the figures for user calls and executes look very small – again not technically impossible, just unlikely in “normal” database processing given the apparent workload – and bear in mind that the “SQL ordered by ..” figures seem to account for none of the workload. Similarly the figures for logical reads and block changes are a little unusual (for “normal” processing) – on average this instance seems to have modified every block it visited (without any work showing up in the captured SQL).

Next in the set of images was the Time Model:https://jonathanlewis.files.wordpress.com/2018/01/time_model.jpg?w=150&h=76 150w, https://jonathanlewis.files.wordpress.com/2018/01/time_model.jpg?w=300&h... 300w" sizes="(max-width: 689px) 100vw, 689px" />As you can see, the foreground time (DB time) is 85,944 seconds or which foreground CPU usage (DB CPU) is 16,735 seconds – with about 69,000 seconds unaccounted ! THere’s virtually no time spend on PL/SQL or SQL, and Rman doesn’t even make an appearance  (I mention rman specifically because there was at least one version of Oracle where the rman time was accidentally lost from this summary).

So what does the Top Timed Events look like:https://jonathanlewis.files.wordpress.com/2018/01/top_timed.jpg?w=150&h=50 150w, https://jonathanlewis.files.wordpress.com/2018/01/top_timed.jpg?w=300&h=99 300w" sizes="(max-width: 481px) 100vw, 481px" />

It’s nice to see that this is consistent with the Load Profile: the DB CPU matches, and there’s a lot of physical reads (and a quick check says that 6,560,642/1,800 = 3,644 … which is pretty close to the 3,746 physical reads per second in the Load Profile).  There’s one major anomaly here, though: the huge number of (and time spent on) latch: row cache objects. and even though it’s not the single largest component of time it’s the most obvious one to pursue so the next bit of the report to check is the Dictionary Cache Statistics, with the Tablespace IO Stats and Segments by Physical Reads to follow. I don’t have an image for the dictionary cache stats, but the answer to the question “What were all the rowcache object gets for?” was: “dc_tablespaces (214,796,434)” – which (probably) told me everything I needed to know.

I could show you the specific Instance Activity statistic that I wanted to see next, but I’ll postpone that for a moment and jump to the Tablespace IO Stats – which we were planning to do and might have done anyway if we hadn’t known the rocket science behind massive number of gets on dc_tablespaces.

https://jonathanlewis.files.wordpress.com/2018/01/tablespace_io.jpg?w=15... 150w, https://jonathanlewis.files.wordpress.com/2018/01/tablespace_io.jpg?w=30... 300w" sizes="(max-width: 653px) 100vw, 653px" />

That’s a horrendous number of (single block) reads of the undo tablespace – and why would they be happening ? The instance engaged in some massive rollback activity (and the transactions being rolled back are on objects in the GIRO tablespace – which is why it is also suffering a lot of single block reads) and this is the point where we jump to the relevant Instance Activity statistic to confirm the claim:

https://jonathanlewis.files.wordpress.com/2018/01/undo_applied.jpg?w=150... 150w, https://jonathanlewis.files.wordpress.com/2018/01/undo_applied.jpg?w=300... 300w" sizes="(max-width: 572px) 100vw, 572px" />

There are two other “undo records applied” statistics, but we don’t need to think about them – the match between the count of records applied and the gets on the dc_tablespaces latch is nearly perfect. Almost everything that this instance is doing is rolling back – there must have been some enormous data changes (possibly only one, possibly parallel-enabled) that failed in the previous half hour and now the entire mess is being cleaned up.

One little detail to note – the “undo records applied” per second is 122,355, but the Load Profile reported 247,885 “Block changes” per second. The two figures are consistent with each other. Each application of an undo record is two block changes – the first when you modify the original data block, the second when you update the undo record itself to flag it as “applied”:  122,355 * 2  = 244,710, which is a good match for 247,855.

Final Thoughts

There is a second reason why you could see lots of gets on dc_tablespaces – but I think it was a bug in 9i relating to temporary tablespace groups. The phenomenon as a side effect of rolling back was something I discovered in the 8i timeline and I’ve not actually checked what an AWR report really would look like if I forced a massive rollback to take place as the only workload across the interval – so here’s a quick test I constructed and ran to finish the thread off:


set timing on

create table t1 as
with g as (select rownum id from dual connect by level <= 1000)
select rownum id, rownum n1, rpad ('x',150) padding from g,g
;

create index t1_i1 on t1(id);

begin
        for i in 1..1000000 loop
                update t1 set n1 = n1 where id = i;
        end loop;
end;
/

alter system flush buffer_cache;

pause take AWR snapshot from another session

rollback;

prompt take AWR snapshot from another session

On the small VM I used to run the test it took a little over 60 seconds to run the update, and the same again to complete the rollback. The “DB time” show in the Time Model section of the AWR report was 62.0 seconds, while the “sql execute elapsed time” was 3.9 seconds (which was the SQL executed while capturing the AWR data).

Conclusion

This was a problem where the database seemed to have done a lot of work that couldn’t be attributed to an SQL. While I had a little rocket science up my sleeve that may have allowed me to identify the source more quickly and with more confidence than the average DBA all I’ve done in this note it highlight a couple of oddities and big numbers that anyone could have spotted, and followed a couple of simple guesses:

a) DB time was large, but sql (and pl/sql) time was non-existent

b) Top timed events were I/O and latches – so identify the latch

c) The specific latch pointed towards tablespaces – so check the Tablespace I/O and note the UNDO tablespace

d) Look at any Instance Activity statistics about “undo”.

e) Job done – but a quick test case is the icing on the cake.

 

 

Oracle database block checksum XOR algorithm explained

Recently I’ve started to write my own clone of BBED to have something handy and useful in extreme cases when you have to go deep and fix stuff on low level (I have only like 2 such cases a year but each time it is really fun and a nice money </p />
</p></div>

    	  	<div class=

Testing Oracle SQL online

Want to test some DDL, a query, check an execution plan? You need only a browser. And you can copy-paste, or simply link, your test-case in a forum, a tweet, an e-mail, a tweet. Here is a small list (expecting to grow from your comments) of free online services which can run with an Oracle Database: SQL Fiddle, Rextester, db<>fiddle and Oracle Live SQL

SQL Fiddle

SQL Fiddle let you build a schema and run DDL on the following databases:

  • Oracle 11gR2
  • Microsoft SQL Server 2014
  • MySQL 5.6
  • Postgres 9.6 and 9.3
  • SQLLite (WebSQL and SQL.js)

As an Oracle user, the Oracle 11gR2 is not very useful as it is a version from 2010. But there’s a simple reason for that: that’s the latest free version – the Oracle XE Edition. And a free online service can run only free software. Now that Oracle plans to release an XE version every year, this should be better soon.

Example: http://sqlfiddle.com/#!4/42960/1/0

CaptureSQLfiddle

Rextester

Rextester is a service to compile code online, in a lot of languages and also the following databases:

  • Oracle 11gR2
  • Microsoft SQL Server 2014
  • MySQL 5.7
  • PostgreSQL 9.6

Example: http://rextester.com/QCYJF41984

Rextester has also an API where you can run a query and get a JSON answer:

$ curl -s --request POST --data 'LanguageChoice=35 Program=select * from dual' http://rextester.com/rundotnet/api
{"Warnings":null,"Errors":null,"Result":"\u003ctable class=\"sqloutput\"\u003e\u003ctbody\u003e\u003ctr\u003e\u003cth\u003e\u0026nbsp;\u0026nbsp;\u003c/th\u003e\r\n\u003cth\u003eDUMMY\u003c/th\u003e\r\n\u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e1\u003c/td\u003e\r\n\u003ctd\u003eX\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/tbody\u003e\u003c/table\u003e\r\n","Stats":"absolute service time: 1,37 sec","Files":null}

The answer has the result as an HTML table:

$ curl -s --request POST --data 'LanguageChoice=35 Program=select * from dual' http://rextester.com/rundotnet/api | jq -r .Result

 nbsp; nbsp; DUMMY
1 X


Here is my SELECT * FROM DUAL:

$ curl -s --request POST --data 'LanguageChoice=35 Program=select * from dual' http://rextester.com/rundotnet/api | jq -r .Result | lynx -dump -stdin
DUMMY
1 X

Capturerextester

db<>fiddle

db<>fiddle has a very nice interface, easy to link and easy to paste to StackOverflow (click on ‘markdown’)

  • Oracle 11gR2
  • SQL Server 2014 2016 2017, and even 2017 Linux version.
  • MariaDB 10.2
  • SQLite 3.8
  • PostgreSQL 8.4 9.4 9.6 10

Example: http://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=948a067dd17780ca65b01243751c2cb0

Capturedbfiddle

Oracle Live SQL

Finally, you can also run on the latest release of Oracle, with a service provided by Oracle itself: Live SQL.

  • Oracle 12cR2 (an early build from October 2016)

Example: https://livesql.oracle.com/apex/livesql/s/f6ydueahcslf66dlynagw9s3w

CaptureLiveSQL

 

Cet article Testing Oracle SQL online est apparu en premier sur Blog dbi services.

Have Dog, Will Travel

So I went on a business trip to visit my company headquarters this last week and I took my dog with me.  Yes, you heard that right, my dog.

My 10lb Chihuahua/Basenji mix dog, Esme, is actually a service dog, which means she can go almost anywhere with me.  She’s gone through a good amount of training to do so and honestly, she travels better than most adults.  Although she’s had all the training, the fact can’t be ignored that she was a severely abused puppy and still suffers some residual effects of this.  The task she performs is at night for a sleep disorder, but as I travel, it’s important to have her travel and is socialized as part of her training and she’s in that odd category of service animal that doesn’t fit most people’s idea of what a service dog is.

She’s an adorable animal and as loving as she is to me, it can be difficult for other people to understand, it’s not acceptable to come into her space and try to pet her or that as friendly as she may seem, she still isn’t too keen on new people.  Her connection to me is part of her job and it’s necessary for her to perform her job, but it will also result in her being more protective of me if you come into my space.  I’ve had to get a “Do Not Pet” label for her Service Dog harness due to this.

The How

Now, with that said, I’ve been asked how to best travel with a dog.  I’ve never found it that difficult to still travel with Esme, a small backpack and my 20 inch suitcase, but there are some fast rules and tips to know that can make the best of the trip.

  1. Dogs travel rather inexpensively on flights.  Most airlines have become quite pet friendly, but the animal must remain in a small kennel that will fit under the seat in front of your own unless its an ESA, (Emotional Support Animal) or service animal.
  2. Train your dog.  If you plan on traveling with them, train them not to bark and travel with them in a car extensively before air travel to get them accustomed to sitting still and being content with the travel arrangements.
  3. Make them as comfortable as possible.
    1. Get a thundershirt or similar to help ease the stress of the travel.
    2. Use a pheromone spray that will help to relax the dog.  You spray it on a blanket or on the thundershirt, there’s a special patch.
    3. If the dog is a small service or ESA dog, consider a pouch.  I have people ask me all the time where I got Esme’s Cozy and Dozy Sling.  She looks like a little wallaby in it, but she even sleeps in it while I work-  typing on my computer, attending meetings.  People are just amazed that she’s completely content to snuggle up and snooze in it.  It also comes with pockets for my cell phone when I’m getting on the plane, has a short leash with hook to secure her and a couple zip pockets to keep necessities.
    4. Collapsible food and water bowl.  Often made of silicone, they collapse flat and fit in a zipper pocket.  I keep a collapsible water bottle with me and often give her water in her bowl from it as we travel.
    5. Take your dogs customary food with you in a Tupperware, along with their favorite toys.  Esme’s is a Nylabone that she’s happy to chew on when we relax for the evening.
    6. And take a roll of dog poop bags.  Yeah, you’re going to need those and often at the most inopportune times.
  4. Know where the relief areas are for the departure and arrival airport are.  Most airports now have them and it’s important to know, so look online.
  5. Make sure they are up to date on all their vaccinations and have all their tags present on their collar.
  6. If they are an ESA or Service dog, you should have a letter or documentation showing this.
  7. Make sure to call the airlines BEFORE checking in for your flight so they can add your pet to your ticket.  Then check in, otherwise they may have to “uncheck” you in to add the pet information.
  8. Book with a pet-friendly hotel, like aLoft and call beforehand to find out what accommodations the hotel has for pets.  I’ve had ones that just accepted pets and then ones that are pet-friendly and there’s a big difference.  Ones that are pet-friendly will often have quiet yards with receptacles for pet refuse, walking areas and rooms set aside that have easy access to the outdoors for early morning outings, etc.  aLoft actually has dog beds, bowls and pet yards.  When trying to make a pet comfortable in a new surrounding, this can make a huge difference.

The day of a trip, I feed her a little earlier, ensure she has at least two hours afterwards to have enough time to relieve herself and then give her only a little bit of water until the flight.  I won’t feed her again until we’re at the hotel and it gives her a reason to be happy to be in the new place, too.  This will keep her from getting ill if their are turbulents, etc.

I’m sure there’s other things I’m doing to help make Esme a good traveler, but these are some of the tips I can offer to make it easier for everyone involved.  I love that people are astounded what a great traveling companion she is.  I’d do anything to take away the abuse she went through as a puppy before I rescued her at 5 months old, but she’s got it pretty good now.  She’s earned it and she earns her keep.

https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/01/esme_air.png?... 256w, https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/01/esme_air.png?... 768w, https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/01/esme_air.png?... 988w" sizes="(max-width: 428px) 100vw, 428px" data-recalc-dims="1" />

 



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Have Dog, Will Travel], All Right Reserved. 2018.

The post Have Dog, Will Travel appeared first on DBA Kevlar.

Getting started…adding an account to use

If you’ve read my previous post about getting started with the Oracle database, then hopefully you now have your very own database installed and running, and you have a explored a little with the sample schemas using SQL Developer.  Perhaps now you want to venture out into your own database development, and for that, you will want to create your own user account and create your own tables.  Here’s another video which will guide you through the process.

The video shows the process of creating a user via SQL Developer. If you are a command line addict, then here’s the commands you will need run to enact the same. You can use SQL Plus or SQLcl to run these commands. If you’ve never done that, simply

  • open up a terminal or MS-DOC command window,
  • type sqlplus and hit Enter
  • Type SYSTEM for the username and hit Enter
  • Type the password you specified when you created the database as per my previous blog post.

You will see a prompt “SQL>”.  You enter the following commands to create your first user account.


SQL> create user DEVELOPER identified by MyCaseSensitivePassword;

User created.

SQL> grant connect, resource to DEVELOPER;

Grant succeeded.

SQL> alter user DEVELOPER quota unlimited on USERS;

User altered.

And you are now good to go. Fire up SQL Developer as shown in the video, and you are ready to connect as your new user account as shown at this moment in the video.

Remember – we’re doing this to learn and explore Oracle. It is very easy to see all of functionality available to you, and to leap in and start building a fully-fledged application.  But don’t forget – for that, you need more than just a play-thing at home. You need backups; you need security; you need scalability; you need redundancy and failover.  So if you’re ready to get serious, you probably need to think about using a cloud account or similar.

Enjoy your database !

Is NFS on ZFS slowing you down?

If you think so, check out shell script “ioh.sh” from github at  https://github.com/khailey/ioh

Introduction and Goals

The goal of ioh.sh is to measure both the throughput and latency of the different code layers when using NFS mounts on a ZFS appliance. The ZFS appliance code layers inspected with the script are I/O from the disks, ZFS layer and the NFS layer. For each of these layers the script measures the throughput, latency and average I/O size. Some of the layers are further broken down into other layers. For example NFS writes are broken down into data sync, file sync and non-sync operations and NFS reads are broken down into cached data reads and reads that have to go to disk.

The primary three questions ioh is used to answer are

  • Is I/O latency from the I/O subsystem to ZFS appliance sufficiently fast?
  • Is NFS latency from ZFS appliance to the consumer sufficiently fast?
  • Is ZFS adding unusual latency

One: If the latency from the I/O subsystem is not adequate then look into supplying better performing I/O subsystem for ZFS appliance. For example if the goal is 3ms write times per 1K redo write but the underlying I/O subsystem is taking 6ms, then it will be impossible for ZFS appliance to meet those expectations.

Two: If the latency for NFS response from ZFS appliance is adequate and yet the NFS client reports latencies as much slower (more than 2ms slower) then one should look instead at problems in the NIC, network or NFS client host, see network tracing, example http://datavirtualizer.com/tcp-trace-analysis-for-nfs/

Three: If the I/O latency is sufficiently fast but ZFS latency is slow, then this could indicate a problem in the ZFS layer.

The answer to the question “what is adequate I/O latency” depends. In general a single random 8 Kb block read on Oracle is expected to take 3-12 ms on average, thus the typical latency is around 7.5 ms. NOTE: when measuring I/O latency on the source system it’s important to use a tool like “iostat” that will show the actually I/Os to the subsystem. The I/O measured by the Oracle database will include both I/Os satisfied from the host file system cache as well as the I/O subsystem unless the database is running with direct I/O

The ioh tool can also give insight into other useful information such as

  • Are IOPs getting near the supported IOPs of the underlying I/O subsystem
  • is NFS throughput getting near the maximum bandwidth of the NIC?”

For example if the NIC is 1GbE then the maximum bandwidth is about 115MB/s, and generally 100MB/s is a good rule of thumb for the max. If throughput is consistently near the NIC maximum, then demand is probably going above maximum and thus increasing latency

$ ioh.sh -h

usage: ./ioh.sh options

collects I/O related dtrace information into file "ioh.out"
and displays the

OPTIONS:
  -h              Show this message
  -t  seconds     runtime in seconds, defaults to forever
  -c  seconds     cycle time ie time between collections, defaults to 1 second
  -f  filename    change the output file name [defautls to ioh.out]
  -p              parse the data from output file only,  don't run collection
  -d  display     optional extra data to show: [hist|histsz|histszw|topfile|all]
                    hist    - latency histogram for NFS,ZFS and IO both reads and writes
                    histsz  - latency histogram by size for NFS reads
                    histszw - latency histogram by size for NFS writes
                    topfile - top files accessed by NFS
                    all     - all the above options
                  example
                    ioh.sh -d "histsz topfile"

two optional environment variables CWIDTH – histogram column width PAD – character between columns in the histograms, null by default

Running

$ sudo ./ioh.sh 

Outputs to the screen and put raw output into default file name “ioh.out.[date]”. The default output file name can be changed with “-o filename” option. the raw output can later be formatted with

$ ./ioh.sh -p  ioh.out.2012_10_30_10:49:27

By default it will look for “ioh.out”. If the raw data is in a different file name it can be specified with “-o filename”

The output looks like

date: 1335282287 , 24/3/2012 15:44:47
TCP out:  8.107 MB/s, in:  5.239 MB/s, retrans:        MB/s  ip discards:
----------------
            |       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|     0.005 |    24.01 |    4.899 |        1
R |     zfs:|     7.916 |     0.05 |    7.947 |     1020
C |   nfs_c:|           |          |          |        .
R |     nfs:|     7.916 |     0.09 |    8.017 |     1011
- 
W |      io:|     9.921 |    11.26 |   32.562 |      312
W | zfssync:|     5.246 |    19.81 |   11.405 |      471
W |     zfs:|     0.001 |     0.05 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     5.215 |    19.94 |   11.410 |      468
W |nfssyncF:|     0.031 |    11.48 |   16.000 |        2

The sections are broken down into

  • Header with date and TCP throughput
  • Reads
  • Writes

Reads and Writes are are further broken down into

  • io
  • zfs
  • nfs

For writes, the non stable storage writes are separated from the writes to stable storage which are marked as “sync” writes. For NFS the sync writes are further broken down into “data” and “file” sync writes.

examples:

The following will refresh the display every 10 seconds and display an extra four sections of data

$ sudo ./ioh.sh -c 10 -d "hist histsz histszw topfile"   

date: 1335282287 , 24/3/2012 15:44:47
TCP out:  8.107 MB/s, in:  5.239 MB/s, retrans:        MB/s  ip discards:
----------------
            |       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|     0.005 |    24.01 |    4.899 |        1
R |     zfs:|     7.916 |     0.05 |    7.947 |     1020
R |     nfs:|     7.916 |     0.09 |    8.017 |     1011
- 
W |      io:|     9.921 |    11.26 |   32.562 |      312
W | zfssync:|     5.246 |    19.81 |   11.405 |      471
W |     zfs:|     0.001 |     0.05 |    0.199 |        3
W |     nfs:|           |                     |        .
W |nfssyncD:|     5.215 |    19.94 |   11.410 |      468
W |nfssyncF:|     0.031 |    11.48 |   16.000 |        2
---- histograms  -------
    area r_w   32u   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .5s    1s    2s    2s+
R        io      .     .     .     .     .     .     .     .     .     1     3     1
R       zfs   4743   287    44    16     4     3     .     .     .     1     2     2
R       nfs      .  2913  2028    89    17     3     .     .     .     1     2     2
-
W        io      .     .     .    58   249   236    50    63   161   381   261    84    20     1
W       zfs      3    12     2
W   zfssync      .     .     .     .    26   162   258   129   228   562   636   250    75    29
W       nfs      .     .     .     .    12   164   265   134   222   567   637   250    75    29
--- NFS latency by size ---------
    ms   size_kb
R   0.1    8     .  2909  2023    87    17     3     .     .     .     1     2     2
R   0.1   16     .     4     5     2
-
W   5.0    4     .     .     .     .     8    49    10     3     4    11     4     2     1
W  21.4    8     .     .     .     .     4    55   196    99   152   414   490   199    60    23
W  18.3   16     .     .     .     .     .    34    29    25    43    91    84    28     8     5
W  16.1   32     .     .     .     .     .    19    16     7    14    38    36    15     3
W  19.3   64     .     .     .     .     .     6    11     .     9    11    19     6     2     1
W  20.4  128     .     .     .     .     .     1     3     .     .     2     4     .     1
---- top files ----
   MB/s                  IP  PORT filename
W  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
W  0.02MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.57MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
W  0.70MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/redo3.log
W  3.93MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/10.2.0.4/db_1/dbs/soe.dbf
-
R  0.01MB/s  172.16.100.102 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
R  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.02MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.05MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
R  7.84MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/10.2.0.4/db_1/dbs/soe.dbf
IOPs         313

Sections

First line is the date Second line is TCP MB per second in,out and retransmitted. The last value is “ip discards”

The three parts are all related and are a drill down starting with course grain data at the top to finer grain data at the bottom.

  • averages – default
  • histograms [hist]
  • histograms by size reads [histsz] writes [histszw] for NFS

The first section is a quick overview.

The second section breaks out the latency into a histogram so one can get an indication of amount of I/O from memory (ie those in microsecond ranges) as well as how far out the outliers are. (are the outliers on the VDBs matching up to the outliers seen on ZFS appliance?)

The third section differentiates between latency of single block random (typically the 8K size) and latency of multi-block sequential reads (32K and higher). The differentiation is important when comparing to Oracle stats which are grouped by single block random reads (called “db file sequential read” ) and sequential multi-block read (called “db file scattered read”).

The final section

top files read and write [topfile]

is sort of a sanity check as there are periods where there is suppose to be little to no NFS I/O and yet there is, so the top file sections tells which file and which host the NFS I/O is coming from.

The last line after all the sections is total IOPs for reads plus writes. (Note these IOPs could get converted to higher values at the storage layer if using RAID5 which will cause each write to be two reads plus two writes.)

The first section, shows up by default. The other sections require command line arguments.

To see just the first section, which is the default, run ioh.sh without any arguments:

sudo ./ioh.sh

To show non-default sections, add them to the command line

sudo ./ioh.sh -d "hist histsz histszw topfile"

A shortcut for all sections is “all”

sudo ./ioh.sh  -d all

Collecting in the background

nohup sudo ./ioh.sh -c 60 -t 86400 &

Runs the collection for 1 day (86400 seconds) collecting every 60 seconds and put raw output into default file name “ioh.out”. The default output file name can be changed with “-o filename” option.

1. Averages:

The displays I/O, ZFS and NFS data for both reads and writes. The data is grouped to try and help easily correlate these different layers First line is date in epoch format

columns

  • MB/s – MB transferred a second
  • avg_ms – average operation time
  • avg_sz_kb – average operation size in kb
  • count – number of operations

example

.
             |      MB/s|     mx_ms| avg_sz_kb|     count
 ------------|----------|----------|----------|--------------------
 R |      io:|  through | average  |  average |      count of operations
 R |     zfs:|  put     |latency   | I/O size | 
 R |     nfs:|          |millisec  | in KB    |   
 - 
 W |      io:|          |          |          |
 W | zfssync:|          |          |          |                                         
 W |     zfs:|          |          |          |                                         
 W |     nfs:|          |          |          |                                         
 W |nfssyncD:|          |          |          |                                         
 W |nfssyncF:|          |          |          |                                         

For writes

  • zfssync – these are synchronous writes. THese should mainly be Oracle redo writes.
  • nfs – unstable storage writes
  • nfssyncD – data sync writes
  • nfssyncF – file sync writes

DTrace probes used

  • io:::start/done check for read or write
  • nfs:::op-read-start/op-read-done , nfs:::op-write-start/op-write-done
  • zfs_read:entry/return, zfs_write:entry/return

2. Histograms

latency distribution for i/o, zfs, nfs for reads and writes. These distributions are not normalized by time, ie if ioh.d is outputs once a second then these counts will be equal to the counts in the first section. If ioh.d outputs every 10 seconds, then these values will be 10x higher

  1. Histograms by size for reads and writes

The first column is the average latency for the size of I/O for this line. The second column is the size. The size includes this size and every size lower up till the previous bucket. The goal here is to show the sizes of I/Os and the different latency for different sizes. For an Oralce database with 8k block size, 8k reads will tend to be random where as higher read sizes say will be multiblock requests and represent sequential reads. It’s common to see the 8K reads running slower than the larger reads.

4. Top files

shows the top 5 files for reads and writes. First column is MB/s, then R or W, then IP, then port then filename

Examples and Usage

Idle system

First thing to look at is the MB in and out which answers

  • “how busy is the system?”
  • “is NFS throughput approaching the limits of the NIC?”

In the following example, there is only less than 50KB/s total NFS throughput ( in plus out) thus the system isn’t doing much, and there must be no database activity other than the regular maintenance processes which are always running on a database. To confirm this, one can look at the top files at the bottom and see that the only activity is on the control files which are read and written to as part of database system maintenance. Otherwise there is no activity to speak of, so no reason look at I/O latency in this case. Additionally, all majority of what little I/O is in 16K sizes which is typical of control file activity, where as the default database data block activity is in 8K sizes. Most read I/O is coming from ZFS appliance cache as its 64 micro seconds.

date: 1335282646 , 24/3/2012 15:50:46
TCP  out:  0.016 MB/s, in:  0.030 MB/s, retrans:        MB/s  ip discards:
----------------
            |       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|           |          |          |        .
R |     zfs:|     0.016 |     0.01 |    1.298 |       13
R |     nfs:|     0.016 |     0.10 |   16.000 |        1
- 
W |      io:|     0.365 |     4.59 |    9.590 |       39
W | zfssync:|     0.031 |    14.49 |   16.000 |        2
W |     zfs:|     0.001 |     0.07 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     0.003 |          |          |        .
W |nfssyncF:|     0.028 |    14.33 |   14.400 |        2
---- histograms  -------
    area r_w   32u   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .5s    .5s+
R        io      .
R       zfs     60     5
R       nfs      .     .     5
-
W        io      .     .     .    20    43    60    11    11     8    28    17     1
W       zfs      2     8     5     2
W   zfssync      .     .     .     .     .     .     2     .     2     5     1     1
W       nfs      .     .     .     .     .     .     2     .     2     5     1     1
--- NFS latency by size ---------
    ms   size_kb
R   0.1   16     .     .     5
-
W          8     .     .     .     .     .     .     .     .     .     1     .     1
W  16.0   16     .     .     .     .     .     .     2     .     2     4     1
---- top files ----
   MB/s                  IP  PORT filename
W  0.00MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.01MB/s  172.16.100.102 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s  172.16.103.133 59394 /domain0/group0/vdb13/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s   172.16.100.69 39682 /domain0/group0/vdb14/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
-
R  0.00MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.01MB/s  172.16.100.102 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
R  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
IOPs          39

Active System Below is an example of an active system. Looking at TCP bytes in and out, there is a fair bit 3MB/s out and 2MB/s in. These rates are a long way from saturating 1GbE but there is activity going on.

READs all reads are coming out of the cache. How do we know? For one the average ms latency is 0.07, or 70 micro seconds. Does thhis 70us include slower reads that might be off disk? Looking at the histogram, one can see that the slowest zfs I/O is under 100us and looking just above at the I/O histogram there are no I/Os thus all the I/O is coming from cache.

Writes Writes are pretty slow. Oracle Redo writes on good systems are typically 3ms or liess for small redo. Unfortunately most of the I/O is coming from datafile writes so it’s difficult to tell what the redo write times are. (maybe worth enhancing ioh.d to show average latency by file) Typically the redo does “nfssyncD” writes and datafile writes are simply unstable storage writes “nfs” writes that get sync at a later date. This particular database is using the Oracle parameter “filesystemio_options=setall” which implements direct I/O. Direct I/O can work without sync writes but the implementation depends on the OS. This O/S implementation, OpenSolaris, causes all Direct I/O writes to by sync writes.

date: 1335284387 , 24/3/2012 16:19:47
TCP out:  3.469 MB/s, in:  2.185 MB/s, retrans:        MB/s  ip discards:
----------------
            ||         |           |          |          o       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|           |          |          |        .
R |     zfs:|     3.387 |     0.03 |    7.793 |      445
R |     nfs:|     3.384 |     0.07 |    8.022 |      432
- 
W |      io:|     4.821 |    12.08 |   24.198 |      204
W | zfssync:|     1.935 |    38.50 |   11.385 |      174
W |     zfs:|     0.001 |     0.06 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     1.906 |    39.06 |   11.416 |      171
W |nfssyncF:|     0.028 |    14.59 |   14.400 |        2
---- histograms  -------
    area r_w   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .3s+
R        io      .
R       zfs   2185    34     5     .     1
R       nfs    903  1201    47     8     1
-
W        io      .     .    19   142   143    46    42   108   240   212    57    12     1
W       zfs     13     3     1
W   zfssync      .     .     .     .    10     6     .    21    60   384   287    86    16
W       nfs      .     .     .     .    10     5     .    21    60   384   287    86    16
--- NFS latency by size ---------
    ms   size_kb
R   0.1    8   900  1199    47     7     1
R   0.1   16     3     2     .     1
-
W  17.7    4     .     .     .     .     3     1     .     2     5     3     3
W  41.1    8     .     .     .     .     3     .     .    13    35   292   231    76    13
W  34.0   16     .     .     .     .     3     3     .     4    13    61    30     8     2
W  39.0   32     .     .     .     .     .     1     .     .     2    16    14     2     1
W  28.3   64     .     .     .     .     1     .     .     .     2     9     8
W  26.2  128     .     .     .     .     .     .     .     2     3     2     1
W        256     .     .     .     .     .     .     .     .     .     1
---- top files ----
   MB/s             IP  PORT filename
R  0.01   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.01  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.02   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/system.dbf
R  0.02   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
R  3.33   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/product/dbs/soe.dbf 
-
W  0.01   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.01  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
W  0.15   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
W  0.30   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/redo1.log
W  1.46   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/product/dbs/soe.dbf 
IOPs         204

ZFS read layer problem

          |      MB/s|    avg_ms|  avg_sz_kb
----------|----------|----------|-----------
R |   io :|    88.480|      4.60|     17.648 
R |  zfs :|    19.740|      8.51|     12.689 
R |  nfs :|    16.562|     22.67|     30.394 

In this case the ZFS I/O 19MB/s is higher than NFS at 16MB/s. Now that could because some thing is accessing the file system locally on ZFS appliance or that ZFS is doing read ahead, so there are possible explanations, but it’s interesting. Second subsystem I/O at 88MB/s is much greater than ZFS I/O at 19MB/s. Again that is notable. Could because there is a scrub going on. (to check for a scrub, run “spool status”, to turn off scrub run “zpool scrub -s domain0″ though the scrub has to be run at some point). Both interesting observations.

Now the more interesting parts. The NFS response time 22ms is almost 3x the average ZFS response time 8ms. On the other hand the average size of NFS I/O is 2.5x the average ZFS I/O size so that might be understandable. The hard part to understand is that the ZFS latency 8ms is twice the latency of subsystem I/O at 4ms yet the average size of the I/O sub-system reads is bigger than the average ZFS read. This doesn’t make any sense.

In this case to hone in the data a bit, it would be worth turning off a scrub if it was running and see what the stats are to eliminate a factor that could be muddying the waters.

But in this case, even without a scrub going, the ZFS latency was 2-3x slower than the I/O subsystem latency.

It turns out ZFS wasn’t caching and spending a lot of time trying to keep the ARC clean.

ZFS write layer problem

           |       MB/s|    avg_ms| avg_sz_kb|     count
-----------|-----------|----------|----------|--------------------
W |     io:|     10.921|     23.26|    32.562|       380
W |    zfs:|    127.001|     37.95|     0.199|      8141
W |    nfs:|     0.000 |     0.00 |    0.000 |        0

NFS is 0 MB/s because this was from http traffic. The current version of ioh would show the TCP MB/s. This version also mixed up zfs sync and non-sync writes into one bucket, but much of the ZFS writes have to be non-sync because the write rate is 127MB/s where as the I/O subsystem writes are only 10MB/s thus at least 117MB/s is not sync and if they are not sync they are just memory writes so should be blindingly fast, but they aren’t. The average latency for the ZFS writes is 37ms. All the more shockingly the average size is only 0.199K where as the I/O subsystem writes 32K in 23ms. The case here was that because of disk errors, the ZFS layer was self throttling way to much. This was a bug