Search

OakieTags

Who's online

There are currently 0 users and 35 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

Oracle Enterprise Manager 12c Command Line Interface is Available!

The new Oracle Enterprise Manager 12c Command Line Interface book is available via a number of locations, including Amazon and directly from Apress.  If you are an EM12c fanatic or just learning and want to learn more, consider the new book that will show you why the command line returns the DBA to the golden age, empowering Enterprise Manager to script and enable tasks at a global level!

9781484202395HiRes



Tags:  


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Oracle Enterprise Manager 12c Command Line Interface is Available!], All Right Reserved. 2014.

Filtering PeopleTools SQL from Performance Monitor Traces

I have been doing some on-line performance tuning on a PeopleSoft Financials system using PeopleSoft Performance Monitor (PPM).  End-users have collect verbose PPM traces. Usually, when I use PPM in a production system, all the components are fully cached by the normal activity of the user (except when the application server caches have recently been cleared).  However, when working in a user test environment it is common to find that the components are not fully cached. This presents two problems.

  • The application servers spend quite a lot of time executing queries on the PeopleTools tables to load the components, pages and PeopleCode into their caches. We can see in the screenshot of the component trace that there is a warning message that component objects are not fully cached, and that these  cache misses skew timings.
  • In verbose mode, the PPM traces collect a lot of additional transactions capturing executions and fetches against PeopleTools tables. The PPM analytic components cannot always manage the resultant volume of transactions.
    Figure 1. Component trace as collected by PPM
    Figure 1. Component trace as collected by PPM

    If I go further down the same page and look in the SQL Summary, I can see SQL operations against PeopleTools tables (they are easily identifiable in that they generally do not have an underscore in the third character). Not only are 5 of the top 8 SQL operations related to PeopleTools tables, we can also see that they also account for over 13000 executions, which means there are at least 13000 rows of additional data to be read from PSPMTRANSHIST.

    Figure 2. SQL Summary of PPM trace with PeopleTools SQL
    Figure 2. SQL Summary of PPM trace with PeopleTools SQL

    When I open the longest running server round trip (this is also referred to as a Performance Monitoring Unit or PMU), I can only load 1001 rows before I get a message warning that the maximum row limit has been reached. The duration summary and the number of executions and fetches cannot be calculated and hence 0 is displayed.

    Figure 3: Details of longest PMU with PeopleTools SQL
    Figure 3: Details of longest PMU with PeopleTools SQL

    Another consequence of the PeopleTools data is that it can take a long time to open the PMU tree. There is no screenshot of the PMU tree here because in this case I had so much data that I couldn't open it before the transaction timed out!

    Solution 

    My solution to this problem is to delete the transactions that relate to PeopleTools SQL and correct the durations, and the number of executions and fetches held in summary transactions. The rationale is that these transactions would not normally occur in significant quantities in a real production system, and there is not much I can do about them when they do.
    The first step is to clone the trace. I could work on the trace directly, but I want to preserve the original data.
    PPM transactions are held in the table PSPMTRANSHIST. They have a unique identifier PM_INSTANCE_ID. A single server round trip, also called a Performance Monitoring Unit (PMU), will consist of many transactions. They can be shown as a tree and each transaction has another field PM_PARENT_INST_ID which holds the instance of the parent. This links the data together and we can use hierarchical queries in Oracle SQL to walk the tree. Another field PM_TOP_INST_ID identifies the root transaction in the tree.
    Cloning a PPM trace is simply a matter of inserting data into PSPMTRANSHIST. However, when I clone a PPM trace I have to make sure that the instance numbers are distinct but still link correctly. In my system I can take a very simple approach. All the instance numbers actually collected by PPM are greater than 1016. So, I will simply use the modulus function to consistently alter the instances to be different. This approach may break down in future, but it will do for now.
    On an Oracle database, PL/SQL is a simple and effective way to write simple procedural processes.  I have written two anonymous blocks of code.
    Note that the cloned trace will be purged from PPM like any other data by the delivered PPM archive process.

    #eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
    REM xPT.sql
    BEGIN --duplicate PPM traces
    FOR i IN (
    SELECT h.*
    FROM pspmtranshist h
    WHERE pm_perf_trace != ' ' /*rows must have a trace name*/
    -- AND pm_perf_trace = '9b. XXXXXXXXXX' /*I could specify a specific trace by name*/
    AND pm_instance_id > 1E16 /*only look at instance > 1e16 so I do not clone cloned traces*/
    ) LOOP
    INSERT INTO pspmtranshist
    (PM_INSTANCE_ID, PM_TRANS_DEFN_SET, PM_TRANS_DEFN_ID, PM_AGENTID, PM_TRANS_STATUS,
    OPRID, PM_PERF_TRACE, PM_CONTEXT_VALUE1, PM_CONTEXT_VALUE2, PM_CONTEXT_VALUE3,
    PM_CONTEXTID_1, PM_CONTEXTID_2, PM_CONTEXTID_3, PM_PROCESS_ID, PM_AGENT_STRT_DTTM,
    PM_MON_STRT_DTTM, PM_TRANS_DURATION, PM_PARENT_INST_ID, PM_TOP_INST_ID, PM_METRIC_VALUE1,
    PM_METRIC_VALUE2, PM_METRIC_VALUE3, PM_METRIC_VALUE4, PM_METRIC_VALUE5, PM_METRIC_VALUE6,
    PM_METRIC_VALUE7, PM_ADDTNL_DESCR)
    VALUES
    (MOD(i.PM_INSTANCE_ID,1E16) /*apply modulus to instance number*/
    ,i.PM_TRANS_DEFN_SET, i.PM_TRANS_DEFN_ID, i.PM_AGENTID, i.PM_TRANS_STATUS,
    i.OPRID,
    SUBSTR('xPT'||i.PM_PERF_TRACE,1,30) /*adjust trace name*/,
    i.PM_CONTEXT_VALUE1, i.PM_CONTEXT_VALUE2, i.PM_CONTEXT_VALUE3,
    i.PM_CONTEXTID_1, i.PM_CONTEXTID_2, i.PM_CONTEXTID_3, i.PM_PROCESS_ID, i.PM_AGENT_STRT_DTTM,
    i.PM_MON_STRT_DTTM, i.PM_TRANS_DURATION,
    MOD(i.PM_PARENT_INST_ID,1E16), MOD(i.PM_TOP_INST_ID,1E16), /*apply modulus to parent and top instance number*/
    i.PM_METRIC_VALUE1, i.PM_METRIC_VALUE2, i.PM_METRIC_VALUE3, i.PM_METRIC_VALUE4, i.PM_METRIC_VALUE5,
    i.PM_METRIC_VALUE6, i.PM_METRIC_VALUE7, i.PM_ADDTNL_DESCR);
    END LOOP;
    COMMIT;
    END;
    /

    Now I will work on the cloned trace. I want to remove certain transaction.

    • PeopleTools SQL. Metric value 7 reports the SQL operation and SQL table name. So if the first word is SELECT and the second word is a PeopleTools table name then it is a PeopleTools SQL operation. A list of PeopleTools tables can be obtained from the object security table PSOBJGROUP.
    • Implicit Commit transactions. This is easy - it is just transaction type 425. 

    Having deleted the PeopleTools transactions, I must also

    • Correct transaction duration for any parents of transaction. I work up the hierarchy of transactions and deduct the duration of the transaction that I am deleting from all of the parent.
    • Transaction types 400, 427 and 428 all record PeopleTools SQL time (metric 66). When I come to that transaction I also deduct the duration of the deleted transaction from the PeopleTools SQL time metric in an parent transaction.
    • Delete any children of the transactions that I delete. 
    • I must also count each PeopleTools SQL Execution transaction (type 408) and each PeopleTools SQL Fetch transaction (type 414) that I delete. These counts are also deducted from the summaries on the parent transaction 400. 

    The summaries in transaction 400 are used on the 'Round Trip Details' components, and if they are not adjusted you can get misleading results. Without the adjustments, I have encountered PMUs where more than 100% of the total duration is spent in SQL - which is obviously impossible.
    Although this technique of first cloning the whole trace and then deleting the PeopleTools operations can be quite slow, it is not something that you are going to do very often. 

    #eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
    REM xPT.sql
    REM (c)Go-Faster Consultancy Ltd. 2014
    set serveroutput on echo on
    DECLARE
    l_pm_instance_id_m4 INTEGER;
    l_fetch_count INTEGER;
    l_exec_count INTEGER;
    BEGIN /*now remove PeopleTools SQL transaction and any children and adjust trans durations*/
    FOR i IN (
    WITH x AS ( /*returns PeopleTools tables as defined in Object security*/
    SELECT o.entname recname
    FROM psobjgroup o
    WHERE o.objgroupid = 'PEOPLETOOLS'
    AND o.enttype = 'R'
    )
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_trans_duration, h.pm_trans_defn_id
    FROM pspmtranshist h
    LEFT OUTER JOIN x
    ON h.pm_metric_value7 LIKE 'SELECT '||x.recname||'%'
    AND x.recname = upper(regexp_substr(pm_metric_value7,'[^ ,]+',8,1)) /*first word after select*/
    WHERE pm_perf_trace like 'xPT%' /*restrict to cloned traces*/
    -- AND pm_perf_trace = 'xPT9b. XXXXXXXXXX' /*work on a specific trace*/
    AND pm_instance_id < 1E16 /*restrict to cloned traces*/
    AND ( x.recname IS NOT NULL
    OR h.pm_trans_defn_id IN(425 /*Implicit Commit*/))
    ORDER BY pm_instance_id DESC
    ) LOOP
    l_pm_instance_id_m4 := TO_NUMBER(NULL);
     
        IF i.pm_parent_inst_id>0 AND i.pm_trans_duration>0 THEN
    FOR j IN(
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_top_inst_id, h.pm_trans_defn_id
    , d.pm_metricid_3, d.pm_metricid_4
    FROM pspmtranshist h
    INNER JOIN pspmtransdefn d
    ON d.pm_trans_defn_set = h.pm_trans_defn_set
    AND d.pm_trans_defn_id = h.pm_trans_Defn_id
    START WITH h.pm_instance_id = i.pm_parent_inst_id
    CONNECT BY prior h.pm_parent_inst_id = h.pm_instance_id
    ) LOOP
    /*decrement parent transaction times*/
    IF j.pm_metricid_4 = 66 /*PeopleTools SQL Time (ms)*/ THEN --decrement metric 4 on transaction 400
    --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value4 by '||i.pm_trans_duration);
    UPDATE pspmtranshist
    SET pm_metric_value4 = pm_metric_value4 - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_Defn_id = j.pm_trans_defn_id
    AND pm_metric_value4 >= i.pm_trans_duration
    RETURNING pm_instance_id INTO l_pm_instance_id_m4;
    ELSIF j.pm_metricid_3 = 66 /*PeopleTools SQL Time (ms)*/ THEN --SQL time on serialisation
    --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value3 by '||i.pm_trans_duration);
    UPDATE pspmtranshist
    SET pm_metric_value3 = pm_metric_value3 - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_Defn_id = j.pm_trans_defn_id
    AND pm_metric_value3 >= i.pm_trans_duration;
    END IF;

    UPDATE pspmtranshist
    SET pm_trans_duration = pm_trans_duration - i.pm_trans_duration
    WHERE pm_instance_id = j.pm_instance_id
    AND pm_trans_duration >= i.pm_trans_duration;
    END LOOP;
    END IF;

    l_fetch_count := 0;
    l_exec_count := 0;
    FOR j IN( /*identify transaction to be deleted and any children*/
    SELECT pm_instance_id, pm_parent_inst_id, pm_top_inst_id, pm_trans_defn_id, pm_metric_value3
    FROM pspmtranshist
    START WITH pm_instance_id = i.pm_instance_id
    CONNECT BY PRIOR pm_instance_id = pm_parent_inst_id
    ) LOOP
    IF j.pm_trans_defn_id = 408 THEN /*if PeopleTools SQL*/
    l_exec_count := l_exec_count + 1;
    ELSIF j.pm_trans_defn_id = 414 THEN /*if PeopleTools SQL Fetch*/
    l_fetch_count := l_fetch_count + j.pm_metric_value3;
    END IF;
    DELETE FROM pspmtranshist h /*delete tools transaction*/
    WHERE h.pm_instance_id = j.pm_instance_id;
    END LOOP;

    IF l_pm_instance_id_m4 > 0 THEN
    --dbms_output.put_line('ID:'||l_pm_instance_id_m4||' Decrement '||l_exec_Count||' executions, '||l_fetch_count||' fetches');
    UPDATE pspmtranshist
    SET pm_metric_value5 = pm_metric_value5 - l_exec_count
    , pm_metric_value6 = pm_metric_value6 - l_fetch_count
    WHERE pm_instance_id = l_pm_instance_id_m4;
    l_fetch_count := 0;
    l_exec_count := 0;
    END IF;

    END LOOP;
    END;
    /

    Now, I have a second PPM trace that I can open in the analytic component.

    Figure 4: Original and Cloned PPM traces
    Figure 4: Original and Cloned PPM traces

    When I open the cloned trace, both timings in the duration summary have reduced as have the number of executions and fetches.  The durations of the individual server round trips have also reduced.

    Figure 5: Component Trace without PeopleTools transactions
    Figure 5: Component Trace without PeopleTools transactions

    All of the PeopleTools SQL operations have disappeared from the SQL summary.

    Figure 6: SQL Summary of PPM trace after removing PeopleTools SQL transactions
    Figure 6: SQL Summary of PPM trace after removing PeopleTools SQL transactions

    The SQL summary now only has 125 rows of data.

    Figure 7: SQL Summary of PMU without PeopleTools SQL

    Now, the PPM tree component opens quickly and without error.

    Figure 8: PMU Tree after removing PeopleTools SQL
    Figure 8: PMU Tree after removing PeopleTools SQL

    There may still be more transactions in a PMU than I can show in a screenshot, but I can now find the statement that took the most time quite quickly.

    Figure 9: Long SQL transaction further down same PMU tree
    Figure 9: Long SQL transaction further down same PMU tree

    Conclusions 

    I think that it is reasonable and useful to remove PeopleTools SQL operations from a PPM trace.
    In normal production operation, components will mostly be cached, and this approach renders traces collected in non-production environments both usable in the PPM analytic components and more realistic for performance tuning. However, it is essential that when deleting some transactions from a PMU, that summary data held in other transactions in the same PMU are also corrected so that the metrics remain consistent.

    Interesting Oracle Syntax Error

    As shared by a well known Oracle and Big Data performance geek!

    SQL> ALTER SYSTEM SET inmemory_size = 5T SCOPE=spfile;
    ALTER SYSTEM SET inmemory_size = 5T SCOPE=spfile
    *
    ERROR at line 1:
    ORA-32005: error while parsing size specification [5T]
    
    
    SQL> ALTER SYSTEM SET inmemory_size = 5120G SCOPE=spfile;
    
    System altered.
    

    :)

    Upgrades

    One of the worst problems with upgrades is that things sometimes stop working. A particular nuisance is the execution plan that suddenly stops appearing, to be replaced by an alternative plan that is much less efficient.

    Apart from the nuisance of the time spent trying to force the old plan to re-appear, plus the time spent working out a way of rewriting the query when you finally decide the old plan simply isn’t going to re-appear, there’s also the worry about WHY the old plan won’t appear. Is it some sort of bug, is it that some new optimizer feature has disabled some older optimizer feature, or is it that someone in the optimizer group realised that the old plan was capable of producing the wrong results in some circumstances … it’s that last possibility that I find most worrying.

    Here’s an example that appeared recently on OTN that’s still got me wondering about the possibility of wrong results (in the general case). We start with a couple of tables, a view, and a pipelined function. This example is a simple model of the problem that showed up on OTN; it’s based on generated data so that anyone who wants to can play around with it to see if they can bypass the problem without making any significant changes to the shape of the code:

    
    create table t1
    as
    with generator as (
    	select	--+ materialize
    		rownum id
    	from dual
    	connect by
    		level <= 1e4
    )
    select
    	rownum			id,
    	rownum			n1,
    	mod(rownum,100)		n_100,
    	rpad('x',100)		padding
    from
    	generator	v1
    ;
    
    create table t2
    as
    with generator as (
    	select	--+ materialize
    		rownum id
    	from dual
    	connect by
    		level <= 1e4
    )
    select
    	rownum			id,
    	rownum			n1,
    	mod(rownum,100)		n_100,
    	rpad('x',100)		padding
    from
    	generator	v1
    ;
    
    alter table t2 add constraint t2_pk primary key(id);
    
    begin
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T1',
    		method_opt	 => 'for all columns size 1'
    	);
    
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T2',
    		method_opt	 => 'for all columns size 1'
    	);
    
    end;
    /
    
    create or replace type myScalarType as object (
            x int,
            y varchar2(15),
            d date
    )
    /
    
    create or replace type myArrayType as table of myScalarType
    /
    
    create or replace function t_fun1(i_in number)
    return myArrayType
    pipelined
    as
    begin
    	pipe row (myscalartype(i_in,     lpad(i_in,15),     trunc(sysdate) + i_in    ));
    	pipe row (myscalartype(i_in + 1, lpad(i_in + 1,15), trunc(sysdate) + i_in + 1));
    	return;
    end;
    /
    
    create or replace view v1
    as
    select
    	--+ leading(t2 x) index(t2)
    	x.x, x.y, x.d,
    	t2.id, t2.n1
    from
    	t2,
    	table(t_fun1(t2.n_100)) x
    where
    	mod(t2.n1,3) = 1
    union all
    select
    	--+ leading(t2 x) index(t2)
    	x.x, x.y, x.d,
    	t2.id, t2.n1
    from
    	t2,
    	table(t_fun1(t2.n_100)) x
    where
    	mod(t2.n1,3) = 2
    ;
    
    

    A key part of the problem is the UNION ALL view, where each subquery holds a join to a pipeline function. We’re about to write a query that joins to this view, and wants to push a join predicate into the view. Here’s the SQL:

    
    select
    	/*+ leading(t1 v1) use_nl(v1) */
    	v1.x, v1.y, v1.d,
    	v1.n1,
    	t1.n1
    from
    	t1,
    	v1
    where
    	t1.n_100 = 0
    and	v1.id = t1.n1
    ;
    
    

    You’ll notice that the join v1.id = t1.n1 could (in principle) be pushed inside the view to become t2.id = t1.n1 in the two branches of the UNION ALL; this would make it possible for the nested loop that I’ve hinted between t1 and v1 to operate efficiently – and in 11.1.0.7 this is exactly what happens:

    
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                      |        | 16336 |   733K|   123   (1)| 00:00:01 |
    |   1 |  NESTED LOOPS                         |        | 16336 |   733K|   123   (1)| 00:00:01 |
    |*  2 |   TABLE ACCESS FULL                   | T1     |   100 |   700 |    23   (5)| 00:00:01 |
    |   3 |   VIEW                                | V1     |   163 |  6357 |     1   (0)| 00:00:01 |
    |   4 |    UNION-ALL PARTITION                |        |       |       |            |          |
    |   5 |     NESTED LOOPS                      |        |  8168 |   103K|    16   (0)| 00:00:01 |
    |*  6 |      TABLE ACCESS BY INDEX ROWID      | T2     |     1 |    11 |     2   (0)| 00:00:01 |
    |*  7 |       INDEX UNIQUE SCAN               | T2_PK  |     1 |       |     1   (0)| 00:00:01 |
    |   8 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |       |       |            |          |
    |   9 |     NESTED LOOPS                      |        |  8168 |   103K|    16   (0)| 00:00:01 |
    |* 10 |      TABLE ACCESS BY INDEX ROWID      | T2     |     1 |    11 |     2   (0)| 00:00:01 |
    |* 11 |       INDEX UNIQUE SCAN               | T2_PK  |     1 |       |     1   (0)| 00:00:01 |
    |  12 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |       |       |            |          |
    ------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("T1"."N_100"=0)
       6 - filter(MOD("T2"."N1",3)=1)
       7 - access("T2"."ID"="T1"."N1")
      10 - filter(MOD("T2"."N1",3)=2)
      11 - access("T2"."ID"="T1"."N1")
    
    

    For each row returned by the tablescan at line 2 we call the view operator at line 3 to generate a rowsource, but we can see in the predicate sections for lines 7 and 11 that the join value has been pushed inside the view, allowing us to access t2 through its primary key index. Depending on the data definitions, constraints, view definition, and version of Oracle, you might see the UNION ALL operator displaying the PARTITION option or the PUSHED PREDICATE option in cases of this type.

    So now we upgrade to 11.2.0.4 (probably any 11.2.x.x version) and get the following plan:

    
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                      |        |  1633K|    99M|   296K  (4)| 00:24:43 |
    |   1 |  NESTED LOOPS                         |        |  1633K|    99M|   296K  (4)| 00:24:43 |
    |*  2 |   TABLE ACCESS FULL                   | T1     |   100 |   700 |    23   (5)| 00:00:01 |
    |*  3 |   VIEW                                | V1     | 16336 |   909K|  2966   (4)| 00:00:15 |
    |   4 |    UNION-ALL                          |        |       |       |            |          |
    |   5 |     NESTED LOOPS                      |        |   816K|    10M|  1483   (4)| 00:00:08 |
    |*  6 |      TABLE ACCESS BY INDEX ROWID      | T2     |   100 |  1100 |   187   (2)| 00:00:01 |
    |   7 |       INDEX FULL SCAN                 | T2_PK  | 10000 |       |    21   (0)| 00:00:01 |
    |   8 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |  8168 | 16336 |    13   (0)| 00:00:01 |
    |   9 |     NESTED LOOPS                      |        |   816K|    10M|  1483   (4)| 00:00:08 |
    |* 10 |      TABLE ACCESS BY INDEX ROWID      | T2     |   100 |  1100 |   187   (2)| 00:00:01 |
    |  11 |       INDEX FULL SCAN                 | T2_PK  | 10000 |       |    21   (0)| 00:00:01 |
    |  12 |      COLLECTION ITERATOR PICKLER FETCH| T_FUN1 |  8168 | 16336 |    13   (0)| 00:00:01 |
    ------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("T1"."N_100"=0)
       3 - filter("V1"."ID"="T1"."N1")
       6 - filter(MOD("T2"."N1",3)=1)
      10 - filter(MOD("T2"."N1",3)=2)
    
    

    In this plan the critical join predicate appears at line 3; the predicate hasn’t been pushed. On the other hand the index() hints in the view have, inevitably, been obeyed (resulting in index full scans), as has the use_nl() hint in the main query – leading to a rather more expensive and time-consuming execution plan.

    The first, quick, debugging step is simply to set the optimizer_features_enable back to 11.1.0.7 – with no effect; the second is to try adding the push_pred() hint to the query – with no effect; the third is to generate the outline section of the execution plans and copy the entire set of hints from the good plan into the bad plan, noting as we do so that the good plan actually uses the hint OLD_PUSH_PRED(@”SEL$1″ “V1″@”SEL$1″ (“T2″.”ID”)) – still no effect.

    Since I happen to know a few things about what is likely to appear in the 10053 (optimizer) trace file, my next step would be to flush the shared pool, enable the trace, and then check the trace file (using grep or find depending on whether I was running UNIX or Windows) for the phrase “JPPD bypassed”; this is what I got:

    
    test_ora_9897.trc:OJPPD:     OJPPD bypassed: View contains TABLE expression.
    test_ora_9897.trc:JPPD:     JPPD bypassed: View not on right-side of outer-join.
    test_ora_9897.trc:JPPD:     JPPD bypassed: View not on right-side of outer-join.
    
    

    So 11.1.0.7 had a plan that used the old_push_pred() hint, but 11.2.0.4 explicitly bypassed the option (the rubric near the top of the trace file translates OJPPD to “old-style (non-cost-based) JPPD”, where JPPD translates to “join predicate push-down”). It looks like the plan we got from 11.1.0.7 has been deliberately blocked in 11.2.0.4. So now it’s time to worry whether or not that means I could have been getting wrong results from 11.1.0.7.

    In my test case, of course, I can bypass the problem by explicitly rewriting the query – but I’ll have to move the join with t1 inside the view for both subqueries; alternatively, given the trivial nature of the pipeline function, I could replace the table() operator with a join to another union all view. In real life such changes are not always so easy to implement.

    Footnote: the restriction is still in place on 12.1.0.2.

    Footnote 2: somewhere I’ve probably published a short note explaining that one of my standard pre-emptive strikes on an upgrade is to run the following command to extract useful information from the executable: “strings -a oracle | grep -v bypass”: it can be very helpful to have a list of situations in which some query transformation is bypassed.

     

    Digital Learning: LVC customer quotes

    LVC

    Here’s a collection of customer quotes as a follow-up to my last post about the importance of attitude towards Live Virtual Classes (LVCs). They are from courses that I have taught personally this year with an average delivery score of about 96%:

    Oracle Database 12c: Managing Multitenant Architecture

    Oracle Grid Infrastructure 11g: Manage Clusterware and ASM

    Oracle Database 12c: New Features for Administrators

    Exadata Database Machine Administration Workshop

    Oracle 11g: RAC and Grid Infrastructure Administration Accelerated

    “My first experience of an LVC. Pleased to say it was very positive. Introduction and start on Monday morning was smooth. I would recommend to my colleagues.”

    Always important to make a good first impression!

    “The whole LVC package just worked. From the comfort of my own environment with a great instructor makes for happy learning :)”

    And that is exactly what we strive to deliver.

    “Both, host and producer were very professional and guided the students through the course.”

    An LVC producer takes care for all technical aspects apart from the course itself, like access to the learning platform. The instructor appears as “host” on the learning platform.

    “Instructor professionally answered students’ questions and kept up a positive mood in the community!”

    LVCs can be funny too :-)

    “I appreciate the way how the course was presented. Very well controlled time, organization of presentation, exercises. Interaction with us was great. Always ready to answer a question, give an examples to difficult topic, illustrating topics.”

    So much about allegedly missing interaction in LVCs.

    “I work few years on RAC databases, my knowledge was not so clear regarding some topic on RAC and Grid after completing this training I’m sure that I will handle our RAC and Grid environment differently and for sure will have positive impact in our production environment. Great thank!”

    You cannot top that with a classroom course either :-)

    “LVC is offering great and flexible way to gain knowledge without travel or hotels etc.”

    “LVCs reduce travel costs and help the students to manage their time on their own, i.e. to join the classes from home and focus on the presented content.”

    Trust me, I didn’t make up the last two although they may sound like your manager talking – or mine, for that matter ;-)

    Tagged: Digital Learning, LVC

    Mobile Front and Center at Oracle Open World 2014

    Wow! Sorry I have not posted anything since Oracle Open World. Lot’s of great new stuff is coming from Oracle that you should look into including some pretty cool stuff for mobile development.

    1. Oracle’s Mobile Application Framework (MAF) is more than just a rework of the ADF Mobile package (all existing ADF Mobile apps will be migrated to MAF); MAF includes new and updated components and a complete ecosystem to support it see OTN for more: http://www.oracle.com/technetwork/developer-tools/maf/overview/index.html
    2. Oracle is sponsoring the Oracle Mobile Framework Developer Challenge to allow developers to show their MAF capabilities: https://blogs.oracle.com/imc/entry/maf12c_developer_challenge
    3. Oracle’s Mobile Suite and Oracle Mobile Security suite provide a comprehensive way to secure web applications:  http://www.oracle.com/technetwork/developer-tools/maf/overview/index.html
    4. Oracle’s User Experience (UX) gets a new UI facelift named “Alta” that goes beyond mere skinning to include a new design philosophy and uses proven patterns in a “mobile first” responsive and whitespace-heavy interface: http://www.oracle.com/webfolder/ux/middleware/alta/index.html
    5. Oracle Enterprise Mobile Platform allows quick-and-easy development mobile applications using simple drag-and-drop wizards – no programming required! Resourceful users build their own applications and developers can use it to prototype and create templates for more-comprehensive MAF applications. http://www.oracle.com/us/technologies/mobile/overview/index.html
    6. Oracle Database 12c provides native REST (http://www.oracle.com/technetwork/developer-tools/rest-data-services/overview/index.html) and JSON (http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246) to better-support mobile applications.

    I hope this article interests you in learning more about Oracle’s mobility efforts. If you’re in Michigan or Reston please some see my MAF demos!

    EM12c Release 4, Health Overview

    As part of the projects I work on at Oracle, it’s often ensuring that those customers that wish to deploy Enterprise Manager, (EM12c) to large environments, have the correct settings and are tuned to offer the best performance from the Oracle Management repository database, weblogic and up to the console URL’s accessed by users.  This means that these large environments often receive recommendations from our experts  that differ from what EM12c comes “out of the box” with.

    For those that aren’t receiving internal Oracle training on what to look for and how to tune EM12c tiers, there are some new features in release 4 that should be checked out by anyone using EM12c.

    EM12c Health Overview

    The Health Overview is accessible via the Setup menu, (right side of Console), Manage Cloud Control and Health Overview.

    ho1

    We’ll go over each of these new monitoring tools, but the Health Overview includes valuable information about the health of both the Oracle Management Repository, (OMR) and the Oracle Management Service, (OMS).

    The overview page breaks down into easy to understand sections.  The first is basic information and workload on the OMS:

    ho2

    From here you can see all pertinent, high level information about the OMS/OMR, including OMS information, the number of agents, including status counts on availability and if a load balancer is used in the EM12c configuration.  Important target status availability is posted, along with how many administrators have been given access to the EM console.

    Below this we can view the backlog graph on the right hand side of the page:

    ho2

    That right hand graph is important since along with the upload rate, you can see if there is a consistent backlog of XML files to be uploaded and that can signal performance trouble.  A backlog can cause problems, as this beings there is a backlog for the loader can delay receipt of critical alerts and information about a target.  If the backlog becomes too extensive, an agent can reach a threshold on how many files it can handle backlogged and stop collecting, which is a significant issue.  It’s important that if serious backlog issues are noted, tuning options to deter from them, (like add a load balancer to assist in managing the workload or a second OMS.)

    Repository Details

    The next section includes connection information, which also has the service name, the database name and database type, the port and job system status.  On the right is a graph showing if any backoff requests have been sent.  These occur when the OMS is busy processing an XML upload and requests the agent to hold off on sending anymore files until it has finished.

    ho3

    Notification Information

    Scanning down from the backoff graph in the Health Overview displays the Notification backlog graph.  Knowing how backlogged your time-sensitive notifications are performing is crucial when someone asks if there is anyway to know why one of the notifications weren’t received in a timely manner and you can quickly assess if it is an issue with EM12c or if the problem resides elsewhere.

    ho4

    Alerts

    The last section in the health overview covers incident management.  You can easily see if there are any Metric Collection Errors, (corresponding this to the backlog data above if necessary), access related Metric Errors and Related Alerts.

    ho5

    You also can choose to launch the Incident Manager from the Health Overview console if you wish to get more details about all incidents currently in the queue. This view is really to give you a very high level account of what incidents are currently open and related alerts and metric errors.  Use that button to launch the Incident Manager if you wish to see what the alerts are all about.

    We’ll dig into the deeper diagnostic data offered in EM12c, release 4 for the OMR, OMS and Agents in subsequent posts, so until next time!



    Tags:  


    Del.icio.us



    Facebook

    TweetThis

    Digg

    StumbleUpon




    Copyright © DBA Kevlar [EM12c Release 4, Health Overview], All Right Reserved. 2014.

    The First PASS Summit Bloggers’ Meetup

    We are stoked to announce the first ever PASS Summit Bloggers’ Meetup!

    What: PASS Summit Bloggers’ Meetup 2014
    When: Thursday, November 6th, 5pm – 7pm
    Where: Sports Restaurant & Bar, 140 4th Ave North, Suite 130, Seattle, WA 98109
    How: Please comment with “COUNT ME IN” if coming — we need to know attendance numbers.

    We’re excited to meet old friends, and make new ones in Seattle this year. Pythian will sponsor munchies and drinks. There will be a networking contest with some cool prizes, plus everyone will get their very own PASS Summit Bloggers Meetup shirt! Meetups are a great way to make new friends in the community, so come on out — all bloggers are welcome!

    If you are planning to attend, please comment here with the phrase “COUNT ME IN”. This will help us ensure we have the attendance numbers right. Please provide your blog URL with your comment — it’s a Bloggers Meetup after all! Make sure you comment here if you are attending so that we have enough room, food, and drinks.

    The location is perfect to get ready for the appreciation event — minutes walk from EMP Museum! Snacks and drinks before the big event and mingle with fellow bloggers. What can be better?

    Of course, do not forget to blog and tweet about this year’s bloggers meetup using #Summit14 #sqlpass. See you there!

    first_rows(10)

    No, not the 10th posting about first_rows() this week – whatever it may seem like – just an example that happens to use the “calculate costs for fetching the first 10 rows” optimizer strategy and does it badly. I think it’s a bug, but it’s certainly a defect that is a poster case for the inherent risk of using anything other than all_rows optimisation.  Here’s some code to build a couple of sample tables:

    
    begin
    	dbms_stats.set_system_stats('MBRC',16);
    	dbms_stats.set_system_stats('MREADTIM',12);
    	dbms_stats.set_system_stats('SREADTIM',5);
    	dbms_stats.set_system_stats('CPUSPEED',1000);
    end;
    /
    
    create table t1
    as
    with generator as (
    	select	--+ materialize
    		rownum id 
    	from dual 
    	connect by 
    		level <= 1e4
    )
    select
    	rownum					id,
    	trunc(dbms_random.value(1,1000))	n1,
    	lpad(rownum,10,'0')	small_vc,
    	rpad('x',100)		padding
    from
    	generator	v1,
    	generator	v2
    where
    	rownum <= 1e6
    ;
    
    create index t1_n1 on t1(id, n1);
    
    create table t2
    as
    with generator as (
    	select	--+ materialize
    		rownum id 
    	from dual 
    	connect by 
    		level <= 1e4
    )
    select
    	rownum					id,
    	trunc(dbms_random.value(10001,20001))	x1,
    	lpad(rownum,10,'0')	small_vc,
    	rpad('x',100)		padding
    from
    	generator	v1,
    	generator	v2
    where
    	rownum <= 1e6
    ;
    
    create index t2_i1 on t2(x1);
    
    begin
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T1',
    		estimate_percent => 100,
    		method_opt	 => 'for all columns size 1'
    	);
    
    	dbms_stats.gather_table_stats(
    		ownname		 => user,
    		tabname		 =>'T2',
    		estimate_percent => 100,
    		method_opt	 => 'for all columns size 1'
    	);
    
    end;
    /
    
    
    create or replace view  v1
    as
    select 
    	id, n1, small_vc, padding
    from	t1 
    where	n1 between 101 and 300
    union all
    select 
    	id, n1, small_vc, padding
    from	t1 
    where	n1 between 501 and 700
    ;
    
    

    The key feature of this demonstration is the UNION ALL view and what the optimizer does with it when we have first_rows_N optimisation – this is a simplified model of a production problem I was shown a couple of years ago, so nothing special, nothing invented. Here’s a query that behaves badly:

    
    select
    	/*+ gather_plan_statistics */
    	v1.small_vc,
    	v1.n1
    from
    	v1,
    	t2
    where
    	t2.id = v1.id
    and	t2.x1 = 15000
    ;
    
    

    I’m going to execute this query in three different ways – as is, using all_rows optimisation; as is, using first_rows_10 optimisation, then using all_rows optimisation but with the necessary hints to make it follow the first_rows_10 execution path. Here are the resulting plans from an instance of 12.1.0.2 (the same thing happens in 11.2.0.4):

    
    first_rows_10 plan
    ----------------------------------------------------------------------------------------------
    | Id  | Operation                            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
    ----------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                     |       |     1 |    35 |   107   (0)| 00:00:01 |
    |*  1 |  HASH JOIN                           |       |     1 |    35 |   107   (0)| 00:00:01 |
    |   2 |   VIEW                               | V1    |    24 |   600 |     4   (0)| 00:00:01 |
    |   3 |    UNION-ALL                         |       |       |       |            |          |
    |*  4 |     TABLE ACCESS FULL                | T1    |    12 |   240 |     2   (0)| 00:00:01 |
    |*  5 |     TABLE ACCESS FULL                | T1    |    12 |   240 |     2   (0)| 00:00:01 |
    |   6 |   TABLE ACCESS BY INDEX ROWID BATCHED| T2    |   100 |  1000 |   103   (0)| 00:00:01 |
    |*  7 |    INDEX RANGE SCAN                  | T2_I1 |   100 |       |     3   (0)| 00:00:01 |
    ----------------------------------------------------------------------------------------------
    
    all_rows plan
    ------------------------------------------------------------------------------------------------
    | Id  | Operation                              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
    ------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                       |       |    40 |  1400 |   904   (1)| 00:00:01 |
    |   1 |  NESTED LOOPS                          |       |    40 |  1400 |   904   (1)| 00:00:01 |
    |   2 |   TABLE ACCESS BY INDEX ROWID BATCHED  | T2    |   100 |  1000 |   103   (0)| 00:00:01 |
    |*  3 |    INDEX RANGE SCAN                    | T2_I1 |   100 |       |     3   (0)| 00:00:01 |
    |   4 |   VIEW                                 | V1    |     1 |    25 |     8   (0)| 00:00:01 |
    |   5 |    UNION ALL PUSHED PREDICATE          |       |       |       |            |          |
    |   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| T1    |     1 |    20 |     4   (0)| 00:00:01 |
    |*  7 |      INDEX RANGE SCAN                  | T1_N1 |     1 |       |     3   (0)| 00:00:01 |
    |   8 |     TABLE ACCESS BY INDEX ROWID BATCHED| T1    |     1 |    20 |     4   (0)| 00:00:01 |
    |*  9 |      INDEX RANGE SCAN                  | T1_N1 |     1 |       |     3   (0)| 00:00:01 |
    ------------------------------------------------------------------------------------------------
    
    first_rows_10 plan hinted under all_rows optimisation
    ---------------------------------------------------------------------------------------------- 
    | Id  | Operation                    | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | 
    ---------------------------------------------------------------------------------------------- 
    |   0 | SELECT STATEMENT             |       |   200 |  8600 |       |  6124   (3)| 00:00:01 | 
    |*  1 |  HASH JOIN                   |       |   200 |  8600 |    17M|  6124   (3)| 00:00:01 |
    |   2 |   VIEW                       | V1    |   402K|    12M|       |  5464   (3)| 00:00:01 | 
    |   3 |    UNION-ALL                 |       |       |       |       |            |          | 
    |*  4 |     TABLE ACCESS FULL        | T1    |   201K|  3933K|       |  2731   (3)| 00:00:01 | 
    |*  5 |     TABLE ACCESS FULL        | T1    |   201K|  3933K|       |  2733   (3)| 00:00:01 | 
    |   6 |   TABLE ACCESS BY INDEX ROWID| T2    |   100 |  1000 |       |   103   (0)| 00:00:01 |
    |*  7 |    INDEX RANGE SCAN          | T2_I1 |   100 |       |       |     3   (0)| 00:00:01 |
    ----------------------------------------------------------------------------------------------
    
    

    I’m not sure why the first_rows_10 plan uses “table access by rowid batched”, but I’d guess it’s because the optimiser calculates that sorting the index rowids before visiting the table may have a small benefit on the speed of getting the first 10 rows – eventually I’ll get around to checking the 10053 trace file. The important thing, though, is the big mistake in the strategy, not the little difference in table access.

    In the first_rows_10 plan the optimizer has decided building an in-memory hash table from the UNION ALL of the rows fetched from the two copies of the t1 table will be fast and efficient; but it’s made that decision based on the assumption that it will only get 10 rows from each copy of the table – and at run-time it HAS to get all the relevant t1 rows to build the hash table before it can get any t2 rows. We can get some idea of the scale of this error when we look at the hinted plan under all_rows optimisation – it’s a lot of redundant data and a very expensive hash table build.

    In contrast the all_rows plan does an efficient indexed access into the t2 table then, for each row, does a join predicate pushdown into the union all view using an indexed access path. If we only wanted to fetch 10 rows we could stop after doing a minimum amount of work. To demonstrate the error more clearly I’ve re-run the experiment for the first two plans from SQL*PLus, setting the arraysize to 11, the pagesize to 5, and stopping after the first 10 rows. Here are the plans showing the rowsource execution stats:

    
    first_rows_10 plan
    ------------------------------------------------------------------------------------------------------------------------------------------------
    | Id  | Operation                            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
    ------------------------------------------------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                     |       |      1 |        |   107 (100)|     12 |00:00:00.43 |   35150 |       |       |          |
    |*  1 |  HASH JOIN                           |       |      1 |      1 |   107   (0)|     12 |00:00:00.43 |   35150 |    24M|  3582K|   23M (0)|
    |   2 |   VIEW                               | V1    |      1 |     24 |     4   (0)|    400K|00:00:06.48 |   35118 |       |       |          |
    |   3 |    UNION-ALL                         |       |      1 |        |            |    400K|00:00:04.20 |   35118 |       |       |          |
    |*  4 |     TABLE ACCESS FULL                | T1    |      1 |     12 |     2   (0)|    200K|00:00:00.71 |   17559 |       |       |          |
    |*  5 |     TABLE ACCESS FULL                | T1    |      1 |     12 |     2   (0)|    200K|00:00:00.63 |   17559 |       |       |          |
    |   6 |   TABLE ACCESS BY INDEX ROWID BATCHED| T2    |      1 |    100 |   103   (0)|     28 |00:00:00.01 |      32 |       |       |          |
    |*  7 |    INDEX RANGE SCAN                  | T2_I1 |      1 |    100 |     3   (0)|     28 |00:00:00.01 |       4 |       |       |          |
    ------------------------------------------------------------------------------------------------------------------------------------------------
    
    all_rows plan
    -----------------------------------------------------------------------------------------------------------------------
    | Id  | Operation                              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
    -----------------------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                       |       |      1 |        |   904 (100)|     12 |00:00:00.01 |     213 |
    |   1 |  NESTED LOOPS                          |       |      1 |     43 |   904   (1)|     12 |00:00:00.01 |     213 |
    |   2 |   TABLE ACCESS BY INDEX ROWID BATCHED  | T2    |      1 |    100 |   103   (0)|     28 |00:00:00.01 |      32 |
    |*  3 |    INDEX RANGE SCAN                    | T2_I1 |      1 |    100 |     3   (0)|     28 |00:00:00.01 |       4 |
    |   4 |   VIEW                                 | V1    |     28 |      1 |     8   (0)|     12 |00:00:00.01 |     181 |
    |   5 |    UNION ALL PUSHED PREDICATE          |       |     28 |        |            |     12 |00:00:00.01 |     181 |
    |   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| T1    |     28 |    212K|     4   (0)|      8 |00:00:00.01 |      93 |
    |*  7 |      INDEX RANGE SCAN                  | T1_N1 |     28 |      1 |     3   (0)|      8 |00:00:00.01 |      85 |
    |   8 |     TABLE ACCESS BY INDEX ROWID BATCHED| T1    |     28 |    213K|     4   (0)|      4 |00:00:00.01 |      88 |
    |*  9 |      INDEX RANGE SCAN                  | T1_N1 |     28 |      1 |     3   (0)|      4 |00:00:00.01 |      84 |
    -----------------------------------------------------------------------------------------------------------------------
    
    

    If I had set the optimizer_mode to first_rows_10 because I really only wanted to fetch (about) 10 rows then I’ve managed to pay a huge overhead in buffer visits, memory and CPU for the privilege – the all_rows plan was much more efficient.

    Remember – we often see cases where the first_rows(n) plan will do more work to get the whole data set in order to be able to get the first few rows more quickly (the simplest example is when the optimizer uses a particular index to get the first few rows of a result set in order without sorting rather than doing a (faster) full tablescan with sort. This case, though, is different: the optimizer is choosing to build a hash table as if it only has to put 10 rows into that hash table when it actually HAS to build the whole has table before it can take any further steps – we don’t get 10 rows quicker and the rest more slowly; we just get 10 very slow rows.

    Footnote:

    It’s possible that this is an example of bug 9633142: (FIRST_ROWS OPTIMIZER DOES NOT PUSH PREDICATES INTO UNION INLINE VIEW) but that’s reported as fixed in 12c, with a couple of patches for 11.2.0.2/3. However, setting “_fix_control”=’4887636:off’, does bypass the problem. (The fix control, introduced in 11.1.0.6 has description: “remove restriction from first K row optimization”)

    Interesting observations executing SLOB2 with ext4 and xfs on SSD

    My lab server has 2 SSDs, one is connected using SATA 2 and another is connected using SATA 3. I’d expect the SATA 3 connected device to be equally well equipped or even better to do work than the “old” interface. I ran SLOB on these devices to find out if there was a difference. To my great surprise the SATA2 – connected SSD performed a lot better than the SATA 3 device, as shown in the AWR report! Initially I was not entirely sure why, since the FIO results on both devices are roughly equal. You will see why though when reading this post. In summary: use XFS for any concurrent writes. Or maybe ASM.

    Let’s investigate

    Let’s do a little I/O investigation because a) it’s cool and b) you can.

    This is Oracle Linux 6.5 with UEK 3. A word of warning though-ext4 has been, let’s say, a little flaky. I first found out about that in 2011 (http://martincarstenbach.wordpress.com/2011/11/04/an-interesting-problem-with-ext4-on-oracle-linux-5-5/) and Oracle now published a MOS note: ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1). I am on a reasonably recent version so the observations in this post are probably not related to the above notes. The device in /dev/sda uses SATA2 and the file system /u01/app/oracle/oradata is XFS formatted. The other SSD, /dev/sde uses EXT4 but is connected via SATA3 provided by a PCIe v2 x4 card. I wish I had PCIe v3 but my Opteron 6200 system doesn’t support native SATA6G.

    SLOB

    I created a 12c non-CDB database for SLOB with the IOPS tablespace on my SATA 2 SSD that’s mounted on XFS. Every test has been repeated four times. Should you want to follow along, here is my slob.conf:

    [/m/kvm/oracle]$ cat ~/SLOB2/SLOB/slob.conf 
    
    UPDATE_PCT=20
    RUN_TIME=120
    WORK_LOOP=0
    SCALE=10000
    WORK_UNIT=256
    REDO_STRESS=HEAVY
    LOAD_PARALLEL_DEGREE=8
    SHARED_DATA_MODULUS=0
    
    # Settings for SQL*Net connectivity:
    #ADMIN_SQLNET_SERVICE=slob
    #SQLNET_SERVICE_BASE=slob
    #SQLNET_SERVICE_MAX=2
    #SYSDBA_PASSWD="change_on_install"
    
    export UPDATE_PCT RUN_TIME WORK_LOOP SCALE WORK_UNIT LOAD_PARALLEL_DEGREE REDO_STRESS SHARED_DATA_MODULUS
    
    [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
    

    Notice the UPDATE_PCT-that will become important later. The SLOB configuration has been created using a call to ./setup.sh with IOPS as the target tablespace and 128 users which is the default. My database uses this initialisation file shown below, which is more or lesse the same as the default (and my buffer cache is indeed 16M)

    [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
    [/m/kvm/oracle]$ cat /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/initSLOB12.ora 
    
    #set the following to paths that make sense on your system:
    db_create_file_dest = '/u01/app/oracle/oradata'
    control_files='/u01/app/oracle/oradata/SLOB12/controlfile/o1_mf_b52b4lvk_.ctl'
    
    db_name = SLOB12
    compatible = 12.1.0.2.0
    undo_management = auto
    db_block_size = 8192
    db_files = 2000
    processes = 500
    shared_pool_size = 1G
    db_cache_size = 10M     # Will force maximum physical I/O
    #db_cache_size = 15G    # Sufficient for cached runs with reference scale (10,000 rows) and 128 sessions
    filesystemio_options = setall
    parallel_max_servers = 0
    log_buffer = 134217728
    cpu_count = 2
    pga_aggregate_target = 1G
    remote_login_passwordfile = exclusive
    workarea_size_policy = auto
    recyclebin = off
    
    # The following are needed to force db file sequential read. Omit for direct path short scans.
    # If you would like to investigate a blend of short scans mixed with single-block random I/O
    # then comment out these three parameters :
    _db_block_prefetch_limit = 0
    _db_block_prefetch_quota = 0
    _db_file_noncontig_mblock_read_count = 0
    
    # Uncomment the following if suffering library cache contention (at larger numbers of sessions) or high log file sync waits
    # at CPU-saturated levels.
    #_cursor_obsolete_threshold=130
    #_high_priority_processes=LGWR
    
    [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
    

    After the fourth execution of ./runit.sh 16 I moved the IOPS data file from /dev/sda to /dev/sde and ran another 4 tests.

    Observations

    I ran an AWR difference report to compare the performance of the XFS and EXT4 SSDs. The most interesting bit is right on the top. Apologies for the very wide output-that’s the way the text report lays it out.

    WORKLOAD REPOSITORY COMPARE PERIOD REPORT
    
    Snapshot Set    DB Id    Instance     Inst Num Release     Cluster Host          Std Block Size
    ------------ ----------- ------------ -------- ----------- ------- ------------ ---------------
    First (1st)   1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
    Second (2nd)  1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
    
    Snapshot Set  Begin Snap Id Begin Snap Time            End Snap Id End Snap Time                  Avg Active Users           Elapsed Time (min)            DB time (min)
    ------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- --------------------------
    1st                      41 31-Oct-14 12:18:14 (Fri)           42 31-Oct-14 12:20:17 (Fri)                      15.8                       2.0                      32.1
    2nd                      49 31-Oct-14 12:38:18 (Fri)           50 31-Oct-14 12:40:20 (Fri)                      15.8                       2.0                      32.3
                                                                                          ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                          %Diff:                     0.2%                       0.0%                       0.5%
    
    Host Configuration Comparison
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                         1st                  2nd                 Diff     %Diff
    ----------------------------------- -------------------- -------------------- -------------------- ---------
    Number of CPUs:                                       24                   24                    0       0.0
    Number of CPU Cores:                                  24                   24                    0       0.0
    Number of CPU Sockets:                                 2                    2                    0       0.0
    Physical Memory:                                64413.7M             64413.7M                   0M       0.0
    Load at Start Snapshot:                             9.13                 9.57                  .44       4.8
    Load at End Snapshot:                              15.38                 15.6                  .22       1.4
    %User Time:                                         1.24                  .64                  -.6     -48.4
    %System Time:                                        .75                  .74                 -.02      -1.3
    %Idle Time:                                        97.89                 98.6                  .71       0.7
    %IO Wait Time:                                     35.21                 9.06               -26.15     -74.3
    Cache Sizes
    ~~~~~~~~~~~
                              1st (M)    2nd (M)   Diff (M)    %Diff
    ---------------------- ---------- ---------- ---------- --------
    Memory Target
    .....SGA Target
    ..........Buffer Cache       16.0       16.0        0.0      0.0
    ..........Shared Pool     1,024.0    1,024.0        0.0      0.0
    ..........Large Pool
    ..........Java Pool          16.0       16.0        0.0      0.0
    ..........Streams Pool
    .....PGA Target           1,024.0    1,024.0        0.0      0.0
    Log Buffer                  128.0      128.0        0.0      0.0
    In-Memory Area
    
    Workload Comparison
    ~~~~~~~~~~~~~~~~~~~                      1st Per Sec          2nd Per Sec      %Diff              1st Per Txn          2nd Per Txn      %Diff
                                         ---------------      ---------------     ------          ---------------      ---------------     ------
                           DB time:                 15.8                 15.8        0.2                      2.5                  4.4       78.7
                          CPU time:                  0.5                  0.3      -29.8                      0.1                  0.1       28.6
               Background CPU time:                  0.1                  0.1      -40.0                      0.0                  0.0        0.0
                 Redo size (bytes):          8,721,293.5          4,652,147.0      -46.7              1,379,498.8          1,311,306.6       -4.9
             Logical read (blocks):             10,403.9              5,231.1      -49.7                  1,645.6              1,474.5      -10.4
                     Block changes:              5,216.5              2,640.2      -49.4                    825.1                744.2       -9.8
            Physical read (blocks):              7,412.5              3,750.0      -49.4                  1,172.5              1,057.0       -9.8
           Physical write (blocks):              3,369.0              1,729.0      -48.7                    532.9                487.4       -8.5
                  Read IO requests:              7,411.8              3,749.3      -49.4                  1,172.4              1,056.8       -9.9
                 Write IO requests:              3,018.0              1,538.2      -49.0                    477.4                433.6       -9.2
                      Read IO (MB):                 57.9                 29.3      -49.4                      9.2                  8.3       -9.8
                     Write IO (MB):                 26.3                 13.5      -48.7                      4.2                  3.8       -8.4
                      IM scan rows:                  0.0                  0.0        0.0                      0.0                  0.0        0.0
           Session Logical Read IM:
                        User calls:                  1.4                  1.4       -1.4                      0.2                  0.4       77.3
                      Parses (SQL):                  5.9                  6.3        7.5                      0.9                  1.8       91.4
                 Hard parses (SQL):                  0.1                  0.3      107.1                      0.0                  0.1      300.0
                SQL Work Area (MB):                  0.6                  0.4      -25.9                      0.1                  0.1      -25.9
                            Logons:                  0.1                  0.1        0.0                      0.0                  0.0      100.0
                    Executes (SQL):                 38.1                 23.7      -37.6                      6.0                  6.7       11.1
                      Transactions:                  6.3                  3.5      -43.8
    
                                                   First               Second       Diff
                                         ---------------      ---------------     ------
         % Blocks changed per Read:                 50.1                 50.5        0.3
                  Recursive Call %:                 98.1                 97.6       -0.6
        Rollback per transaction %:                  0.0                  0.0        0.0
                     Rows per Sort:                 68.3                 35.4      -32.9
        Avg DB time per Call (sec):                 11.3                 11.5        0.2
    ...
    

    I copied the header  from the report  to show you that nothing really changed between the test executions, and that I’m going to do a lot of I/O since the buffer cache is only 16 MB.

    What is striking is that the physical reads and writes are roughly half of what I saw during the first batch of executions. The culprit is found quickly. In this scenario I chose SLOB is primarily performing sequential reads:

    Top Timed Events   First DB/Inst: SLOB12/SLOB12 Snaps: 41-42 (Elapsed time: 122.27 sec  DB time: 1925.83 sec),  Second DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec  DB time: 1935.33 sec)
    -> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes
    
                                                   1st                                                                                                2nd
    ------------------------------------------------------------------------------------------------   ------------------------------------------------------------------------------------------------
    Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time   Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time
    ------------------------------ ------------- ------------ ------------ ------------- -----------   ------------------------------ ------------- ------------ ------------ ------------- -----------
     db file sequential read       User I/O           906,150      1,899.1           2.1        98.6    db file sequential read       User I/O           459,637      1,920.9           4.2        99.3
     db file parallel write        System I/O         116,318         60.0           0.5         3.1    db file async I/O submit      System I/O           2,975         97.2          32.7         5.0
     CPU time                                             N/A         57.7           N/A         3.0    CPU time                                             N/A         40.2           N/A         2.1
     log file parallel write       System I/O           3,485         47.4          13.6         2.5    log file parallel write       System I/O           1,881          3.2           1.7         0.2
     db file async I/O submit      System I/O           7,425          3.7           0.5         0.2    db file parallel write        System I/O           4,670          0.9           0.2         0.0
    ...
    

    The IO times for the second interval (the EXT4 case) are twice as what they are on the first. But why? I thought at first that the PCIe card adds the extra latency. And I came up with a few other ideas that all remained that: ideas. Ideas are nice, but proof is better. Maybe there is indeed something odd? Or maybe there is something in the way that Oracle uses the device? To find out I used FIO. FIO is my favourite I/O benchmark when it comes to measuring raw performance. If the FIO benchmark results match the Oracle results then indeed there must be something wrong with the SSD or the way it is addressed. I spent a bit of time to compile FIO and ran a set of tests.

    FIO

    I was really quite curious to see what was happening during the execution of FIO-was the SATA3 SSD really slower? So I started FIO tests, using FIO 2.1.3. Using the same test harness, just the destination is different:

    [~/fio-2.1.13]$ cat rand-read-sd*
    [rand-read]
    numjobs=4
    rw=randread
    directory=/u01/fio/data
    size=5G
    bs=8k
    ioengine=libaio
    buffered=0
    iodepth=8
    
    [rand-read]
    numjobs=4
    rw=randread
    directory=/m/kvm/oracle
    size=5G
    bs=8k
    ioengine=libaio
    buffered=0
    iodepth=8
    
    [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
    [~/fio-2.1.13]$
    

    The /u01 file system sits on /dev/sda, and uses XFS. /m/kvm/oracle is on /dev/sde, using EXT4. Here is the output from the tests:

    [~/fio-2.1.13]$ ./fio rand-read-sde.fio
    rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
    ...
    fio-2.1.13
    Starting 4 processes
    rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
    rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
    rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
    Jobs: 3 (f=3): [r(3),_(1)] [99.2% done] [192.8MB/0KB/0KB /s] [24.7K/0/0 iops] [eta 00m:01s]
    rand-read: (groupid=0, jobs=1): err= 0: pid=6124: Thu Oct 30 15:41:15 2014
      read : io=5120.0MB, bw=44174KB/s, iops=5521, runt=118686msec
        slat (usec): min=4, max=109, avg=13.06, stdev= 6.14
        clat (usec): min=128, max=43082, avg=1434.23, stdev=586.65
         lat (usec): min=140, max=43094, avg=1447.47, stdev=586.41
        clat percentiles (usec):
         |  1.00th=[  516],  5.00th=[  708], 10.00th=[  820], 20.00th=[  988],
         | 30.00th=[ 1112], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480],
         | 70.00th=[ 1608], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480],
         | 99.00th=[ 3184], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4576],
         | 99.99th=[ 6752]
        bw (KB  /s): min=27088, max=70928, per=24.99%, avg=44119.50, stdev=6314.88
        lat (usec) : 250=0.01%, 500=0.78%, 750=6.02%, 1000=14.12%
        lat (msec) : 2=65.54%, 4=13.37%, 10=0.15%, 20=0.01%, 50=0.01%
      cpu          : usr=1.89%, sys=11.05%, ctx=421189, majf=0, minf=44
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    rand-read: (groupid=0, jobs=1): err= 0: pid=6125: Thu Oct 30 15:41:15 2014
      read : io=5120.0MB, bw=44138KB/s, iops=5517, runt=118784msec
        slat (usec): min=4, max=108, avg=13.14, stdev= 6.14
        clat (usec): min=117, max=43252, avg=1435.31, stdev=587.66
         lat (usec): min=124, max=43264, avg=1448.64, stdev=587.40
        clat percentiles (usec):
         |  1.00th=[  502],  5.00th=[  700], 10.00th=[  820], 20.00th=[  988],
         | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480],
         | 70.00th=[ 1624], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480],
         | 99.00th=[ 3152], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4640],
         | 99.99th=[ 6688]
        bw (KB  /s): min=27504, max=69760, per=24.91%, avg=43972.22, stdev=6169.69
        lat (usec) : 250=0.07%, 500=0.92%, 750=5.84%, 1000=13.94%
        lat (msec) : 2=65.68%, 4=13.40%, 10=0.16%, 20=0.01%, 50=0.01%
      cpu          : usr=1.89%, sys=11.10%, ctx=421379, majf=0, minf=46
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    rand-read: (groupid=0, jobs=1): err= 0: pid=6126: Thu Oct 30 15:41:15 2014
      read : io=5120.0MB, bw=44230KB/s, iops=5528, runt=118538msec
        slat (usec): min=4, max=120, avg=12.59, stdev= 5.61
        clat (usec): min=185, max=42871, avg=1433.09, stdev=581.75
         lat (usec): min=196, max=42881, avg=1445.85, stdev=581.45
        clat percentiles (usec):
         |  1.00th=[  532],  5.00th=[  716], 10.00th=[  828], 20.00th=[  988],
         | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1464],
         | 70.00th=[ 1608], 80.00th=[ 1800], 90.00th=[ 2128], 95.00th=[ 2480],
         | 99.00th=[ 3152], 99.50th=[ 3440], 99.90th=[ 4192], 99.95th=[ 4576],
         | 99.99th=[ 7008]
        bw (KB  /s): min=27600, max=65952, per=25.01%, avg=44149.65, stdev=5945.98
        lat (usec) : 250=0.01%, 500=0.63%, 750=5.90%, 1000=14.17%
        lat (msec) : 2=65.89%, 4=13.25%, 10=0.15%, 20=0.01%, 50=0.01%
      cpu          : usr=1.74%, sys=11.20%, ctx=426219, majf=0, minf=44
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    rand-read: (groupid=0, jobs=1): err= 0: pid=6127: Thu Oct 30 15:41:15 2014
      read : io=5120.0MB, bw=47359KB/s, iops=5919, runt=110704msec
        slat (usec): min=4, max=114, avg=12.83, stdev= 5.90
        clat (usec): min=188, max=42988, avg=1337.06, stdev=571.06
         lat (usec): min=198, max=43000, avg=1350.07, stdev=570.63
        clat percentiles (usec):
         |  1.00th=[  446],  5.00th=[  620], 10.00th=[  748], 20.00th=[  900],
         | 30.00th=[ 1020], 40.00th=[ 1144], 50.00th=[ 1256], 60.00th=[ 1368],
         | 70.00th=[ 1512], 80.00th=[ 1704], 90.00th=[ 2040], 95.00th=[ 2352],
         | 99.00th=[ 3024], 99.50th=[ 3280], 99.90th=[ 3984], 99.95th=[ 4384],
         | 99.99th=[ 6624]
        bw (KB  /s): min=33664, max=51008, per=26.83%, avg=47369.34, stdev=4051.34
        lat (usec) : 250=0.01%, 500=1.45%, 750=8.72%, 1000=17.95%
        lat (msec) : 2=61.12%, 4=10.65%, 10=0.09%, 20=0.01%, 50=0.01%
      cpu          : usr=1.98%, sys=11.60%, ctx=410433, majf=0, minf=44
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    
    Run status group 0 (all jobs):
       READ: io=20480MB, aggrb=176551KB/s, minb=44137KB/s, maxb=47359KB/s, mint=110704msec, maxt=118784msec
    
    Disk stats (read/write):
      sde: ios=2619049/4, merge=0/1, ticks=3663514/6, in_queue=3662827, util=100.00%
    

    The best tool to look at IO without tracing block devices is collectl (in my opinion). I ran the extended disk stats in parallel:

    [root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sde
    waiting for 5 second sample...
    
    # DISK STATISTICS (/sec)
    #          <---------reads---------><---------writes---------><--------averages--------> Pct
    #Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
    sde              0      0    0    0  125340    129  123 1021    1020   219  1749      8   99
    sde              0      0    0    0  119608     58  118 1017    1017   156  1666      7   88
    sde         146968      0  18K    8       0      0    0    0       8    31     1      0   89
    sde         165435      0  20K    8       2      0    0    6       7    31     1      0   99
    sde         187097      0  23K    8       0      0    0    0       8    31     1      0  100
    sde         187113      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         157684      0  19K    8       0      0    0    0       8    31     1      0   99
    sde         186542      0  23K    8       1      0    0    4       7    31     1      0  100
    sde         186721      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         158416      0  19K    8       0      0    0    0       8    31     1      0  100
    sde         187168      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         186027      0  23K    8       0      0    0    0       8    31     1      0  100
    sde         158366      0  19K    8       0      0    0    0       8    31     1      0   99
    sde         186028      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         186416      0  23K    8       0      0    0    0       8    31     1      0  100
    sde         159102      0  19K    8       0      0    0    0       8    31     1      0  100
    sde         186348      0  23K    8       0      0    0    0       8    31     1      0  100
    sde         186859      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         153996      0  19K    8       0      0    0    0       8    31     1      0   99
    sde         186043      0  23K    8       0      0    0    0       8    31     1      0  100
    sde         185348      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         158150      0  19K    8       0      0    0    0       8    31     1      0   99
    sde         186148      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         185900      0  23K    8       0      0    0    0       8    31     1      0   99
    sde         159353      0  19K    8       0      0    0    0       8    25     1      0   99
    sde         167065      0  20K    8       0      0    0    0       8    22     0      0   86
    sde              0      0    0    0       0      0    0    0       0     0     0      0    0
    

    This matches the FIO output-and indeed, the device can do around 23k IOPS with an 8k IO size. What about the XFS mount?

    [~/fio-2.1.13]$ ./fio rand-read-sda.fio
    rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
    ...
    fio-2.1.13
    Starting 4 processes
    rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
    rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
    rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
    Jobs: 4 (f=4): [r(4)] [100.0% done] [191.3MB/0KB/0KB /s] [24.5K/0/0 iops] [eta 00m:00s]
    rand-read: (groupid=0, jobs=1): err= 0: pid=6329: Thu Oct 30 15:52:03 2014
      read : io=5120.0MB, bw=49087KB/s, iops=6135, runt=106809msec
        slat (usec): min=3, max=93, avg= 7.90, stdev= 6.40
        clat (usec): min=280, max=35113, avg=1294.26, stdev=152.91
         lat (usec): min=285, max=35130, avg=1302.35, stdev=152.60
        clat percentiles (usec):
         |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1240],
         | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
         | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
         | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224],
         | 99.99th=[ 3024]
        bw (KB  /s): min=46000, max=49536, per=25.02%, avg=49124.10, stdev=291.98
        lat (usec) : 500=0.01%, 750=0.01%, 1000=0.03%
        lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01%
      cpu          : usr=2.39%, sys=8.05%, ctx=505195, majf=0, minf=43
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    rand-read: (groupid=0, jobs=1): err= 0: pid=6330: Thu Oct 30 15:52:03 2014
      read : io=5120.0MB, bw=49093KB/s, iops=6136, runt=106795msec
        slat (usec): min=3, max=114, avg= 7.55, stdev= 5.74
        clat (usec): min=263, max=35353, avg=1294.46, stdev=155.68
         lat (usec): min=271, max=35369, avg=1302.21, stdev=155.46
        clat percentiles (usec):
         |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1256],
         | 30.00th=[ 1272], 40.00th=[ 1288], 50.00th=[ 1288], 60.00th=[ 1304],
         | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
         | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224],
         | 99.99th=[ 3184]
        bw (KB  /s): min=45728, max=49472, per=25.02%, avg=49123.20, stdev=300.30
        lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02%
        lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01%
      cpu          : usr=2.42%, sys=7.88%, ctx=549312, majf=0, minf=46
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    rand-read: (groupid=0, jobs=1): err= 0: pid=6331: Thu Oct 30 15:52:03 2014
      read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec
        slat (usec): min=3, max=137, avg= 7.70, stdev= 5.87
        clat (usec): min=266, max=35102, avg=1294.64, stdev=147.64
         lat (usec): min=274, max=35121, avg=1302.52, stdev=147.39
        clat percentiles (usec):
         |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256],
         | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
         | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
         | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 2064], 99.95th=[ 2256],
         | 99.99th=[ 3600]
        bw (KB  /s): min=45888, max=49536, per=25.02%, avg=49110.62, stdev=295.81
        lat (usec) : 500=0.01%, 750=0.01%, 1000=0.04%
        lat (msec) : 2=99.84%, 4=0.10%, 10=0.01%, 20=0.01%, 50=0.01%
      cpu          : usr=2.32%, sys=8.07%, ctx=536841, majf=0, minf=44
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    rand-read: (groupid=0, jobs=1): err= 0: pid=6332: Thu Oct 30 15:52:03 2014
      read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec
        slat (usec): min=3, max=106, avg= 7.66, stdev= 5.93
        clat (usec): min=274, max=35268, avg=1294.67, stdev=157.16
         lat (usec): min=281, max=35286, avg=1302.53, stdev=156.91
        clat percentiles (usec):
         |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256],
         | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
         | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
         | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1944], 99.95th=[ 2224],
         | 99.99th=[ 3536]
        bw (KB  /s): min=45840, max=49536, per=25.02%, avg=49109.56, stdev=303.79
        lat (usec) : 500=0.01%, 750=0.01%, 1000=0.05%
        lat (msec) : 2=99.84%, 4=0.09%, 10=0.01%, 20=0.01%, 50=0.01%
      cpu          : usr=2.50%, sys=7.83%, ctx=533469, majf=0, minf=44
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
         latency   : target=0, window=0, percentile=100.00%, depth=8
    
    Run status group 0 (all jobs):
       READ: io=20480MB, aggrb=196318KB/s, minb=49079KB/s, maxb=49092KB/s, mint=106795msec, maxt=106824msec
    
    Disk stats (read/write):
      sda: ios=2620077/60, merge=0/2, ticks=3348979/94, in_queue=3348631, util=100.00%
    [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
    [~/fio-2.1.13]$
    

    Here is some of the corresponding collectl output:

    [root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sda
    waiting for 5 second sample...
    
    # DISK STATISTICS (/sec)
    #          <---------reads---------><---------writes---------><--------averages--------> Pct
    #Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
    sda         196313      0  24K    8       5      0    1    9       8    31     1      0   99
    sda         196681      0  24K    8       6      0    0   16       8    31     1      0  100
    sda         196369      0  24K    8       7      0    1   12       8    31     1      0   99
    sda         196217      0  24K    8       4      0    0   10       8    31     1      0  100
    sda         196300      0  24K    8      14      0    2    6       8    31     1      0  100
    sda         196699      0  24K    8       6      0    0   16       8    31     1      0   99
    sda         196340      0  24K    8       3      0    0   16       8    31     1      0   99
    sda         196350      0  24K    8       6      0    0   16       8    31     1      0  100
    sda         196379      0  24K    8       6      0    0   16       8    31     1      0   99
    sda         196560      0  24K    8       3      0    0   16       8    31     1      0   99
    sda         196259      0  24K    8       8      0    1    8       8    31     1      0   99
    

    Summary Number 1

    So it would appear that the SSDs are more or less equal from FIO’s point of view. If you followed closely (hurray if you did-I appreciate this is a long post!) you will have noticed the following:

    • The SLOB test used updates (20%). There is furthermore just one data file for the IOPS tablespace (-> there is concurrency!)
    • The FIO test was based on random reads, no writes. It’s 4 jobs also worked on 4 distinct files.

    I have tried to avoid the effect of file system buffering by setting buffer to 0 and I’m also using asynchronous IO as provided by libaio (just as Oracle does).

    Theory Number 2

    It would appear that XFS is better suited for concurrent writes. Let’s put that theory into action. I reformatted the SATA3 SSD with XFS and ran another test. Feast your eyes-comparing the AWR reports of the EXT4 test execution to XFS on /dev/sde:

    WORKLOAD REPOSITORY COMPARE PERIOD REPORT
    
    Snapshot Set    DB Id    Instance     Inst Num Release     Cluster Host          Std Block Size
    ------------ ----------- ------------ -------- ----------- ------- ------------ ---------------
    First (1st)   1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
    Second (2nd)  1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
    
    Snapshot Set  Begin Snap Id Begin Snap Time            End Snap Id End Snap Time                  Avg Active Users           Elapsed Time (min)            DB time (min)
    ------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- --------------------------
    1st                      49 31-Oct-14 12:38:18 (Fri)           50 31-Oct-14 12:40:20 (Fri)                      15.8                       2.0                      32.3
    2nd                      55 31-Oct-14 15:35:23 (Fri)           56 31-Oct-14 15:37:25 (Fri)                      15.7                       2.0                      32.1
                                                                                          ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                          %Diff:                     0.2%                      -0.5%                      -0.6%
    
    Host Configuration Comparison
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                         1st                  2nd                 Diff     %Diff
    ----------------------------------- -------------------- -------------------- -------------------- ---------
    Number of CPUs:                                       24                   24                    0       0.0
    Number of CPU Cores:                                  24                   24                    0       0.0
    Number of CPU Sockets:                                 2                    2                    0       0.0
    Physical Memory:                                64413.7M             64413.7M                   0M       0.0
    Load at Start Snapshot:                             9.57                 1.77                 -7.8     -81.5
    Load at End Snapshot:                               15.6                14.15                -1.45      -9.3
    %User Time:                                          .64                 2.07                 1.43     223.4
    %System Time:                                        .74                 1.48                  .75     100.0
    %Idle Time:                                         98.6                 96.4                 -2.2      -2.2
    %IO Wait Time:                                      9.06                53.29                44.23     488.2
    Cache Sizes
    ~~~~~~~~~~~
                              1st (M)    2nd (M)   Diff (M)    %Diff
    ---------------------- ---------- ---------- ---------- --------
    Memory Target
    .....SGA Target
    ..........Buffer Cache       16.0       16.0        0.0      0.0
    ..........Shared Pool     1,024.0    1,024.0        0.0      0.0
    ..........Large Pool
    ..........Java Pool          16.0       16.0        0.0      0.0
    ..........Streams Pool
    .....PGA Target           1,024.0    1,024.0        0.0      0.0
    Log Buffer                  128.0      128.0        0.0      0.0
    In-Memory Area
    
    Workload Comparison
    ~~~~~~~~~~~~~~~~~~~                      1st Per Sec          2nd Per Sec      %Diff              1st Per Txn          2nd Per Txn      %Diff
                                         ---------------      ---------------     ------          ---------------      ---------------     ------
                           DB time:                 15.8                 15.8        0.2                      4.4                  1.5      -66.7
                          CPU time:                  0.3                  0.9      175.8                      0.1                  0.1        0.0
               Background CPU time:                  0.1                  0.1      133.3                      0.0                  0.0      -50.0
                 Redo size (bytes):          4,652,147.0         15,186,494.1      226.4              1,311,306.6          1,426,136.0        8.8
             Logical read (blocks):              5,231.1             16,957.8      224.2                  1,474.5              1,592.5        8.0
                     Block changes:              2,640.2              8,922.1      237.9                    744.2                837.9       12.6
            Physical read (blocks):              3,750.0             12,560.9      235.0                  1,057.0              1,179.6       11.6
           Physical write (blocks):              1,729.0              5,855.0      238.6                    487.4                549.8       12.8
                  Read IO requests:              3,749.3             12,560.8      235.0                  1,056.8              1,179.6       11.6
                 Write IO requests:              1,538.2              5,219.5      239.3                    433.6                490.2       13.0
                      Read IO (MB):                 29.3                 98.1      234.9                      8.3                  9.2       11.6
                     Write IO (MB):                 13.5                 45.7      238.6                      3.8                  4.3       12.9
                      IM scan rows:                  0.0                  0.0        0.0                      0.0                  0.0        0.0
           Session Logical Read IM:
                        User calls:                  1.4                  1.4        2.9                      0.4                  0.1      -66.7
                      Parses (SQL):                  6.3                 10.3       62.7                      1.8                  1.0      -45.5
                 Hard parses (SQL):                  0.3                  0.0      -86.2                      0.1                  0.0     -100.0
                SQL Work Area (MB):                  0.4                  0.2      -48.8                      0.1                  0.0      -48.8
                            Logons:                  0.1                  0.1        0.0                      0.0                  0.0      -50.0
                    Executes (SQL):                 23.7                 58.0      144.3                      6.7                  5.4      -18.5
                      Transactions:                  3.5                 10.6      200.0
    
                                                   First               Second       Diff
                                         ---------------      ---------------     ------
         % Blocks changed per Read:                 50.5                 52.6        2.1
                  Recursive Call %:                 97.6                 99.0        1.5
        Rollback per transaction %:                  0.0                  0.0        0.0
                     Rows per Sort:                 35.4                 49.6       14.2
        Avg DB time per Call (sec):                 11.5                 11.3       -0.3
    
    Top Timed Events   First DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec  DB time: 1935.33 sec),  Second DB/Inst: SLOB12/SLOB12 Snaps: 55-56 (Elapsed time: 121.705 sec  DB time: 1923.77 sec)
    -> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes
    
                                                   1st                                                                                                2nd
    ------------------------------------------------------------------------------------------------   ------------------------------------------------------------------------------------------------
    Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time   Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time
    ------------------------------ ------------- ------------ ------------ ------------- -----------   ------------------------------ ------------- ------------ ------------ ------------- -----------
     db file sequential read       User I/O           459,637      1,920.9           4.2        99.3    db file sequential read       User I/O         1,528,725      1,840.4           1.2        95.7
     db file async I/O submit      System I/O           2,975         97.2          32.7         5.0    CPU time                                             N/A        110.3           N/A         5.7
     CPU time                                             N/A         40.2           N/A         2.1    db file parallel write        System I/O         114,997         34.3           0.3         1.8
     log file parallel write       System I/O           1,881          3.2           1.7         0.2    free buffer waits             Configuration        2,318         28.0          12.1         1.5
     db file parallel write        System I/O           4,670          0.9           0.2         0.0    log file parallel write       System I/O           6,897         14.3           2.1         0.7
     control file sequential read  System I/O             499          0.2           0.3         0.0    db file async I/O submit      System I/O          12,819          5.2           0.4         0.3
     oracle thread bootstrap       Other                    7          0.1          18.1         0.0    buffer busy waits             Concurrency            407          2.7           6.7         0.1
     Disk file operations I/O      User I/O                77          0.1           0.9         0.0    Data file init write          User I/O               152          0.9           5.7         0.0
     library cache: mutex X        Concurrency             22          0.1           2.7         0.0    read by other session         User I/O                60          0.7          11.9         0.0
     Data file init write          User I/O                 5          0.0           5.6         0.0    enq: HW - contention          Configuration           18          0.7          39.3         0.0
    -read by other session         User I/O                18          0.0           0.6         0.0   -control file sequential read  System I/O             841          0.5           0.5         0.0
    -buffer busy waits             Concurrency             82          0.0           0.0         0.0   -oracle thread bootstrap       Other                    7          0.1          17.2         0.0
    -                                                     N/A          N/A           N/A         N/A   -Disk file operations I/O      User I/O               136          0.1           0.8         0.0
    -                                                     N/A          N/A           N/A         N/A   -library cache: mutex X        Concurrency             21          0.0           1.4         0.0
                              --------------------------------------------------------------------------------------------------------------------
    

    Everything is just better – the IO times, the redo size, the number of IOPS etc. That’s what I would have expected from the outset. Good to know-I’ll no longer use EXT4 for Oracle databases. After discussing this with @fritshoogland he showed me a post by @kevinclosson that didn’t come to mind at the time:

    http://kevinclosson.net/2012/03/06/yes-file-systems-still-need-to-support-concurrent-writes-yet-another-look-at-xfs-versus-ext4/