Who's online

There are currently 0 users and 36 guests online.

Recent comments


Oakies Blog Aggregator


I have a simple script that creates two identical tables , collects stats (with no histograms) on the pair of them, then executes a join. Here’s the SQL to create the first table:

create table t1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
	trunc(dbms_random.value(0,1000))	n_1000,
	trunc(dbms_random.value(0,750))		n_750,
	trunc(dbms_random.value(0,600))		n_600,
	trunc(dbms_random.value(0,400))		n_400,
	trunc(dbms_random.value(0,90))		n_90,
	trunc(dbms_random.value(0,72))		n_72,
	trunc(dbms_random.value(0,40))		n_40,
	trunc(dbms_random.value(0,3))		n_3
	generator	v1,
	generator	v2
	rownum <= 1e6

-- gather stats: no histograms

The two tables have 1,000,000 rows each and t2 is created from t1 with a simple “create as select”. The columns are all defined to be integers, and the naming convention is simple – n_400 holds 400 distinct values with uniform distribution from 0 – 399, n_750 holds 750 values from 0 – 749, and so on.

Here’s the simple query:

        t1.*, t2.*
        t1, t2
        t1.n_400 = 0
and     t2.n_72  = t1.n_90
and     t2.n_750 = t1.n_600
and     t2.n_400 = 1

Since I’ve created no indexes you might expect the query to do a couple of and a hash join to get its result – and you’d be right; but what do you think the predicted cardinality would be ?

Here are the results from running explain plan on the query and then reporting the execution plan – for three different versions of Oracle:
| Id  | Operation            |  Name       | Rows  | Bytes | Cost (%CPU)|
|   0 | SELECT STATEMENT     |             |    96 |  4992 |  1230  (10)|
|*  1 |  HASH JOIN           |             |    96 |  4992 |  1230  (10)|
|*  2 |   TABLE ACCESS FULL  | T1          |  2500 | 65000 |   617  (11)|
|*  3 |   TABLE ACCESS FULL  | T2          |  2500 | 65000 |   613  (10)|

Predicate Information (identified by operation id):
   1 - access("T2"."N_750"="T1"."N_600" AND "T2"."N_72"="T1"."N_90")
   2 - filter("T1"."N_400"=0)
   3 - filter("T2"."N_400"=1)

| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |      |   116 |  6032 |  1229  (10)| 00:00:07 |
|*  1 |  HASH JOIN         |      |   116 |  6032 |  1229  (10)| 00:00:07 |
|*  2 |   TABLE ACCESS FULL| T1   |  2500 | 65000 |   616  (11)| 00:00:04 |
|*  3 |   TABLE ACCESS FULL| T2   |  2500 | 65000 |   612  (10)| 00:00:04 |

Predicate Information (identified by operation id):
   1 - access("T2"."N_750"="T1"."N_600" AND "T2"."N_72"="T1"."N_90")
   2 - filter("T1"."N_400"=0)
   3 - filter("T2"."N_400"=1)

| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |      |  2554 |   139K|  1225  (10)| 00:00:07 |
|*  1 |  HASH JOIN         |      |  2554 |   139K|  1225  (10)| 00:00:07 |
|*  2 |   TABLE ACCESS FULL| T1   |  2500 | 70000 |   612  (10)| 00:00:04 |
|*  3 |   TABLE ACCESS FULL| T2   |  2500 | 70000 |   612  (10)| 00:00:04 |

Predicate Information (identified by operation id):
   1 - access("T2"."N_72"="T1"."N_90" AND "T2"."N_750"="T1"."N_600")
   2 - filter("T1"."N_400"=0)
   3 - filter("T2"."N_400"=1)

The change for (which is still there for I didn’t check to see if it also appears in is particularly worrying. When you see a simple query like this changing cardinality on the upgrade you can be fairly confident that some of your more complex queries will change their plans – even if there are no clever new optimizer transformations coming into play.

I’ll write up an explanation of how the optimizer has produced three different estimates some time over the next couple of weeks; but if you want an earlier answer this is one of the things I’ll be covering in my presentation on calculating selectivity at “Super Sunday” at UKOUG Tech 14.

UKOUG 2014

So it’s that time of year when I have to decide on my timetable for the UKOUG annual conference. Of course, I never manage to stick to it, but in principle here are the sessions I’ve highlighted:


  • 12:30 – How to Avoid a Salted Banana – Lothar Flatz
  • 13:30 – Calculating Selectivity  – Me
  • 15:00 – Advanced Diagnostics Revisited – Julian Dyke
  • 16:00 – Testing Jumbo Frames for RAC – Neil Johnson


  • 9:00 – Oracle Indexes Q & A session – Richard Foote
  • 10:00 – How Oracle works in 50 minutes – Martin Widlake
  • 11:30 – Predictive Queries in 12c – Brendan Tierney
  • 14:30 – Oracle Database In-Memory DB and the Query Optimizer – Christian Antognini
  • 16:00 – Instrumenting, Analysing, & Tuning the Performance of Oracle ADF Applications – Frank Houweling
  • 17:00 – Techniques for Strategic Testing – Clive King


  • 9:30 – Top Five Things You Need To Know About Oracle Database In-Memory Option – Maria Colgan
  • 10:30 – How to Write Better PL/SQL – Andrew Clarke
  • 12:00 – Optimizer Round Table – Tony Hasler
  • 14:00 – What we had to Unlearn & Learn when Moving from M9000 to Super Cluster -Philippe Fierens
  • 15:00 – Maximum Availability Architecture: A Recipe for Disaster? – Julian Dyke
  • 16:30 – Chasing the Holy Grail of HA – Implementing Transaction Guard & Application Continuity in Oracle Database 12c -Mark Bobak
  • 17:30 – Five Hints for Efficient SQL – Me


  • 9:00 – Fundamentals of Troubleshooting (without graphics) pt.1 – Me
  • 10:00 – Fundamentals of Troubleshooting (without graphics) pt.2 – Me
  • 11:30 – Indexing in Exadata – Richard Foote



How do you Explain Oracle in 50 Minutes?

I’ve done a very “brave”* thing. I’ve put forward a talk to this year’s UKOUG Tech14 conference titled “How Oracle Works – in under 50 minutes”. Yes, I really was suggesting I could explain to people how the core of Oracle functions in that time. Not only that, but the talk is aimed at those new to Oracle technology. And it got accepted, so I have to present it. I can’t complain about that too much, I was on the paper selection committee…

* – “brave”, of course, means “stupid” in this context.

As a result I am now strapped to the chair in front of my desk, preparing an attempt to explain the overall structure of an Oracle instance, how data moves in out of storage, how ACID works and a few other things. Writing this blog is just avoidance behaviour on my part as I delay going back to it.

Is it possible? I’m convinced it is.

If you ignore all the additional bits, the things that not all sites use, such as Partitioning, RAC, Resource Manager, Materialized Views etc, etc, etc, then that removes a lot. And if not everyone uses it, then it is not core.
There is no need or intention on my part to talk about details of the core – for example, how the Cost Based aspect of the optimizer works, Oracle permissions or the steps needed for instance recovery. We all use those but the details are ignored by some people for their whole career {not usually people who I would deem competent, despite them holding down jobs as Oracle technicians, but they do}.

You are left with a relatively small set of things going on. Don’t get me wrong, it is still a lot of stuff to talk about and is almost certainly too much for someone to fully take in and digest in the time I have. I’m going to have to present this material as if I am possessed. But my intention is to describe a whole picture that makes sense and will allow people to understand the flow. Then, when they see presentations on aspects of it later in the conference, there is more chance it will stick. I find I need to be taught something 3 or 4 times. The first time simply opens my mind to the general idea, the second time I retain some of the details and the third or forth time I start integrating it into what I already new.

My challenge is to say enough so that it makes sense and *no more*. I have developed a very bad habit of trying to cram too much into a presentation and of course this is a real danger here. I’m trying to make it all visual. There will be slides of text, but they are more for if you want to download the talk after the conference. However, drawing pictures takes much, much, much longer than banging down a half dozen bullet points.

One glimmer in the dark is that there is a coffee break after my session. I can go right up to the wire and then take questions after I officially stop, if I am not wrestled to the ground and thrown out the room.

If anyone has any suggestions or comments about what I should or should not include, I’d love to hear them.

This is all part of my intention to provide more conference content for those new to Oracle. As such, this “overview” talk is at the start of the first day of the main conference, 10am Monday. I have to thank my fellow content organisers for allowing me to stick it in where I wanted it. If you are coming to the conference and don’t know much Oracle yet – then I am amazed you read my blog (or any other blog other than maybe AskTom). But if you have colleagues or friends coming who are still relatively new to the tech, tell them to look out for my talk. I really hope it will help them get that initial understanding.

I had hoped to create a fully fledged thread of intro talks running through all of Monday and Tuesday, but I brought the idea up too late. We really needed to promote the idea at the call for papers and then maybe sources a couple of talk. However, using the talks that were accepted we did manage to get a good stab at a flow of intro talks through Monday. I would suggest:

  • 08:50 – Welcome and Introduction
    • Get there in time for the intro if you can, as if you are newish to the tech you are probably newish to a conference
  • 09:00 RMAN the basics, by Michael Abbey.
    • If you are a DBA type, backup/recovery is your number one concern.
  • 10:00 – How Oracle Works in 50 Minutes
    • I think I have said enough!
  • 11:30 – All about Joins by Tony Hasler
    • Top presenter, always good content
  • 12:30 – Lunch. Go and talk to people, lots of people, find some people you might like to talk with again. *don’t stalk anyone*
  • 13:20 – Go to the Oracle Keynote.
    • Personally, I hate whole-audience keynotes, I am sick of being told every year how “there has never been a better time to invest in oracle technology” – but this one is short and after it there is a panel discussion by technical experts.
  • 14:30 is a bit tricky. Tim Hall on Analytical Functions is maybe a bit advanced, but Tim is a brilliant teacher and it is an intro to the subject. Failing that, I’d suggest the Oracle Enterprise Manager round table hosted by Dev Nayak as Database-centric oracle people should know OEM.
  • 16:00 – Again a bit tricky for someone new but I’d plump for The role of Privileges and Roles in Oracle 12C by Carl Dudley. He lectures (lectured?) in database technology and knows his stuff, but this is a New Feature talk…
  • 17:00 – Tuning by Explain Plan by Arian Stijf
    • This is a step-by-step guide to understanding the most common tool used for performance tuning
  • 17:50 onwards – go to the exhibition drinks, the community drinks and just make friends. One of the best thing to come out of conferences is meeting people and swapping stories.

I better get back to drawing pictures. Each one takes me a day and I need about 8 of them. Whoops!

Little things worth knowing: creating the Swingbench order entry schema “lights out”

This is a little note, primarily to myself I guess, about the creation of the order entry schema (part of Swingbench, written by Dominic Giles) when no VNC sessions are available (although you can almost always use port-forwarding :). Instead, you can create the schema on the command line. I always execute commands on remote systems in screen for increased peace of mind. Should the network drop, the order entry generation will continue as if nothing ever happened.

Like many others I use Swingbench during trainings and presentations to have some activity on a system. Very useful for demonstrating ASH and OEM, and many other things too!

This article references swingbench 2.5, the latest version at the time of writing. It requires a JDK to be present, I simply used the one bundled with Oracle which is my target database version. You first start by downloading Swingbench from Dominic’s website and unzip it in a directory from where you want to run it. It is usually a good idea to keep the load generator on a different (set of) machine(s). As long as you can connect to the database you want to use to house the SOE schema you’re fine.

Creating the SOE schema

Before you can run swingbench (or minibench/charbench) you need to create the SOE schema. The oewizard is the tool of choice. If you haven’t done so, add java to your path. Then you can have a look at the available options:

$ JAVA_HOME=/u01/app/oracle/product/
$ export PATH=$PATH:$JAVA_HOME/bin

$ ./oewizard -h
usage: parameters:
 -allindexes             build all indexes for schema
 -bigfile                use big file tablespaces
 -c            wizard config file
 -cl                     run in character mode
 -compositepart          use a composite paritioning model if it exisits
 -compress               use default compression model if it exists
 -create                 create benchmarks schema
 -cs      connectring for database
 -dba          dba username for schema creation
 -dbap         password for schema creation
 -debug                  turn on debugging output
 -debugf                 turn on debugging output to file (debug.log)
 -df           datafile name used to create schema in
 -drop                   drop benchmarks schema
 -dt         driver type (oci|thin)
 -g                      run in graphical mode (default)
 -generate               generate data for benchmark if available
 -h,--help               print this message
 -hashpart               use hash paritioning model if it exists
 -hcccompress            use HCC compression if it exisits
 -nocompress             don't use any database compression
 -noindexes              don't build any indexes for schema
 -nopart                 don't use any database partitioning
 -normalfile             use normal file tablespaces
 -oltpcompress           use OLTP compression if it exisits
 -p            password for benchmark schema
 -part                   use default paritioning model if it exists
 -pkindexes              only create primary keys for schema
 -rangepart              use a range paritioning model if it exisits
 -s                      run in silent mode
 -scale           mulitiplier for default config
 -sp    the number of softparitions used. Defaults to cpu
 -tc       the number of threads(parallelism) used to
                         generate data. Defaults to cpus*2
 -ts         tablespace to create schema in
 -u            username for benchmark schema
 -v                      run in verbose mode when running from command
 -version       version of the benchmark to run

Note that not all partitioning schemes are valid in this case, and neither are all compression methods. I found the below command to create the schema the way I want it.

./oewizard -allindexes -cl -create -cs //cluster-scan/sbench -dba martin \
> -dbap notTellingYouHere -ts soe_tbs -part -p soe -scale 8 -tc 8 -u soe -v

SwingBench Wizard
Author  :        Dominic Giles
Version :

Running in Lights Out Mode using config file : oewizard.xml
Connecting to : jdbc:oracle:thin:@//cluster-scan/sbench
Starting script ../sql/soedgcreateuser.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 104 millisecond(s)
Starting run
Starting script ../sql/soedgdrop2.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 55 millisecond(s)
Starting script ../sql/soedgcreatetableshash2.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 520 millisecond(s)
Starting script ../sql/soedgviews.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 19 millisecond(s)
Starting script ../sql/soedgsqlset.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 84 millisecond(s)
Inserting data into table ADDRESSES_11499978
Inserting data into table ADDRESSES_3999993

-- more data

Completed processing table LOGON_7943271 in 0:00:25
Connection cache closed
Starting script ../sql/soedganalyzeschema2.sql
Script completed in 0 hour(s) 1 minute(s) 20 second(s) 701 millisecond(s)
Starting script ../sql/soedgconstraints2.sql
Script completed in 0 hour(s) 1 minute(s) 41 second(s) 520 millisecond(s)
Starting script ../sql/soedgindexes2.sql
Script completed in 0 hour(s) 0 minute(s) 44 second(s) 685 millisecond(s)
Starting script ../sql/soedgsequences2.sql
Script completed in 0 hour(s) 0 minute(s) 3 second(s) 53 millisecond(s)
Starting script ../sql/soedgpackage2_header.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 215 millisecond(s)
Starting script ../sql/soedgpackage2_body.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 127 millisecond(s)
Starting script ../sql/soedgsetupmetadata.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 614 millisecond(s)

|           Datagenerator Run Stats        |
Connection Time                        0:00:00.005
Data Generation Time                   0:10:38.185
DDL Creation Time                      0:03:51.630
Total Run Time                         0:14:29.823
Rows Inserted per sec                      151,713
Data Generated (MB) per sec                   12.1
Actual Rows Generated                   97,718,288

Connecting to : jdbc:oracle:thin:@//cluster-scan/sbench

Post Creation Validation Report
The creation of the schema appears to have been unsuccessful. See the following sections for further details.

Valid Objects
Valid Code :

Invalid Objects (1)
Invalid Tables :
Invalid Indexes :
Invalid Views :
Invalid Sequences :
Invalid Code : 'ORDERENTRY'

Missing Objects (0)
Missing Tables :
Missing Indexes : 
Missing Views :
Missing Sequences :
Missing Code :
Schema Created

To compile the package SYS needs to grant execute on DBMS_LOCK to SOE.

This results in a schema with the following number of rows:

SQL> declare
  2   v_cnt number;
  3  begin
  4   for i in (select table_name from dba_tables where owner='SOE') loop
  5    execute immediate 'select count(*) from soe.' || i.table_name into v_cnt;
  6    dbms_output.put_line(i.table_name || ' - ' || to_char(v_cnt, '999,999,999'));
  7   end loop;
  8  end;
  9  /
ORDER_ITEMS -   34,316,243
ORDERS -   11,438,320
INVENTORIES -      896,853
LOGON -   19,063,872
CUSTOMERS -    8,000,000
ADDRESSES -   12,000,000
CARD_DETAILS -   12,000,000
WAREHOUSES -        1,000

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.70

SQL> compute sum of m on report
SQL> break on report

SQL> select sum(bytes)/power(1024,2) m, segment_name, segment_type
  2   from dba_segments where owner = 'SOE'
  3  group by segment_name, segment_type
  4  order by 1;

              M SEGMENT_NAME                   SEGMENT_TYPE
--------------- ------------------------------ ------------------
            .06 WAREHOUSES                     TABLE
            .06 PRODUCT_INFORMATION_PK         INDEX
            .06 ORDERENTRY_METADATA            TABLE
            .13 WHS_LOCATION_IX                INDEX
            .13 PROD_CATEGORY_IX               INDEX
            .13 WAREHOUSES_PK                  INDEX
            .13 PROD_SUPPLIER_IX               INDEX
            .13 PROD_NAME_IX                   INDEX
            .13 PRD_DESC_PK                    INDEX
            .25 PRODUCT_INFORMATION            TABLE
            .31 PRODUCT_DESCRIPTIONS           TABLE
          16.69 INV_WAREHOUSE_IX               INDEX
          16.69 INV_PRODUCT_IX                 INDEX
          18.00 INVENTORY_PK                   INDEX
          88.00 INVENTORIES                    TABLE
         169.00 CUST_ACCOUNT_MANAGER_IX        INDEX
         177.06 CUSTOMERS_PK                   INDEX
         204.06 CUST_DOB_IX                    INDEX
         242.19 ORD_SALES_REP_IX               INDEX
         253.38 ORDER_PK                       INDEX
         254.31 CUST_FUNC_LOWER_NAME_IX        INDEX
         265.06 ORD_CUSTOMER_IX                INDEX
         265.69 ADDRESS_PK                     INDEX
         265.75 CARD_DETAILS_PK                INDEX
         276.88 CARDDETAILS_CUST_IX            INDEX
         278.13 ORD_WAREHOUSE_IX               INDEX
         278.50 ADDRESS_CUST_IX                INDEX
         291.88 ORD_ORDER_DATE_IX              INDEX
         347.81 CUST_EMAIL_IX                  INDEX
         717.88 ITEM_PRODUCT_IX                INDEX
         788.94 ITEM_ORDER_IX                  INDEX
         864.00 LOGON                          TABLE PARTITION
         968.00 CARD_DETAILS                   TABLE PARTITION
       1,352.00 CUSTOMERS                      TABLE PARTITION
       1,408.00 ADDRESSES                      TABLE PARTITION
       1,544.00 ORDERS                         TABLE PARTITION
       2,368.00 ORDER_ITEMS                    TABLE PARTITION
37 rows selected.

Happy benchmarking!

truncated ddl in 12c (and

Just curious if anyone else is seeing this behaviour..

SQL> set long 500000
SQL> @pt "select * from v$sql where sql_id = '1km492z723vpu'"
SQL_TEXT                      : alter table scott.emp
SQL_FULLTEXT                  : alter table scott.emp
SQL_ID                        : 1km492z723vpu

Suffice to say, there’s a lot more to that SQL statement, but its lost in v$sql. Once complete, it does appear in full in v$sqlstats, but its a pain when you cant see the whole statement..

Physical IO on Linux

I posted a fair amount of stuff on how Oracle is generating IOs, and especially large IOs, meaning more than one Oracle block, so > 8KB. This is typically what is happening when the Oracle database is executing a row source which does a full segment scan. Let’s start off with a quiz: what you think Oracle is the maximum IO size the Oracle engine is capable of requesting of the Operating System (so the IO size as can be seen at the SCI (system call interface) layer? If you made up your answer, remember it, and read on!

The real intention of this blogpost is to describe what is going on in the Oracle database kernel, but also what is being done in the Linux kernel. Being a performance specialised Oracle DBA means you have to understand what the operating system does. I often see that it’s of the utmost importance to understand how an IO ends up as a request at the NAS or SAN head, so you understand what a storage admin is talking about.

Many people (including myself in the past) would state that the maximum IO size on Linux is 1MB. For the Linux 2.6 kernel and higher this statement is incorrect because there is no such thing as a single maximum IO size on Linux. There used to be one in the Linux 2.4 era, which was set with the maxphys parameter, but that time is long gone. In order to find out what it is now, let’s test and see!

First let’s get a Linux system and a big table!
The system I got is a VMWare Fusion VM, running Linux 3.8.13-44.1.5.el6uek.x86_64 (UEK3) on Oracle Linux 6u6. On top of that I am using the Oracle database and grid infrastructure version This system is using udev for providing disk access (as opposed to asmlib)
The redundancy mode of ASM is external, although for reading (what I will be covering) this doesn’t matter.

Now let’s get a normal database session, and use a combination of sql_trace with waits (10046/8) and strace to see how the Oracle database interfaces with the kernel. Please mind I’ve prepared a heap table with no indexes on it, so a count(*) on it always will result in a full table scan. Also, the buffercache is sized small enough (or the table is created large enough, it depends on how you look at it) to have the session make the decision to do a direct path read, instead of a buffered read. If you don’t know what that means: please search this blog on direct path reads, or even better, download my presentation ‘about multiblock reads’.

The direct path read decision is visible via the ‘direct path read’ wait event. If you get a full table scan operation and see ‘db file scattered read’ waits, you are doing a buffered scan.

In the most cases, you will get a maximum value of 1MB if possible, which seems to support the generally assumed 1MB maximum operating system induced IO size. Why? Well, because you probably set the DB_FILE_MULTIBLOCK_READ_COUNT parameter to 128, which means you have explicitly set the Oracle process not to do IO with a size more than 1MB (8192*128). In this blogpost, I explain that Oracle can request IOs bigger than 1MB.

In the blogpost series on extra huge database IOs, I show that Oracle can do huge (1MB+) IOs, but the physical request size (what Oracle actually requests at the SCI layer, visible with the pread/pwrite/io_submit/io_getevents functions) still is 1MB. This limit is imposed by the physical storage structure which the database uses with ASM, called allocation unit (often called ‘AU’). The default size of an allocation unit is 1MB. The allocation unit can be seen in both the database and the ASM instance with the following query:

SYS@+ASM AS SYSASM> select name, allocation_unit_size from v$asm_diskgroup;

------------------------------ --------------------
DATA					    1048576

How about doing an unbuffered read on a filesystem? I’ve created a database on an (XFS, but this doesn’t matter AFAIK) filesystem, and tried to set the maximum value to DB_FILE_MULTIBLOCK_READ_COUNT. I’ve done this by setting DB_FILE_MULTIBLOCK_READ_COUNT to 10000 (ten thousand), and then bounce the database to see what the number has become. In my case, the value became 4096. I think this is the limit for Oracle on Linux x86_64, but love to hear if you have gotten different results:

I set 10000:

SYS@fv12102 AS SYSDBA> select name, value from v$spparameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   10000

But Oracle limits this to 4096:

SYS@fv12102 AS SYSDBA> select name, value from v$parameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   4096

Okay. Let’s start our investigation at that point: a database which is set up with a DB_FILE_MULTIBLOCK_READ_COUNT set to 4096, alias 32MB (with a block size of 8KB), and a table which got extents large enough to accommodate huge (32MB) IOs.

Fire up a session regular database session, and enable sql trace at level 8:

$ sqlplus ts/ts@//localhost/v11204
SQL> alter session set events 'sql_trace level 8';

Now start another session as root on the database server, and find the PID of the server process of the sqlplus process we just created above. Issue strace with verbose writing setting:

# strace -e write=all -e all -p PID
Process PID attached - interrupt to quit

Okay, we are setup and ready to go, but there is one additional thing: the way direct path reads work, they would probably give little waits with fast IO capabilities. One way to get the waits back, is to limit the IO capabilities of the process. Doing so is documented in this article.

Now issue the full table scan on a large table in sqlplus while strace is attached:

SQL> select count(*) from bigtab;

Now take a peek at the strace output!
The output first shows IOs as we expect:

io_getevents(139717184229376, 1, 128, {{0x7f126dd3d780, 0x7f126dd3d780, 33554432, 0}}, {600, 0}) = 1
times(NULL)                             = 431386800
write(7, "\n*** 2014-11-24 13:09:28.028\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  34 2d 31 31 2d 32 34 20  .*** 201 4-11-24  |
 | 00010  31 33 3a 30 39 3a 32 38  2e 30 32 38 0a           13:09:28 .028.    |
lseek(7, 0, SEEK_CUR)                   = 31181
write(7, "WAIT #139717129509840: nam='dire"..., 130) = 130
 | 00000  57 41 49 54 20 23 31 33  39 37 31 37 31 32 39 35  WAIT #13 97171295 |
 | 00010  30 39 38 34 30 3a 20 6e  61 6d 3d 27 64 69 72 65  09840: n am='dire |
 | 00020  63 74 20 70 61 74 68 20  72 65 61 64 27 20 65 6c  ct path  read' el |
 | 00030  61 3d 20 33 39 30 37 33  30 20 66 69 6c 65 20 6e  a= 39073 0 file n |
 | 00040  75 6d 62 65 72 3d 34 20  66 69 72 73 74 20 64 62  umber=4  first db |
 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |
 | 00070  37 34 20 74 69 6d 3d 31  39 32 30 30 37 31 30 31  74 tim=1 92007101 |
 | 00080  39 39                                             99                |

What is visible here, is first the reap of an I/O request (with asynchronous IO on Linux this is typically the io_getevents() call). If you take a close look at the arguments of the io_getevents() call (taken from the manpage of io_getevents):

int io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout);

And then focus on the struct io_event:

struct io_event {
         __u64           data;           /* the data field from the iocb */
         __u64           obj;            /* what iocb this event came from */
         __s64           res;            /* result code for this event */
         __s64           res2;           /* secondary result */

The above description is taken from the annotated Linux kernel source, as available here: I use this site for navigating the Linux kernel source. What is important, is that the third field (io_event.res) contains the size of the IO request. Having learned this, now look again in the io_getevents call. The size of the IO reaped above is 33554432, which is 33554432/1024/1024=32 MB. Yes, that’s a single IO of 32MB! Also, this is consistent with the wait line a little lower:

 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |

Block count = 4096 * 8192 (block size) = 33554432

So, I wonder what you thought was possible, the correct answer on my operating system (Linux x86_64) with Oracle is 32MB. It turned out the big IOs in the ASM case were limited by the allocation unit size of 1MB.

The next thing I’ve wondered is how this matches with the maximum IO size of the disk devices as visible by the Operating System. You can request 32MB, but a normal SCSI disk doesn’t do 32MB IOs. Of course in my case the SCSI disk really is a VMWare virtual disk device.

Let’s keep the 32MB IO in mind, now dive from the top layer, the SCI (system call interface) where an IO enters the kernel to the bottom of the kernel from an IO perspective, to the block device. The block device settings are found in /sys/block/DEVICE/queue. The maximum IO size the device is capable of is found in max_hw_sectors_kb. This is in kilobytes, and read only (can’t change hardware, right?). In my case this is:

[root@bigmachine queue]# cat max_hw_sectors_kb

My disk supports a maximum of 4M for an IO size! But this is not what is used, the actual setting is in max_sectors_kb:

[root@bigmachine queue]# cat max_sectors_kb

That’s half a megabyte!

So…we got (up to) 32MB sized IO requests coming in, and a device that is set to 512KB IOs. This means that somewhere between the SCI and the device, there is a mechanism to scatter the request size to the device’s maximum IO size, and once the IO requests are done, going back to gather the IO results to the original request.

There are a couple of layers in the Linux kernel through which the call travels (including common functions):

-SCI/system call interface: system_call, sys_io_submit…. (io_submit, do_io_submit, io_submit_one; these seem to be in the VFS layer)
-VFS/virtual filesystem: aio_run_iocb, do_aio_read, xfs_file_read_iter, generic_file_read_iter, xfs_vm_direct_IO, bio_*, kiocb_batch_refill
-Block layer: blk_finish_plug, blk_flush_plug_list, queue_unplugged, __blk_run_queue, blk_run_queue
-SCSI layer: scsi_*
-Device driver: mptspi_qcmd, mptscsih_qcmd, mpt_put_msg_frame

(note: there seems to be consensus the above mentioned layers exist, although there is different wording and different numbers by different sources. Also, there doesn’t seem to be a very clear description of what is done by which layer, and what typically defines a kernel layer. For some functions it is clear they belong to a certain layer (for example aio_run_iocb in Linux/fs/aio.c, bulk_finish_plug in Linux/block/blk-core.c, etc.), for some layers, like the SCI layer, it seems there isn’t a clear layer definition by looking at where the function is defined. Also please mind the SCSI layer is implemented as a driver, just like the actual device driver for the hardware. This is very understandable, but makes it a bit harder to see it in a layered way)

System Call Interface (SCI)
The request enters kernel space via the SCI. The function of the SCI is to elevate a process to system priority to perform a kernel mode task, like (but not limited to) doing I/O. The system call implementation on Linux makes use of a wrapper function in glibc, which executes the system call on behalf of the user systemcall request. The reason for mentioning this, is that sometimes the glibc wrapper “hides” the real system call, for example calling the semtimedop() function:

(gdb) break semtimedop
Breakpoint 1 at 0x3bb38eb090: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) c

Breakpoint 1, semtimedop () at ../sysdeps/unix/syscall-template.S:82

Above is a gdb (GNU debugger) session which attaches to an Oracle background process, which I know is sleeping in the system call semtimedop() when idle, A breakpoint is set on the semtimedop function, and the execution of the attached process is resumed. It then breaks on the function, showing the source code at which the break happened. Instead of showing the actual semtimedop function, it shows the pseudo function in glibc which wraps this system call. This hides the arguments of calling the semtimedop() function. My current workaround is to read the kernel registers which “carry” the arguments (RDI, RSI, RDX, RCX, R8, R9 for the first 6 arguments in most cases).

Virtual File System (VFS)
The next layer is virtual filesystem. Here we see functions specific to asynchronous IO or synchronous IO, and doing direct IO or not, and also actual filesystem specific functions (in my case xfs, when ext4 is used, you will see specific functions for that. I highly recommend XFS!). This layer also uses a structure called ‘request_queue’, which keeps track of the actual IO requests for a block device, of which each individual request is a struct ‘request’, which contains one or more structs called ‘bio’ which contains a description of the request, which points to structure called ‘bio_vec’, which points to pages for storing the disk request contents. This is all setup and created in kernel memory by the user process in system mode. It’s my assumption that the properties of the disk device (=maximum advertised IO size) are taken into account when the VFS filesystem implementation creates requests and all necessary structs and memory area’s. Please mind it’s important that enough memory is available to setup the necessary structures, and enough CPU to make this happen. Also some of the crucial structures for doing IO (request, bio, bio_vec) seem to be setup in this layer. An IO can’t be done without a memory area for the IO request to hold the data for sending it to the device (alias a write), or a memory area for the IO request to hold the data which is fetched from the device (alias a read).

The funny thing is that when you use ASM (the simple version 11.2 ASM with a local ASM instance and local disk devices), you will still see some functions of the VFS layer, because you use a disk device which is opened using the local filesystem. Examples of these functions are: aio_run_iocb, do_aio_read.

Block Layer
The next layer is the block layer. Here the request queue is handled, and I/O scheduling is done. Oracle advises the deadline scheduler in all cases. The scheduler works by plugging a request queue, much like a plug in your bathtub, letting the requests enter the queue. Having multiple requests in a queue means it can be optimised by reordering the requests, and merging adjacent requests up to the device’s advertised maximum IO size. Once a request’s timeout expires, or the requesting process finishes submitting IO, the queue is unplugged.

SCSI layer
The SCSI layer is responsible for communicating with SCSI devices to do IOs.

Device driver
The device driver layer is the layer that truly physically communicates with a device, and implements the device specific communication. In my case the functions start with mpt, which is the driver for LSI PCI adapters.

To see how the flow of IO going through the block layer, there is a tool called blktrace. Actually this is a mini-suite of tools consisting of blktrace (tracing the IO requests through the block layer), blkparse (parsing the output of blktrace to make it human readable), btrace (script to combine blktrace and blkparse, and btt (a blktrace output post processing tool)), among others.

In order to use blktrace, the debug file system of the Linux kernel needs to be mounted. Here is how that is done:

# mount -t debugfs debugfs /sys/kernel/debug

If the kernel debugfs is not mounted, you get the following message:

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
Invalid debug path /sys/kernel/debug: 0/Success

I use blktrace in this article for looking at the IO requests to understand what is going on. The workflow for this use of blktrace is:
– create a trace file of the block flow using blktrace
– make the trace file human readable via blkparse or analyse via btt (block trace times)

Actually, you can parse the output of blktrace directly via blkparse using ‘blktrace -d DEVICE – | blkparse -i -‘. To make that even simpler, the script ‘btrace’ is created, to do exactly that.

Here’s how that looks like (depending on the number of processes using it, the output can be huge, this is only a snippet):

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
  8,16   0       57     0.260669503  2421  Q  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       58     0.260672502  2421  G  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       59     0.260673231  2421  P   N [asm_gmon_+asm]
  8,16   0       60     0.260674895  2421  I  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       61     0.260675745  2421  U   N [asm_gmon_+asm] 1
  8,16   0       62     0.260677119  2421  D  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       63     0.260882884     0  C  WS 4088 + 8 [0]

What is shown here, is the typical flow of an IO in the block layer:
Q – Queue. A request starts off sending a notification on the intent to queue at the given location.
G – Get request. A struct request is allocated.
P – Plug. When the block device queue is empty, the queue is plugged in order to receive further IOs and have the ability to optimise (merge and/or reorder) them before the data is sent to the device.
I – Insert. A request is sent to the IO scheduler for addition to the internal queue and later service by the driver. The request is fully allocated at this time.
U – Unplug. The start of sending requests to the driver.
D – Driver. A request has been sent to the driver and removed from the queue.
C – Complete. A previously issued request to the driver has been completed.

The main point is here, that you can truly see how the IO requests flow through the block layer and are issued to the storage device, in other words, you can see how the block layer receives the IOs, and what is exactly submitted to the driver as request for the physical storage layer.

This is a microscopic view of the disk IOs. In most cases, when you want to gain information on block layer IO processing, another view on it is provided by processing blktrace output with btt. This is an example output of btt:

First capture IO events using blktrace:

[root@bigmachine ~]# blktrace -w 60 -d /dev/oracleasm/disk1 -o - | blkparse -d sdb.blkparse -i -

In this example I captured IOs for 60 seconds. You can exclude ‘-w 60′, and press interrupt (ctrl-c) when you deem IO recording is enough. This produces a binary file ‘sdb.blkparse’, which can be used btt:

This is the first part, the flow through the block layer until IO completion:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.239795347   3.002829973         238
Q2G               0.000000001   0.159337842   3.011192142         264
G2I               0.000000679   0.000001724   0.000011618         264
I2D               0.000000764   0.000007633   0.000153436         264
D2C               0.000000001   0.103328167   3.012509148         233
Q2C               0.000000001   0.270961298   3.012516496         233

Note: time is in milli seconds.
Q2Q – Time between IO requests.
Q2G – Time it takes for a request struct to be allocated.
G2I – Time it takes for the request to be inserted in the device’s queue.
I2D – Time spend in the device queue waiting to be issued to the driver.
D2C – Time spend between issuing to the driver and completion of the request. This includes controller, storage. This is the same figure as the ‘svctm’ column with iostat -x.
Q2C – Total time spend in block layer and physical IO. This is the same figure as the ‘await’ column with iostat -x.

The second part is the device overhead section:

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%
---------- | --------- --------- --------- --------- ---------
   Overall |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%

This is partly the same as the IO flow table above. This is expressed as a percentage of where the total time of the IO is spend.
Q2G – Request struct allocation.
G2I – Insertion in the device queue.
Q2M – Total time until merge.
I2D – Time spend in the queue until it was dispatched to the driver.
D2C – Time spend on doing the IO after submitting the request to the driver.

Tagged: blkparse, blktrace, btt, internals, IO, IO size, linux, oracle

Line Numbers

One of the presentations I went to at the DOAG conference earlier on this month was called “PL/SQL Tuning, finding the perf. bottleneck with hierarchical profiler” by Radu Parvu from Finland. If you do a lot of PL/SQL programming and haven’t noticed the dbms_hprof package yet make sure you take a good look at it.

A peripheral question that came up at the end of the session asked about problems with line numbers in pl/sql procedures; why, when you get a run-time error, does the reported line number sometimes look wrong, and how do you find the right line. I can answer (or give at least one reason for) the first part, but not the second part; Julian Dontcheff had an answer for the second bit, but unfortunately I failed to take a note of it.

Here’s the SQL to create, run and list a very simple (and silly) procedure.

define m_value = 3
set timing on

create or replace procedure silly
        m_n number;
        for i in 1..1000000 loop
                m_n := exp(sqrt(ln(&m_value)));
        end loop;

execute silly


Here’s the output I got from running this, and the thing I want you to note is the time it takes to run, and the line number reported for the assignment:

Procedure created.

Elapsed: 00:00:00.01

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.00
  1  create or replace procedure silly
  2  as
  3          m_n number;
  4  begin
  5          for i in 1..1000000 loop
  6                  m_n := exp(sqrt(ln(&m_value)));
  7          end loop;
  8* end;

It seems to take my instance o.oo seconds to perform 1,000,000 evaluations of the exponent of the square root of the natural logarithm of 3 at line 6 of the code. But let’s make two changes; first, let’s try that with the value -1 (which is a little careless if you know your logarithms).

Procedure created.

Elapsed: 00:00:00.02
BEGIN silly; END;

ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "TEST_USER.SILLY", line 5
ORA-06512: at line 1

Notice that the 6502 error is reported at line 5, not line 6.

Now let’s go back to the value 3, but start the script with the command: alter session set plsql_optimize_level = 0; (the default level is 2, the range is 0 to 3):

Session altered.

Elapsed: 00:00:00.00

Procedure created.

Elapsed: 00:00:00.01
BEGIN silly; END;

ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "TEST_USER.SILLY", line 6
ORA-06512: at line 1

Reducing the pl/sql optimisation level to zero (or one) results in the the error message reporting the problem at line 6 – which matches our original code. So let’s check the effect of running the code at level zero with a valid number for the input.

Session altered.

Elapsed: 00:00:00.00

Procedure created.

Elapsed: 00:00:00.02

PL/SQL procedure successfully completed.

Elapsed: 00:00:24.56

The run time for our pl/sql call has gone up from 0.00 seconds to 24.56 seconds.

Like all good compilers the pl/sql compiler had recognised (at level 2) that we were assigning a constant inside a loop, so it had (effectively) rewritten our code to move the assignment outside the loop, effectively swapping lines 6 and 5, storing the change in the “object” level code, but not in the database copy of the source. The benefit we get (from the default settings) is a saving of 25 seconds of CPU, the (small) penalty we pay is that the lines reported for run-time errors aren’t always going to identify the text that caused the problem.

HOWTO: Create a Structured XMLIndex, using the In-Memory Column Store

In Oracle database version, Oracle introduced the Oracle In-Memory Database option. It is possible…

Agents Management Via EM12c Release 4 Console

I’m going to take a break from the OMS/OMR Performance pages and jump over to the Agents page in release  You can access this page from the Setup menu drop down in EMCC:


Main Agent Management Page

The Agents Management page will default to show you all agents and looks very similar to the All Targets page when filtered by a target type, but this page focuses on Agents, something the All Targets page just doesn’t do well no matter how you filter.


A few things are going to stand out when you come to the Agents Management page-

  • The names of all the agents deployed.
  • The status of the agents, (which often is the first place you should look when you have an issue with a target.)
  • Counts of how many targets it monitors and if there are any broken targets.  (I haven’t taken into consideration how this count may be affected by composite targets, still looking into this…)
  • Status, incident and when the last successful upload from the agent occurred.

At the top of the graph are actions that can be used to manage the agents from the list.

  • Block and Unblock
  • Start up and Shut down
  • Restart
  • Secure and Unsecure
  • Properties
  • Query by Example
  • Detach

Depending on the current status of the agent, different actions will be available.  To choose an agent to perform an action on, you must move your cursor to the box to the very left of the agent or to a spot on the agent line that does not have a link.  If a link exists, it’s going to take you to the corresponding page, (i.e. Agent Home page, Incident Manager, etc.)


One of the cool things is you can perform actions on multiple agents by highlighting more than one agents in the list like you see above.  There’s a couple ways to do this-

  1. Bring your cursor to the left side box of the agent name and then left click hold and move your cursor down to to the last agent you wish to perform an action on.
  2. The standard known way of choosing multiple lines via office programs, hold down the ctrl key and left click on the individual agents or…
  3. Left click on the first, hold down the shift key and then click on the last one to select a succession of agents from the list.

Once you have chosen the agents you are interested in performing an action on, click on what you wish to do from the menu above the graph and follow through with any required information to complete the action.

Agents with a Down Status

To quickly see what agents are down is valuable as I’ve seen way too often where administrators have become accustomed to not having a 100% green status and give up and correcting the issue.  Focusing on your agents first is a solid path to resolving this.


You can quickly choose, (and I recommend doing this action one at a time to then see if anything is amiss, troubleshoot and proceed) and attempt to start the agent.  Once the agent is started, expect some “noise” in the form of EM notifications as Cloud Control gets a handle on what it’s been missing while the agents been down.

Agent Unreachable

An unreachable status has always bothered me-  It’s not just down, the OMS can’t reach it due to some issue and you should expect to have some type of troubleshooting to correct the problem.  Luckily, you can first hover your cursor over the status and the Agent Management page will show you the high level reason behind the agent being in an unreachable status.  The one we’ve highlighted below shows that it can’t write to the file system, so we are already way ahead in what will need to be corrected for this agent to resolve to an up, (green) status.


Agents that are blocked are also shown clearly in the status section, so keep in mind that these are “double” and also shown in their own filtering in the Blocked view, which we discuss more in depth below.  The rest of the agents you will need to double click on the agent name and investigate the logs to see what has caused the issue.  If data isn’t up to date enough to explain the issue, then the best course of action is to follow my instructions in this post to correct the problems.

Agent Blocked


Blocked agents are quite straight forward and if you hover your cursor over the Blocked Reason, the full description and how to correct it will even be shown.  I had to laugh when it said to have the EM administrator do this, but for many, the separation of duties is clear and this message will be very helpful.  For those of you that do need to perform this yourself, you can resync the agent by following this blog post from Adeesh Fulay.

Misconfigured Agents

Now I don’t have any misconfigured agents to show you in our environment:


What I can tell you, that the Misconfigured status, like the Agent Unreachable, will offer you high level information if you hover over the status section for the agent in question.  As this is a misconfigured agent, the best advice I will offer is to go to the host the agent resides and attempt the following command:

emctl status agent -details

This should give you information on why or what it believes is the OMS upload URL/port, where the agent believes the AGENT_HOME is, etc.  It’s a good place to start, but gather information from the emctl.log, the emctl*.out and other logs to find out what is causing the issue.

Advanced Performance Data for Agents

The default view is not the only data that can be displayed.  Note that there are a number of columns that could be added or removed from the graph display, allowing you to answer any number of questions regarding EM12c agents.


As you can see in the next screenshot, you can even change the Agents Management page to view performance data, which can help you if you are experiencing backlog or agent resource usage issues.  I’ve changed my view to display the following:

  • Agent Name, Incident information
  • Backlog
  • CPU Usage
  • Restarts


I can now sort by backlog data and display what  agents have a backlog of collections to upload and view the higher CPU usage, (but as you can see by this graph, it’s pretty light considering… :)) This gives me an opportunity to optimize these agents collections, upload intervals and the data its collecting to give it better performance.

Locating All Targets for an Agent

Last but not least, at the bottom left of the Agents Management page you’ll see the following link:  agent13This link will take you to the search page to view what agents support what targets to Cloud Control.


By typing in a filter for a target type, (or you can choose to leave this blank) you can then type in a name of an agent or partial agent and all targets and agents with that naming convention will be listed.  You will see each target and the agent that supports it.

As unimportant as some may think the Agents Management page is to the Manage Cloud Control offering in the console, I know how often the agent is the first place one should look to a solution for upload issues, incidents and collection errors.  I hope administrators will come to appreciate this valuable console page and how much time it should save them when issues arise.








Copyright © DBA Kevlar [Agents Management Via EM12c Release 4 Console], All Right Reserved. 2014.

Lunchtime quiz

There was a question on OTN a few days ago asking the following question:

Here’s a query that ran okay on 11g, but crashed with Oracle error “ORA-01843: not a valid month” after upgrade to 12c; why ?

The generically correct answer, of course, is that the OP had been lucky (or unlucky, depending on your point of view) on 11g – and I’ll explain that answer in another blog posting.

That isn’t the point of this posting, though. This posting is a test of observation and deduction. One of the respondants in the thread had conveniently supplied a little bit of SQL that I copied and fiddled about with to demonstrate a point regarding CPU costing, but as I did so I thought I’d show you the following and ask a simple question.’

drop table T;

Create Table T
periods as (
                  Select 'January' period, 1 cal  From Dual
        union all Select 'February' period , 2 cal From Dual
        union all Select 'March' period , 3 cal From Dual
        union all Select 'April' period , 4 cal From Dual
        union all Select 'May'  period, 5 cal From Dual
        union all Select 'June' period, 6 cal From Dual
        union all Select 'July' period, 7 cal From Dual
        union all Select 'August' period, 8 cal From Dual
        union all Select 'September' period, 9 cal  From Dual
        union all Select 'October' period, 10 cal From Dual
        union all Select 'November' period, 11 cal From Dual
        Union All Select 'December' Period, 12 Cal From Dual
        Union All Select '13 Series' Period, Null Cal  From Dual
Select  Period,Cal
from periods;

prompt  ==================================
prompt  When we invoke below SQL it works.
prompt  ==================================

set autotrace on explain

select *
from    (
                to_date(Period || ', ' || 2014,'Month, YYYY') col1 ,
                to_date('November, 2014','Month, YYYY') col2
        From  T
        Where  Cal > 0

prompt  ================================================
prompt  But when we add comparison operations , it fails
prompt  ================================================

select *
from    (
                to_date(Period || ', ' || 2014,'Month, YYYY')   col1,
                to_date('November, 2014','Month, YYYY')         col2
        From  T
        Where  Cal > 0
        col1 >= col2

set autotrace off

All I’ve done is create a table then run and generate the execution plans for two queries – with a comment that if you try to run one query it will succeed but if you try to run the other it will fail (and raise ORA-01843). As far as the original supplier was concerned, both queries succeeded in 11g and the failure of the second one appeared only in 12c. In fact, for reasons that I won’t discuss here, it is POSSIBLE for the failure to appear in 11g as well, though not necessarily with this exact data set.

Here’s the COMPLETE output I got from running the code above on an instance:

Table dropped.

Table created.

When we invoke below SQL it works.

PERIOD           CAL COL1      COL2
--------- ---------- --------- ---------
January            1 01-JAN-14 01-NOV-14
February           2 01-FEB-14 01-NOV-14
March              3 01-MAR-14 01-NOV-14
April              4 01-APR-14 01-NOV-14
May                5 01-MAY-14 01-NOV-14
June               6 01-JUN-14 01-NOV-14
July               7 01-JUL-14 01-NOV-14
August             8 01-AUG-14 01-NOV-14
September          9 01-SEP-14 01-NOV-14
October           10 01-OCT-14 01-NOV-14
November          11 01-NOV-14 01-NOV-14
December          12 01-DEC-14 01-NOV-14

12 rows selected.

Execution Plan
Plan hash value: 1601196873

| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT  |      |    12 |   228 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    12 |   228 |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):

   1 - filter("CAL">0)

   - dynamic sampling used for this statement (level=2)

But when we add comparison operations , it fails

PERIOD           CAL COL1      COL2
--------- ---------- --------- ---------
November          11 01-NOV-14 01-NOV-14
December          12 01-DEC-14 01-NOV-14

2 rows selected.

Execution Plan
Plan hash value: 1601196873

| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT  |      |     1 |    19 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |    19 |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):

   1 - filter("CAL">0 AND TO_DATE("PERIOD"||', '||'2014','Month,
              YYYY')>=TO_DATE(' 2014-11-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

So this is the question. What’s the anomaly in this output ?

Bonus question: What’s the explanation for the anomaly ?


If I had asked why the query might, or might not, crash – the answer would be about the order of predicate evaluation, and simply collecting stats (or not) might have made a difference. Ever since “system stats”  and “CPU costing” appeared the optimizer has been able to change the order in which it applies filter predicates to a table (there’s a pdf of an article of mine from Oracle magazine in the 9i / 10g timeline linked at this URL) .  In this case, applying the “cal > 0″ predicate first luckily eliminates the rows that would fail the second predicate. Since the effect is driven by the optimizer’s stats this type of failure could occur ANY TIME you have a predicate that requires coercion between types to take place – which is one reason why you see the injunctions to use the correct data types; and why, if you need coercion to work around incorrect data types you have to consider writing your own functions to trap and resolve the necessary errors raised by Oracle’s implicit conversion mechanisms.

For a quick sketch of the optimizer strategy, the arithmetic is roughly:  predicate A costs c1 and predicate B costs c2; if I apply predicate A to every row I have to apply predicate B to only N surviving rows; if I apply predicate B to every row I have to apply predicate A to M surviving rows; which is smaller: (input_rows * c1 + N * c2) or (input_rows * c2 + M * c1).

The answer to the question I actually asked is this, though: I stressed the fact that this was the COMPLETE output because, as Narenda highlighted in comment 7 below –  the first query shows a note about dynamic sampling and the second query does not. This is a little surprising; we don’t have stats on the table, and the two queries are different so we have to optimizer both of them.  In 12c, of course, it’s possible that the optimizer may have done something clever with statistics feedback (formerly cardinality feedback) and created an SQL directive – but even then we should have seen a note about that.

For the bonus question: given the second output doesn’t report dynamic sampling we should be curious why not – did the optimizer simply decide not to try, did it try then decide not to use the results for some reason, or is there some other reason.  The obvious next step is to look at the 10053 (optimizer) trace – where you find that the optimizer DID do dynamic sampling or rather, it tried to do dynamic sampling but the query generated to take the sample failed with Oracle error ORA-01843, as suggested by Chinar Aliyev in comment 9  and expanded by Mohamed Houri in comment 11.

The irony of the sampling problem (hinted by Chinar Aliyev in comment 10) is that you could be in a position where you have a large table and oracle picks a small sample which happens to miss any of the problem rows and then return a sample that persuades the optimizer to pick an execution plan that is bound to find a problem row; alternatively the SQL used to generate the sample might apply the predicate in an order that manages to eliminate the problem rows, while the final plan derived after sampling persuades the optimizer to use the predicate in the order B, A.