Search

OakieTags

Who's online

There are currently 0 users and 26 guests online.

Recent comments

Oakies Blog Aggregator

PK Problem

I have been saying for a very long time (probably since some time in the last century) that if you want to add a primary key (or unique) constraint to a table where there might be some duplicate data then one of the best strategies for doing so might be to create a non-unique index (with the online option), then add the constraint in the state enable novalidate, and then validate the constraint. For example:

create table t1
as
select
	*
from
	all_objects
;

create index t1_i1 on t1(object_id) online;
-- collect stats
alter table t1 add constraint t1_pk primary key(object_id) enable novalidate;
alter table t1 modify constraint t1_pk validate;

The theory was simple – the online build would require only a brief period of locking, the addition of the constraint would require only a brief period of locking, after which all new (or modified) data would be checked for uniqueness; and then the validation would simply walk the index in order checking for duplicates. If you wanted you could always choose the option to list the rowids of duplicates into an exceptions table.

In the last 48 hours it has been brought to my attention that that last assumption is, and perhaps has always been, wrong. I’ve just been running some tests of the mechanism and checked what Oracle does about the validation step all the way back to 8.1.7.4.

Depending on whether the constraint is single column or multi-column, depending on whether you’re adding a unique or primary key constraint, depending on whether the column(s) are pre-declared as NOT NULL, and depending on exact version of Oracle, there are several slightly different scenarios to review if you want to do an exhaustive analysis; but here’s the SQL, with execution plan, that Oracle ran for the validate step in the example above on 11.2.0.3:

select /*+ all_rows ordered */
	A.rowid, :1, :2, :3
from
	TEST_USER.T1 A,
	(
	select /*+ all_rows */
		OBJECT_ID
	from
		TEST_USER.T1 A
	where	(OBJECT_ID is not null)
	group by
		OBJECT_ID
	having
		count(1) > 1
	) B
where
	(A.OBJECT_ID = B.OBJECT_ID)
union all
select
	/*+ all_rows ordered */
	A.rowid, :1, :2, :3
from
	TEST_USER.T1 A
where
	(OBJECT_ID is null)
;

-------------------------------------------------------------------------------------------
| Id  | Operation                 | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |       |  2706 | 81167 |       |   149   (3)| 00:00:02 |
|   1 |  UNION-ALL                |       |       |       |       |            |          |
|*  2 |   HASH JOIN               |       |  2705 | 81150 |  1536K|   149   (3)| 00:00:02 |
|   3 |    INDEX FAST FULL SCAN   | T1_I1 | 54081 |   897K|       |    34   (0)| 00:00:01 |
|   4 |    VIEW                   |       |  2705 | 35165 |       |    36   (6)| 00:00:01 |
|*  5 |     FILTER                |       |       |       |       |            |          |
|   6 |      HASH GROUP BY        |       |  2705 | 13525 |       |    36   (6)| 00:00:01 |
|   7 |       INDEX FAST FULL SCAN| T1_I1 | 54081 |   264K|       |    34   (0)| 00:00:01 |
|*  8 |   FILTER                  |       |       |       |       |            |          |
|   9 |    INDEX FAST FULL SCAN   | T1_I1 | 54081 |   897K|       |    34   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
   5 - filter(COUNT(*)>1)
   8 - filter(NULL IS NOT NULL)

A couple of background points to note: this execution plan came from “explain plan” and a call to dbms_xplan.display, but it does match the run-time plan as reported in the trace file. The “intent” of the code is to identify all the rowids for rows where the same object_id has been used more than once (if there are none then the PK is valid). Because of the standard optimizer arithmetic for “having count(*) > {const}”, the optimizer assumes that duplicates account for 5% of the data. The second query block in the union all allows Oracle to capture rows where the primary key can’t be valid because the column is null – but this is a redundant check in my example because the column is defined to be not null; however if you check line 8 of the execution plan you can see that there is a FILTER operation that means that part of the query will be short-circuited and do no work. The three bind variables in the main select hold the table owner, table name, and constraint name – as required by the exceptions table (not that we’ve asked Oracle to capture exceptions in this example).

I think we can easily explain the redundancy in the code by assuming that the developer who wrote it decided to maximise code reuse; but why is there an ordered hint in it that (surely) maximises the inefficiency of the code ? The resulting execution plan for any large index is almost fixed by that hint. Oracle is going to build an in-memory (if it fits) hash table of every entry in the index, then do a massive aggregation job to generate the data to probe the hash table.  As a consequence, it’s quite likely that we will end up doing the equivalent for dumping the whole index into the temporary tablespace twice – once for the hash table, once to do the aggregation, and this might be something we’d like to avoid.

But if we’re hoping to enable a primary key constraint aren’t we expecting a very small volume of data (i.e. small number of duplicates) in the aggregate, and wouldn’t it make sense to do a nested loop driven by that small number, rather than a hash join; and since we have a suitable index in place could we generate the aggregate without doing any sorting ?  In other orders can we avoid doing any I/O to the temporary tablespace and produce a (pseudo-)plan like:

NESTED LOOP
    Aggregated object_id having count(*) > 1
    Index range scan t1_i1

Since the code is embedded in a package we can’t change it but, for very special (extremely large) cases, we might decide that we need to change the execution plan for the query – so we might look at the option for creating an SQL Plan Baseline for a modified version of the query, and then attaching that plan to this query. Let’s work on that idea in steps – first we’ll just try aggregating the object ids by walking the index in order:

select * from (
	select /*+ index(t1(object_id)) */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
)
;
--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |  2705 | 35165 |   121   (0)| 00:00:02 |
|   1 |  VIEW                  |       |  2705 | 35165 |   121   (0)| 00:00:02 |
|*  2 |   FILTER               |       |       |       |            |          |
|   3 |    SORT GROUP BY NOSORT|       |  2705 | 13525 |   121   (0)| 00:00:02 |
|   4 |     INDEX FULL SCAN    | T1_I1 | 54081 |   264K|   121   (0)| 00:00:02 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(COUNT(*)>1)

In this plan we walk the index in order, keeping a running count for each object_id, discarding any ids where the count is just 1. Note particulary that the aggregation step is “sort group by nosort” – we DON’T sort the data. In the absence of the index() hint my system chose to do an “index fast full scan” with “hash group by” – it’s up to you to decide which of those two plans is likely to be the more efficient way to generate the small number of exceptions in your case.

Once we have this starting step we can try to use it as the first step of a nested loop (since we believe it will return a handful of rows rather than the 5% indicated by the optimizer.

select
	--+ leading(b a) use_nl(a) index(a(object_id)) no_merge(b)
	a.rowid
from
	(
	select /*+ index(t1 (object_id)) */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
	)
	b ,
	t1	a
where
	a.object_id = b.object_id
;

---------------------------------------------------------------------------------
| Id  | Operation               | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |       |  2705 | 81150 |   121   (0)| 00:00:02 |
|   1 |  NESTED LOOPS           |       |  2705 | 81150 |   121   (0)| 00:00:02 |
|   2 |   VIEW                  |       |  2705 | 35165 |   121   (0)| 00:00:02 |
|*  3 |    FILTER               |       |       |       |            |          |
|   4 |     SORT GROUP BY NOSORT|       |  2705 | 13525 |   121   (0)| 00:00:02 |
|   5 |      INDEX FULL SCAN    | T1_I1 | 54081 |   264K|   121   (0)| 00:00:02 |
|*  6 |   INDEX RANGE SCAN      | T1_I1 |     1 |    17 |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(COUNT(*)>1)
   6 - access("A"."OBJECT_ID"="B"."OBJECT_ID")

This is exactly what we want – as each row comes out of the aggregate rowsource at line 2, we access the index to get the details that we need of the multiple rows (just the rowids) – and the relevant index leaf block is the one we’ve just walked to generate the aggregate – so no further random I/O needed. This HAS to be pretty efficient.

There’s just one problem – I lied.

I’ve hacked the plan by hand; when you embed the previous query as an in-line view the nosort option disappears. Line 4 should read “SORT GROUP BY” – I edited in the nosort by hand – Oracle actually sorts the entire result set before starting the nested loop; in fact it’s slightly worse than that because by default the optimizer chooses a “HASH GROUP BY”, which I blocked by setting the hidden parameter _gby_hash_aggregation_enabled to false. Looking at the costing, by the way, it struck me that it was just possible that Oracle was doing a nosort sort – the incremental cost was zero – so I enabled the 10046 and 10032 trace so show that Oracle wrote the result set to the temporary tablespace, then read it back, and did actualy record the expected number of records sorted.

I couldn’t find any way to avoid this undesirable sort by hinting – so that seems to have scuppered our plans for using an SQL Plan Baseline to eliminate the large volume of writes to the temporary tablespace (note, we might still be happy to do an index fast full scan at this point with a hash group by – it should be better than the default plan because it will only dump the content of the index to the temporary tablespace once.)

But our plans might not be completely wrecked. It is possible to get the intermediate result set with a nosort and then use it as follows:

with b as (
	select /*+ materialize index(t1(object_id))  */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
)
select
	--+ leading(b a) use_nl(a) index(a(object_id))
	a.rowid
from
	b ,
	t1	a
where
	a.object_id = b.object_id
;

----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                        | Rows  | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                             |  2705 | 81150 |   123   (0)| 00:00:02 |
|   1 |  TEMP TABLE TRANSFORMATION |                             |       |       |            |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D6608_4017FEF5 |       |       |            |       |
|*  3 |    FILTER                  |                             |       |       |            |       |
|   4 |     SORT GROUP BY NOSORT   |                             |  2705 | 13525 |   121   (0)| 00:00:02 |
|   5 |      INDEX FULL SCAN       | T1_I1                       | 54081 |   264K|   121   (0)| 00:00:02 |
|   6 |   NESTED LOOPS             |                             |  2705 | 81150 |     2   (0)| 00:00:01 |
|   7 |    VIEW                    |                             |  2705 | 35165 |     2   (0)| 00:00:01 |
|   8 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D6608_4017FEF5 |  2705 | 13525 |     2   (0)| 00:00:01 |
|*  9 |    INDEX RANGE SCAN        | T1_I1                       |     1 |    17 |     0   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(COUNT(*)>1)
   9 - access("A"."OBJECT_ID"="B"."OBJECT_ID")

If we can rewrite the SQL to use subquery factoring and then materialize the subquery (which should only produce a small temporary table) we can get the nosort back and get the nested loop – and I didn’t even have to play with a hidden parameter to do it. But how do we rewrite the SQL that’s generated by code embedded in a package ? First of all, here’s the SQL (with plan) I would like to use – I’ve eliminated the original hints, introduced a subquery factoring clause with materialization and an index hint:


with b as (
	select /*+ materialize index(t1 (object_id))  */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
)
select
	A.rowid, :1, :2, :3
from
	t1 A,
	B
where
	(a.object_id = b.object_id)
union all
select
	a.rowid, :1, :2, :3
from
	t1 A
where
	object_id is null
;

----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                        | Rows  | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                             |  2706 | 81167 |     2   (0)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                             |       |       |            |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D660D_4017FEF5 |       |       |            |       |
|*  3 |    FILTER                  |                             |       |       |            |       |
|   4 |     SORT GROUP BY NOSORT   |                             |  2705 | 13525 |   121   (0)| 00:00:02 |
|   5 |      INDEX FULL SCAN       | T1_I1                       | 54081 |   264K|   121   (0)| 00:00:02 |
|   6 |   UNION-ALL                |                             |       |       |            |       |
|   7 |    NESTED LOOPS            |                             |  2705 | 81150 |     2   (0)| 00:00:01 |
|   8 |     VIEW                   |                             |  2705 | 35165 |     2   (0)| 00:00:01 |
|   9 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D660D_4017FEF5 |  2705 | 13525 |     2   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN       | T1_I1                       |     1 |    17 |     0   (0)| 00:00:01 |
|* 11 |    FILTER                  |                             |       |       |            |       |
|  12 |     INDEX FAST FULL SCAN   | T1_I1                       | 54081 |   897K|    34   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(COUNT(*)>1)
  10 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
  11 - filter(NULL IS NOT NULL)

As you can see, we load a small (we hope) temporary table by doing an index walk with running count; we use that result set in a nested loop back into the index, and then we do a union all with a result set that we don’t attempt to create because the “null is not null” filter terminates that branch of the plan. All we have to do now is make Oracle run OUR query instead of its own.

Plan B – dbms_advanced_rewrite

The package dbms_advanced_rewrite basically allows you to tell Oracle that when it sees a particular piece of SQL it should run some other piece of SQL. You do have to be a little careful to make it work, of course – the text should be “sufficiently similar”, and the number and type of values used and returned has to be the same. Here’s an example of setting up a “rewrite equivalence”:

begin
        sys.dbms_advanced_rewrite.declare_rewrite_equivalence(
                name             => 'TEST_HINT',
                source_stmt      => 'select * from t1 where n1 = 15',
                destination_stmt =>
                        'select /*+ index(v1.t1) */ * from v1 where n1 = 15',
                validate         => false,
                rewrite_mode     => 'general'
        );
end;
/

In this example, whenever the statement “select * from t1 where n1 = 15″ reaches the database, it will execute the hinted SQL query against the view.

Following this example, all we have to do (and this is a technique that became avaiable in 10g) is let Oracle generate the SQL that’s giving us the bad plan, and set up an equivalence so that what actually runs is our version with the subquery factoring clause instead – except we run into another problem when we try to declare the equivalence:

ERROR at line 1:
ORA-30389: the source statement is not compatible with the destination statement
ORA-32034: unsupported use of WITH clause
ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 29
ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 185
ORA-06512: at line 11

Brilliant idea number 2 isn’t going to work – subquery factoring isn’t supported with advanced rewrite. So I guess we’re just going to have to wait for Oracle to improve the code.

Summary

After many years of believing that Oracle did something sensible when validating a primary key or unique constraint against an existing index, I finally discovered that it was doing something wierdly inefficient that invariably (probably) will result in the content of a very large index being dumped to the temporary tablespace twice as the validation takes place.

I tried to find a way of making the existing code avoid any large joins and aggregations by working out how to construct an SQL Plan Baseline for the query that would make it use a different join order and walk the index in order to produce an optimal nested loop approach. A limitation in the optimizer made the plan unreachable. (A variation on the plan that did an index fast full scan and hash aggregation followed by a nested loop is still viable, and could be a better strategy than the default.)

Because I couldn’t use an SQL Plan Baseline to get the plan I wanted, I looked at the possibility of a rewrite that would allow me to aggregate without using an temporary space, and it turned out that materializing a subquery factoring clause would work. The only way to impose this on Oracle, though, would be through the package dbms_advanced_rewrite – and it is a current limitation of that package that it won’t allow you to use subquery factoring in the rewrite process.

It is worth noting that the work I’ve done was triggered by a question relating to a table of 1.8 Billion (with a B) rows and the way that the default validation process needed more temporary space than was available. Although the general principle of what to do is going to change, the actual code you need would have to be tailored to the specific table and index every time tried to apply the method. In general it’s the sort of task which you don’t do unless you’ve got a very good reason. I went to the trouble of working out a few possibilities (a) because I was curious and (b) because I thought it might make an interesting little study for the blog.

OTN Tour of Latin America: Guatemala…

The Guatemala event is over. I will write a proper post about it on the plane tomorrow, but I just wanted to share three photos from the speakers room.

To set the scene, Graham Wood and Debra Lilley are on my left and Tom Kyte is on my right. Debra was describing something that happened earlier in the week and I just completely lost it. I was laughing so hard, tears were streaming down my face. These tours are tiring and really hard work, but they are also extremely good fun!

Cheers

Tim…


OTN Tour of Latin America: Guatemala… was first posted on July 12, 2012 at 2:29 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




The linux ‘perf’ utility with EL 6 – part 2

In my previous post about perf I showed how it can be used to determine the function which is used the most systemwide.

One thing it helped me with, is determining the impact of swapping on a system. The testcase I build for that is kind of a no-brainer (setting up a 3GB SGA with 2GB of physical memory in a test VM), but think about a few things: I regularly encounter systems who have an amount of swap allocated. This can mean memory is simply plain over allocated, but can also be older pages are just swapped out by Linux “just in case”. This probably will happen more with the new oracle recommendation for “swappiness” (the eagerness of Linux to swap) to set it to 100 (used to be 60). The amount of swap transfers can be measured, but the regular tools cannot measure the impact. With perf, this is possible.

I ran another SLOB run (10 readers) against the 3GB SGA/2GB physical memory system, and collected perf data with ‘perf record -e cpu-clock -g -a’. After the run, I printed a summary of the collected data using ‘perf report -g none’:

# Events: 33K cpu-clock
#
# Overhead          Command                       Shared Object                                                                                                                     
# ........  ...............  ..................................  
#
    24.65%          swapper  [kernel.kallsyms]                   [k] native_safe_halt
    17.99%           oracle  [kernel.kallsyms]                   [k] _raw_spin_unlock_irqrestore
    10.82%           oracle  oracle                              [.] kcbgtcr
     2.94%           oracle  oracle                              [.] qetlbr
     2.66%           oracle  oracle                              [.] kcbz_fp_buf
     2.22%           oracle  oracle                              [.] kdsgrp
     1.98%           oracle  oracle                              [.] kafger

I guess this speaks for itself. The highest number of measurements of a function (24.65%) was a kernel-mode function ([k]) ‘native_safe_halt’, which was issued by ‘swapper’. The next most active function (17.99%) was a kernel-mode function ([k]) ‘_raw_spin_unlock_irqrestore’ which was issued by oracle. This function represent waiting for I/O (read on if you want to be able to determine that). Next functions are user land functions ([.]). These are issued by oracle, with the executable oracle. Because we are using SLOB, which is made to issue 8KB block reads, it makes sense the user land function with the most impact is the ‘kcbgtcr’ function (kernel cache buffer get consistent read).

Above example shows the summary of the perf recording. The “normal” report shows the complete stack from the active function down to main. Here is an example with the data from which the summary above has been made:

# Events: 33K cpu-clock
#
# Overhead          Command                       Shared Object                                                                                                                     
# ........  ...............  ..................................  ...................................................................................................................
#
    24.65%          swapper  [kernel.kallsyms]                   [k] native_safe_halt
                    |
                    --- native_safe_halt
                        cpu_idle
                        rest_init
                        start_kernel
                        x86_64_start_reservations
                        x86_64_start_kernel

    17.99%           oracle  [kernel.kallsyms]                   [k] _raw_spin_unlock_irqrestore
                    |
                    --- _raw_spin_unlock_irqrestore
                       |          
                       |--97.51%-- scsi_dispatch_cmd
                       |          scsi_request_fn
                       |          __blk_run_queue
                       |          |          
                       |          |--99.69%-- queue_unplugged
                       |          |          blk_flush_plug_list
                       |          |          |          
                       |          |          |--94.22%-- blk_finish_plug
                       |          |          |          |          
                       |          |          |          |--99.85%-- generic_file_read_iter
                       |          |          |          |          generic_file_aio_read
                       |          |          |          |          aio_rw_vect_retry
                       |          |          |          |          aio_run_iocb
                       |          |          |          |          io_submit_one
                       |          |          |          |          do_io_submit
                       |          |          |          |          sys_io_submit
                       |          |          |          |          system_call_fastpath
                       |          |          |          |          io_submit
                       |          |          |          |          |          
                       |          |          |          |           --100.00%-- skgfqio
                       |          |          |          |                     ksfd_skgfqio

Here you see the stack of the swapper, and the different stacks, as a tree, which lead to _raw_spin_unlock_irqrestore, with percentages of functions from the differences which lead the same executable/function combination.

Using the percentages, the function calls can be investigated which happen on the system. Above sample shows the last two oracle database functions prior to the io_submit call (io_submit is the function to issue an asynchronous IO call), ksfd_skgfqio and skgfqio. Next we see the io_submit call, and everything which is called by the kernel.

The report contains all the stacks, not a partial one like my example.

AllThingsOracle.com - Dynamic Sampling (III) Part II

Just again a short note that the third installment's final part of the series on Dynamic Sampling has been published on AllThingsOracle.com.

I show there how indexes can be used to get Dynamic Sampling working with rare / clustered values, and expand on further ideas how to deal with tricky situations for the optimizer.

Proactive Tuning

There’s no such thing as proactive tuning. Either you do it right before it goes into production (in which case it’s called “design”) or you react to the nasty surprises you get after it’s gone live.

Even if you do something about the problems before anyone has complained that’s still “reactive” not “proactive” – and the trick is to be reactive as soon as possible. For most companies the best strategy is probably to stick a big screen on the wall with the “Top Activity” page from OEM (or equivalent) so that the key players (DBAs and Developers) can keep a casual eye out for resource hogs and prioritise dealing with them as they appear.

Footnote: if you fall into that rare group of people who can look at what’s currently happening, identify the design errors, and explain how to fix them before the performance starts to nose-dive, then I’ll let you say you’re doing proactive tuning; but personally I tend to call that process “eliminating design errors”.

Hate the player, not the game — or my case for Data Engineers

Has “database” become a dirty word within your organization lately? If you’re someone who has been a data technologist for the better part of your career, you may be wondering why the technologies you work with everyday seem to be acquiring such a bad rap. From NoSQL to No DB the current influx of brogrammers seem to take extreme pride in describing how they’re able to write code while avoiding any kind of database technology whatsoever.

The impetus for this post actually started with something I read on the ‘Net the other day about Command Query Responsibility Segregation (CQRS), and how I was initially excited about the concept.

Martin Fowler has a nice, gentle introduction the topic here.

Before I get into the post, however, I think it’s useful for me to describe some of my attitudes toward data management. What’s really odd is that while I rather strongly disagree with the tone of Uncle Bob Martin’s Rant, I actually strongly agree with his assertion about the high value of use-case driven development.

I’ve had gentle debates about the meaning of “data as truth” with several people, and the age-old debate of whether data is more “important” than application code. Generally I’ve found that such debates end up as religious arguments instead of attempting to get to the value of acting on data / or data in action. Because in the end it’s hard for data to have value unless its acted on by a set of processing directives (applications), and while it’s possible to have valuable applications that don’t require knowledge about the past (basic rule engine codifications), in general they need each other.

Why I call myself a data engineer

I’ve been impressed with EMC’s attempt to define a Data Science curriculum. In particular, I like how they describe the different skills and roles necessary for a successful data science team, including the hot new title of data scientist. The data science team often includes a data architect, a data engineer, and a database administrator. So, what is a data engineer? In a blog by Steve Todd, Director of EMC’s Global Research and Innovation Portfolio, he has the following characterizations:

The “Database Administrator” provisions and configures the database environment to support the analytical needs of the working team. The “Data Engineer” tends to have deep technical skills to assist with tuning SQL queries for data management and extraction. They also support data ingest to the analytic sandbox. These people can be one in the same, but many times the data engineer is an expert on queries and data manipulation (and not necessarily analytics as such). The DBA may be good at this too, but many times they may simply be someone who is primarily skilled at setting up and deploying a large database schema, or product, or stack.

Many, many DBAs wear both hats, but I think it’s not a good idea — in general I think that DBA is to data engineer as system administrator is to software engineer, but the lack of data engineers has forced DBAs into dual-roles, often for which they are not well-suited. While I have basic DBA skills, I’m much better at the skills listed under the data engineer — and I enjoy working with the data scientists or application developers who have questions about the data and/or how they’d like it structured to support their use cases.

This is one of the reasons why I agree with Uncle Bob’s rant in which he also rails against frameworks in addition to the database — I just wish frameworks had received equal billing in the rant and title, but I’m guessing that the No DB vitriol resonated more highly with readers. In general I like making sure data is organized in such a way as to support as many use cases as possible. That includes being performant for each use case — which may mean taking advantage of techniques to denormalize, duplicate and synchronize, cache and distribute data.

I suppose I could write a similar rant on No Data Frameworks, but then I’d probably step into the ORM battle, which really isn’t the focus of this post. But just to follow on to Uncle Bob’s rant — the reason I dislike many ORM Data Frameworks is that they tightly bind the application to a particular physical implementation of a data layout, which then limits and constrains my ability to adapt the data layout for new use cases, and leads to “persistence avoidance” in application code.

True story — on a recent Agile project, I was providing guidance on the data layer when I noticed that a bit of information for a new use case wasn’t being captured. I suggested to the team that it would be easy to extend the data layer in order to retain the additional information and I was met with groans: “But that means touching the persistence framework — that’s a big change!” — I was flabbergasted. Isn’t the data layer usually blamed for being inflexible? Are you telling me that it’s actually the framework causing the inflexibility?

Again I point back to Uncle Bob on Clean and Screaming Architecture.

If you’re still reading this, I’m sure you’re wondering how this ties in to CQRS and the original blog title.

When I first read about CQRS in Martin Fowler’s post, I became really interested — the idea that you would use different models for commands (“change” the data) and queries (“read” the data) made me think that frameworks that directly map models into applications could be retired in favor of messages related to use cases instead of model objects. To me, this means a data service API or set of virtual data layers which provide interfaces to data for applications, regardless of how the data is physically stored or organized. Huzzah! This would free me as a data engineer to ensure that I organized the data in ways which efficiently supported use cases. Since I tend to work in full-featured RDBMS systems, that meant I could wrap data using a data service API using whatever works, including things like stored procedures or RESTful web APIs using something like Oracle’s APEX listener.

So imagine my dismay when reading about CQRS and coming upon a whole series of blog posts about implementing CQRS expressly to “get rid of the database“. I intently read through the entire series trying to figure out what was wrong with the database that necessitated “getting rid of it” to implement CQRS. All to no avail. I’ve left a comment asking for that information, because I’m generally curious about it, but I have a guess.

It’s not about technology — it’s about the organization and its associated personalities that foster such an attitude.

Really now. In an organization with responsive data engineers there shouldn’t be a need to “get rid of the database”. One of the best reasons to have a database is that it provides so many ways to build the different kinds of models and transform the data between them with minimal need for additional frameworks or mountains of custom code.

In the end, I’m guessing that after years of hearing “No, we can’t do that” from the DBA’s-designated-as-data-engineers, the application team had come to equate the people with the technology. The implication is that the technology is the constraint instead of the people responsible for it.

So, what’s a way out? If your existing technology is the game, make sure you get the best players for every role and responsibility — don’t make your DBAs “play out of position” or else they’ll become hated representations of barriers to progress. If your organizational structure is the game, hate the game, change the game and advocate for skilled data engineers who can make your data more responsive to your businesses use cases. If you believe in “data as truth”, then invest in people who can make that data as useful as possible to as many use cases as you have.

OTN Tour of Latin America: Trinidad & Tobago…

We arrived at Trinidad & Tobago in the evening and it was hot and humid. I decided to wash some clothes in the hotel room sink, which was a mistake. It was so humid that even after about 36 hours my clothes were still damp and smelled a bit like wet dog when I packed them. Asking around it sounds like Guatemala will be just as humid, so I might have to use a hotel laundry service to get my clothes clean. :)

The timetable was such that I really didn’t get to see anything of Trinidad & Tobago. The whole of the first day was taken up with the conference. It was a single track event, so every presentation was to the whole audience. I was the first speaker, so I did a quick poll of the audience and decided to switch from my intended presentation to a more general one, so as not to alienate all but 5 of the audience. :)

The previous events were using translation services, so I had been constantly trying to speak slower. Presenting to an English speaking audience should have made it easier, but I found myself rushing a little. You gotta laugh. I’m never satisfied. :)

When the event was over a few of us went next door to the Hyatt Hotel and sat on the terrace chatting for a couple of hours. I had a Hyatt Sunrise and a Toasted Almond Joy, both of which were very sweet and a little alcoholic. :) Sheeri and I had agreed to present at a separate event the following day, so we were discussing the event with Loren (the organiser). When I heard the audience was made up of students I decided none of my presentations were suitable, so it was back to the hotel to write something new. I got most of the outline done before bed and (almost) finished it in the morning. I spoke to Debra Lilley at breakfast and she gave me a couple of ideas, which I added when I got to the event.

I like to rehearse my presentations a lot, so doing an on-the-fly presentation like this was pretty scary for me. I was really pleased with the way it went. I feel like it flowed well and the response from the audience was good. If anything, I feel like it went better than my presentation from the previous day. During the presentation I made some comments about documenting code and systems, so when I finished Loren asked if I could do a short presentation about documentation. I figured “in for a penny, in for a pound”, so while Sheeri was presenting I wrote some slides. As soon as Sheeri was finished I did a 20 minute spot on documentation. Two unplanned presentations in one day is about as much as my tiny brain could cope with. :)

Soon after our sessions were done, we got a taxi to the airport for our flights to Guatemala. The crammed timetable meant all I really saw of Trinidad & Tobago was:

  • The airport.
  • The Hilton hotel.
  • The government building hosting the two events.
  • The terrace of the Hyatt hotel.
  • The scenes on the two fridge magnets I bought at the airport.
  • A guy selling coconuts from a cart near a park (thanks to Debra for reminding me :) ).

I’m not sure I can count this as a visit to the carribean. It’s a bit like saying you’ve been to a country when all you’ve done is a wait at an airport for a connecting flight. :)

The timetable for Guatamala suggests it’s going to be a similar story there. Such is life!

Thanks to Loren and the gang for organising the event and of course, thanks to the Oracle ACE program for making it all happen.

Cheers

Tim…

PS. I’m looking forward to getting some sleep. I’ve kinda forgotten what it’s like…

PPS. I don’t have many photos, but what I have are here.


OTN Tour of Latin America: Trinidad & Tobago… was first posted on July 11, 2012 at 1:43 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




The linux ‘perf’ utility with EL 6

Some of you might have experimented with, or used Oprofile, ltrace/strace, Systemtap, and wished you could look further into the userland process, like dtrace does, because dtrace allows you to profile and look into a process’ function calls.

If you’ve used oprofile/strace/systemtap/etc., you probably found out you can see all the kernel functions which are processed, but it does not get into userspace, or with very limited information. Perhaps the only tool which is different is gdb, which enables you to see userspace function names, but gdb is not a profiler, it’s a debugger. And it works best with (I assume it made for) debugging symbols in the executable. Most (all I presume) commercial executables, like the oracle database executable, do not have debugging symbols.

Some time ago, a tweet from Kevin Closson caught my attention, which showed a profile of a Linux system running Oracle, which showed the functions inside Oracle. This totally shocked me! That is something I want to be able to do on Linux too!!

It was done using the perf utility. The perf utility is delivered with the kernel (rpm) package on EL Linuxes, and the since I am using Oracle Linux, I can tell you it’s delivered with the ‘UEK’ kernels (the 2.6.32+ ones). I’ve eventually used the latest version of the UEK kernel (2.6.39-200.24.1.el6uek.x86_64), which you get when you subscribe to the ol6_UEK_latest yum channel. Newer versions of the kernel have more perf functions.

The ‘perf top’ sub-utility shows the kernel and userland functions which are active (by sampling active processes). It can be used in a variety of ways, some of them are: system-wide, per processor or per process. This allows you to do look beyond the Oracle wait interface. As Tanel Poder often states in his performance classes: the Oracle wait interface is a good starting point for performance investigation, but sometimes you need to look deeper. You always need to use the appropriate layer to look at when investigating issues.

Note when using virtualization: virtualization does not allow direct access to the performance monitoring hardware. So if you try to use perf on that (or oprofile for that matter), you get no output. You can get around that with perf by using a software event. I use the software event cpu-clock. This means I invoke perf with the option ‘-e cpu-clock’

For example: this statement uses a lot of CPU, and might (depending on Oracle version, operating system type and version) never return an answer. But it is perfect to have a statement that consumes approximately 1 CPU, and keeps doing that for some time:

select count(*) from dba_extents, dba_extents, dba_extents

A quick glimpse on the statistics profile using sql_monitor reveals:

Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets	|
=========================================
|     109 |	103 |	  6.23 |   9755 |
=========================================

Of course I could look in the sql monitoring plan details, which would reveal what row sources are active at the moment, but let’s see what ‘perf top -e cpu-clock -p shows:

   PerfTop:     108 irqs/sec  kernel: 0.9%  exact:  0.0% [1000Hz cpu-clock],  (target_pid: 12727)
--------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ ____________________________________________________

             1414.00 58.8% kxhrHash                  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              389.00 16.2% qerhjSplitProbe           /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              263.00 10.9% rworofprVeryFastUnpackRow /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              122.00  5.1% smbget                    /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               93.00  3.9% sorgetqbf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               86.00  3.6% qersoFetch                /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               31.00  1.3% qerhjInnerProbeHashTable  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

This shows 58.8% of the samples takes of this process are busy in the function ‘kxhrHash’. So it’s the hashing which this process is busy with. Another important and interesting aspect is: there is very little time spend in kernel mode: (kernel: 0.9%). But…the information about where the time is spend could have been read in the sql monitoring report: it reports CPU and waits per row source.

Where perf comes in to fulfill something which could not be done different, is when you got multiple processes, and you cannot setup perf for every distinct process, or you want to understand system-wide performance. To mimic a system doing OLTP, I’ve setup Kevin Closson’s SLOB in PIO mode, and use perf top -e cpu-clock without specifying a process ID, so I measure the entire system. I get the following profile:

  PerfTop:     984 irqs/sec  kernel:92.4%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
--------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                           DSO
             _______ _____ __________________________________ ______________________________________________________________________

            13102.00 88.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]                                                     
              772.00  5.2% copy_page                          [kernel.kallsyms]                                                     
              298.00  2.0% __do_softirq                       [kernel.kallsyms]                                                     
              114.00  0.8% scsi_request_fn                    [kernel.kallsyms]                                                     
               29.00  0.2% kcbgtcr                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               25.00  0.2% kmem_cache_alloc                   [kernel.kallsyms]                                                     
               25.00  0.2% finish_task_switch                 [kernel.kallsyms]                                                     
               22.00  0.1% kcbzib                             /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               17.00  0.1% sxorchk                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Conclusion: the majority of the processes spend time in kernel calls (kernel.kallsyms). This is not too surprising: I've setup SLOB in 'PIO mode', which means I've reduced the Oracle database buffer cache as much as possible. With Oracle version 11.2.0.3, the smallest buffer cache I can create is 4MB. Next I've let 10 readers do index range scans (that is one of the functions of SLOB: swamping your system with single block (8KB) IO's). The majority of the processes are spending time in the kernel function _raw_spin_unlock_irqrestore. This is how a system which is swamped with physical IO looks like with perf.

This all changes when I revert SLOB to LIO mode, and issue the same test (SLOB run with 10 readers):

   PerfTop:    1004 irqs/sec  kernel: 3.1%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________________________________________

              478.00 23.9% kcbgtcr                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              154.00  7.7% qetlbr                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              149.00  7.4% kcbz_fp_buf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              133.00  6.6% kdsgrp                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              104.00  5.2% kafger                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               94.00  4.7% kcbrls                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrgcm                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrget3                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Ah, we now spend only 3.1% in kernel mode, and the majority of the time is spend in kcbgtcr (quite probably meaning 'kernel cache buffer get consistent read')! I've modified the buffer cache to be able to contain all blocks (also known as SLOB 'LIO mode'), so once they are read, only a logical IO is needed to fetch the block.

Expanding SQL

Here’s a little thing that Dan Morgan mentioned to me some time ago. It’s a little routine from a package (owned by sys) that appeared in 11.2.0.3 that gives you some idea of the mess hidden behind a query that uses views. The procedure is dbms_sql2.expand_sql_text and takes two (CLOB) parameters, an IN parameter that is the text you want expanded, and an OUT parameter that is likely to be a long and messy piece of text. Rather than say any more, here’s an example of usage:

set linesize 70
set pagesize 0
set feedback off

declare
	m_sql_in	clob :='select * from dba_synonyms where owner = ''TEST_USER''';
	m_sql_out	clob := empty_clob();

begin

	dbms_sql2.expand_sql_text(
		m_sql_in,
		m_sql_out
	);

	dbms_output.put_line(m_sql_out);
end;
/

Notice how I’ve doubled up all the single quotes.
Here’s the output – you might want to try this only after setting linesize to 32767, and then you can paste the result into an SQL formatter:

SELECT "A1"."OWNER" "OWNER","A1"."SYNONYM_NAME" "SYNONYM_NAME","A1"."T
ABLE_OWNER" "TABLE_OWNER","A1"."TABLE_NAME" "TABLE_NAME","A1"."DB_LINK
" "DB_LINK" FROM  (SELECT "A4"."NAME" "OWNER","A2"."NAME" "SYNONYM_NAM
E","A3"."OWNER" "TABLE_OWNER","A3"."NAME" "TABLE_NAME","A3"."NODE" "DB
_LINK" FROM "SYS"."USER$" "A4","SYS"."SYN$" "A3", (SELECT "A6"."OBJ#"
"OBJ#","A6"."DATAOBJ#" "DATAOBJ#","A6"."OWNER#" "DEFINING_OWNER#","A6"
."NAME" "NAME","A6"."NAMESPACE" "NAMESPACE","A6"."SUBNAME" "SUBNAME","
A6"."TYPE#" "TYPE#","A6"."CTIME" "CTIME","A6"."MTIME" "MTIME","A6"."ST
IME" "STIME","A6"."STATUS" "STATUS","A6"."REMOTEOWNER" "REMOTEOWNER","
A6"."LINKNAME" "LINKNAME","A6"."FLAGS" "FLAGS","A6"."OID$" "OID$","A6"
."SPARE1" "SPARE1","A6"."SPARE2" "SPARE2","A6"."SPARE3" "SPARE3","A6".
"SPARE4" "SPARE4","A6"."SPARE5" "SPARE5","A6"."SPARE6" "SPARE6","A6"."
SPARE3" "OWNER#",CASE  WHEN (("A6"."TYPE#"<>4 AND "A6"."TYPE#"<>5 AND
"A6"."TYPE#"<>7 AND "A6"."TYPE#"<>8 AND "A6"."TYPE#"<>9 AND "A6"."TYPE
#"<>10 AND "A6"."TYPE#"<>11 AND "A6"."TYPE#"<>12 AND "A6"."TYPE#"<>13
AND "A6"."TYPE#"<>14 AND "A6"."TYPE#"<>22 AND "A6"."TYPE#"<>87) OR BIT
AND("A5"."SPARE1",16)=0) THEN NULL WHEN "A5"."TYPE#"=2 THEN  (SELECT "
A9"."NAME" "NAME" FROM SYS."OBJ$" "A9" WHERE "A9"."OBJ#"="A5"."SPARE2"
) ELSE 'ORA$BASE' END  "DEFINING_EDITION" FROM SYS."OBJ$" "A6",SYS."US
ER$" "A5" WHERE "A6"."OWNER#"="A5"."USER#" AND ("A6"."TYPE#"<>4 AND "A
6"."TYPE#"<>5 AND "A6"."TYPE#"<>7 AND "A6"."TYPE#"<>8 AND "A6"."TYPE#"
<>9 AND "A6"."TYPE#"<>10 AND "A6"."TYPE#"<>11 AND "A6"."TYPE#"<>12 AND
 "A6"."TYPE#"<>13 AND "A6"."TYPE#"<>14 AND "A6"."TYPE#"<>22 AND "A6"."
TYPE#"<>87 AND "A6"."TYPE#"<>88 OR BITAND("A5"."SPARE1",16)=0 OR ("A6"
."TYPE#"=4 OR "A6"."TYPE#"=5 OR "A6"."TYPE#"=7 OR "A6"."TYPE#"=8 OR "A
6"."TYPE#"=9 OR "A6"."TYPE#"=10 OR "A6"."TYPE#"=11 OR "A6"."TYPE#"=12
OR "A6"."TYPE#"=13 OR "A6"."TYPE#"=14 OR "A6"."TYPE#"=22 OR "A6"."TYPE
#"=87) AND ("A5"."TYPE#"<>2 AND SYS_CONTEXT('userenv','current_edition
_name')='ORA$BASE' OR "A5"."TYPE#"=2 AND "A5"."SPARE2"=SYS_CONTEXT('us
erenv','current_edition_id') OR  EXISTS (SELECT 0 FROM SYS."OBJ$" "A8"
,SYS."USER$" "A7" WHERE "A8"."TYPE#"=88 AND "A8"."DATAOBJ#"="A6"."OBJ#
" AND "A8"."OWNER#"="A7"."USER#" AND "A7"."TYPE#"=2 AND "A7"."SPARE2"=
SYS_CONTEXT('userenv','current_edition_id'))))) "A2" WHERE "A2"."OBJ#"
="A3"."OBJ#" AND "A2"."TYPE#"=5 AND "A2"."OWNER#"="A4"."USER#") "A1" W
HERE "A1"."OWNER"='TEST_USER'

This expansion is probably the first step the optimizer takes in handling your code – but it’s still not the transformed text that is ultimately the “unparsed” version of your statement.

Random Plans

Have you ever felt that the optimizer was persecuting you by picking plans at random ? Perhaps you’re not paranoid, perhaps it’s part of Oracle Corp’s. master plan to take over the world. If you look closely at the list of hidden parameters you’ll find that some details of this cunning plan have leaked. In 10.1.0.2 Oracle created a new parameter _optimizer_random_plan with the description “optimizer seed value for random plans”. Who knows what terrible effects we may see when the default value of this parameter changes.