Search

OakieTags

Who's online

There are currently 0 users and 41 guests online.

Recent comments

Affiliations

CBO

Join Surprise

Imagine I have a simple SQL statement with a “where clause” that looks like this:


	t2.id1(+) = t1.id1
and	t2.id2(+) = t1.id2

Would you expect it to run several times faster (25 minutes instead of a few hours) when the only change you made was to swap the order of the join predicates to:


	t2.id2(+) = t1.id2
and	t2.id1(+) = t1.id1

You may recall that a couple of years ago I wrote about some bugs in the optimizer, and pointed you to a blog article by Alberto Dell’Era that demonstrated an anomaly in cardinality calculations that made this type of thing possible. But here’s an example which has nothing to do with cardinality errors. We start with a suitable dataset – running on 11.1.0.6.


create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	trunc(dbms_random.value(1,1000))	id1,
	trunc(dbms_random.value(1,1000))	id2,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',1000)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 10000
;

create table t2
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 7
)
select
	t1.id1,
	t1.id2,
	v1.id,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',70)		padding
from
	t1		t1,
	generator	v1
;

-- collect stats, compute, no histograms

This data set models a problem – stripped to the bare essentials – that I came across at a client site some time ago. We have a “parent/child” relationship between the tables (although I haven’t declared the referential integrity), with roughly seven child rows per parent. The parent rows are quite long, the child rows are quite short. Some parents may not have children (although in this data set they do).

We now run a “report” that generates data for a number-crunching tool that extracts all the data from the tables – using an outer join so that parent rows don’t get lost. For various reasons the tool wanted the data sorted in a certain order – so there’s also an order by clause in the query. I’m going to show you the original query – first unhinted, and then hinted to use a merge join:


select
	t1.padding,
	t2.padding
from
	t1, t2
where
	t2.id1(+) = t1.id1
and	t2.id2(+) = t1.id2
order by
	t1.id2,
	t1.id1
;

---------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      | 10000 |    10M|       |  3720   (1)| 00:00:45 |
|   1 |  SORT ORDER BY         |      | 10000 |    10M|    22M|  3720   (1)| 00:00:45 |
|*  2 |   HASH JOIN RIGHT OUTER|      | 10000 |    10M|  6224K|  1436   (1)| 00:00:18 |
|   3 |    TABLE ACCESS FULL   | T2   | 70000 |  5400K|       |   260   (1)| 00:00:04 |
|   4 |    TABLE ACCESS FULL   | T1   | 10000 |  9853K|       |   390   (1)| 00:00:05 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T2"."ID1"(+)="T1"."ID1" AND "T2"."ID2"(+)="T1"."ID2")

select
	/*+ leading(t1 t2) use_merge(t2) */
	t1.padding,
	t2.padding
from
	t1, t2
where
	t2.id1(+) = t1.id1
and	t2.id2(+) = t1.id2
order by
	t1.id2,
	t1.id1
;

-------------------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      | 10000 |    10M|       |  6343   (1)| 00:01:17 |
|   1 |  SORT ORDER BY       |      | 10000 |    10M|    22M|  6343   (1)| 00:01:17 |
|   2 |   MERGE JOIN OUTER   |      | 10000 |    10M|       |  4059   (1)| 00:00:49 |
|   3 |    SORT JOIN         |      | 10000 |  9853K|    19M|  2509   (1)| 00:00:31 |
|   4 |     TABLE ACCESS FULL| T1   | 10000 |  9853K|       |   390   (1)| 00:00:05 |
|*  5 |    SORT JOIN         |      | 70000 |  5400K|    12M|  1549   (1)| 00:00:19 |
|   6 |     TABLE ACCESS FULL| T2   | 70000 |  5400K|       |   260   (1)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T2"."ID1"(+)="T1"."ID1" AND "T2"."ID2"(+)="T1"."ID2")
       filter("T2"."ID2"(+)="T1"."ID2" AND "T2"."ID1"(+)="T1"."ID1")

But there’s something a little odd about how the optimizer has chosen to do the merge join. Although our join condition references the join columns in the order (id1, id2) our final sort order is on (id2, id1) – and the optimizer hasn’t taken advantage of the fact that it could do the “sort join” operations in the order (id2, id1) and avoid the final “sort order by” at line 1.

So let’s rewrite the query to make the order of the join predicates match the order of the order by clause, and see what happens to the plan:


select
	/*+ leading(t1 t2) use_merge(t2) */
	t1.padding,
	t2.padding
from
	t1, t2
where
	t2.id2(+) = t1.id2
and	t2.id1(+) = t1.id1
order by
	t1.id2,
	t1.id1
;

------------------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      | 10000 |    10M|       |  4059   (1)| 00:00:49 |
|   1 |  MERGE JOIN OUTER   |      | 10000 |    10M|       |  4059   (1)| 00:00:49 |
|   2 |   SORT JOIN         |      | 10000 |  9853K|    19M|  2509   (1)| 00:00:31 |
|   3 |    TABLE ACCESS FULL| T1   | 10000 |  9853K|       |   390   (1)| 00:00:05 |
|*  4 |   SORT JOIN         |      | 70000 |  5400K|    12M|  1549   (1)| 00:00:19 |
|   5 |    TABLE ACCESS FULL| T2   | 70000 |  5400K|       |   260   (1)| 00:00:04 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T2"."ID2"(+)="T1"."ID2" AND "T2"."ID1"(+)="T1"."ID1")
       filter("T2"."ID1"(+)="T1"."ID1" AND "T2"."ID2"(+)="T1"."ID2")

The plan no longer has the final “sort order by” operation – and the cost of the plan is much lower as a consequence.. You’ll also notice that the predicate sections (always check the predicate section) are a little different – the order of evaluation has been reversed.

In my test case the cost of the merge join still hasn’t fallen below the cost of the hash join – but in the case of the client changing the order of predicates – without adding any hints – made the cost of the merge join much cheaper than the cost of the hash join. Fortunately this was a case where the cost was a realistic indication of run time and avoiding a sort operation of some 35GB of join result was a very good move.

So watch out – with multi-column joins, the order of the join predicates can make a big difference to the way Oracle operates a merge join.

ANSI – argh

I’m not keen on ANSI standard SQL – even though it is, technically, the strategic option and even though you have to use it for full outer joins and partitioned outer joins.

One reason for disliking it is that it “separates join predicates from filter predicates” – a reason often given in praise of the syntax which, to my mind, claims a spurious distinction and introduces a mechanism that makes it harder to keep mental track of what’s going to happen as you walk  through the join order.

The other reason for disliking ANSI SQL in Oracle databases is that sometimes it really is necessary to add hints to the SQL to make the optimizer do what needs to be done – and ANSI makes it so much harder and messier to add hints to code. Here’s a wonderful example that Tony Hasler presented in our recent debate “Does Oracle Ignore Hints” at the UKOUG annual conference:

WITH q1 as (
	SELECT /*+ qb_name(q1block) */
		*
	FROM	t1
	JOIN	t2
	ON	t1_i1 = t2_i1
	AND	t1_i1 < 10
),
q2 AS (
	SELECT
		/*+ qb_name(q2block) */
		*
	FROM
		t3
	JOIN	t4
	ON	t3_i1 = t4_i1
	AND	t3_i1 < 10
)
SELECT
	/*+
		no_merge(@q1block)
		no_merge(@q2block)
		leading (@q1block t2)
		use_nl  (@q1block t1)
	*/
	*
FROM
	q1
JOIN
	q2
ON	t1_i1 + t2_i1 = t3_i1 + t4_i1
;

Just to make life really hard, he’s included a couple of “factored subqueries” – and there are a few outstanding optimizer defects with handling subquery factoring – so when he claimed that this was an example of Oracle ignoring hints I had two different directions of investigation to worry about.

Here’s the execution plan (from my 10.2.0.3 system with the data generation, constraints and indexing that Tony supplied):

------------------------------------------------------------------------
| Id  | Operation                      | Name  | Rows  | Bytes | Cost  |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |       |   250K|    33M|    54 |
|*  1 |  HASH JOIN                     |       |   250K|    33M|    54 |
|   2 |   VIEW                         |       |  5000 |   327K|    11 |
|*  3 |    HASH JOIN                   |       |  5000 |   581K|    11 |
|   4 |     TABLE ACCESS BY INDEX ROWID| T4    |  5000 |   253K|     3 |
|*  5 |      INDEX RANGE SCAN          | T4_I1 |   900 |       |     2 |
|   6 |     TABLE ACCESS BY INDEX ROWID| T3    |  5000 |   327K|     3 |
|*  7 |      INDEX RANGE SCAN          | T3_I1 |   900 |       |     2 |
|   8 |   VIEW                         |       |  5000 |   361K|    12 |
|*  9 |    HASH JOIN                   |       |  5000 |   615K|    12 |
|  10 |     TABLE ACCESS BY INDEX ROWID| T1    |  5000 |   297K|     3 |
|* 11 |      INDEX RANGE SCAN          | T1_I1 |   900 |       |     2 |
|  12 |     TABLE ACCESS BY INDEX ROWID| T2    |  5000 |   317K|     3 |
|* 13 |      INDEX RANGE SCAN          | T2_I1 |   900 |       |     2 |
------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$16C51A37
   2 - SEL$4C69CCA2 / Q2@SEL$1
   3 - SEL$4C69CCA2
   4 - SEL$4C69CCA2 / T4@SEL$2
   5 - SEL$4C69CCA2 / T4@SEL$2
   6 - SEL$4C69CCA2 / T3@SEL$2
   7 - SEL$4C69CCA2 / T3@SEL$2
   8 - SEL$7939585E / Q1@SEL$1
   9 - SEL$7939585E
  10 - SEL$7939585E / T1@SEL$3
  11 - SEL$7939585E / T1@SEL$3
  12 - SEL$7939585E / T2@SEL$3
  13 - SEL$7939585E / T2@SEL$3

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1_I1"+"T2_I1"="T3_I1"+"T4_I1")

As you can see, Oracle has copied the two factored subqueries inline (they appear just once each in the body of the query so this is – probably – inevitable). Then Oracle has obeyed the no_merge() hints – which I could check by deleting the hints and watching the plan change. So why, in lines 10 through 13, has Oracle not obeyed the leading() and use_nl() hints ?

By changing the ANSI syntax to traditional Oracle syntax, I got a different plan:

-------------------------------------------------------------------------
| Id  | Operation                       | Name  | Rows  | Bytes | Cost  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |       |   250K|    33M| 10045 |
|*  1 |  HASH JOIN                      |       |   250K|    33M| 10045 |
|   2 |   VIEW                          |       |  5000 |   327K|    11 |
|*  3 |    HASH JOIN                    |       |  5000 |   581K|    11 |
|   4 |     TABLE ACCESS BY INDEX ROWID | T4    |  5000 |   253K|     3 |
|*  5 |      INDEX RANGE SCAN           | T4_I1 |   900 |       |     2 |
|   6 |     TABLE ACCESS BY INDEX ROWID | T3    |  5000 |   327K|     3 |
|*  7 |      INDEX RANGE SCAN           | T3_I1 |   900 |       |     2 |
|   8 |   VIEW                          |       |  5000 |   361K| 10003 |
|   9 |    TABLE ACCESS BY INDEX ROWID  | T1    |     1 |    61 |     2 |
|  10 |     NESTED LOOPS                |       |  5000 |   615K| 10003 |
|  11 |      TABLE ACCESS BY INDEX ROWID| T2    |  5000 |   317K|     3 |
|* 12 |       INDEX RANGE SCAN          | T2_I1 |   900 |       |     2 |
|* 13 |      INDEX RANGE SCAN           | T1_I1 |     1 |       |     1 |
-------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - Q2BLOCK / Q2@SEL$1
   3 - Q2BLOCK
   4 - Q2BLOCK / T4@Q2BLOCK
   5 - Q2BLOCK / T4@Q2BLOCK
   6 - Q2BLOCK / T3@Q2BLOCK
   7 - Q2BLOCK / T3@Q2BLOCK
   8 - Q1BLOCK / Q1@SEL$1
   9 - Q1BLOCK / T1@Q1BLOCK
  11 - Q1BLOCK / T2@Q1BLOCK
  12 - Q1BLOCK / T2@Q1BLOCK
  13 - Q1BLOCK / T1@Q1BLOCK

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1_I1"+"T2_I1"="T3_I1"+"T4_I1")
   3 - access("T3_I1"="T4_I1")
   5 - access("T4_I1"<10)
   7 - access("T3_I1"<10)
  12 - access("T2_I1"<10)
  13 - access("T1_I1"="T2_I1")
       filter("T1_I1"<10)

Notice how the optimizer is now obeying the leading() and use_nl() hints.

The problem is this: Oracle doesn’t optimise ANSI SQL, it transforms it then optimises it. Transformation can change query blocks, and Tony’s hints apply to specific query blocks. After a little testing and checking I worked out what the SQL looked like AFTER transformation and BEFORE optimisation; and it’s this:

select
	/*+ qb_name(sel$4) */
	*
from
	(
	SELECT
		/*+ qb_name(sel$1) */
		Q1.T1_I1 T1_I1, Q1.T1_I2 T1_I2, Q1.T1_D1 T1_D1, Q1.T2_I1 T2_I1, Q1.T2_I2 T2_I2, Q1.T2_TS T2_TS,
		Q2.T3_I1 T3_I1, Q2.T3_I2 T3_I2, Q2.T3_TSTZ T3_TSTZ, Q2.T4_I1 T4_I1, Q2.T4_I2 T4_I2
	FROM
		(
		SELECT
			/*+ NO_MERGE QB_NAME (Q1BLOCK) */
			from$_subquery$_003.T1_I1_0 T1_I1, from$_subquery$_003.T1_I2_1 T1_I2,
			from$_subquery$_003.T1_D1_2 T1_D1, from$_subquery$_003.T2_I1_3 T2_I1,
			from$_subquery$_003.T2_I2_4 T2_I2, from$_subquery$_003.T2_TS_5 T2_TS
		FROM	(
			SELECT
				/*+ qb_name(sel$3) */
				T1.T1_I1 T1_I1_0, T1.T1_I2 T1_I2_1, T1.T1_D1 T1_D1_2,
				T2.T2_I1 T2_I1_3, T2.T2_I2 T2_I2_4, T2.T2_TS T2_TS_5
			FROM
				TEST_USER.T1 T1,
				TEST_USER.T2 T2
			WHERE
				T1.T1_I1 = T2.T2_I1
			AND	T1.T1_I1 < 10
			)	from$_subquery$_003
		)	Q1,
		(
		SELECT
			/*+ NO_MERGE QB_NAME (Q2BLOCK) */
			from$_subquery$_006.T3_I1_0 T3_I1, from$_subquery$_006.T3_I2_1 T3_I2,
			from$_subquery$_006.T3_TSTZ_2 T3_TSTZ, from$_subquery$_006.T4_I1_3 T4_I1,
			from$_subquery$_006.T4_I2_4 T4_I2
		FROM
			(
			SELECT
				/*+ qb_name(sel$2) */
				T3.T3_I1 T3_I1_0, T3.T3_I2 T3_I2_1, T3.T3_TSTZ T3_TSTZ_2,
				T4.T4_I1 T4_I1_3, T4.T4_I2 T4_I2_4 FROM TEST_USER.T3 T3, TEST_USER.T4 T4
			WHERE
				T3.T3_I1 = T4.T4_I1
			AND	T3.T3_I1 < 10
			)	from$_subquery$_006
		)	Q2
	WHERE
		Q1.T1_I1 + Q1.T2_I1 = Q2.T3_I1 + Q2.T4_I1
	)
;

I got most of this from the “Query Block Name / Object Alias” section of the ANSI execution plan (there are some important clues there, like ‘T1@SEL$3′) and the “unparsed” SQL from the 10053 trace.

Notice how the query blocks q1block and q2block still exist – that’s why the no_merge() hints can survive the transformation. Notice, though, that the transformation engines has introduced a layer of inline views inside q1block and q2block - which is why the leading(@q1block t2) and use_nl(@q1block t1) hints are no longer valid: they reference objects which are not in q1block. To get his hints to work at the global level, Tony would have to change the last two hints to reference sel$3 rather than q1block.

So, next time you write a complicated piece of ANSI, make sure you think carefully about what you’re going to have to do if you subsequently have to add hints to force a particular execution plan.  (And bear in mind that one day the transformation engine might be modified to transform the query differently.)

[Further reading on "ignoring hints"]

Distributed Queries – 2

I have often said that the optimizer “forgets” that it is dealing with a distributed query once it has collected the stats that it can about the objects in the query, and that as a consequence the driving site for a distributed query will be the local database unless you use the /*+ driving_site */ hint to change it.

While investigating an oddity with a distributed query between two 11.1.0.7 databases a few days, I noticed something in the 10053 trace file that made me change my mind, and go back to look at earlier versions of Oracle.

Here are two sections extracted from a 10053 trace file running under 10.2.0.3 with CPU costing (system statistics) enabled:

SINGLE TABLE ACCESS PATH
  Table: T1  Alias: AWAY
    Card: Original: 3240  Rounded: 41  Computed: 40.50  Non Adjusted: 40.50
  Access Path: TableScan
    Cost:  53.22  Resp: 53.22  Degree: 0
      Cost_io: 53.00  Cost_cpu: 2073815
      Resp_io: 53.00  Resp_cpu: 2073815
  Access Path: index (AllEqRange)
    Index: 0
    resc_io: 4.00  resc_cpu: 29536
    ix_sel: 0.0125  ix_sel_with_filters: 0.0125
    Cost: 4.00  Resp: 4.00  Degree: 1
  Remote table cost added, new values: cost 4.00  resc 4.00  resp .2f  Best:: AccessPath: IndexRange  Index: 0        <<===
         Cost: 4.00  Degree: 1  Resp: 4.00  Card: 40.50  Bytes: 0

...

  HA cost: 50.54
     resc: 50.54 resc_io: 50.00 resc_cpu: 5187262
     resp: 50.54 resp_io: 50.00 resp_cpu: 5187262
  Cost adjustment for NL join with remote table: 0.72             <<===
Join order aborted: cost > best plan cost
***********************

Note the two lines with the reference to “remote” (I’d highlight them properly, but you can’t do highlighing and code in the same text). Notice, also that one of the programmers made a bit of a mistake with their printf() call in the first of the lines – a bug that is still there in 11.1.0.6

Clearly Oracle is doing some arithmetic relating to the costs of accessing distributed data from at least 10.2.0.3 (there was nothing similar in the equivalent trace file for 9.2.0.8, and I don’t have a 10.1 available for testing). Unfortunately I have yet to see a single distributed execution plan where it does the right thing – but that might be a problem related to histograms (and the failure to use them) rather than a defect in the algorithms for distributed cost.

I’ll have to spend some time looking at what it does before I can write any more about it – but given the number of times I’ve said the optimizer doesn’t do any arithmetic I thought it was important to point out that I was wrong as soon as I discovered the change.

Footnote: I have added a category “distributed” to my list of categories – and added a link to it at the bottom of every article I’ve written about distributed SQL. That’s a pattern that I may copy across other articles in the future – especially if I can find out how to order the articles by date (ascending).

[Further reading on distributed databases]

Index Join

One of the less well known access paths available to the optimizer is the “index join” also known as the “index hash join” path. It’s an access path that can be used when the optimizer decides that it doesn’t need to visit a table to supply the select list because there are indexes on the table that, between them, hold all the required columns. A simple example might look something like the following:


create table indjoin
as
select
	rownum	id,
	rownum	val1,
	rownum	val2,
	rpad('x',500) padding
from
	all_objects
where
	rownum <= 3000
;

create unique index ij_v1 on indjoin(id, val1);
create unique index ij_v2 on indjoin(id, val2);

-- collect stats on the table and indexes

select
	ij.id
from
	indjoin		ij
where
	ij.val1 between 100 and 200
and	ij.val2 between 50 and 150
;

Note that the columns in the where clause appear in (some) indexes, and the column(s) in the select list exist in (at least) some indexes. Under these circumstances the optimizer can produce the following plan (the test script was one I wrote for 8i – but this plan comes from an 11.1 instance):


---------------------------------------------------------------------------
| Id  | Operation              | Name             | Rows  | Bytes | Cost  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                  |     3 |    36 |    24 |
|   1 |  VIEW                  | index$_join$_001 |     3 |    36 |    24 |
|*  2 |   HASH JOIN            |                  |       |       |       |
|*  3 |    INDEX FAST FULL SCAN| IJ_V1            |     3 |    36 |    11 |
|*  4 |    INDEX FAST FULL SCAN| IJ_V2            |     3 |    36 |    11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(ROWID=ROWID)
   3 - filter("VAL1"<=200 AND "VAL1">=100)
   4 - filter("VAL2"<=150 AND "VAL2">=50)

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - "ID"[NUMBER,22]
   2 - (#keys=1) "ID"[NUMBER,22]
   3 - ROWID[ROWID,10], "ID"[NUMBER,22]
   4 - ROWID[ROWID,10]

We do a fast full scan of the two indexes extracting the rowid and id from index ij_v1 and just the rowid from index ij_v2. We can then get the result we want by doing a hash join between these two result sets on the rowid values because any time the two rowsources have a rowid in common, it’s a rowid for a row where val1 is between 100 and 200, and val2 is between 50 and 150 and the first rowsource is carrying the id - which is the thing we need to report.

There are a couple of little observations that we can make about this example.

    First, although I’ve only used two indexes in this example Oracle is not limited to just two indexes. The number of indexes that could be used is effectively unlimited.
    Second, the index_join path is strictly limited to cases where the optimizer can see that every column in the query can be found in indexes on the table.
    Third, although my example uses index fast full scans that’s not a necessary feature of the plan. Just like any other hash join, Oracle could use an index range (or full) scan to get some of the data.
    Finally, there are clearly a couple of bugs in the code.

Bugs:

If you check the rows/bytes columns in the plan you’ll see that the predicted number of rows selected is the same for both indexes (lines 3 and 4) – but we extract the rowid and the id from the first index (projection detail for line 3), so the total data volume expected from line 3 is slightly larger than the total data volume from line 4 where we extract only the rowid; theoretically, therefore, the optimizer has used the tables (indexes) in the wrong order – the one supplying the smaller volume of data should have been used as the first (build) rowsource.

More significantly, though, a quick check of the code that generates the data tells you that each index will supply 101 rows to the hash join – and you can even show that for other query execution plans the optimizer will calculate this cardinality (nearly) correctly. In the case of the index join the optimizer seems to have lost the correct individual cardinalities and has decided to use the size of the final result set as the cardinality of the two driving index scans.

There’s more, of course – one of the strangest things about the index join is that if your select list includes the table’s rowid, the optimizer doesn’t consider that to be a column in the index. So even though the predicate section of the plan shows the rowids being projected in the hash join, Oracle won’t use an index join for a query returning the rowid !

Footnote: The reason I’ve written this brief introduction to the index join is because an interesting question came up at the first E2SN virtual conference.

“If you hint an index hash join, is there any way of telling Oracle the order in which it should use the indexes?”

The answer is no – but there are ways of creating code that will do what you want, and that will be the topic of my next blog.

[Further reading on Index Joins]

Local Indexes – 2

In the previous note on local indexes I raised a couple of questions about the problems of different partitions holding different volumes of data, and supplied a script to build some sample data that produced the following values for blevel across the partitions of a list-partitioned table.

INDEX_NAME           PARTITION_NAME           BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------------- -------------------- ---------- ----------- ----------
T1_ID                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          33       9000
                     P4                            1         339      90000
                     P5                            2        3384     900000

T1_N1                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          32       9000
                     P4                            1         314      90000
                     P5                            2        3136     900000

Q1: What do you think Oracle will record as the blevel at the global level for the two indexes ?
A1: As one of the commentators suggested, it seems to be the highest blevel recorded for any partition – in this case 2. (It’s possible that this assumption is wrong, of course, there may be some subtle weighting calculation involved – but I haven’t yet tested that hypothesis.)

Q2: If you have query with a where clause like “id between 100 and 400 and n1 != 5″ – which is designed very precisely to exclude the last, very big, partition – what value of blevel is Oracle going to use when considering the cost of using the index t1_id to access the data ?
A2: As I pointed out in an earlier note on list partitioned tables, Oracle doesn’t recognise the (obvious to the human eye) option for partition pruning in this predicate, so it uses the global blevel in the calculations.

The second answer is the one that is causing me a problem – because I have a client system where almost all the data is in a “dead” partiiton – it has a status, stored as the partition key in a list-partitioned table, of “COMPLETE”, and lots of their code includes the predicate: status != ‘COMPLETE’, but this can make the optimizer take the wrong execution path because it uses a global blevel that has been dictated by the huge volume of data that we know we don’t want to see.

The client queries are fairly complex, of course, but here’s a very trivial example demonstrating the basic problem (using the data generated by the code in the previous note – running under 11.1.0.6):

set autotrace traceonly explain

select	*
from	t1
where
	id = 99
and	n1 != 5
;

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |    14 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |    14 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |    14 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |    13 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

From Oracle’s perspective it has to visit all six partitions because it can’t use the most apporpriate index and do partition pruning – and the final cost of this simple query is 14 because the value used (six times, in effect) for the blevel in the calculations is two; but we have inside information that tells us that this is essentially an unreasonable cost.

If Oracle were to believe that a more appropriate blevel for this query was just one then the cost would drop significantly (although in this case the plan wouldn’t change):

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |     8 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |     8 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |     8 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |     7 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

Of course for a really big system, where the “dead” partition was 200 Million rows, we might have a blevel of three:

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |    20 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |    20 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |    20 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |    19 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

Note how changing the global blevel by one makes the cost change by six – a consequence of the fact that we have six partitions with no pruning. If you’re having trouble with queries against partitioned table that don’t use the right index, take a close look at the data volumes and values recorded for blevel at the global, partition and subpartition levels – it’s possible that you’re suffering from a bias introduced by one partition being much larger than all the rest.

If you’re wondering how I got these plans (without simply editing them) it was by using dbms_stats.set_index_stats() to change the stored statistics – see “Copy Stats” for an example of the type of code needed. In cases like this, where I have better information about the data and the intent of the code than the optimizer has, I am perfectly happy to give a “more truthful” picture of the data to the optimizer by writing scripts to adjust statistics.

There are three drawbacks to such an approach, of course. First: on the next upgrade the optimizer might get smarter and make my clever little hack a liability rather than a benefit; secondly, there may be examples of application code that I haven’t noticed that might go wrong because of my hack; finally, and more importantly in the short term, I have to make sure that my code runs every time the statistics on the index are modified by any other program (such as the automatic stats collection job).

But the principle is sound – if we understand the system better than the optimizer then it’s positively important to help the optimizer in the most truthful way possible. List partitions (in a way similar to frequency histograms) are an obvious target for this type of treatment.

Local Indexes

In a recent article about list partitioned tables I raised some questions about the cases where the optimizer can’t (yet) do partitioning pruning even when the opportunity is clearly visible to the human eye. The most important example was the case where each partition was defined to hold rows for just one partition key value – but the optimizer could not prune out the redundant partition for a query like: “partition_key != {constant}”.

I recently came across a situation where this really made a big difference. The system had a huge table that had been list partitioned as follows (with some camouflage):


partition by list (status) (
	partition p_state01 values  ('STATE01'),
	partition p_state02 values  ('STATE02'),
	    ....
	partition p_state25 values  ('STATE25'),
	partition p_handled values  ('Completed')
)

The table was defined to allow row movement, and every day there would be tens of thousands of rows moving through various states until they reached the “Completed” state.

There are various pros and cons to this setup. The most significant con is that when you update the status of a row Oracle actually has to update the row “in situ”, then delete it from what is now the wrong partition and insert it into the right partition. The most dramatic pro is that if the rows you’re interested in are (almost always) the ones that haven’t got to the “Completed” you’ve put all the boring old garbage out of the way where it doesn’t cause any problems. (In fact, if you’re running 11.2 you might choose to declare some of the “Completed” partitions of any local indexes as unusable and save yourself a lot of space – and by the time I’ve finished this article you might think this is a truly wonderful idea.) In the case of the client, there were about 200 million rows in the completed partition, and barely 2 million spread over the other partitions.

There was a bit of a problem, though. Some of the indexes on this table had been created as local indexes (arguably they should all have been local)and this resulted in some odd optimisation side effects. Here’s a little bit of code to build a table that demonstrates an interesting issue:

create table t1 (
	id,
	n1,
	small_vc,
	padding
)
partition by list (n1) (
	partition p0 values(0),
	partition p1 values(1),
	partition p2 values(2),
	partition p3 values(3),
	partition p4 values(4),
	partition p5 values(5)
)
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	rownum			id,
	trunc(log(10,rownum))	n1,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 999999
;

create index t1_n1 on t1(n1, small_vc) local nologging;
create index t1_id on t1(id, small_vc) local nologging;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 1'
	);
end;
/

break on index_name skip 1

select
	index_name, partition_name, blevel, num_rows, leaf_blocks
from
	user_ind_partitions    -- but see comment #1 below from Tony Sleight
order by
	index_name, partition_name
;

Thanks to the log definition of column n1, you will see a very skewed distribution of data across the partitions, and the output from the query against the index partitions shows this quite dramatically. Since the code sample uses a 100% sample on the stats, you should get the following figures for the indexes (with a little variation in leaf blocks, perhaps, depending on your version and tablespace definitions. I was using 11.1.0.6 with 8KB blocks, locally managed tablespaces, freelists, and 1MB uniform extents.)

INDEX_NAME           PARTITION_NAME           BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------------- -------------------- ---------- ----------- ----------
T1_ID                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          33       9000
                     P4                            1         339      90000
                     P5                            2        3384     900000

T1_N1                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          32       9000
                     P4                            1         314      90000
                     P5                            2        3136     900000

So here’s important question number 1: What do you think the blevel will be at the global level for the two indexes ?

Important question number 2: If you have query with a where clause like “id between 100 and 400 and n1 != 5″ – which is designed very precisely to exclude the last partition – what value of blevel is Oracle going to use when considering the cost of using the index t1_id to access the data ?

My answers are in this follow-up post.

Good Nulls

I’ve often been heard to warn people of the accidents that can happen when they forget about the traps that appear when you start allowing columns to be NULL – but sometimes NULLs are good, especially when it helps Oracle understand where the important (e.g. not null) data might be.

An interesting example of this came up on OTN a few months ago where someone was testing the effects of changing a YES/NO column into a YES/NULL column (which is a nice idea because it allows you to create a very small index on the YESes, and avoid creating a histogram to tell the optimizer that the number of YESes is small).

They were a little puzzled, though, about why their tests showed Oracle using an index to find data in the YES/NO case, but not using the index in the YES/NULL case. I supplied a short explanation on the thread, and was planning to post a description on the blog, but someone on the thread supplied a link to AskTom where Tom Kyte had already answered the question, so I’m just going to leave you with a link to his explanation.

DOUG 2010 Presentation

You can find pdf version of the presentation here: Performance tuning – for developers . Feel free to ask questions in the comments.
Thank you Mary Elizabeth Mcneely for the opportunity to talk in the Dallas Oracle User Group technology meetup.

_connect_by_use_union_all

This is just a short note on the parameter introduced in the 11gR2 called _connect_by_use_union_all. I’ve noticed it for the first time in Doc ID 7210630.8, which gives a brief overview of the changes made to the way CBO generates plans for hierarchical queries. As usually happens, the change helps to one problem, but produces [...]

Group by Hash aggregation

So, Here I was merrily enjoying OpenWorld 2010 presentations in SFO, I got a call from a client about a performance issue. Client recently upgraded from Version 9i to Version 10g in an E-Business environment. I had the privilege of consulting before the upgrade, so we setup the environment optimally, and upgrade itself was seamless. Client did not see much regression except One query: That query was running for hours in 10g compared to 15 minutes in 9i.

Review and Analysis

Reviewed the execution plan in the development database and I did not see any issues with the plan. Execution plan in development and production looked decent enough. I wasn’t able to reproduce the issue in the development database either. So, the client allowed me to trace the SQL statement in the production database. Since the size of data in few tables is different between production and development databases, we had to analyze the problem in production environment.

I had to collect as much data possible as the tracing was a one-time thing. I setup a small script to get process stack and process memory area of that Unix dedicated server process to collect more details, in addition to tracing the process with waits => true.

Execution plan from the production database printed below. [ Review the execution plan carefully, it is giving away the problem immediately.] One execution of this statement took 13,445 seconds and almost all of it spent in the CPU time. Why would the process consume 13,719 seconds of CPU time?. Same process completed in just 15 minutes in 9i, as confirmed by Statspack reports. [ As a side note, We collected enormous amount of performance data in 9i in the Production environment before upgrading to 10g, just so that we can quickly resolve any performance issues, and you should probably follow that guideline too]. That collection came handy and It is clear that SQL statement was completing in 15 minutes in 9i and took nearly 3.75 hours after upgrading the database to version 10g.

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch       10  13719.71   13445.94         27    5086407          0       99938
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       12  13719.71   13445.94         27    5086407          0       99938

     24   HASH GROUP BY (cr=4904031 pr=27 pw=0 time=13240600266 us)
     24    NESTED LOOPS OUTER (cr=4904031 pr=27 pw=0 time=136204709 us)
     24     NESTED LOOPS  (cr=4903935 pr=27 pw=0 time=133347961 us)
 489983      NESTED LOOPS  (cr=3432044 pr=27 pw=0 time=104239982 us)
 489983       NESTED LOOPS  (cr=2452078 pr=27 pw=0 time=91156653 us)
 489983        TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=1472112 pr=27 pw=0 time=70907109 us)
 489983         INDEX RANGE SCAN HR_LOCATIONS_UK2 (cr=981232 pr=0 pw=0 time=54338789 us)(object id 43397)
 489983        INDEX UNIQUE SCAN MTL_PARAMETERS_U1 (cr=979966 pr=0 pw=0 time=17972426 us)(object id 37657)
 489983       INDEX UNIQUE SCAN HR_ORGANIZATION_UNITS_PK (cr=979966 pr=0 pw=0 time=10876601 us)(object id 43498)
     24      INDEX RANGE SCAN UXPP_FA_LOCATIONS_N3 (cr=1471891 pr=0 pw=0 time=27325172 us)(object id 316461)
     24     TABLE ACCESS BY INDEX ROWID PER_ALL_PEOPLE_F (cr=96 pr=0 pw=0 time=2191 us)
     24      INDEX RANGE SCAN PER_PEOPLE_F_PK (cr=72 pr=0 pw=0 time=1543 us)(object id 44403)

pstack, pmap, and truss

Reviewing pstack output generated from the script shows many function calls kghfrempty, kghfrempty_ex, qerghFreeHashTable etc, implying hash table operations. Something to do with hash table consuming time?

 ( Only partial entries shown )
 0000000103f41528 kghfrempty
 0000000103f466ec kghfrempty_ex
 0000000103191f1c qerghFreeHashTable
 000000010318e080 qerghFetch
 00000001030b1b3c qerstFetch
...
 0000000103f41558 kghfrempty
 0000000103f466ec kghfrempty_ex
 0000000103191f1c qerghFreeHashTable
 000000010318e080 qerghFetch
 00000001030b1b3c qerstFetch

Truss of the process also showed quite a bit of mmap calls. So, the process is allocating more memory to an hash table?

...
mmap(0xFFFFFFFF231C0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xFFFFFFFF231C0000
...
pollsys(0xFFFFFFFF7FFF7EC8, 1, 0xFFFFFFFF7FFF7E00, 0x00000000) = 0
mmap(0xFFFFFFFF231D0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xFFFFFFFF231D0000
...

Execution plan again ..

Reviewing the execution plan again showed an interesting issue. I am going to post only two relevant lines from the execution plan below. As you can see that elapsed time at NESTED LOOPS OUTER step is 136 seconds. But the elapsed time at the next HASH GROUP BY step is 13240 seconds, meaning nearly 13,100 seconds spent in the HASH GROUP BY Step alone! Why would the process spend 13,100 seconds in a group by operation? Actual SQL execution took only 136 seconds, but the group by operation took 13,100 seconds. That doesn’t make sense, Does it?

     24   HASH GROUP BY (cr=4904031 pr=27 pw=0 time=13240600266 us)
     24    NESTED LOOPS OUTER (cr=4904031 pr=27 pw=0 time=136204709 us)
...

OFE = 9i

Knowing that time is spent in the Group by operation and that the 10g new feature Hash Grouping method is in use, I decided to test this SQL statement execution with 9i optimizer. The SQL completed in 908 seconds with OFE(optimizer_features_enabled) set to 9.2.0.8 (data is little bit different since production is an active environment). You can also see that SORT technique is used to group the data.

alter session set optimizer_features_enabled=9.2.0.8;

Explain plan :
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch   106985    887.41     908.25     282379    3344916        158     1604754
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   106987    887.41     908.25     282379    3344916        158     1604754

      4   SORT GROUP BY (cr=2863428 pr=0 pw=0 time=37934456 us)
      4    NESTED LOOPS OUTER (cr=2863428 pr=0 pw=0 time=34902519 us)
      4     NESTED LOOPS  (cr=2863412 pr=0 pw=0 time=34198726 us)
 286067      NESTED LOOPS  (cr=2003916 pr=0 pw=0 time=24285794 us)
 286067       NESTED LOOPS  (cr=1431782 pr=0 pw=0 time=19288024 us)
 286067        TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=859648 pr=0 pw=0 time=13568456 us)
 286067         INDEX RANGE SCAN HR_LOCATIONS_UK2 (cr=572969 pr=0 pw=0 time=9271380 us)(object id 43397)
 286067        INDEX UNIQUE SCAN MTL_PARAMETERS_U1 (cr=572134 pr=0 pw=0 time=4663154 us)(object id 37657)
...

Knowing the problem is in the GROUP BY step, we setup a profile with _gby_hash_aggregation_enabled set to FALSE, disabling the new 10g feature for that SQL statement. With the SQL profile, performance of the SQL statement is comparable to pre-upgrade timing.

This almost sounds like a bug! Bug 8223928 is matching with this stack, but it is the opposite. Well, client will work with the support to get a bug fix for this issue.

Summary

In summary, you can use scientific methods to debug performance issues. Revealing the details underneath, will enable you to come up with a workaround quickly, leading to a faster resolution.
Note that, I am not saying hash group by feature is bad. Rather, we seem to have encountered an unfortunate bug which caused performance issues at this client. I think, Hash Grouping is a good feature as the efficiency of grouping operations can be improved if you have ample amount of memory. That’s the reason why we disabled this feature at the statement level, NOT at the instance level.
This blog in a traditional format hash_group_by_orainternals

Update 1:

I am adding a script to capture pmap and pstack output in a loop for 1000 times, with 10 seconds interval. Tested in Oracle Solaris.

#! /bin/ksh
 pid=$1
 (( cnt=1000 ))
 while  [[ $cnt -gt 0 ]];
  do
        date
        pmap -x $pid
        pstack $pid
        echo $cnt
        (( cnt=cnt-1 ))
        sleep 10
  done

To call the script: assuming 7887 is the UNIX pid of the process.
nohup ./pmap_loop.ksh 7887 >> /tmp/a1.lst 2>>/tmp/a1.lst &

Syntax for the truss command is given below. Please remember, you can’t use pmap, pstack and truss concurrently. These commands stops the process (however short that may be!) and inspects them, so use these commands sparingly. [ I had a client who used to run truss on LGWR process on a continuous(!) basis and database used to crash randomly!]. I realize that pmap/pstack/truss can be scripted to work together, but that would involve submitting a background process for the truss command and killing that process after a small timeout window. That would be a risky approach in a Production environment and So, I prefer to use truss command manually and CTRL+C it after few seconds.

truss -d -E -o /tmp/truss.lst -p 7887

I can not stress enough, not to overuse these commands in a Production environment. Command strace( Linux), tusc (HP) are comparable commands of truss(Solaris).