Search

OakieTags

Who's online

There are currently 0 users and 48 guests online.

Recent comments

Oakies Blog Aggregator

Column Groups - Edge Cases

Oracle 11g added Extended Statistics support for column groups in order to detect correlated columns for filter predicates using an equal comparison.

Note that Oracle 11g also added the ability to use the number of distinct keys of a composite index as an upper limit for the cardinality estimates for matching column predicates, which means that the optimizer is now capable of detecting correlated columns without the explicit addition of Extended Statistics / Column Groups.

The "composite index" feature was also backported to 10.2.0.4 and 10.2.0.5 but needs to be activated explicitly via a FIX_CONTROL. You can read more about that on Jonathan Lewis' blog.

The "Column Group" feature (either by explicit Extended Statistics or via a matching composite index) however has a weakness in a very special case: If there is only a single distinct value in the statistics then the "out-of-range" detection of the optimizer is not working correctly.

This can be easily demonstrated with a very simple test case:

set echo on linesize 200 pagesize 999 tab off trimspool on trimout on

drop table t;

purge table t;

create table t
as
select
rownum as id
, mod(rownum, 1) + 1 as attr1
, mod(rownum, 1) + 1 as attr2
, rpad('x', 100) as filler
from
dual
connect by
level <= 1000000
;

select
attr1
, attr2
, count(*) as cnt
from
t
group by
attr1
, attr2
;

exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1')

create index t_idx on t (attr1, attr2);

explain plan for
select * from t where attr1 = 1 and attr2 = 1;

select * from table(dbms_xplan.display(null, null, 'BASIC PREDICATE ROWS'));

explain plan for
select * from t where attr1 = 0 and attr2 = 0;

select * from table(dbms_xplan.display(null, null, 'BASIC PREDICATE ROWS'));

explain plan for
select * from t where attr1 = 2 and attr2 = 2;

select * from table(dbms_xplan.display(null, null, 'BASIC PREDICATE ROWS'));

explain plan for
select * from t where attr1 = 100000 and attr2 = 100000;

select * from table(dbms_xplan.display(null, null, 'BASIC PROJECTION ROWS'));

explain plan for
select * from t where attr1 = -100000 and attr2 = 100000;

select * from table(dbms_xplan.display(null, null, 'BASIC PROJECTION ROWS'));

The relevant output is:

SQL> select
2 attr1
3 , attr2
4 , count(*) as cnt
5 from
6 t
7 group by
8 attr1
9 , attr2
10 ;

ATTR1 ATTR2 CNT
---------- ---------- ----------
1 1 1000000

select * from t where attr1 = 1 and attr2 = 1;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 1000K|
|* 1 | TABLE ACCESS FULL| T | 1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ATTR1"=1 AND "ATTR2"=1)

select * from t where attr1 = 0 and attr2 = 0;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 1000K|
|* 1 | TABLE ACCESS FULL| T | 1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ATTR1"=0 AND "ATTR2"=0)

select * from t where attr1 = 2 and attr2 = 2;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 1000K|
|* 1 | TABLE ACCESS FULL| T | 1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ATTR1"=2 AND "ATTR2"=2)

select * from t where attr1 = 100000 and attr2 = 100000;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 1000K|
|* 1 | TABLE ACCESS FULL| T | 1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ATTR1"=100000 AND "ATTR2"=100000)

select * from t where attr1 = -100000 and attr2 = 100000;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 1000K|
|* 1 | TABLE ACCESS FULL| T | 1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ATTR1"=(-100000) AND "ATTR2"=100000)

As you can see the optimizer still thinks to return "all" rows even in obvious cases where no rows will be returned.

In 11g the problem does not appear if there is more than a single distinct value, however the 10g backport seems to produce similar silly estimates even with more than a single distinct value.

One possible workaround is to generate a histogram on the column group in 11g. This way the out-of-range detection works again as expected for the single distinct value case.

It's also possible to generate a histogram on one or more of the affected columns each to avoid the problem, but this "breaks" the column group and correlation detection and hence might lead to worse cardinality estimates in the case of more than a single distinct value.

Note that when generating histograms you need to carefully check if you run into the problems described here: If you end up with a histogram containing only a single bucket but the underlying basic column statistics show more than a single distinct value then in principle unfortunately again the same issue will show up as demonstrated in the other blog post.

Note that this inconsistency between basic column statistics (Number Of Distinct Values, NDV) and the corresponding histogram is a consequence of how 11g gathers statistics when using the AUTO_SAMPLE_SIZE default.

Maria Colgan has just published a note about that.

Friendly Reminder - Free Webinar

This is just a friendly reminder that tomorrow my free Webinar on Cost-Based Optimizer Basics will take place at AllThingsOracle.com.

Repairman Jack : Hosts

Hosts is the fifth book in the Repairman Jack series by F. Paul Wilson.

Jack is reunited with his sister, just as a new virus threatens to link all human brains into a single collective hive mind, turning the human race into a docile petri dish living for the greater good of the virus.

Perhaps it’s my mood, but this story is just a little too depressing for my tastes. I’m hoping the rest of the series won’t degenerate into this type of doom and gloom. It needs a little more wit and humor to lift it.

Cheers

Tim…




Stats Collection

This is just a temporary note to point you to a poll set up by Timur Akhmadeev to get an idea of how people are handling stats collection in newer versions of Oracle.

 

METHOD_OPT

Here is a poll prompted by today twitter talks on the default METHOD_OPT value (which is ‘for all columns size auto’ since 10g) and automatic statistics gathering job (auto-task since 11g). CBO development team suggests to use the default job to gather statistics with default options. Do you use the job exactly this way with real production databases? I’m very interested to see some numbers. Please share the poll link http://poll.fm/3n7pn so that more people vote. Thanks.

Filed under: Oracle Tagged: defaults, statistics

Do you think Dropbox are getting nervous about Google’s GDrive?

Just got this message today…

Hi Timothy,

You can now earn twice as much free space by inviting your friends!

For each friend that installs Dropbox, you’ll both get 500 MB of free space. You can earn up to 16 GB.

Invite your friends now!

Thanks for spreading the Dropbox love,
- The Dropbox Team

P.S. Already invited a bunch of people? Don’t worry! You’ll get credited for all of them.

Do you think Dropbox are getting nervous about Google’s GDrive? :)

Cheers

Tim…




2012年4月6日 Oracle OpenWorld Unconference presented by JPOUG 開催

おもしろった!

・EC2上でRAC構築を無理やりやる。 無意味なのがおもしろい!
・OracleとMySQLのベンチマーク比較。エバンジェリスト(少しアングラ系)がいました。最高でした!
・StatsPackをつまらない!といいながらStatsPack愛を感じるセミナー。
・IndexOnlyAccessだけに特化したパフォーマンスデモだけをボクトツとやり続けながら、何か言い続けているデモ。
・バインド変数を使うな! Oコンサルとのバトル。
・FullScanのIO、3種類。OS側から見ると、、、突っ込みどころ満載だけど、突っ込ませない技ありデモ。
・途中でハングさせる(する?)、よく分からないけど、課題を与えて終わるデモ。

JPOUGのセミナーから、目を離せない。http://www.jpoug.org/

The latest website move…

Yesterdays move of the website went pretty smoothly from a technical perspective. I had done a trial run using a different domain name so I didn’t foresee any problems on that front. During the trial, the DNS propagation took less than an hour, but the real run took a little longer. Typical. :)

Let’s hope this marks the end of me harping on about hosting companies and website outages…

Cheers

Tim…




The proof is out there

I’d like to thank Jonathan Lewis for taking up my challenge and writing up a proof that there is indeed a formula that solves Cary Millsap‘s string problem for not just circles, but for any regular polygon.

Like Jonathan, I found the problem intriguing, and “wasted” a few hours on a Saturday afternoon discovering the formula.

This blog entry isn’t about the formula or proof — it’s rather about the process I used to discover it.

You see, I didn’t know that such a formula actually existed before I set out — but I was convinced that there might be. After all, it would be neat if there was one, wouldn’t it?

This is a common way that I approach problems, especially in the Oracle programming and tuning world — usually I assume the existence of a solution and then I go about trying to find it. One of the things I’m constantly surprised at is how many people “give up” on these kinds of problems, start guessing, guess wrong, and then decide that such a solution either doesn’t exist, or that they’re not smart enough to find it. This post is an attempt to debunk the idea that finding such solutions requires any special smarts — that today there are so many ways to learn and discover that anyone can do it. It just takes perseverance and a bit of dedication.

I started out on my quest to find the formula for regular polygons by searching Wikipedia for “Regular Polygon”

http://en.wikipedia.org/wiki/Regular_polygon

About half-way down the page, I found the following image:

The little letter a looked exactly like what I was looking for — something that resembled the radius of a circle. Before reading this article I had no idea was it was called — from this article I learned it was called the apothem.

From there, I clicked on the link to the definition of the apothem, trying to see if there was a way to calculate its length somehow and relate it to the perimeter of the polygon.

http://en.wikipedia.org/wiki/Apothem

In the article, a way to calculate the apothem based on the length of the sides and the number of them is presented:

a = s / (2 tan (pi / n))

with the comment that the formula can be used when only the perimeter (p) and the number of sides (n) is known because s = p / n.

From here, the rest becomes relatively simple, as I substituted s = p / n for s in the formula for the apothem.

a = (p / n) / (2 tan (pi / n))

Given a new apothem of length a’ (where a’ = a + h); we can work through the formulas to determine the difference between our new perimeter p’ and the original perimeter p.

We start by isolating the perimeter:

a = p / (2n tan (pi/n))

2an tan (pi/n) = p

or

p = 2an tan(pi/n)

Looking at our new perimeter (p’), using the new apothem (a’):

p’ = 2a’n tan(pi/n)

And we want p’ – p (the difference between the perimeters)

p’ – p = 2a’n tan(pi/n) – 2an tan(pi/n)

Factor out the 2n tan (pi/n) which is common to both terms on the right:

p’ – p = (2n tan(pi/n)) (a’ – a)

And since we know that a’ simply equals a + h (the “height” we’re raising the string):

p’ – p = (2n tan(pi/n))(a + h – a)

(This was my aha! moment :-) , sorry, bad joke)

p’ – p = (2n tan(pi/n))(h)

or

p’ – p = 2nh tan(pi/n)

Voila, a solution that only requires the number of sides and the “height”.

I was satisfied that this solution matched Cary’s just by simply trying out the formula with 1,000-sided polygon and seeing how it matched — Jonathan went beyond and showed how the approximation of the tan function using a Taylor series (actually a MacLaurin series) gets you to 2hpi.

Why all this focus on a simple problem?

Mainly because it’s been interesting to see how people approach solving it and — in my mind, more interesting — how they attempt to generalize it in a way that makes it really applicable.

When I’m confronted with a particularly challenging Oracle problem, I generally assume there’s a pretty easy way out — I can’t be the first person to ever have encountered such a problem, and the database developers have probably thought of it too, so I usually start with the “Big 3″ manuals:

The Concepts Guide, the SQL Reference Guide, and the PL/SQL Packages and Types Reference.

  • Want to tokenize a string? Check out CTX_DOC.TOKENS
  • Want to know the “distance” between 2 strings? Check out UTL_MATCH — complete with Levenshtein and Jaro-Winkler differences
  • Want to do linear algebra? Check out UTL_NLA
  • Want to “smash” an XML document into something you can select against like a table? Check out XMLTABLE
  • Want to implement a k-means clustering algorithm? Check out DBMS_DATA_MINING

Don’t re-invent the wheel, and don’t assume that the problem is unsolvable — while we might not always stand on the shoulders of giants, we should start by assuming our problem isn’t unique, and that the proof or truth is out there…

Thanks for attending the webinar!

Thanks to everyone who attended my two Embarcadero sponsored webinar sessions. I promised to make the slides and some scripts demonstrated in the slides available so here they are:

Slide deck (pdf format)
Scripts (zip format)

Keep an eye out for more webinars I'll be conducting in the coming months!