A recent comment on a note I wrote some time ago about faking histograms asked about the calculations of selectivity in the latest versions of Oracle. As I read the question, I realised that I had originally supplied a formula for calculating cardinality, rather than selectivity, so I thought I’d supply a proper example.
We’ll start with a script to create some data and stats – and I’m going to start with a script I wrote in Jan 2001 (which is why it happens to use the analyze command rather than dbms_stats.gather_table_stats, even though this example comes from an instance of 18.104.22.168).
create table t1 ( skew, skew2, padding ) as select r1, r2, rpad('x',400) from ( select /*+ no_merge */ rownum r1 from all_objects where rownum ) v1, ( select /*+ no_merge */ rownum r2 from all_objects where rownum ) v2 where r2 order by r2,r1 ; alter table t1 modify skew not null; alter table t1 modify skew2 not null; create index t1_skew on t1(skew); analyze table t1 compute statistics for all indexed columns size 75;
The way I’ve created the data set the column skew has one row with the value 1, two rows with the value 2, and so on up to 80 rows with the value 80. I’ve put in a bid to collect a histogram of 75 buckets – which the default, by the way, for the analyze command - on any indexed columns . (Interestingly the resulting histogram on column skew held on 74 buckets.)
To demonstrate the calculation of selectivity, I then enabled the 10053 trace and ran a query to select one of the “non-popular” values (i.e. a value with a fairly small number of duplicates). The section of the trace file I want to talk about appears as the “Single Table Access Path”.
SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T1[T1] Column (#1): NewDensity:0.008940, OldDensity:0.006757 BktCnt:74, PopBktCnt:31, PopValCnt:15, NDV:80 Column (#1): SKEW( AvgLen: 2 NDV: 80 Nulls: 0 Density: 0.008940 Min: 1 Max: 80 Histogram: HtBal #Bkts: 74 UncompBkts: 74 EndPtVals: 59 Table: T1 Alias: T1 Card: Original: 3240.000000 Rounded: 29 Computed: 28.96 Non Adjusted: 28.96 Access Path: TableScan Cost: 32.00 Resp: 32.00 Degree: 0 Cost_io: 32.00 Cost_cpu: 0 Resp_io: 32.00 Resp_cpu: 0 Access Path: index (AllEqRange) Index: T1_SKEW resc_io: 30.00 resc_cpu: 0 ix_sel: 0.008940 ix_sel_with_filters: 0.008940 Cost: 30.00 Resp: 30.00 Degree: 1 Best:: AccessPath: IndexRange Index: T1_SKEW Cost: 30.00 Degree: 1 Resp: 30.00 Card: 28.96 Bytes: 0
A few points to notice on line 4:
The NDV (number of distinct values) is 80; this means that the Density, in the absence of a histogram, would be 1/80 = 0.0125
We have a NewDensity and an OldDensity. The OldDensity is the value that would have been reported simply as Density in 10g or 9i, and is derived using the mechanism I described in “Cost Based Oracle – Fundamentals” a few years ago. Informally this was:
sum of the square of the frequency of the non-popular values /
(number of non-null rows * number of non-popular non-null rows)
This formula for OldDensity is (I assume) a fairly subtle formula based on expected number of rows for a randomly selected non-popular value in the presence of popular values. The NewDensity, however, seems to take the much simpler approach of “factoring out” the popular values. There are two ways you can approach the arithmetic – one is by thinking of the number of rows you expect for the query “column = ‘non-popular value’”, the other is by thinking of the number of non-popuar values and then adjusting for the relative volume of non-popular value in the table.
Line 9 tells us there are 3240 rows in the table.
Line 4 tells us there are 80 (NDV) distinct values of which 15 (PopValCnt) are popular, and 74 (BktCnt) buckets of which 31 (PopBktCnt) contain popular values.
From this we determine that there are (80 – 15 = ) 65 non-popular values and (3240 * (74-31)/74 = ) 1883 non-popular rows.
Hence we infer that a typical non-popular value will report (1883 / 65 = ) 29 rows – which is the rounded cardinality we see in line 9.
If we consider only non-popular values, then the selectivity is 1/(number of non-popular values) = 1/65.
But this selectivity applies to only 43 buckets of the 74 total bucket count.
To generate a selectivity that can be applied to the original cardinality of the table we have to scale it accordingly.
The selectivity, labelled the NewDensity, is (1/65) * (43/74) = 0.00894 – which is the value we see in line 4.
(Following this, of course, the cardinality for ‘column = constant’ would be 3,240 * 0.00894 = 29
I have been a little cavalier with rounding throughout the example, just to keep the numbers looking a little tidier.
If the column is allowed to be null then our calculation of cardinality would use (3,240 – number of nulls) instead of 3,240. The method for calculating the selectivity would not change, but the resulting figure would be applied to (3,240 – number of nulls).