In the * previous posting* I listed the order of precision of histograms as:

- Frequency
- Top-Frequency
- Hybrid
- Height-balanced
- None

Having covered the * Frequency/Frequency* join (for a single column, no nulls, equijoin) in the previous posting I’ve decided to work down the list and address

rem rem Script: freq_hist_join_05.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem Purpose: rem rem Last tested rem 18.3.0.0 rem 12.2.0.1 rem execute dbms_random.seed(0) create table t1 ( id number(6), n04 number(6), n05 number(6), n20 number(6), j1 number(6) ) ; create table t2( id number(8,0), n20 number(6,0), n30 number(6,0), n50 number(6,0), j2 number(6,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 4) + 1 n04, mod(rownum, 5) + 1 n05, mod(rownum, 20) + 1 n20, trunc(2.5 * trunc(sqrt(v1.id*v2.id))) j1 from generator v1, generator v2 where v1.id <= 10 -- > comment to avoid WordPress format issue and v2.id <= 10 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 20) + 1 n20, mod(rownum, 30) + 1 n30, mod(rownum, 50) + 1 n50, 28 - round(abs(7*dbms_random.normal)) j2 from generator v1 where rownum <= 800 -- > comment to avoid WordPress format issue ; begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns j1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', method_opt => 'for all columns size 1 for columns j2 size 16' ); end; /

In this example I’ve used the * sqrt()* function and the

Here are the resulting stats, preceded by the code that reported them:

select table_name, column_name, histogram, num_distinct, num_buckets, density from user_tab_cols where table_name in ('T1','T2') and column_name in ('J1','J2') order by table_name ; select table_name, num_rows from user_tables where table_name in ('T1','T2') order by table_name ; break on table_name skip 1 on report skip 1 with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' order by endpoint_value ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' order by endpoint_value ) select f1.* from f1 union all select f2.* from f2 order by 1,2 ; TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS DENSITY -------------------- -------------------- --------------- ------------ ----------- ---------- T1 J1 FREQUENCY 10 10 .005 T2 J2 TOP-FREQUENCY 22 16 .000625 TABLE_NAME NUM_ROWS -------------------- ---------- T1 100 T2 800 TABLE_NAME VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER -------------------- ---------- ------------------- --------------- T1 2 5 5 5 15 20 7 15 35 10 17 52 12 13 65 15 13 78 17 11 89 20 7 96 22 3 99 25 1 100 T2 1 1 1 13 14 15 15 11 26 16 22 48 17 34 82 18 31 113 19 36 149 20 57 206 21 44 250 22 45 295 23 72 367 24 70 437 25 87 524 26 109 633 27 96 729 28 41 770

Table * t1* reports 100 rows, 10 distinct values and a

Table

Things we notice from the histograms are: * t1* has a range from 2 to 25, while

SQL> select j2, count(*) from t2 group by j2 order by count(*), j2; J2 COUNT(*) ---------- ---------- 1 1 9 1 * 8 3 * 11 4 * 10 5 * 12 8 * 14 9 * 15 11 ...

The reason why the total number of rows accounted for is less than the total number of rows in the table comes in two parts. The Top-Frequency histogram is designed to hold the Top N most popular entries in the table, so there will be some entries that don’t make an appearance in the histogram despite contributing rows to the total table count; the number of “lost” rows can then be increased because the Top N popular values may not include the column low and high values, and these two values * must *appear in the histogram. Looking at the output above we can see that we could have reported 14 as the 16th most popular value, instead we have to record 1, losing a further 9 rows and regaining 1.

Let’s test the pure join query on the two tables to see what the optimizer is predicting as the join cardinality, and then try to re-create that cardinality from the histogram data:

alter session set statistics_level = all; alter session set events '10053 trace name context forever'; alter session set tracefile_identifier='BASELINE'; select count(*) from t1, t2 where t1.j1 = t2.j2 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set statistics_level = typical; alter session set events '10053 trace name context off'; ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 41 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 41 | | | | |* 2 | HASH JOIN | | 1 | 1608 | 1327 |00:00:00.01 | 41 | 2545K| 2545K| 1355K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 7 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 7 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

Our target is to work out how we can query the histogram data in a way that gets the result 1,608. Ideally we’ll also think of a rationale for justifying our method, and then we’ll apply the same method with 15 buckets and 17 buckets, and with a couple of variations to the data (e.g. update all rows where j1 = 25 to set j1 = 28), to see if the method still gets the right result.

All we did with the ** frequency/frequency** join was to join the two histograms on matching values, multiply the frequencies on each resulting row , then sum down the set, and this automatically eliminated rows which were outside the “highest low” and “lowest high” (i.e. we only examined rows where the histograms overlapped). We might hope that things shouldn’t be too different when one of the histograms is a

There is an important difference, though, between frequency and top-frequency histograms – in the latter case there are values in the table which will not be in the histogram, so we ought to make some allowance for these (even though it’s only *“one bucket’s worth”*). It’s possible that some of these values might match values in the frequency histogram so we need to include a mechanism for adding in a factor to allow for them. So as a first step let’s work out the *“average number of rows per value”* for the missing values.

We have 22 distinct values and 16 end points so there are 6 missing values. We have 800 rows in the table but only 770 rows reported in the histogram so there are 30 missing rows. So let’s say the missing values have an average cardinality of 30/6 = 5 (and we might extend that to say they have an average selectivity of 5/800 = * 0.00625*).

Let’s bring that value into the query we wrote for the frequency/frequency case by using an outer join (which I’ll write as an “ANSI” Full Outer Join”) with a predicate in place that restricts the result to just the overlapping range, which is [2,25], the *“higher low value”* and *“lower high value”* across the two histograms. Here’s some code – with an odd little detail included:

column product format 999,999,999.99 compute sum of product on report compute sum of t1_count on report compute sum of t1_value on report compute sum of t2_count on report compute sum of t2_value on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' order by endpoint_value ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' order by endpoint_value ) select f1.value f1_value, f2.value f2_value, nvl(f1.row_or_bucket_count,0.00) t1_count, nvl(f2.row_or_bucket_count,800*0.00625) t2_count, nvl(f1.row_or_bucket_count,0.00) * nvl(f2.row_or_bucket_count,800*0.006250) product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between 2 and 25 order by coalesce(f1.value, f2.value) ;

I’ve included an * nvl()* on the columns for the top-frequency histograms that convert nulls (i.e. the preserved rows derived from the frequency histogram) into the average frequency we’ve just calculated, using the

T1_VALUE T2_VALUE T1_COUNT T2_COUNT PRODUCT ---------- ---------- ---------- ---------- --------------- 2 5 5 25.00 5 15 5 75.00 7 15 5 75.00 10 17 5 85.00 12 13 5 65.00 13 0 14 .00 15 15 13 11 143.00 16 0 22 .00 17 17 11 34 374.00 18 0 31 .00 19 0 36 .00 20 20 7 57 399.00 21 0 44 .00 22 22 3 45 135.00 23 0 72 .00 24 0 70 .00 25 25 1 87 87.00 ---------- ---------- ---------- ---------- --------------- 135 233 100 548 1,463.00

The figure is too low, so there has to be an adjustment. What if the code is allowing for the *“maybe there are other values”* algorithm that the optimizer uses with fequency histograms ? If you’ve gathered a frequency histogram on a column but query it with a value that isn’t in the histogram than Oracle applies an algorithm that looks like: *“if you’re asking for something that isn’t in the histogram I’ll assume that there must be some data there and use a frequency that’s half the lowest frequency I have recorded”*

T1_VALUE T2_VALUE T1_COUNT T2_COUNT PRODUCT ---------- ---------- ---------- ---------- --------------- 2 5 5 25.00 5 15 5 75.00 7 15 5 75.00 10 17 5 85.00 12 13 5 65.00 13 .5 14 7.00 15 15 13 11 143.00 16 .5 22 11.00 17 17 11 34 374.00 18 .5 31 15.50 19 .5 36 18.00 20 20 7 57 399.00 21 .5 44 22.00 22 22 3 45 135.00 23 .5 72 36.00 24 .5 70 35.00 25 25 1 87 87.00 ---------- ---------- ---------- ---------- --------------- 135 233 103.5 548 1,607.50

Since we were looking for 1,608 I’m going to call that a success. I can check precision, of course, by looking at the 10053 trace file. Extracting a few critical lines:

egrep -e"Density" -e"Join Card" orcl12c_ora_6520_BASELINE.trc AvgLen: 3 NDV: 22 Nulls: 0 Density: 0.006250 Min: 1.000000 Max: 28.000000 AvgLen: 3 NDV: 10 Nulls: 0 Density: 0.005000 Min: 2.000000 Max: 25.000000 Join Card: 1607.500000 = outer (100.000000) * inner (800.000000) * sel (0.020094)

The * “Density”* lines come from the column statistics – note the 0.00625 that matches the

The * “Join Card”* line is exactly what it says – the join cardinality calculation showing that the plan’s prediction of 1,608 rows was actually a rounded 1607.5

There is one more important thing to check before I start tweaking the data to see if there are any other factors involved. Is the 0.5 I stuck into the query really the value of *“half the least common frequency”* or is it a fixed value in all cases. A nice easy way of testing this is to update the * t1* table to change one row from 22 to 25 (22 will still be present in the table and histogram before and after this test, so it’s a minimal and safe change). Making this change and re-running the calculation query leaving the 0.5 unchanged gives the following:

update t1 set j1 = 25 where j1 = 22 and rownum = 1; ... 21 .5 44 22.00 22 22 2 45 90.00 23 .5 72 36.00 24 .5 70 35.00 25 25 2 87 174.00 ---------- ---------- --------------- sum 103.5 548 1,649.50

Without reporting all the details:

- the estimate in the plan went up from 1,608 to 1,794
- leaving 0.5 in the query the derived result was 1,649.5 (last few lines of output above)
- changing the 0.5 to 1.0 the derived result was 1,794.0

Conclusion – the “fudge factor” is consistent with the model the optimizer uses with frequency histogram calculations. The optimizer models “missing” rows in the join calculation as *“half the number of the least frequently occuring value**“*^{**Important footnote}

After a dozen tests varying the number of buckets in the top-frequency histogram (and checking it really was still a top-frequency histogram), and tweaking the * t1* (frequency histogram) data to use values on the boundaries of, or outside, the range of the

So the next step is to check what happens when you add filter predicates on one, or both, sides. I listed a fragment of code earlier on to execute the pure join and count the number of rows it produced, enabling the 10053 trace and pulling the actual plan from memory at the same time. I repeated this code with 3 variations and checked the *“Join Card”* lines from the resulting trace files:

select count(*) from t1, t2 where t1.j1 = t2.j2 select count(*) from t1, t2 where t1.j1 = t2.j2 and t1.n04 = 2 select count(*) from t1, t2 where t1.j1 = t2.j2 and t2.n30 = 25 select count(*) from t1, t2 where t1.j1 = t2.j2 and t1.n04 = 2 and t2.n30 = 25 egrep -e"Join Card" orcl12c_ora_10447*.trc orcl12c_ora_10447_BASELINE.trc:Join Card: 1607.500000 = outer (800.000000) * inner (100.000000) * sel (0.020094) orcl12c_ora_10447_FILTERJ1.trc:Join Card: 401.875000 = outer (800.000000) * inner (25.000000) * sel (0.020094) orcl12c_ora_10447_FILTERJ2.trc:Join Card: 53.583333 = outer (100.000000) * inner (26.666667) * sel (0.020094) orcl12c_ora_10447_FILTJ1J2.trc:Join Card: 13.395833 = outer (26.666667) * inner (25.000000) * sel (0.020094)

As you can see in all 4 cases, Oracle reports an inner and outer cardinality estimate and a join selectivity. The join selectivity **remains unchanged** throughout; it’s the value we can derive from our pure join test (0.020094 = 1607.5 / (100 * 800)). All that changes is that the individual table predicates are applied to the base tables before the join selectivity is applied to the product of the filtered base table cardinalities:

- Column n04 has 4 distinct values in 100 rows – filter cardinality = 100/4 = 25
- Column n30 has 30 distinct values in 800 rows – filter cardinality = 800/30 = 26.66666…

For a single column equijoin on columns with no nulls where one column has a * frequency* histogram and the other has a

^{**Important Footnote} Until Chinar Aliyev questioned what I had written, I had never noticed that the *“half the lowest frequency”* that I describe at various point in the arithmetic was anything other than a fixed fudge factor. In fact, in perfect symmetry with the expression used for the average selectivity in the top-frequency part of the calculcation, this “fudge factor” is simple *“num_rows * column_density”* for the column with the frequency histogram. (Whether the *“half the lowest frequency”* drops out as a side effect of the density calculation, or whether the column density is derived from half the lowest frequency is another matter.)

- October 2018 (40)
- September 2018 (36)
- August 2018 (64)
- July 2018 (58)
- June 2018 (57)
- May 2018 (62)
- April 2018 (40)
- March 2018 (61)
- February 2018 (67)
- January 2018 (53)
- December 2017 (37)
- November 2017 (43)
- October 2017 (57)
- September 2017 (46)
- August 2017 (61)
- July 2017 (52)
- June 2017 (42)
- May 2017 (53)
- April 2017 (53)
- March 2017 (68)
- February 2017 (40)
- January 2017 (64)
- December 2016 (56)
- November 2016 (76)
- October 2016 (71)
- September 2016 (55)
- August 2016 (71)
- July 2016 (41)
- June 2016 (59)
- May 2016 (68)

## Recent comments

38 weeks 3 days ago

50 weeks 4 days ago

1 year 2 weeks ago

1 year 3 weeks ago

1 year 8 weeks ago

1 year 29 weeks ago

1 year 49 weeks ago

2 years 27 weeks ago

3 years 11 weeks ago

3 years 11 weeks ago