So far in this series I’ve written about the way that the optimizer estimates cardinality for an equijoin where one end of the join has a frequency histogram and the other end has a histogram of type:

It’s now time to look at a join where the other end has a * height-balanced* histogram. Arguably it’s not sensible to spend time writing about this since you shouldn’t be creating them in 12c (depending, instead, on the hybrid histogram that goes with the

rem rem Script: freq_hist_join_04a.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem Purpose: rem rem Last tested rem 18.3.0.0 rem 12.2.0.1 rem 12.1.0.2 rem 11.2.0.4 Different results rem drop table t2 purge; drop table t1 purge; set linesize 156 set trimspool on set pagesize 60 set feedback off execute dbms_random.seed(0) create table t1( id number(6), n04 number(6), n05 number(6), n20 number(6), j1 number(6) ) ; create table t2( id number(8,0), n20 number(6,0), n30 number(6,0), n50 number(6,0), j2 number(6,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 4) + 1 n04, mod(rownum, 5) + 1 n05, mod(rownum, 20) + 1 n20, trunc(2.5 * trunc(sqrt(v1.id*v2.id))) j1 from generator v1, generator v2 where v1.id <= 10 -- > comment to avoid WordPress format issue and v2.id <= 10 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 20) + 1 n20, mod(rownum, 30) + 1 n30, mod(rownum, 50) + 1 n50, 28 - round(abs(7*dbms_random.normal)) j2 from generator v1 where rownum <= 800 -- > comment to avoid WordPress format issue ; commit; prompt ========================================================== prompt Using estimate_percent => 100 to get height-balanced in t2 prompt ========================================================== begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns j1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', estimate_percent => 100, method_opt => 'for all columns size 1 for columns j2 size 20' ); end; /

As in earlier examples I’ve created some empty tables, then inserted randomly generated data (after calling the * dbms_random.seed(0)* function to make the data reproducible). Then I’ve gathered stats, knowing that there will be 22 distinct values in

When we try to calculate the join cardinality we’re going to need various details from the histogram information, such as bucket sizes, number of distinct values, and so on, so in the next few queries to display the histogram information I’ve captured a few values into SQL*Plus variables. Here’s the basic information about the histograms on the join columns * t1.j1* and

column num_distinct new_value m_t2_distinct column num_rows new_value m_t2_rows column num_buckets new_value m_t2_buckets column bucket_size new_value m_t2_bucket_size select table_name, column_name, histogram, num_distinct, num_buckets, density from user_tab_cols where table_name in ('T1','T2') and column_name in ('J1','J2') order by table_name ; select table_name, num_rows, decode(table_name, 'T2', num_rows/&m_t2_buckets, null) bucket_size from user_tables where table_name in ('T1','T2') order by table_name ; column table_name format a3 heading "Tab" break on table_name skip 1 on report skip 1 with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ) select f1.* from f1 union all select f2.* from f2 order by 1,2 ; Tab COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS DENSITY -------------------- -------------------- --------------- ------------ ----------- ---------- T1 J1 FREQUENCY 10 10 .005 T2 J2 HEIGHT BALANCED 22 20 .052652266 Tab NUM_ROWS BUCKET_SIZE -------------------- ---------- ----------- T1 100 T2 800 40 Tab VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER --- ---------- ------------------- --------------- T1 2 5 5 5 15 20 7 15 35 10 17 52 12 13 65 15 13 78 17 11 89 20 7 96 22 3 99 25 1 100 T2 1 0 0 14 1 1 17 1 2 18 1 3 19 1 4 20 1 5 21 2 7 22 1 8 23 1 9 24 2 11 25 2 13 26 3 16 27 2 18 28 2 20

As you can see, there is a * frequency* histogram on

Let’s modify the code to join the two sets of hisogram data on data value – using a full outer join so we don’t lose any data but restricting ourselves to values where the histograms overlap. We’re going to follow the idea we’ve developed in earlier postings and multiply frequencies together to derive a join frequency, so we’ll start with a simple full outer join and assume that when we find a real match value we should behave as if the height-balanced buckets (* t2*) where the bucket count is 2 or greater represent completely full buckets and are popular values..

I’ve also included in this query (because it had a convenient full outer join) a column selection that counts how many rows there are in * t1* with values that fall inside the range of the

column unmatch_ct new_value m_unmatch_ct column product format 999,999.99 break on report skip 1 compute sum of product on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ), join1 as ( select f1.value t1_value, f2.value t2_value, f1.frequency t1_frequency, f2.frequency t2_frequency, sum( case when f2.frequency > 1 and f1.frequency is not null then 0 else f1.frequency end ) over() unmatch_ct, f2.frequency * &m_t2_bucket_size * case when f2.frequency > 1 and f1.frequency is not null then f1.frequency end product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between 2 and 25 -- coalesce(f1.value, f2.value) between &m_low and &m_high order by coalesce(f1.value, f2.value) ) select * from join1 ; T1_VALUE T2_VALUE T1_FREQUENCY T2_FREQUENCY UNMATCH_CT PRODUCT ---------- ---------- ------------ ------------ ---------- ----------- 2 5 99 5 15 99 7 15 99 10 17 99 12 13 99 14 1 99 15 13 99 17 17 11 1 99 18 1 99 19 1 99 20 20 7 1 99 21 2 99 22 22 3 1 99 23 1 99 24 2 99 25 25 1 2 99 80.00 ----------- sum 80.00

We captured the bucket size (* &m_bucket_size*) for the

Our code so far only acounts for items that are popular in both tables. Previous experience tells us that when a popular value exists only at one end of the join predicate we need to derive a contribution to the total prediction through an *“average selectivity”* calculated for the other end of the join predicate. For frequency histograms we’ve seen that *“half the number of the least frequently occuring value”* seems to be the appropriate frequency estimate, and if we add that in we’ll get two more contributions to the total from the values 21 and 24 which appear in the height-balanced (* t2*) histogram as popular but don’t appear in the frequency (

To calculate the *“average selectivity”* of a non-popular row in * t2* I need a few numbers (some of which I’ve already acquired above). The total number of rows in the table (NR), the number of distinct values (NDV), and the number of popular values (NPV), from which we can derive the the number of distinct non-popular values and the number of rows for the non-popular values. The model that Oracle uses to derive these numbers is simply to assume that a value is popular if its frequency in the histogram is greater than one and the number of rows for that value is

The first query we ran against the * t2* histogram showed 6 popular values, accounting for 13 buckets of 40 rows each. We reported 22 distinct values for the column and 800 rows for the table so the optimizer assumes the non-popular values account for (22 – 6) = 16 distinct values and (800 – 13 * 40) = 280 rows. So the selectivity of non-popular values is (280/800) * (1/16) = 0.021875. This needs to be multiplied by the 99 rows in

We could enhance our earlier full outer join and slot 0.5, 99, and 0.021875 into it as “magic” constants. Rather than do that though I’m going to write a couple of messy queries to derive the values (and the low/high range we’re interested in) so that I will be able to tweak the data later on and see if the formula still produces the right answer.

column range_low new_value m_low column range_high new_value m_high column avg_t1_freq new_value m_avg_t1_freq column new_density new_value m_avg_t2_dens with f1 as ( select endpoint_value ep_val, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select endpoint_value ep_val, endpoint_number ep_num, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ) select max(min_v) range_low, min(max_v) range_high, min(min_f)/2 avg_t1_freq, max(new_density) new_density from ( select min(ep_val) min_v, max(ep_val) max_v, min(frequency) min_f, to_number(null) new_density from f1 union all select min(ep_val) min_v, max(ep_val) max_v, null min_f, (max(ep_num) - sum(case when frequency > 1 then frequency end)) / ( max(ep_num) * (&m_t2_distinct - count(case when frequency > 1 then 1 end)) ) new_density from f2 ) ; RANGE_LOW RANGE_HIGH AVG_T1_FREQ NEW_DENSITY ---------- ---------- ----------- ----------- 2 25 .5 .021875

This query finds the overlap by querying the two histograms and reporting the lower high value and higher low value. It also reports the minimum frequency from the frequency histogram and divides by 2, and calculates the number of non-popular rows divided by the total number of rows and the number of distinct non-popular values. (Note that I’ve picked up the number of distinct values in * t2.j2* as a substituion variable generated by one of my earlier queries.) In my full script this messy piece of code runs before the query that showed I showed earlier on that told us how well (or badly) the two histograms matched.

Finally we can use the various values we’ve picked up in a slightly more complex version of the full outer join – with a special row added through a * union all* to give us our the estimate:

break on report skip 1 compute sum of product on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ), join1 as ( select f1.value t1_value, f2.value t2_value, f1.frequency t1_frequency, f2.frequency t2_frequency, f2.frequency * case when f2.frequency > 1 and f1.frequency is not null then f1.frequency when f2.frequency > 1 and f1.frequency is null then &m_avg_t1_freq end * &m_t2_bucket_size product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between &m_low and &m_high order by coalesce(f1.value, f2.value) ) select * from join1 union all select null, &m_avg_t2_dens, &m_unmatch_ct, &m_t2_rows * &m_avg_t2_dens, &m_t2_rows * &m_avg_t2_dens * &m_unmatch_ct from dual ; T1_VALUE T2_VALUE T1_FREQUENCY T2_FREQUENCY PRODUCT ---------- ---------- ------------ ------------ ----------- 2 5 5 15 7 15 10 17 12 13 14 1 15 13 17 17 11 1 18 1 19 1 20 20 7 1 21 2 40.00 22 22 3 1 23 1 24 2 40.00 25 25 1 2 80.00 .021875 99 17.5 1,732.50 ----------- sum 1,892.50

It remains only to check what the optimizer thinks the cardinality will be on a simple join, and then modify the data slightly to see if the string of queries continues to produce the right result. Here’s a starting test:

set serveroutput off alter session set statistics_level = all; alter session set events '10053 trace name context forever'; alter session set tracefile_identifier='BASELINE'; select count(*) from t1, t2 where t1.j1 = t2.j2 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set statistics_level = typical; alter session set events '10053 trace name context off'; COUNT(*) ---------- 1327 PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------ SQL_ID f8wj7karu0hhs, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 41 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 41 | | | | |* 2 | HASH JOIN | | 1 | 1893 | 1327 |00:00:00.01 | 41 | 2545K| 2545K| 1367K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 7 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 7 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

The * E-rows* for the hash join operation reports 1893 – and a quick check of the 10053 trace file shows that this is 1892.500000 rounded – a perfect match for the result from my query. I’ve modified the data in various ways (notably updating the

Upgrades are a pain. With the same data set and same statistics on 11.2.0.4, running the same join query between * t1* and

SQL_ID f8wj7karu0hhs, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 12 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 12 | | | | |* 2 | HASH JOIN | | 1 | 1855 | 1327 |00:00:00.01 | 12 | 2440K| 2440K| 1357K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 6 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 6 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

Notice that the * E-rows* value is different. The join cardinality algorithm seems to have changed in the upgrade from 11.2.0.4 to 12c. I haven’t quite figured out how to get to the 11g result, but I seem to get quite close

case when f2.frequency > 1 and f1.frequency is not null -- then f1.frequency -- 12c then &m_avg_t1_freq -- 11g when f2.frequency > 1 and f1.frequency is null then &m_avg_t1_freq end *

As I modified the * t1* row with the value 25 to hold other values this change kept producing results that were exactly 2, 2.5, or 3.0 different from the execution plan

In some email I exchanged with Chinar Aliyev, he suggested three fix-controls that might be relevant. I’ve added these to * an earlier posting* I did when I first hit the anomaly a few days ago but I’ll repeat them here. I will be testing their effects at some point in the not too distant future:

14033181 1 QKSFM_CARDINALITY_14033181 correct ndv for non-popular values in join cardinality comp. (12.1.0.1) 19230097 1 QKSFM_CARDINALITY_19230097 correct join card when popular value compared to non popular (12.2.0.1) 22159570 1 QKSFM_CARDINALITY_22159570 correct non-popular region cardinality for hybrid histogram (12.2.0.1)

- January 2019 (16)
- December 2018 (35)
- November 2018 (48)
- October 2018 (63)
- September 2018 (36)
- August 2018 (64)
- July 2018 (58)
- June 2018 (57)
- May 2018 (62)
- April 2018 (40)
- March 2018 (61)
- February 2018 (67)
- January 2018 (53)
- December 2017 (37)
- November 2017 (43)
- October 2017 (57)
- September 2017 (46)
- August 2017 (61)
- July 2017 (52)
- June 2017 (42)
- May 2017 (53)
- April 2017 (53)
- March 2017 (68)
- February 2017 (38)
- January 2017 (66)
- December 2016 (53)
- November 2016 (79)
- October 2016 (71)
- September 2016 (55)
- August 2016 (71)

## Recent comments

51 weeks 11 hours ago

1 year 11 weeks ago

1 year 15 weeks ago

1 year 16 weeks ago

1 year 20 weeks ago

1 year 41 weeks ago

2 years 10 weeks ago

2 years 39 weeks ago

3 years 24 weeks ago

3 years 24 weeks ago