# Join Cardinality – 4

In previous installments of this series I’ve been describing how Oracle estimates the join cardinality for single column joins with equality where the columns have histograms defined. So far I’ve  covered two options for the types of histogram involved: frequency to frequency, and frequency to top-frequency. Today it’s time to examine frequency to hybrid.

My first thought about this combination was that it was likely to be very similar to frequency to top-frequency because a hybrid histogram has a list of values with “repeat counts” (which is rather like a simple frequency histogram), and a set of buckets with variable sizes that could allow us to work out an “average selectivity” of the rest of the data.

I was nearly right but the arithmetic didn’t quite work out the way I expected.  Fortunately Chinar Aliyev’s document highlighted my error – the optimizer doesn’t use all the repeat counts, it uses only those repeat counts that identify popular values, and a popular value is one where the endpoint_repeat_count is not less than the average number of rows in a bucket. Let’s work through an example – first the data (which repeats an earlier article, but is included here for ease of reference):

```rem
rem     Script:         freq_hist_join_06.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem

set linesize 156
set pagesize 60
set trimspool on

execute dbms_random.seed(0)

create table t1 (
id              number(6),
n04             number(6),
n05             number(6),
n20             number(6),
j1              number(6)
)
;

create table t2(
id              number(8,0),
n20             number(6,0),
n30             number(6,0),
n50             number(6,0),
j2              number(6,0)
)
;

insert into t1
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum                                  id,
mod(rownum,   4) + 1                    n04,
mod(rownum,   5) + 1                    n05,
mod(rownum,  20) + 1                    n20,
trunc(2.5 * trunc(sqrt(v1.id*v2.id)))   j1
from
generator       v1,
generator       v2
where
v1.id <= 10 -- > comment to avoid WordPress format issue
and     v2.id <= 10 -- > comment to avoid WordPress format issue
;

insert into t2
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum                                  id,
mod(rownum,   20) + 1                   n20,
mod(rownum,   30) + 1                   n30,
mod(rownum,   50) + 1                   n50,
28 - round(abs(7*dbms_random.normal))        j2
from
generator       v1
where
rownum <= 800 -- > comment to avoid WordPress format issue
;

commit;

begin
dbms_stats.gather_table_stats(
ownname          => null,
tabname          => 'T1',
method_opt       => 'for all columns size 1 for columns j1 size 254'
);
dbms_stats.gather_table_stats(
ownname          => null,
tabname          => 'T2',
method_opt       => 'for all columns size 1 for columns j2 size 13'
);
end;
/

```

As before I’ve got a table with 100 rows using the sqrt() function to generate column j1, and a table with 800 rows using the dbms_random.normal function to generate column j2. So the two columns have skewed patterns of data distribution, with a small number of low values and larger numbers of higher values – but the two patterns are different.

I’ve generated a histogram with 254 buckets (which dropped to 10) for the t1.j1 column, and generated a histogram with 13 buckets for the t2.j2 column as I knew (after a little trial and error) that this would give me a hybrid histogram.

Here’s a simple query, with its result set, to report the two histograms – using a full outer join to line up matching values and show the gaps where (endpoint) values in one histogram do not appear in the other:

```
define m_popular = 62

break on report skip 1

compute sum of product on report
compute sum of product_rp on report

compute sum of t1_count on report
compute sum of t2_count on report
compute sum of t2_repeats on report
compute sum of t2_pop_count on report

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number,
endpoint_repeat_count,
to_number(null)
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
order by
endpoint_value
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number,
endpoint_repeat_count,
case when endpoint_repeat_count >= &m_popular
then endpoint_repeat_count
else null
end     pop_count
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
order by
endpoint_value
)
select
f1.value t1_value,
f2.value t2_value,
f1.row_or_bucket_count t1_count,
f2.row_or_bucket_count t2_count,
f1.endpoint_repeat_count t1_repeats,
f2.endpoint_repeat_count t2_repeats,
f2.pop_count t2_pop_count
from
f1
full outer join
f2
on
f2.value = f1.value
order by
coalesce(f1.value, f2.value)
;

T1_VALUE   T2_VALUE   T1_COUNT   T2_COUNT T1_REPEATS T2_REPEATS T2_POP_COUNT
---------- ---------- ---------- ---------- ---------- ---------- ------------
1                     1                     1
2                     5                     0
5                    15                     0
7                    15                     0
10                    17                     0
12                    13                     0
15         15         13         55          0         11
17         17         11         56          0         34
19                    67                    36
20         20          7         57          0         57
21                    44                    44
22         22          3         45          0         45
23                    72                    72           72
24                    70                    70           70
25         25          1         87          0         87           87
26                   109                   109          109
27                    96                    96           96
28                    41                    41
---------- ---------- ---------- ----------            ---------- ------------
100        800                   703          434

```

You’ll notice that there’s a substitution variable (m_popular) in this script that I use to identify the “popular values” in the hybrid histogram so that I can report them separately. I’ve set this value to 62 for this example because a quick check of user_tables and user_tab_cols tells me I have 800 rows in the table (user_tables.num_rows) and 13 buckets (user_tab_cols.num_buckets) in the histogram: 800/13 = 61.52. A value is popular only if its repeat count is 62 or more.

This is where you may hit a problem – I certainly did when I switched from testing 18c to testing 12c (which I just knew was going to work – but I tested anyway). Although my data has been engineered so that I get the same “random” data in both versions of Oracle, I got different hybrid histograms (hence my complaint in a recent post.) The rest of this covers 18c in detail, but if you’re running 12c there are a couple of defined values that you can change to get the right results in 12c.

At this point I need to “top and tail” the output because the arithmetic only applies where the histograms overlap, so I need to pick the range from 2 to 25. Then I need to inject a “representative” or “average” count/frequency in all the gaps, then cross-multiply. The average frequency for the frequency histogram is “half the frequency of the least frequently occurring value” (which seems to be identical to new_density * num_rows), and the representative frequency for the hybrid histogram is (“number of non-popular rows” / “number of non-popular values”). There are 800 rows in the table with 22 distinct values in the column, and the output above shows us that we have 5 popular values totally 434 rows, so the average frequency is (800 – 434) / (22 – 5) = 21.5294. (Alternatively we could say that the average selectivities (which is what I’ve used in the next query) are 0.5/100 and 21.5294/800.)

[Note for 12c, you’ll get 4 popular values covering 338 rows, so your figurese will be: (800 – 338) / (22 – 4) = 25.6666… and 0.0302833]

So here’s a query that restricts the output to the rows we want from the histograms, discards a couple of columns, and does the arithmetic:

```
define m_t2_sel = 0.0302833
define m_t2_sel = 0.0269118
define m_t1_sel = 0.005

break on table_name skip 1 on report skip 1

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number,
endpoint_repeat_count,
to_number(null) pop_count
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
order by
endpoint_value
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number,
endpoint_repeat_count,
case when endpoint_repeat_count >= &m_popular
then endpoint_repeat_count
else null
end     pop_count
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
order by
endpoint_value
)
select
f1.value f1_value,
f2.value f2_value,
nvl(f1.row_or_bucket_count,100 * &m_t1_sel) t1_count,
nvl(f2.pop_count,          800 * &m_t2_sel) t2_count,
case when (   f1.row_or_bucket_count is not null
or f2.pop_count is not null
)    then
nvl(f1.row_or_bucket_count,100 * &m_t1_sel) *
nvl(f2.pop_count,          800 * &m_t2_sel)
end      product_rp
from
f1
full outer join
f2
on
f2.value = f1.value
where coalesce(f1.value, f2.value) between 2 and 25
order by
coalesce(f1.value, f2.value)
;

F1_VALUE   F2_VALUE   T1_COUNT   T2_COUNT PRODUCT_RP
---------- ---------- ---------- ---------- ----------
2                     5   21.52944   107.6472
5                    15   21.52944   322.9416
7                    15   21.52944   322.9416
10                    17   21.52944  366.00048
12                    13   21.52944  279.88272
15         15         13   21.52944  279.88272
17         17         11   21.52944  236.82384
19         .5   21.52944
20         20          7   21.52944  150.70608
21         .5   21.52944
22         22          3   21.52944   64.58832
23         .5         72         36
24         .5         70         35
25         25          1         87         87
---------- ---------- ----------
sum                          102  465.82384 2289.41456

```

There’s an important detail that I haven’t mentioned so far. In the output above you can see that some rows show “product_rp” as blank. While we cross multiply the frequencies from t1.j1 and t2.j2, filling in average frequencies where necessary, we exclude from the final result any rows where average frequencies have been used for both histograms.

[Note for 12c, you’ll get the result 2698.99736 for the query, and 2699 for the execution plan]

Of course we now have to check that the predicted cardinality for a simple join between these two tables really is 2,289. So let’s run a suitable query and see what the optimizer predicts:

```
set serveroutput off

alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

select
count(*)
from
t1, t2
where
t1.j1 = t2.j2
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

alter session set statistics_level = typical;
alter session set events '10053 trace name context off';

SQL_ID  cf4r52yj2hyd2, child number 0
-------------------------------------
select  count(*) from  t1, t2 where  t1.j1 = t2.j2

Plan hash value: 906334482

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |     108 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |     108 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |   2289 |   1327 |00:00:00.01 |     108 |  2546K|  2546K| 1194K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |      18 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    800 |    800 |00:00:00.01 |      34 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"."J1"="T2"."J2")

```

As you can see, the E-Rows for the join is 2,289, as required.

I can’t claim that the model I’ve produced is definitely what Oracle does, but it looks fairly promising. No doubt, though, there are some variations on the theme that I haven’t considered – even when sticking to a simple (non-partitioned) join on equality on a single column.