Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Random Upgrade

Here’s a problem that (probably) won’t affect the day to day running of most systems – but it could be a pain in the backside for people who write programs to generate repeatable test data. I’m not going to say much about the problem, just leave you with a test script.


rem
rem	Script	random_upgrade.sql
rem	Author:	Jonathan Lewis
rem	Dated:	Oct 2018
rem
rem	Last tested
rem		18.3.0.0
rem		12.2.0.1
rem	Notes
rem	In the upgrade from 12.2.0.1 something
rem	changed that meant
rem		create as select dbms_random
rem	gets different data from
rem		select dbms_random
rem

drop table t4 purge;
drop table t3 purge;
drop table t2 purge;
drop table t1 purge;
drop table t0 purge;

set feedback off

create table t0 as
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
;


execute dbms_random.seed(0);

create table t1
as
select dbms_random.normal
from
	t0
;

execute dbms_random.seed(0);

create table t2
as
with g1 as (
	select rownum id
	from dual
	connect by
		level <= 1e4 -- > comment to avoid WordPress format issue
)
select
	dbms_random.normal
from
	g1
;

prompt	=================
prompt	Diff the two CTAS
prompt	=================

select count(*)
from (
select * from t1
minus
select * from t2
union all
select * from t2
minus
select * from t1
)
;


create table t3 
as 
select * from t2 
where rownum < 1 -- > comment to avoid WordPress format issue
;

create table t4 
as 
select * from t2 
where rownum < 1 -- > comment to avoid WordPress format issue
;

execute dbms_random.seed(0)

insert into t3
select dbms_random.normal
from
	t0
;

execute dbms_random.seed(0)

insert into t4
with g1 as (
	select rownum id
	from dual
	connect by
		level <= 1e4 -- > comment to avoid WordPress format issue
)
select
	dbms_random.normal
from
	g1
;


prompt	===================
prompt	Diff the two Insert
prompt	===================

select count(*)
from (
select * from t3
minus
select * from t4
union all
select * from t4
minus
select * from t3
)
;


prompt	===========
prompt	Sum of CTAS
prompt	===========

select sum(normal) from t1;

prompt	=============
prompt	Sum of Insert
prompt	=============

select sum(normal) from t3;


execute dbms_random.seed(0)

prompt	=============
prompt	Sum of select
prompt	=============

with g1 as (
	select rownum id
	from dual
	connect by
		level <= 1e4 -- > comment to avoid WordPress format issue
)
select sum(n) from (
select
	dbms_random.normal n
from
	g1
)
;


I’m repeatedly using dbms_random.seed(0) to reset the random number generator and trying to generate 10,000 normally distributed numbers. (I’ve chosen the normal distribution because that happened to be the function in a script I sent someone with the comment that “this will recreate the data for the demonstration” – and they wrote back to say that it didn’t.)

I’ve got two “create as select”, and two “insert as select”. One of each pair selects from a real existing table to get 10,000 rows, the other uses the “select dual connect by” trick to generate rows. I’ve written SQL that shows whether or not the two pairs of tables end up with the same data (they do, pairwise), then I’ve summed one table from each pair to see if the different mechanisms produce the same data – and that depends on the version of Oracle you’re using. Finally I’ve reset the random number generator and summed across a pure select to see what that produces.

If you run this code on 12.2.0.1 or earlier you’ll see that the “diffs” report zeros and the “sums” report -160.39249. If you upgrade to 18.3 the diffs will still report zeros and some of the sums will still report -160.39249 but the sum of the CTAS will report -91.352172.

Bottom Line

If you’ve got code that you wrote to create reproducible test cases and the code uses: “create table … as select … dbms_random …” then it won’t produce the same data when you upgrade to 18.3. You’ll have to modify the code to do “create table (); insert as select …”.

As of this afternoon I have 1,209 test scripts on my laptop that use the dbms_random package to model data distribution patterns. It is almost certain that I will end up modifying every single one of them eventually.

There are words to express how I feel about this – but not ones that I would consider publishing.

Use Azure CLI…I Beg You…

#333333; cursor: text; font-family: -apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;">Azure CLI made me feel right at home after working at Oracle in the Enterprise Manager CLI, (EMCLI)  The syntax is simple, powerful and allows an interface to manage Azure infrastructure from the command line, scripting out complex processing that would involve a lot of time in the user interface.
https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/azurecli.jpg?... 300w, https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/azurecli.jpg?... 768w" sizes="(max-width: 383px) 100vw, 383px" data-recalc-dims="1" />

I’d love to start promoting it to more DBAs and infrastructure folks, but not just for creating databases and a few server/containers/vms, but for the entire STACK.  With that request, there’s going to be a lot of follow up blog posts on this one, but let’s just start with a few tips and tricks, along with a 101

1.Download the Azure CLI Client

Download Azure CLI to your desktop-  Its really easy.  Just following the defaults and install it on your desktop.  There’s no need to restart and it’s readily available from the command prompt, (cmd).

2.  Get a Real Script Editor

If you think you’ll get by with Notepad for your script editor, don’t even try it.  Get a proper text or script editor that tracks lines of code, can handle multiple scripting formats, etc.  If you need a suggestion, I am using Sublime Text and it does the trick.

3. Test Your Installation

Logging into Azure, if you’re using Azure Active directory is really easy.  Just open up a Command Prompt, (cmd from the start menu) and type in the following:

az login

An authorization window to prompt which one of your AD accounts you’re using for Azure you’d like to choose and then it will authorize and proceed.  You’ll see the following in the command prompt window once it’s finished.

"You have logged in. Now let us find all the subscriptions to which you have access..."

It will also show you your subscriptions that you have access to in Azure and then return to the prompt.  Congratulations, you’re now ready to deploy via the CLI!

4.  Perform a Few Test Deployments

#333333; cursor: text; font-family: -apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;">Then get started with it#333333; cursor: text; font-family: -apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"> by deploying a few test VMs, SQL Databases and maybe a container or two.

You can deploy a VM pretty easily with just a bit of information:

>C:\az vm create -n  -g  --image UbuntuLTS --generate-ssh-keys

SSH key files 'C:\Users\xxxxxxxxxxxx\.ssh\id_rsa' and 'C:\Users\xxxxxxxxxxxxxxxx\.ssh\id_rsa.pub' have been generated under ~/.ssh to allow SSH access to the VM. If using machines without permanent storage, back up your keys to a safe location.

- Running ..

C:\EDU_Docker>az vm list –n 

C:\EDU_Docker>az vm delete -n  -g  #I like to add the group, too.

The CLI will still ask you to verify if you want to delete the resource, but once you confirm, it will remove it and you’ll be back to clean.

The more people that use the CLI, the more robust it will become and the more powerful you become as an infrastructure specialist in Azure.  Come on, help a girl out here-  I can’t blog about this all on my own… </p />
</p></div>

    	  	<div class=

Join Cardinality – 2

In the previous note I posted about Join Cardinality I described a method for calculating the figure that the optimizer would give for the special case where you had a query that:

  • joined two tables
  • used a single-column to join on equality
  • had no nulls in the join columns
  • had a perfect frequency histogram on the columns at the two ends of the join
  • had no filter predicates associated with either table

The method simply said: “Match up rows from the two frequency histograms, multiply the corresponding frequencies” and I supplied a simple SQL statement that would read and report the two sets of histogram data, doing the arithmetic and reporting the final cardinality for you. In an update I also added an adjustment needed in 11g (or, you might say, removed in 12c) where gaps in the histograms were replaced by “ghost rows” with a frequency that was half the lowest frequency in the histogram.

This is a nice place to start as the idea is very simple, and it’s likely that extensions of the basic idea will be used in all the other cases we have to consider. There are 25 possibilities that could need separate testing – though only 16 of them ought to be relevant from 12c onwards. Oracle allows for four kinds of histograms – in order of how precisely they describe the data they are:

  • Frequency – with a perfect description of the data
  • Top-N (a.k.a. Top-Frequency) – which describes all but a tiny fraction (ca. one bucket’s worth) of data perfectly
  • Hybrid – which can (but doesn’t usually, by default) describe up to 2,048 popular values perfectly and gives an approximate distribution for the rest
  • Height-balanced – which can (but doesn’t usually, by default) describe at most 1,024 popular values with some scope for misinformation.

Finally, of course, we have the general case of no histogram, using only 4 numbers (low value, high value, number of rows, number of distinct values) to give a rough picture of the data – and the need for histograms appears, of course, when the data doesn’t look anything like an even distribution of values between the low and high with close to “number of rows”/”number of distinct values” for each value.

So there are 5 possible statistical descriptions for the data in a column – which means there are 5 * 5 = 25 possible options to consider when we join two columns, or 4 * 4 = 16 if we label height-balanced histograms as obsolete and ignore them (which would be a pity because Chinar has done some very nice work explaining them).

Of course, once we’ve worked out a single-column equijoin between two tables there are plenty more options to consider:  multi-column joins, joins involving range-based predicates, joins involving more than 2 tables, and queries which (as so often happens) have predicates which aren’t involved in the joins.

For the moment I’m going to stick to the simplest case – two tables, one column, equality – and comment on the effects of filter predicates. It seems to be very straightforward as I’ll demonstrate with a new model

rem
rem     Script:         freq_hist_join_03.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem

execute dbms_random.seed(0)

create table t1(
        id      number(8,0),
        n0040   number(4,0),
        n0090   number(4,0),
        n0190   number(4,0),
        n0990   number(4,0),
        n1      number(4,0)
)
;

create table t2(
        id      number(8,0),
        n0050   number(4,0),
        n0110   number(4,0),
        n0230   number(4,0),
        n1150   number(4,0),
        n1      number(4,0)
)
;

insert into t1
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        mod(rownum,   40) + 1                   n0040,
        mod(rownum,   90) + 1                   n0090,
        mod(rownum,  190) + 1                   n0190,
        mod(rownum,  990) + 1                   n0990,
        trunc(30 * abs(dbms_random.normal))     n1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;

insert into t2
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        mod(rownum,   50) + 1                   n0050,
        mod(rownum,  110) + 1                   n0110,
        mod(rownum,  230) + 1                   n0230,
        mod(rownum, 1150) + 1                   n1150,
        trunc(30 * abs(dbms_random.normal))     n1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
end;
/

You’ll notice that in this script I’ve created empty tables and then populated them. This is because of an anomaly that appeared in 18.3 when I used “create as select”, and should allow the results from 18.3 be an exact match for 12c. You don’t need to pay much attention to the Nxxx columns, they were there so I could experiment with a few variations in the selectivity of filter predicates.

Given the purpose of the demonstration I’ve gathered histograms on the column I’m going to use to join the tables (called n1 in this case), and here are the summary results:


TABLE_NAME           COLUMN_NAME          HISTOGRAM       NUM_DISTINCT NUM_BUCKETS
-------------------- -------------------- --------------- ------------ -----------
T1                   N1                   FREQUENCY                119         119
T2                   N1                   FREQUENCY                124         124

     VALUE  FREQUENCY  FREQUENCY      PRODUCT
---------- ---------- ---------- ------------
         0       2488       2619    6,516,072
         1       2693       2599    6,999,107
         2       2635       2685    7,074,975
         3       2636       2654    6,995,944
...
       113          1          3            3
       115          1          2            2
       116          4          3           12
       117          1          1            1
       120          1          2            2
                                 ------------
sum                               188,114,543

We’ve got frequencyy histograms, and we can see that they don’t have a perfect overlap. I haven’t printed every single line from the cardinality query, just enough to show you the extreme skew, a few gaps, and the total. So here are three queries with execution plans:


set serveroutput off

alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
and     t1.n0990 = 20
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));


select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
and     t1.n0990 = 20
and     t2.n1150 = 25
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

I’ve queried the pure join – the count was exactly the 188,114,543 predicted by the cardinality query, of course – then I’ve applied a filter to one table, then to both tables. The first filter n0990 = 20 will (given the mod(,990)) definition identify one row in 990 from the original 100,000 in t1; the second filter n1150 = 25 will identify one row in 1150 from t2. That’s filtering down to 101 rows and 87 rows respectively from the two tables. So what do we see in the plans:


-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:23.47 |     748 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:23.47 |     748 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    188M|    188M|00:00:23.36 |     748 |  6556K|  3619K| 8839K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100K|    100K|00:00:00.01 |     374 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    100K|    100K|00:00:00.01 |     374 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")



-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.02 |     748 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.02 |     748 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    190K|    200K|00:00:00.02 |     748 |  2715K|  2715K| 1647K (0)|
|*  3 |    TABLE ACCESS FULL| T1   |      1 |    101 |    101 |00:00:00.01 |     374 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    100K|    100K|00:00:00.01 |     374 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")
   3 - filter("T1"."N0990"=20)



-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |     748 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |     748 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    165 |    165 |00:00:00.01 |     748 |  2715K|  2715K| 1678K (0)|
|*  3 |    TABLE ACCESS FULL| T2   |      1 |     87 |     87 |00:00:00.01 |     374 |       |       |          |
|*  4 |    TABLE ACCESS FULL| T1   |      1 |    101 |    101 |00:00:00.01 |     374 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")
   3 - filter("T2"."N1150"=25)
   4 - filter("T1"."N0990"=20)


The first execution plan shows an estimate of 188M rows – but we’ll have to check the trace file to confirm whether that’s only an approximate match to our calculation, or whether it’s an exact match. So here’s the relevant pair of lines:


Join Card:  188114543.000000 = outer (100000.000000) * inner (100000.000000) * sel (0.018811)
Join Card - Rounded: 188114543 Computed: 188114543.000000

Yes, the cardinality calculation and the execution plan estimates match perfectly. But there are a couple of interesting things to note. First, Oracle seems to be deriving the cardinality by multiplying the individual cardinalities of the two tables with a figure it calls “sel” – the thing that Chinar Aliyev has labelled Jsel the “Join Selectivity”. Secondly, Oracle can’t do arithmetic (or, removing tongue from cheek) the value it’s reported for the join selectivity is reported at only 6 decimal places, but stored to far more. What is the Join Selectivity, though ? It’s the figure we derive from the cardinality SQL divided by the cardinality of the cartesian join of the two tables – i.e. 188,114,543 / (100,000 * 100,000).

With the clue from the first trace file, can we work out why the second and third plans show 190K and 165 rows respectively. How about this – multiply the filtered cardinalities of the two separate tables, then multiply the result by the join selectivity:

  • 1a)   n0990 = 20: gives us 1 row in every 990.    100,000 / 990 = 101.010101…    (echoing the rounded execution plan estimate).
  • 1b)   100,000 * (100,000/990) * 0.0188114543 = 190,014.69898989…    (which is in the ballpark of the plan and needs confirmation from the trace file).

 

  • 2a)   n1150 = 25: gives us 1 row in every 1,150.    100,000 / 1,150 = 86.9565217…    (echoing the rounded execution plan estimate)
  • 2b)   (100,000/990) * (100,000/1,150) * 0.0188114543 = 165.2301651..    (echoing the rounded execution plan estimate).

Cross-checking against extracts from the 10053 trace files:


Join Card:  190014.689899 = outer (101.010101) * inner (100000.000000) * sel (0.018811)
Join Card - Rounded: 190015 Computed: 190014.689899

Join Card:  165.230165 = outer (86.956522) * inner (101.010101) * sel (0.018811)
Join Card - Rounded: 165 Computed: 165.230165

Conclusion.

Remembering that we’re still looking at very simple examples with perfect frequency histograms: it looks as if we can work out a “Join Selectivity” (Jsel) – the selectivity of a “pure” unfiltered join of the two tables – by querying the histogram data then use the resulting value to calculate cardinalities for simple two-table equi-joins by multiplying together the individual (filtered) table cardinality estimates and scaling by the Join Selectivity.

Acknowledgements

Most of this work is based on a document written by Chinar Aliyev in 2016 and presented at the Hotsos Symposium the same year. I am most grateful to him for responding to a recent post of mine and getting me interested in spending some time to get re-acquainted with the topic. His original document is a 35 page pdf file, so there’s plenty more material to work through, experiment with, and write about.

 

Partial Indexes–Take Care With Truncate

Partial indexes are a very cool feature that came along with Oracle 12c. The capability at partition level to control index existence allows for a couple of obvious use cases:

1) You index the data in recent partitions only, because small amounts of data are aggressively searched by applications and/or users, but not the older data because the queries for older data are either less frequent or are more analytical in nature.

2) You index the data in older partitions only, because you are putting the recent data into the in-memory store so almost any kind of access is very fast, but you do not want to swamp the entire in-memory store with older data that people rarely query.  But you still want reasonable access performance on that old data.

Whatever your use-case, there is currently an outstanding issue with partial indexes that you need to be aware of. (Note: This is not correct functionality and will be fixed in a future release). If you truncate a table, then the partial index dictionary definition may not be correctly observed for partitions.

Here’s an example:


SQL> create table t ( x int, y int )
  2  partition by range ( x )
  3  (
  4    partition p1 values less than ( 1000 ) indexing on,
  5    partition p2 values less than ( 2000 ) indexing off
  6  );
 
Table created.
 
SQL> insert into t
  2  select rownum, rownum from dual connect by level < 2000;
 
1999 rows created.
 
SQL> create index ix on t ( x ) local indexing partial;
 
Index created.
 
SQL> select segment_name, partition_name
  2  from user_segments
  3  where segment_name = 'IX';
 
SEGMENT_NAME                   PARTITION_NAME
------------------------------ ------------------------------
IX                             P1
 
SQL> select partition_name, status
  2  from user_ind_partitions
  3  where index_name = 'IX';
 
PARTITION_NAME                 STATUS
------------------------------ --------
P1                             USABLE
P2                             UNUSABLE

So far so good… partial indexing is working as expected. Then I truncate the table:


SQL> truncate table t;
 
Table truncated.
 
SQL>
SQL> select segment_name, partition_name
  2  from user_segments
  3  where segment_name = 'IX';
 
SEGMENT_NAME                   PARTITION_NAME
------------------------------ ------------------------------
IX                             P1
IX                             P2
 
SQL> select partition_name, status
  2  from user_ind_partitions
  3  where index_name = 'IX';
 
PARTITION_NAME                 STATUS
------------------------------ --------
P1                             USABLE
P2                             USABLE
 
SQL> select segment_name, partition_name
  2  from user_segments
  3  where segment_name = 'IX';
 
SEGMENT_NAME                   PARTITION_NAME
------------------------------ ------------------------------
IX                             P1
IX                             P2
 
SQL> select partition_name, status
  2  from user_ind_partitions
  3  where index_name = 'IX';
 
PARTITION_NAME                 STATUS
------------------------------ --------
P1                             USABLE
P2                             USABLE

And now we seem to have dropped our knowledge of partial indexing and the second index partition springs into existence. The dictionary for the table remains unaffected


SQL> select partition_name, indexing
  2  from   user_tab_partitions
  3  where  table_name = 'T';
 
PARTITION_NAME                 INDE
------------------------------ ----
P1                             ON
P2                             OFF

The workaround is easy. You simply re-issue the partial indexing command


SQL> alter table t modify partition p2 indexing off;

Table altered.

SQL>
SQL> select segment_name, partition_name
  2  from user_segments
  3  where segment_name = 'IX';

SEGMENT_NAME                   PARTITION_NAME
------------------------------ ---------------------------
IX                             P1

But obviously, you need to remember to do this.

Using Microsoft Flows to Automate RSS Feeds

Now everyone knows how I like to automate everything and for those that have known me since I started sharing content, I pretty much cried a thousand tears when the personalized news source, Prism disappeared.

I’ve been working with RSS feeds aggregators to send me content each day to read, but I get frustrated with having to go find them sent to my spam folder or not being able to get to the links, so I wanted to try something new.

Microsoft Flow

I automated a lot of tasks at my previous jobs with IFTTT, (If this, then that) recipes and after arriving at Microsoft, was thrilled when I saw my first demo of Microsoft Flow.  It’s a growing product that has connectors to a ton of applications, features and has a custom API to build out automated workflows for whatever your heart desires.  What I’m about to do isn’t going to require anything THAT custom, so no worries.

Change Flow

With my time in the Analytics and AI space, my content source is going to change, so I have a few new blogs that I want to keep up to date with.  For our example, we’ll use one of my recent favorites, Chris Webb’s BI blog.  As we eat our own dogfood here at Microsoft, I’m would like to have emails notify me via email of when he has a new blog post and provide it to “my personal team of one” in Microsoft Teams, (I found Teams excellent as a self-project management program, like Trello, Outlook Calendar and Slack all in one.) You’ll need a Microsoft Team account to do one of the steps I’ve added, but it’s not necessary if you want to just email the notification of a new post to yourself.

After logging into my Microsoft Flow account, I choose a new template and type in the Search bar, “RSS”.  I double click on the RSS Feed Notification and it displays the following page:

https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow7.jpg?res... 300w, https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow7.jpg?res... 768w" sizes="(max-width: 387px) 100vw, 387px" data-recalc-dims="1" />

Feed the URL

If you don’t know how to get the feed URL from a blog you like, Click on the browser settings and choose either “Developer Tools” or “Source Code”, (or it will be something similar, depending on the browser..)  Do a search in the code for “RSS” and you’ll quickly come across the URL for the RSS Feed.  By default, it’s commonly the URL/feed/, but it’s always better to check.

https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow2.jpg?res... 300w, https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow2.jpg?res... 768w, https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow2.jpg?w=1... 1400w, https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow2.jpg?w=2... 2100w" sizes="(max-width: 501px) 100vw, 501px" data-recalc-dims="1" />

The rest of the information is just my Team name, which shows in the drop down list.  I could send this to any of the Teams I’m part of, but this is for me, so my own little team of one will do.  I dynamically populate the blog summary and the blog post URL to make it easy to go from Teams to the blog post.

https://i2.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow3.jpg?res... 300w, https://i2.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow3.jpg?res... 768w, https://i2.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow3.jpg?w=1... 1400w, https://i2.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow3.jpg?w=2... 2100w" sizes="(max-width: 504px) 100vw, 504px" data-recalc-dims="1" />

Add Email

After filling out the fields, I can save, but instead I click Next Step to add the email to Outlook step.  If you don’t have Outlook, there are connectors for Gmail and event Mailchimp!  As I stated, I’m eating me own dogfood, so I have Outlook and just need to fill in the pertinent information as requested.  I dynamically populate some of the fields for the Blog Summary and URL link so I can easily connect to the site when the email arrives.

https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow4.jpg?res... 300w, https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow4.jpg?res... 768w, https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow4.jpg?w=1... 1400w, https://i0.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow4.jpg?w=2... 2100w" sizes="(max-width: 503px) 100vw, 503px" data-recalc-dims="1" />

Test and Savor

Now I click on the Test up in the upper right hand corner.  It tests all the steps and the connections, just to verify all the steps have been filled out correctly and then once that’s completed, you can then exit, the workflow creation is complete.

https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow5.jpg?res... 300w, https://i1.wp.com/dbakevlar.com/wp-content/uploads/2018/10/flow5.jpg?res... 768w" sizes="(max-width: 203px) 100vw, 203px" data-recalc-dims="1" />

Now, when there’s a new blog post from Chris, I’ll know about it, both via email and Teams!  I may have lost Prism, but that doesn’t mean I can’t get the information that’s important to me without seeing all the rubbish that’s put out on most sites!  Yes, I’m adding a ton more workflows to automate more of my day, including adding contacts from coworker emails, automating refreshes on datasets, etc.  Don’t let the preview status scare you on the product-  this is how the product comes up to speed fast is with people working with it and using it to its full potential.

 



Tags:  , , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBAKevlar [Using Microsoft Flows to Automate RSS Feeds], All Right Reserved. 2018.

ODBV3 and ASM

At Trivadis Performance Days 2018 (awesome event by the way) I promised to deliver ODBV3 with support for ASM – and here it is! </p />
</p></div>

    	  	<div class=

Understanding Distribution in #Exasol

Exasol doesn’t need much administration but getting distribution right matters

Exasol uses a clustered shared-nothing architecture with many sophisticated internal mechanisms to deliver outstanding performance without requiring much administration. Getting the distribution of rows between cluster nodes right is one of the few critical tasks left, though. To explain this, let’s say we have two tables t1 and t2:

https://uhesse.files.wordpress.com/2018/10/tablest1t2.png?w=150&h=137 150w, https://uhesse.files.wordpress.com/2018/10/tablest1t2.png 497w" sizes="(max-width: 300px) 100vw, 300px" />

The two tables are joined on the column JoinCol, while WHERE conditions for filtering are done with the column WhereCol. Other columns are not shown to keep the sketches small and simple. Now say these two tables are stored on a three-node cluster. Again, for simplicity only active nodes are on the sketch – no reserve nodes or license nodes. We also ignore the fact that small tables will be replicated across all active nodes.

Distribution will be random if no distribution key is specified

Without specifying a distribution key, the rows of the tables are distributed randomly across the nodes like this:

https://uhesse.files.wordpress.com/2018/10/randomdistribution.png?w=150&... 150w, https://uhesse.files.wordpress.com/2018/10/randomdistribution.png?w=300&... 300w, https://uhesse.files.wordpress.com/2018/10/randomdistribution.png?w=768&... 768w, https://uhesse.files.wordpress.com/2018/10/randomdistribution.png 888w" sizes="(max-width: 620px) 100vw, 620px" />

Absence of proper distribution keys: global joins

The two tables are then joined:

SELECT  FROM t1 JOIN t2 ON t1.JoinCol = t2.JoinCol;

Internally, this is processed as a global join which means network communication between the nodes on behalf of the join is required. This is the case because some rows do not find local join partners on the same node:

https://uhesse.files.wordpress.com/2018/10/globaljoin.png?w=150&h=58 150w, https://uhesse.files.wordpress.com/2018/10/globaljoin.png?w=300&h=117 300w, https://uhesse.files.wordpress.com/2018/10/globaljoin.png?w=768&h=299 768w, https://uhesse.files.wordpress.com/2018/10/globaljoin.png 889w" sizes="(max-width: 620px) 100vw, 620px" />

Distribution on join columns: local joins

If the two tables were distributed on their join columns with statements like these

ALTER TABLE t1 DISTRIBUTE BY JoinCol;

ALTER TABLE t2 DISTRIBUTE BY JoinCol;

then the same query can be processed internally as a local join:

https://uhesse.files.wordpress.com/2018/10/localjoin.png?w=150&h=59 150w, https://uhesse.files.wordpress.com/2018/10/localjoin.png?w=300&h=119 300w, https://uhesse.files.wordpress.com/2018/10/localjoin.png?w=768&h=304 768w, https://uhesse.files.wordpress.com/2018/10/localjoin.png 891w" sizes="(max-width: 620px) 100vw, 620px" />

Here every row finds a local join partner on the same node so no network communication between the nodes on behalf of the join is required. The performance with this local join is much better than with the global join although it’s the same statement as before.

Why you shouldn’t distribute on WHERE-columns

While it’s generally a good idea to distribute on JOIN-columns, it’s by contrast a bad idea to distribute on columns that are used for filtering with WHERE conditions. If both tables would have been distributed on the WhereCol columns, it would look like this:

https://uhesse.files.wordpress.com/2018/10/wherecols.png?w=150&h=59 150w, https://uhesse.files.wordpress.com/2018/10/wherecols.png?w=300&h=119 300w, https://uhesse.files.wordpress.com/2018/10/wherecols.png?w=768&h=304 768w, https://uhesse.files.wordpress.com/2018/10/wherecols.png 883w" sizes="(max-width: 620px) 100vw, 620px" />

This distribution is actually worse than the initial random distribution! Not only does this cause global joins between the two tables as already explained, statements like e.g.

 WHERE t2.WhereCol='A';

will utilize only one node (the first with this WHERE condition) and that effectively disables one of Exasol’s best strengths, the Massive Parallel Processing (MPP) functionality. This distribution leads to poor performance because all other nodes in the cluster have to stand by being idle while one node has to do all the work alone.

Examine existing distribution with iproc()

The function iproc() helps investigating the existing distribution of rows across cluster nodes. This statement shows the distribution of the table t1:

SELECT iproc(),COUNT(*) FROM t1 GROUP BY 1 ORDER BY 1;

Evaluate the effect of distribution keys with value2proc()

The function value2proc() can be used to display the effect that a (new) distribution key would have:

SELECT home_node,COUNT(*) FROM (SELECT value2proc(JoinCol) AS home_node FROM t1) GROUP BY 1 ORDER BY 1;

Conclusion

Distribution on JOIN-columns leads to local joins which perform better than global joins: Do that!

Distribution on WHERE-columns leads to global joins and disables the MPP functionality, both causing poor performance: Don’t do that!

“Hidden” Efficiencies of Non-Partitioned Indexes on Partitioned Tables Part I (The Jean Genie)

When it comes to indexing a partitioned table, many automatically opt for Local Indexes, as it’s often assumed they’re simply easier to manage and more efficient than a corresponding Global Index. Having smaller index structures that are aligned to how the table is partitioned certainly has various advantages. The focus in this little series is on […]

Ansible tips’n’tricks: a different output option

When running ansible scripts, occasionally you wonder why a given task has failed. I found out more than once that it’s commonly a problem with the script, not the engine ;) Finding out exactly where in the script I made the mistake can be more of a challenge.

With the default ansible settings, output can be a bit hard to read. Consider this example: I do quite a bit of patching in my lab, and this almost always requires an upgrade of OPatch (d’oh!). So instead of connecting to each of my hosts and performing the same unzip command over and over again, I thought of using something else. Why not use ansible for this task? It won’t get tired copying/unzipping OPatch to all the destinations I indicate in my configuration. And it won’t introduce a mistake when dealing with the fifth ORACLE_HOME on the third server…

Before replacing $ORACLE_HOME/OPatch with the new version, I want to take a backup of the current OPatch just in case. I don’t want to keep more than 1 backup around in this particular lab environment, so I decided to check for an existing backup first, before creating a new one. If one exists, I remove it. Or at least, that’s the plan.

So I was happily coding away and in my usual trial-and-error approach was ready to test the script I wrote for the first time. Here’s the result (as shown in my 80×24 terminal):

[martin@controller environment]$ ansible-playbook -i inventory.yml broken.yml

PLAY [blogpost] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [server1]

TASK [check if there is an old backup] *****************************************
ok: [server1]

TASK [remove old OPatch backup] ************************************************
fatal: [server1]: FAILED! => {"msg": "The conditional check 'backup_present.exis
ts' failed. The error was: error while evaluating conditional (backup_present.ex
ists): 'dict object' has no attribute 'exists'\n\nThe error appears to have been
 in '/home/martin/ansible/blogpost/environment/broken.yml': line 20, column 11, 
but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe 
offending line appears to be:\n\n\n        - name: remove old OPatch backup\n  
        ^ here\n"}

PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=1   

[martin@controller environment]$ 

It doesn’t really matter what I was trying to do here, what matters though is the somewhat illegible formatting of the output. The listing above really shows how the error displayed in my terminal. I haven’t quite understood yet why there are linebreaks (\n) in the output that don’t result in a carriage return on screen.

So I did a little bit of digging around and found a global setting named stdout_callback. This is usually defined in /etc/ansible/ansible.cfg which would be bad news for developers if we couldn’t override it. Thankfully you can – using $HOME/.ansible.cfg or even an ansible.cfg file in your project directory. Setting stdout_callback to “debug” reveals a much more readable version of the error:

TASK [remove old OPatch backup] ************************************************
fatal: [server1]: FAILED! => {}

MSG:

The conditional check 'backup_present.exists' failed. The error was: error while
 evaluating conditional (backup_present.exists): 'dict object' has no attribute 
'exists'

The error appears to have been in '/home/martin/ansible/blogpost/environment/bro
ken.yml': line 20, column 11, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


        - name: remove old OPatch backup
          ^ here


PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=1

I find this much easier to read, and by setting stdout_callback to a non-default value in my project directory I don’t break anything inadvertently. It also immediately revealed I wasn’t checking backup_exists.stat.exists, I used backup_exists.exists. Pretty-printing the output helped me debug the mistake much quicker. Later on, when your script is ready to be deployed it’s probably a good idea not to use the debug callback ;)

The ansible version in this post is 2.6.4 by the way.

Happy scripting!

Join Cardinality

Following up my “Hacking for Skew” article from a couple of days ago, Chinar Aliyev has written an article about a method for persuading the optimizer to calculate the correct cardinality estimate without using any undocumented, or otherwise dubious, mechanisms. His method essentially relies on the optimizer’s mechanism for estimating join cardinality when there are histograms at both ends of the join, so I thought I’d write a short note describing the simplest possible example of the calculation – an example where the query is a single column equi-join with no nulls in either column and a perfect frequency histograms at both ends of the join.  (For a detailed description of more general cases I always refer to the work done by Alberto Dell’Era a few years ago). We start with two data sets that exhibit a strong skew in their data distributions:

rem
rem     Script:         freq_hist_join_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem

execute dbms_random.seed(0)

create table t1
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        trunc(3 * abs(dbms_random.normal))      n1
from
        generator       v1
;

create table t2
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        trunc(3 * abs(dbms_random.normal))      n1
from
        generator       v1
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 254'
        );
end;
/


I’ve generated two tables of 10,000 randomly generated values using the dbms_random.normal() function, but I’ve scaled the value up by a factor of three and taken the absolute value – which has given me a range of 12 distinct integer values with a nicely skewed distribution. Then I’ve gathered stats requesting histograms of up to 254 buckets. Since I’ve tested this only on versions from 11.2.0.4 onwards this means I’ll get a perfect histogram on the n1 columns on both tables.

Now I’m going run a query that reports the values and frequencies from the two tables by querying user_tab_histograms using a variant of an analytic query I published a long time ago to convert the cumulative frequencies recorded as the endpoint values into simple frequencies. If, for some reason, this query doesn’t run very efficiently in your tests you could always /*+ materialize */ the two factored subqueries (CTEs – common table expressions):


prompt  =======================================================================
prompt  Multiply and sum matching frequencies. An outer join is NOT needed
prompt  because rows that don't match won't contributed to the join cardinality
prompt  =======================================================================

break on report skip 1
compute sum of product on report
column product format 999,999,999

with f1 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'N1'
order by
        endpoint_value
),
f2 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T2'
and     column_name = 'N1'
order by
        endpoint_value
)
select
        f1.value, f1.frequency, f2.frequency, f1.frequency * f2.frequency product
from
        f1, f2
where
        f2.value = f1.value
;


     VALUE  FREQUENCY  FREQUENCY      PRODUCT
---------- ---------- ---------- ------------
         0       2658       2532    6,730,056
         1       2341       2428    5,683,948
         2       1828       1968    3,597,504
         3       1305       1270    1,657,350
         4        856        845      723,320
         5        513        513      263,169
         6        294        249       73,206
         7        133        117       15,561
         8         40         54        2,160
         9         23         17          391
        10          5          5           25
        11          4          2            8
                                 ------------
sum                                18,746,698

As you can see, the two columns do have a highly skewed data distribution. The pattern of the two data sets is similar though the frequencies aren’t identical, of course. The total I get from this calculation is (I claim) the cardinality (rows) estimate that the optimizer will produce for doing an equi-join on these two tables – so let’s see the test:


set serveroutput off
alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
alter session set statistics_level = typical;
alter session set events '10053 trace name context off';

And the resulting output:

Session altered.
Session altered.


  COUNT(*)
----------
  18746698


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0wxytnyqs4b5j, child number 0
-------------------------------------
select  count(*) from  t1, t2 where  t1.n1 = t2.n1

Plan hash value: 906334482
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:03.23 |      40 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:03.23 |      40 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |     18M|     18M|00:00:02.96 |      40 |  2616K|  2616K| 2098K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |  10000 |  10000 |00:00:00.01 |      20 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |  10000 |  10000 |00:00:00.01 |      20 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")

As we can see, the estimate for the hash join is “18M” which is in the right ballpark but, in its current format, isn’t entirely helpful which is why I’ve enabled the 10053 trace to get an exact figure from the trace file, and this is what we see:


***********************
Best so far:  Table#: 0  cost: 4.352468  card: 9487.000000  bytes: 28461.000000
              Table#: 1  cost: 378.482370  card: 18467968.000000  bytes: 110807808.000000
***********************

The optimizer’s estimate is exactly the sum of the products of the frequencies of matching values from the (frequency) histogram data. There is a simple rationale for this – it gets the right answer. For each row in t1 with value ‘X’ the (frequency) histogram on t2 tells Oracle how many rows will appear in the join, so multiplying the frequency of ‘X’ in t1 by the frequency of ‘X’ in t2 tells Oracle how many rows the ‘X’s will contribute to the join. Repeat for every distinct value that appears in both (frequency) histograms and sum the results.

As a refinement on this (very simple) example, let’s delete data from the two tables so that we have rows in t1 that won’t join to anything in t2, and vice versa – then re-gather stats, query the histograms, and check the new prediction. We want to check whether a value that appears in the t1 histogram contributes to the join cardinality estimate even if there are no matching values in the t2 histogram (and vice versa):


delete from t1 where n1 = 4;
delete from t2 where n1 = 6;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 254', no_invalidate=>false)
execute dbms_stats.gather_table_stats(user,'t2',method_opt=>'for all columns size 254', no_invalidate=>false)

with f1 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'N1'
order by
        endpoint_value
),
f2 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T2'
and     column_name = 'N1'
order by
        endpoint_value
)
select
        f1.value, f1.frequency, f2.frequency, f1.frequency * f2.frequency product
from
        f1, f2
where
        f2.value = f1.value
;


set serveroutput off
alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
alter session set statistics_level = typical;
alter session set events '10053 trace name context off';

And the output – with a little cosmetic tidying:


856 rows deleted.
249 rows deleted.

PL/SQL procedure successfully completed.
PL/SQL procedure successfully completed.


     VALUE  FREQUENCY  FREQUENCY      PRODUCT
---------- ---------- ---------- ------------
         0       2658       2532    6,730,056
         1       2341       2428    5,683,948
         2       1828       1968    3,597,504
         3       1305       1270    1,657,350
         5        513        513      263,169
         7        133        117       15,561
         8         40         54        2,160
         9         23         17          391
        10          5          5           25
        11          4          2            8
                                 ------------
sum                                17,950,172


Session altered.
Session altered.


  COUNT(*)
----------
  17950172


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0wxytnyqs4b5j, child number 0
-------------------------------------
select  count(*) from  t1, t2 where  t1.n1 = t2.n1

Plan hash value: 906334482
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:02.89 |      40 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:02.89 |      40 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |     17M|     17M|00:00:02.61 |      40 |  2616K|  2616K| 2134K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |   9144 |   9144 |00:00:00.01 |      20 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |   9751 |   9751 |00:00:00.01 |      20 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")


From the 10053 trace file:
***********************
Best so far:  Table#: 0  cost: 4.340806  card: 9144.000000  bytes: 27432.000000
              Table#: 1  cost: 368.100010  card: 17950172.000000  bytes: 107701032.000000
***********************

You can see from the frequency histogram report that we “lost” values 4 and 6 from the report; then the total from the report matches the actual number of rows returned by the query, and the cardinality estimate in the plan is again in the right ballpark – with the trace file showing an exact match.

I’ve run this test on 11.2.0.4,  12.1.0.2,  12.2.0.1 and  18.3.0.0 (which generated a different set of random values) – and there’s an anomaly that appears in 11.2.0.4 (though maybe that should be “disappeared from”): the optimizer’s estimate for the cardinality was a little larger than the value generated in the query against user_tab_histograms. [Now explained (probably)]

Conclusion:

For an incredibly simple class of queries with perfect frequency histograms there’s a very simple way to calculate the cardinality estimate that the optimizer will predict. Match up rows from the two frequency histograms, multiply the corresponding frequencies (making sure you don’t multiply the cumulative frequencies), and sum.

This is, of course, only a tiny step in the direction of seeing how Oracle uses histograms and covers only a type of query that is probably too simple to appear in a production system, but it’s a basis on which I may build in future notes over the next few weeks.

Update (5th Oct)

The “error” in the 11g calculation irritated me a little, and I woke up this morning with an idea about the solution. In 10.2.0.4 Oracle changed the way the optimizer calculated for a predicate that used a value that did not appear in the frequency histogram: it did the arithmetic for  “half the least frequently occurring value”. So I thought I’d run up a test where for my “sum of products” query I emulated this model. I had to change my query to an “ANSI”-style full outer join, and here it is:

with f1 as (
select 
        endpoint_value                                                            value, 
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from 
        user_tab_histograms 
where 
        table_name  = 'T1' 
and     column_name = 'N1'
),
f2 as (
select 
        endpoint_value                                                            value, 
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from 
        user_tab_histograms 
where 
        table_name  = 'T2' 
and     column_name = 'N1'
)
select
        f1.value, f2.value,
        nvl(f1.frequency, 0)                t1_frequency,
        nvl(f2.frequency, 0)                t2_frequency,
        nvl(f1.frequency, &t1_least / 2) *
        nvl(f2.frequency, &t2_least / 2)    product
from
        f1
full outer join
        f2
on
        f2.value = f1.value
order by
        coalesce(f1.value, f2.value)
;

Running this code, and noting that the least frequent value in t1 was 4, while the least frequence in t2 was 2, I got the following results (with the 10053 trace file summary following the output)


     VALUE      VALUE T1_FREQUENCY T2_FREQUENCY      PRODUCT
---------- ---------- ------------ ------------ ------------
         0          0         2658         2532    6,730,056
         1          1         2341         2428    5,683,948
         2          2         1828         1968    3,597,504
         3          3         1305         1270    1,657,350
                    4            0          845        1,690
         5          5          513          513      263,169
         6                     294            0          294
         7          7          133          117       15,561
         8          8           40           54        2,160
         9          9           23           17          391
        10         10            5            5           25
        11         11            4            2            8
                      ------------ ------------ ------------
sum                           9144         9751   17,952,156


Join Card:  17952157.000000 = outer (9751.000000) * inner (9144.000000) * sel (0.201341)
Join Card - Rounded: 17952157 Computed: 17952157.00
 

That’s a pretty good match to the trace file result – and the difference of 1 may simply be a rounding error (despite the trace files text suggesting it is accurate to 6 d.p.)

Footnote

Following an exchange of email with Chinar Aliyev, it’s fairly clear that the “half the least frequency” can actually be derived as “table.num_rows * column.density”.