Hi,
Apress has made the draft versions of our Expert Oracle Exadata book available for purchase.
How this works is:
This is an awesome deal if you can’t wait until the final launch and want to get ahead of the curve with your Exadata skills ;-)
Buy the alpha version of our Expert Oracle Exadata book from Apress here!
If you haven’t heard about this book earlier – I’m one of the 3 authors, writing it together with Kerry Osborne and Randy Johnson from Enkitec and our official tech reviewer is no other than THE Kevin Closson and we are also getting some (unofficial) feedback from Oracle database junkie Arup Nanda.
So this book will absolutely rock and if you want a piece of it now, order the alpha book above!
P.S. This hopefully also explains why I’ve been so quiet with my blogging lately – can’t write a book and do many other things at the same time… (at least if you want to do it well…)
When I read the recent post by the optimizer group about the new concurrent gather stats feature added in 11.2.0.2 it reminded me of the fact that I intended to publish something based on the same idea already some time ago.
The Problem
It was motivated by a client's regular need during a transition phase from non-Exadata to Exadata to create literally thousands of indexes with potentially a multitude of (sub-)partitions as fast as possible - as part of a full datapump import job of a multi-terabyte database running 11.1.0.7 and 11.2.0.1 (Exadata V2).
There are actually two issues regarding the index creation part of a large database import:
1. The datapump import performs the index creation only by a single worker thread even when using the PARALLEL worker thread import feature. Although an index could be created in parallel if you have thousands of smaller index objects this single worker thread potentially does not make efficient use of the available hardware resources with high-end configurations, including and in particular Exadata.
2. There is a nasty bug 8604502 that has been introduced with 11.1.0.7 that affects also 11.2.0.1 (fixed in 11.2.0.2 and a generic one-off patch is available on My Oracle Support for 11.1.0.7 and 11.2.0.1): The IMPDP creates all indexes serially, even those supposed to be created in parallel, and only after the creation ALTERs them to the defined PARALLEL degree. Note that the fix actually only fixes the problem at actual execution time, even with the fix installed (and in 11.2.0.2) the SQLFILE option of IMPDP still generates CREATE INDEX DDLs that will always have the parallel degree set to PARALLEL 1 (see MOS document 1289032.1 and bug 10408313 - INDEXES ARE CREATED WITH PARALLEL DEGREE 1 DURING IMPORT which has been closed as not being a bug). This "not-being-a-bug" also affects all other versions that support the datapump utility - the SQLFILE option always generates CREATE INDEX scripts with the parallel degree set to 1 no matter what the actual degree of the index is supposed to be. It's only the ALTER INDEX DDL command following the CREATE INDEX command that sets the parallel degree correctly.
These two issues in combination meant to them that a full database import job took ages to complete the index creation step after loading quite quickly the vast amount of table data in parallel.
In case of partitioned indexes there is another complication independently from the mentioned issues: Oracle uses only one parallel slave per partition for creation - in case of large and/or few partitions this again doesn't make efficient use of the available resources.
Oracle therefore provides several means to speed up index creation and rebuild tasks, in particular the documented DBMS_PCLXUTIL package that is around since the Oracle 8 days to overcome the above mentioned limitation of partitioned index creation by spawning multiple jobs each rebuilding an index partition in parallel.
Another, undocumented feature is the DBMS_INDEX_UTL package that is obviously used internally as part of several maintenance operations, for example those DDLs that include the "UPDATE INDEXES" clause. According to the spec it allows to rebuild multiple indexes concurrently by spawning multiple jobs - however since it is undocumented it might not be safe to use in production-like configurations - furthermore it might be changed in future releases without further notice and therefore is potentially unreliable.
A Solution
Since the client wanted a quick solution that ideally addressed all of the above issues I came up with a simple implementation that uses Advanced Queueing and background jobs to create as many indexes as desired concurrently.
The solution is targeted towards the client's scenario, so the following is assumed:
- There is a SQL file that contains the CREATE INDEX statements. This can easily be generated via IMPDP based on the dump files using the SQLFILE option.
- To address the CREATE INDEX (not-being-a-)bug (the bugfix for the bug 8604502 still generates incorrect CREATE INDEX DDLs with the SQLFILE option of IMPDP as mentioned above) I've created a combination of "sed" and "awk" unix scripts that take the IMPDP SQLFILE potentially including all DDLs commands as input and create a output file that consists solely of the CREATE INDEX commands with correct PARALLEL clauses based on the ALTER INDEX command following the CREATE INDEX in the script
- To address the lengthy index creation process I've created a small PL/SQL package that sets up the required AQ infrastructure, takes the CREATE INDEX DDL file as input, populates a queue with the index creation commands and spawns as many worker threads as specified that will take care of the actual index creation (that in turn might be a parallel index creation)
As a side note it is interesting that Oracle actually allows to build several indexes concurrently on the same segment (which makes totally sense but does probably not happen too often in practice).
Note that in principle this code could be used as a general template to execute arbitrary DDLs concurrently (of course with corresponding modifications).
The following link allows to download an archive that contains the following subdirectories:
- correct_parallel_clause: This directory contains the Unix scripts mentioned above that allow to process a SQLFILE generated by IMPDP and output a DDL file that solely consists of the CREATE INDEX commands contained in the SQLFILE. The generated CREATE INDEX statements also use a correct PARALLEL clause - the degree is taken from the ALTER INDEX DDL command following the CREATE INDEX in the SQLFILE. For further details refer to the README.txt in that directory. Note that the script at present does not handle Domain Indexes, only conventional and bitmap.
- source: Contains the package source for the concurrent index creation, furthermore a package that is required by the provided automated unit testing (see below for more details) and a script that prompts for the required details to initiate a concurrent index creation. The README.txt in that directory provides a quick start guide how to use the concurrent index creation.
- test: Contains two flavours of test harnesses for automated unit testing of the package. One based on the unit testing feature implemented in SQLDeveloper 2.1.1, and another one based on "dbunit", an open-source unit testing framework based on jUnit. The README.txt in the respective subdirectories explain how to use these unit tests.
How to use it
The usage is split into two parts: The first part deals with preparing a suitable text file that consists of the CREATE INDEX commands, the second part is about processing this text file with as many worker threads as desired.
Preparing the file is straightforward: You can use the "transform_all_sql.sh" script to generate the required CREATE INDEX script from a DDL script created via IMPDP SQLFILE.
The script has been tested primarily with bash, sed and awk under Cygwin 1.7.1 and OEL5, different Unix flavors might have different versions of the shell, awk or sed and therefore might behave differently.
Simply put all four Unix scripts in the "correct_parallel_clause" directory into the same directory, mark them as executable and run the "transform_all_sql.sh" like that:
where "input_file" is the file generated via IMPDP SQLFILE option and "output_file" will be the result.
In order to perform the parallel index creation, you need an account that has suitable privileges granted. Since it is assumed that the indexes will have to be created in different schemas this account will have to have extended privileges granted. The package is implemented using invoker's rights so granting these privileges via roles is sufficient. A quick and dirty solution could be creating a temporary account and granting simply the DBA role to it (this is what I used to do to test it). Note that the account also requires EXECUTE privileges on the DBMS_AQ and DBMS_AQADM packages for the AQ stuff. It also needs a simple logging table where errors and progress will be written to as well as a type that is used as payload of the queue. Obviously the account also needs to be able to create jobs - in this version of the package this is done via DBMS_SCHEDULER. At execution time the package is going to create a queue plus queue table that also needs to be stored in a tablespace - so you should make sure that the account (or at least the database) that executes the index creation has an appropriate default tablespace defined.
You can simply run the "pk_create_index_concurrent.sql" script (located in the "source" directory) in such a suitable account which will deinstall/install all required objects.
The execution of the index creation is then straightforward (taken from the package specification):
Note that the "p_job_submit_delay" parameter is currently not used - there were some odd locking issues on the AQ table in case of a RAC environment when using that option so I have commented out its usage at present - I haven't had a chance yet to investigate further what the problem actually was.
So the only required input to the CREATE_INDEX_CONCURRENT procedure is the name of the directory object that points to the directory where the file to process resides and the name of the file itself.
You probably want to specify the number of worker threads for the two sets: The idea here is to distinguish between the creation of serial and parallel indexes. The first parameter specifies the number of worker threads used for serial indexes, the second one the number of concurrent threads for parallel indexes.
The default is CPU_COUNT * INSTANCES threads for serial indexes and a single thread for parallel indexes.
If you don't want/need this separation of serial and parallel indexes simple use the same "worker_set_id" for both parameters "p_worker_set_id_1" and "p_worker_set_id_2" and specify the desired total parallel degree in one of the degree parameters and set the other one to 0 (the 0 is required otherwise one of the DBMS_SCHEDULER.CREATE_JOB calls will fail with a "duplicate job name/job name already exists").
The "p_sleep_seconds" parameter is only used to allow the jobs spawned to put a lock on the queue table - the teardown is then going to wait until all locks have been removed and therefore all queue processing has ended. The default of 10 seconds was sufficient in all cases I've encountered.
Since the package requires as prerequisite a directory where the file to process resides, I've prepared the script "create_index_concurrent.sql" that guides through the required inputs and takes care of that step as well.
It takes the full O/S path to the file and the file name as input, creates a directory CREATE_INDEX_CONCURRENT_DIR pointing to that directory and prompts then for the two degrees as input and the names of the two worker thread sets before calling the CREATE_INDEX_CONCURRENT stored procedure.
Caveats
Please note that you should double-check not to pass a non-transformed SQLFILE generated via IMPDP to the procedure - the results may be dire since the generated SQLFILE always contains much more than the bare CREATE INDEX commands, no matter what options you use for IMPDP. Always use the provided Unix scripts to post-process the SQLFILE before initiating the index creation.
Furthermore you need to be aware of the current limitation of the package that it does not attempt to tokenize the file contents. It simply uses a semicolon as delimiter to separate the DDL commands. This should be sufficient for most cases, but in case you have a function-based index using a string expression containing a semicolon as part of the index definition this will not work as expected. Also if you plan to use this package for other DDL execution activities like CTAS statements you might again hit this limitation if the DDL text contains semicolons.
Note that creating indexes using this tool results potentially in different index statistics than creating the indexes using IMPDP since IMPDP by default also imports the index statistics whereas the indexes created using this tool will end up with the current index statistics automatically generated during index creation (from 10g onwards, and the code requires at least 10.2). If you want to have the index statistics imported you can run IMPDP after the index creation using the INCLUDE=INDEX_STATISTICS option. This should complete fairly quickly and will import the index statistics only.
If you have SERVEROUTPUT enabled by default then you will very likely see some errors that will be printed by the initial attempt to tear down the AQ infrastructure. These errors are expected if the previous run was completed successfully or in case of the initial run and can be ignored (and will be catched/ignored by the default implementation).
Note also that all provided scripts except for the Unix shell scripts use DOS file format - under OEL this isn't a problem but it might be on your platform.
Finally the inevitable disclaimer: Although this has been tested thoroughly it comes with absolutely no warranty. Use it at your own risk and test it in your environment before attempting any runs against anything important.
Monitoring the execution
The code logs errors and progress into the table CREATE_INDEX_CONCURRENT_LOG. At present the code logs every attempt to execute DDL into the table as well as any errors that are raised during that DDL execution.
So the table can be used for both, monitoring the progress as well as checking for errors. The code currently continues the execution in case of errors encountered using the dreaded WHEN OTHERS THEN NULL construct, but the code is already prepared for a more granular error handling if required - see the defined exceptions and commented out exception handler.
You can view the queue contents in the corresponding queue view created by the AQ setup (AQ$CREATE_INDEX_QUEUE) in order to see the data to process. Note that due to the fact that all worker threads do not commit the queue transaction you won't be able to see the progress in the queue table until all worker threads committed. If you don't like that you can remove the wait and "teardown_aq" call at the end of the main procedure "create_index_concurrent" and uncomment the dequeue option "visibility=immediate" in the "create_index_thread" procedure. You would need then to call "teardown_aq" in a separate step as desired. With this modification you can monitor the progress by monitoring the queue, but the provided automated unit testing won't work with that variant since it relies on the main call to wait for all worker threads to complete before validating the results.
However you can see the progress also in the log table using the following sample query:
If you want to perform more sophisticated queries on the that table you might need to use some casts similar to the following, because the text columns are defined as CLOBs in order to be able to hold the complete DDLs and error messages in case of errors. The casts allow you to perform for example GROUP BYs etc.
The Unit Testing
Here we come to a completely different issue that is off-topic for this post, however in my experience so far it seems to be a very important one and I hopefully will have the time to cover it in the future with separate posts.
Generally speaking I've seen to many shops that don't follow best-practice when it comes to database deployment and development, therefore here is what you should know/do about it ideally - in a nutshell:
- Treat your database like source code, which means put everything related to the database under version control. This includes not only the obvious database source code but also DDL and DML scripts for schema evolution
- Use unit testing to test database code. Automate this unit testing
- Automate the deployment of your database related changes
- Install a continuous integration environment that runs the automated deployment and unit tests regularly, for example every night
- Automate deployment everywhere - starting from the development databases up to the production environment
- Follow your guidelines strictly - for example any hotfix-like adhoc change should still go through the established processes - code changes, testing, deployment etc.
I've helped several clients in the past to setup corresponding tools and processes for implementing above - if you are interested, get in touch with me.
So as a bonus, if you haven't spent too much time yet with above mentioned topics, in order to get you started at least with automated unit testing, I've included two different examples for this small source provided, one using the built-in unit test feature of SQLDeveloper and the other one using "dbunit". You can find both in the corresponding subdirectories of the "test" folder in the archive.
The unit testing is based on the "pk_create_index_concur_test.sql" package that is used to setup and teardown the environment for running the unit test. It assumes at present the existence of a directory "C:\app\oracle\admin\orcl112\dpdump" on O/S level. It will create a directory object for the path and attempt to create/write a file used for the unit test runs. You can pass any valid O/S directory path to the "pk_create_index_concur_test.setup" procedure if you want/need to use a different one.
All provided automated tests assume that both scripts, "pk_create_index_concurrent.sql" and "pk_create_index_concur_test.sql" have been run in the schema that should be used for test runs.
You can use the SQLDeveloper Unit Test feature to run the provided Unit Test. You can either use the GUI to import and run the test, or you can use a command line version that is actually using ANT to run the UTUTIL command line tool that comes with SQLDeveloper. You can read and follow the instructions in the "README.txt" in the test/SQLDeveloper directory how to do so. You'll need to setup a unit test repository initially if you want to use SQLDeveloper's unit testing feature either way (GUI or UTUTIL command line). See the SQLDeveloper's user's guide or online help how to do that (Hint: Menu item "Extras->Unit Testing" gets you started).
If you don't like the SQLDeveloper unit test approach or you are simply to lazy to install the tool, the unit test repository etc., you can alternatively try the automated unit testing using "dbunit". Follow the instructions in the "README.txt" in the test/dbunit directory how to run the unit tests using "dbunit".
This version of the package has successfully been tested using these unit tests on 10.2.0.4, 10.2.0.5, 11.1.0.7, 11.2.0.1 and 11.2.0.2 (after all it's dead easy with automated unit testing :-).
Summary
The provided tool set should represent a solid foundation for the given task of concurrent index creation. In particular it has been designed with the following in mind:
- Efficient use of privileges granted via roles: The package uses invoker's rights and most operations use dynamic SQL to avoid compilation issues, therefore granting the required privileges to the account used via roles should be sufficient
- The Unix scripts should be able to deal with table-, schema- and database-level datapump formats from Oracle 10g and 11g (all these variants use slightly different texts to identify the relevant sections of the generated SQLFILE by IMPDP)
- Optional use of two separate worker thread sets: This allows the concurrent creation of a multitude of indexes, be it serial or parallel, with clear distinction between the handling of serial (possibly many worker threads) and parallel indexes (usually only a few worker threads)
- Support for arbitrarily sized SQL: The DDL commands for (sub-)partitioned indexes can become quite large due to the way the Oracle meta data API generates the SQL. Therefore these generated SQLs can easily exceed the usual 32KB limit for PL/SQL character strings. The implementation uses CLOBs for the processed SQLs (and DBMS_SQL in versions lower than 11 to handle these piecewise) to support these potentially very large SQLs
- RAC/Grid/Cluster support via DBMS_SCHEDULER: The usage of DBMS_SCHEDULER allows a fine grained control of the resource consumption by the optional use of job classes (not implemented yet but can easily be added - it is a simple additional parameter to the CREATE_JOB procedure) that allow to specify a resource consumer group and a specific service name for the spawned worker threads
- Automated Unit Testing support: The provided unit test harness allows for easy testing of modifications to the code
I've recently come across an interesting variation of a "famous" ASSM bug. Probably some of you will remember that ASSM bug that was caused by row migrations in larger block sizes (16K/32K).
If you don't remember or don't know what I'm talking about, you can have a look here where Greg Rahn provides a summary of the issue or check My Oracle Support bug description 6918210.
Greg also links to a script originally created by Jonathan Lewis that allows to reproduce the issue at will.
So far the issue was only reproduced on block sizes greater 8K - the variation I've encountered however allows to reproduce the issue on 8K and 4K, possibly also on 2K, but I haven't tested 2K yet.
Below is my version of script. If you compare it to Jonathan's version you'll notice that it is very similar, if not to say almost the same except for additional optional instrumentation, that you can simply un-comment if you've installed my Advanced Oracle Troubleshooting script package that is based on Tanel Poder's awesome "tpt_public" tool set.
The SESSPACK tool can be found in Tanel's tool set (tools/sesspack_0.05_release) and the SNAP_KCBSW package has been developed by Jonathan a long time ago - it can be found here. Note that it only works for versions below 11g - this instrumentation has been "optimized away" in 11g, unfortunately.
In order to reduce the runtime, I've simply limited the number of rows in the table to 50,000 rows.
drop table t1;
purge table t1;
CREATE TABLE t1
(n1 NUMBER,
n2 NUMBER)
TABLESPACE &tblspace;
INSERT --+ append
INTO t1
SELECT TRUNC(dbms_random.VALUE(10000000,100000000)) n1,
TO_NUMBER(NULL) AS n2
FROM dual
CONNECT BY LEVEL <= 50000
/
BEGIN dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1');
END;
/
SELECT num_rows,blocks FROM user_tables WHERE table_name = 'T1';
/* Uncomment for instrumentation
@trci assm_bug
exec sesspack.snap_me
execute snap_kcbsw.start_snap
@46on 8
*/
alter session set events '10046 trace name context forever, level 8';
UPDATE t1 SET n2 = n1;
commit;
/* Uncomment for instrumentation
@trci assm_bug_off
@46off
*/
alter session set events '10046 trace name context off';
/* Uncomment for instrumentation
set serveroutput on size 1000000 format wrapped
set linesize 120
set trimspool on
execute snap_kcbsw.end_snap
exec sesspack.snap_me
*/
BEGIN dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1');
END;
/
SELECT num_rows,blocks FROM user_tables WHERE table_name = 'T1';
/* Uncomment for instrumentation
@trc_orasrp &trc_p &trc_f
@trc_tvdxtat &trc_p &trc_f
*/
Here is the task: You are allowed to modify the script at exactly one single location - the modification can take a maximum of four keywords, which means you can add/modify/remove at most four keywords.
With the correct modification you will be able to reproduce the bug even in 8K and lower block sizes.
So, what to modify and why?
If you want to actually run the script yourself you need to use database versions prior 11.2 because the bug is obviously fixed there - this includes 10.2.0.5, which interestingly doesn't have the bug fixed.
I've used a 8K/4K ASSM tablespace with UNIFORM 1M extents for my tests, but I don't think that the extent management matters in that case. My test database uses 8K as default block size.
You'll notice the bug when checking the runtime and the trace file. If you encounter the bug, the runtime for the update will be several seconds (more than 10 seconds seen on my test system in some cases) and the number of current mode gets for the update will be in the millions.
If you've enabled the additional instrumentation it will tell you that the reasons for the buffer gets where "ktspfsrch" and "ktspscan_bmb" for most of the gets. You can also take stack traces (e.g. using Tanel's OStackProf tool) if you use more than 50,000 rows to have a longer runtime of the update statement which will show you similar function names on the stack.
If you don't hit the bug, the update usually takes max. 1-2 seconds, and the current mode gets should be far less than one million when sticking to the 50,000 rows.
P.S.: There is more than one correct answer - and it is possible to hit the bug for 8K block sizes with a single keyword modification (full points!).
Update 24th Jan: P.P.S: No takers yet... So here's an additional hint: The issue is caused by row migration...
Update 26th Jan: OK, time to post a quick answer here. As pointed out by Narendra below, simply setting PCTFREE to 0 already was sufficient to reproduce the issue with smaller block sizes. However, there is much more to tell about and therefore this deserves a separate post that I'll publish the next couple of days.
For the time being here are the correct answers that I'm aware of at present:
- PCTFREE 0
- COMPRESS
- COMPRESS FOR ALL OPERATIONS
But as I already said, there is much more, in particular when partitioning comes into the picture - and I hope to cover all these details in the upcoming post.
Despite my attempts to not travel (much) anymore, I’m still going to speak at two events in first half of 2011.
The first event is at the Suncoast Oracle User Group meeting in Tampa, FL on Thursday, 27th January. I’ll speak about Advanced Oracle Troubleshooting in 60 minutes.
I will deliver the second presentation at Hotsos Symposium 2011 near Dallas, TX on Wednesday 9th March. I will talk about Troubleshooting the Most Complex Oracle Performance Problem I’ve ever seen. If you think that this performance problem was complex, then be prepared to see an even worse one there!
I’m talking about this:
select-1from from dual;
Looks like invalid, right? Well, let’s run it:
SQL> select-1from from dual; ROM ---------- -1.0E+000
This is because:
The following funky-looking SQL statements are also valid:
SQL> select.1e2ffrom dual; .1E2F ---------- 1.0E+001 SQL> select.1e2fas"."from dual; . ---------- 1.0E+001
In the upper example, the “.1e2f” means number .1 * 10^2 (scientific notation) represented as a FLOAT internally and in the lower one I’ve just added a column alias with “AS” keyword just to make the SQL look a bit crazier.
:-)
And this time we have audio !!! (Wow!)
Following the huge success of my last hacking session, planned while drinking beer at Graham Woods OOW pre-party and delivered from Miracle’s massive Oracle Closed World event in Thirsty Bear (between drinking beers), I’m announcing another hacking session:
What: Using Session Snapper for flexible Oracle Performance Troubleshooting
When: Wednesday 27th Oct 9:00-10:00 AM PDT (US West coast / California time). Check what’s this in your time zone here
Where: Internet! -> Sign up here: http://tech.e2sn.com/secret
You’ll need to register fast and be “there” on time as my current GotoWebinar account only allow 100 attendees to log on… last time over 100 people signed up, but “luckily” less actually showed up, so nobody got left outside!
BTW, I have figured out what went wrong with audio last time and caused my voice in the end of presentation disappear). A program, which I accidentally launched via a keyboard shortcut, grabbed my Mic input to itself, so gotowebinar’s app couldn’t access it anymore.
See you soon!
I was running tests on some very large tables on an Exadata cluster and found an interesting bug.
Instead of having to query 4Billion row tables, I have reproduced this case with a cartesian join…
Check this. I’m generating 8 Billion rows using a 3-way cartesian join of set of 2000 rows. So, this results in 2000 * 2000 * 2000 rows, which is 8 billion rows.
SQL> with sq as (select null from dual connect by level <= 2000)
select count(*)
from sq a, sq b, sq c;
COUNT(*)
----------
8000000000Everything worked well as you see. All 8 billion rows were nicely counted. Let’s modify this query a bit, by adding a WHERE rownum <= 8 000 000 000 predicate, which shouldn’t modify the outcome of my query as 8 billion rows is exactly what I have:
SQL> with sq as (select null from dual connect by level <= 2000)
select count(*)
from sq a, sq b, sq c
where rownum <= 8000000000;
COUNT(*)
----------
4294967295Ding! (as I’m unable to come up with a better word to present this problem :)
Seems like we have a problem! The COUNT operation thinks that we have only a bit over 4 billion rows returned from the query plan branches below it.
We clearly have a(n easily reproducible ) bug here!
What happens here is that whenever you use ROWNUM < N operator in a query predicate like I have done, you will introduce an additional step to the query plan (COUNT STOPKEY).
Check this simple query:
SELECT * FROM t WHERE ROWNUM <= 1000 Plan hash value: 508354683 -------------------------------------------- | Id | Operation | Name | E-Rows | -------------------------------------------- | 0 | SELECT STATEMENT | | | |* 1 | COUNT STOPKEY | | | | 2 | TABLE ACCESS FULL| T | 41 | -------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=1000)
As the above plan explains, the “WHERE rownum <” predicate is applied in the execution plan line (row source) 1 – COUNT STOPKEY. You won’t see this line when you are just counting rows without any “ROWNUM <” predicate.
How this works is that the COUNT STOPKEY knows how many rows you want and will just keep calling its child function under it in the execution plan tree to get more and more rows, until the required amount of rows have been returned. And then the COUNT STOPKEY function just stops calling its child row sources and declares the end-of-data condition.
And here’s the catch – due to a bug, the variable used to keep track of number of processed rows by COUNT STOPKEY is only 4 bytes, even on 64bit platforms. So, it can hold values up to 4294967295 in it (the count returned above), which is 2 ^ 32 – 1. That wouldn’t be so much of a problem in practical real world applications, but what I don’t like is that the operation will silently return wrong data – it will just stop fetching more rows, even though we haven’t reached the ROWNUM <= 8 000 000 000 limit yet and there are rows to fetch from the child row-source, but COUNT STOPKEY declares end-of-data condition and returns ~4B as a result.
This is (obviously a bug) and now there’s a bug number for that as well (thanks to Greg Rahn for letting me know) – Bug# 10214991 (unfortunately it’s not public in MOS).
Now, there’s a reason why this bug has gone unnoticed for so long despite that 4+ billion-row tables have existed for long time (I worked first with such Oracle database in year 2000 – on Windows NT :).
Now we’ll just need to wait until the data volumes (and reports) get so big that someone actually hits this problem in real life!
So, better patch your database before the end of this decade! :-)
Yes, it’s official. I’m organizing a virtual conference with some of THE top speakers in the world. The topic is Systematic Oracle SQL Optimization (in real world)
The dates are 18-19 November, the conference lasts for 4 hours on both days, so you’ll be able to still get some work done as well (and immediately apply the knowledge acquired!).
Well, none of the speakers need introduction, but just in case you’ve lived in space for last 20 years, here are the links to their blogs :)
I can tell you, (at least the first 3) people in the above list ROCK!!!
And all of them are OakTable members too :)
This conference will have 4 x 1.5 hour sessions, each delivered by a separate speaker. We aim to systematically cover the path of:
And as this is the first (pilot) virtual conference, then the price is awesome, especially if you get the early bird rate by signing up before 1. November!
So, check out the abstracts, details, agenda and sign up here!
P.S. I expect this event to be awesome!
Ok guys, thanks for waiting!
I ended up expanding the article quite a lot compared to what I had originally planned. In fact I only wrote 50% of what I plan to write, I’ll update the rest… um… later… Instead of just stating the difference between the joins I took a step back and elaborated something what I often see people doing (and talking about in newsgroups and lists too).
Basically the most fundamental (or biggest or most important) difference between nested loop and hash joins is that:
In other words, when joining table A and B (A is driving table, B is the probed table), then a nested loop join can take 1st row from A and perform a lookup to B using that value (of the column(s) you join by). Then nested loop takes the next row from A and performs another lookup to table B using the new value. And so on and so on and so on.
This opens up additional access paths to the table B, for example when joining ORDERS and ORDER_ITEMS by ORDER_ID (and ORDER_ID is leading column of PK in ORDER_ITEMS table), then for whatever orders are taken from ORDERS table, we can perform a focused, narrow index range scan on ORDER_ITEMS for every ORDER_ID retrieved from the driving ORDERS table. A hash join can’t do that.
Of course this doesn’t mean that hash joins can’t use any indexes for tables they read – index range scans and unique lookups can still be used under a hash join, but only if there are constant values in the query text (in form of literal or bind variables). If there are no such constant (filter) conditions under a hash join, then the other options to use that index would be to do an INDEX FULL SCAN (which is a range scan from end to end of the index) or INDEX FAST FULL SCAN (which is like a full table scan through the entire index segment). However none of these opportunities give the same benefits as nested loops looking up rows from row source B dynamically based on what was retrieved from A during runtime.
Note that this nested loops benefit isn’t limited to indexes only on table B, the difference is more fundamental than just a specific access path. For example, if table B happens to be a single table hash cluster or indexed X$ table, then the nested loop is also able to do “optimized” lookups from these row-sources, based on the values retrieved from table A.
So, my article with a lot of (loosely) related details is here:
In the comments section of my question, Tom, Bernard Polarski, Christian Antognini and Marc Musette got the closest to what I had in my mind when I asked the question. However, of course your mileage may vary somewhat depending on what kind of problems you have experienced the most over all the years. Also, Jonathan Lewis had a valid comment regarding that the answer depends on what exactly does one mean by “fundamental” and yeah this was open to interpretation.
Nevertheless, I wanted to emphasize that there’s a more important difference between NL and hash joins, than the usual stuff you see in training material which talk about implementation details like hash tables and memory allocation…
Some day I will complete that article, I plan to add some design advice in there, like denormalization opportunities for getting the best of the both worlds etc. But now I’m gonna get a beer instead.
Thanks for reading and answering my blog, I was quite impressed by the volume of comments & answers to my question. I must do this more often!
I’ve been tracking my business travel with Tripit.com‘s awesome service for about 2 years now.
After getting back from my Tallinn->Helsinki->New York->Detroit->New York->San Francisco->New York->Helsinki->Tallinn trip yesterday, Tripit reported that I have flown 1 007 509 km during my business trips (American readers, that’s about 42 miles ;)
Check yourself below :)
Tripit says I’ve visited 71 different cities in 27 countries within the last two years.
Here’s the map of places where I’ve visited my clients, done training or spoken at conferences:
Actually there’s probably couple of more cities where I’ve been in the last two years, for some reason Tripit doesn’t recognize my trip to Melbourne (but it does show the visit to Sydney which I did during the same trip).
Anyway, the conclusion here is that I think I’ve done enough flying for now. Now I plan to stay at home for a loooong time (I mean at least 3-4, maybe even 5 weeks in a row!!! ;)
But seriously, what I’ve decided is that:
Ok, enough of self-promotion and advertising, back to work ;-)
P.S. I will publish my online seminar schedule “very soon now”!!!
P.P.S. I’m not affiliated with Tripit.com by any means business-wise, but if you travel frequently, then I recommend you to check out their awesome service (and iPhone app). The basic version is free, but I just decided to upgrade to Pro after couple of years of using it!
Recent comments
17 weeks 20 hours ago
26 weeks 6 days ago
28 weeks 4 days ago
31 weeks 5 days ago
34 weeks 8 hours ago
43 weeks 4 days ago
45 weeks 22 hours ago
46 weeks 23 hours ago
46 weeks 2 days ago
49 weeks 3 hours ago