While I was writing Brown Noise in Written Language, Part 2, twice I came across the word “agile.” First, the word “agility” was in the original sentence that I was criticizing. Joel Garry picked up on it and described it as “a code word for ‘sloppy programming.’” Second, if you read my final paragraph, you might have noticed that I used the term “waterfall” to describe one method for producing bad writing. Waterfall is a reliable method for producing bad computer software too, in my experience, and for exactly the same reason. Whenever I disparage “waterfall,” I’m usually thinking fondly of “agile,” which I consider to be “waterfall’s” opposite.
Finally, we look at the last portion of the index block dump which refers to the actual 3 index entries in our demo that currently reside within the index leaf block we dumped previously. row#0 flag: ——, lock: 0, len=15 col 0; len 5; (5): 42 4f 57 49 45 col 1; len 6; (6): 02 01 48 8a [...]
Ok guys, thanks for waiting!
I ended up expanding the article quite a lot compared to what I had originally planned. In fact I only wrote 50% of what I plan to write, I’ll update the rest… um… later… Instead of just stating the difference between the joins I took a step back and elaborated something what I often see people doing (and talking about in newsgroups and lists too).
Basically the most fundamental (or biggest or most important) difference between nested loop and hash joins is that:
In other words, when joining table A and B (A is driving table, B is the probed table), then a nested loop join can take 1st row from A and perform a lookup to B using that value (of the column(s) you join by). Then nested loop takes the next row from A and performs another lookup to table B using the new value. And so on and so on and so on.
This opens up additional access paths to the table B, for example when joining ORDERS and ORDER_ITEMS by ORDER_ID (and ORDER_ID is leading column of PK in ORDER_ITEMS table), then for whatever orders are taken from ORDERS table, we can perform a focused, narrow index range scan on ORDER_ITEMS for every ORDER_ID retrieved from the driving ORDERS table. A hash join can’t do that.
Of course this doesn’t mean that hash joins can’t use any indexes for tables they read – index range scans and unique lookups can still be used under a hash join, but only if there are constant values in the query text (in form of literal or bind variables). If there are no such constant (filter) conditions under a hash join, then the other options to use that index would be to do an INDEX FULL SCAN (which is a range scan from end to end of the index) or INDEX FAST FULL SCAN (which is like a full table scan through the entire index segment). However none of these opportunities give the same benefits as nested loops looking up rows from row source B dynamically based on what was retrieved from A during runtime.
Note that this nested loops benefit isn’t limited to indexes only on table B, the difference is more fundamental than just a specific access path. For example, if table B happens to be a single table hash cluster or indexed X$ table, then the nested loop is also able to do “optimized” lookups from these row-sources, based on the values retrieved from table A.
So, my article with a lot of (loosely) related details is here:
In the comments section of my question, Tom, Bernard Polarski, Christian Antognini and Marc Musette got the closest to what I had in my mind when I asked the question. However, of course your mileage may vary somewhat depending on what kind of problems you have experienced the most over all the years. Also, Jonathan Lewis had a valid comment regarding that the answer depends on what exactly does one mean by “fundamental” and yeah this was open to interpretation.
Nevertheless, I wanted to emphasize that there’s a more important difference between NL and hash joins, than the usual stuff you see in training material which talk about implementation details like hash tables and memory allocation…
Some day I will complete that article, I plan to add some design advice in there, like denormalization opportunities for getting the best of the both worlds etc. But now I’m gonna get a beer instead.
Thanks for reading and answering my blog, I was quite impressed by the volume of comments & answers to my question. I must do this more often!
Read this article
I got emails from UKOUG last week to say my talk for the conference has been accepted. My talk / presentation is titled "Identifying Your Self In The Database" and is about the problems of identifying end users in the....[Read More]
Posted by Pete On 06/10/10 At 05:50 PM
As of 18.104.22.168 a new package, DBMS_AUTO_SQLTUNE, is available for accessing and configuring Automatic SQL Tuning. The package provides three features:
Execution of the Automatic SQL Tuning task (EXECUTE_AUTO_TUNING_TASK)
Generation of a report showing the output generated by the Automatic SQL Tuning tasks (REPORT_AUTO_TUNING_TASK)
Configuration of the Automatic SQL Tuning parameters (SET_AUTO_TUNING_TASK_PARAMETER)
In this post I would like to [...]
In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.
The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:
Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).
With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):
rem rem How to read a frequency histogram on a character column rem select endpoint_number, endpoint_number - nvl(prev_endpoint,0) frequency, hex_val, chr(to_number(substr(hex_val, 2,2),'XX')) || chr(to_number(substr(hex_val, 4,2),'XX')) || chr(to_number(substr(hex_val, 6,2),'XX')) || chr(to_number(substr(hex_val, 8,2),'XX')) || chr(to_number(substr(hex_val,10,2),'XX')) || chr(to_number(substr(hex_val,12,2),'XX')), endpoint_actual_value from ( select endpoint_number, lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val, endpoint_actual_value from dba_tab_histograms where owner = 'XXX' and table_name = 'YYY' and column_name = 'STATUS_COLUMN' ) order by endpoint_number ; set doc off doc ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE --------------- ---------- ------------------------------- ------ ------------------------------------------ 40254 40254 434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED 40467 213 434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED 40592 125 44454C49564550D642CA2965000000 DELIVE DELIVERED 41304 712 494E564F49432991BF41C99E800000 INVOIC INVOICED 41336 32 4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW 41434 98 5041494400000C08C1A415AD800000 PAID PAID 41435 1 5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE 41478 43 5049434B4544013F0FF93F6EC00000 PICKED PICKED 41479 1 524546554E4436441DE2A321000000 REFUND REFUND MADE 41480 1 524546554E4436441DE2A321000000 REFUND REFUND PENDING 41482 2 52455455524E2F6693F753B6C00000 RETURN RETURNED 11 rows selected. #
You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.
There are a number of side effects to the 6 bytes / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.
Footnote: It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.
The free Oracle Security webinar I did with SkillBuilders last week has been put on line by them. The Oracle Security webinar is available here but you need to register with SkillBuilders first to hear it. I have not put....[Read More]
Posted by Pete On 05/10/10 At 06:05 PM