I learn or relearn something new every day about Oracle. Just about every day really!
Last week I was in Belgrade Serbia delivering a seminar and an attendee reminded me of something I knew once but had totally forgotten about. It had to do with foreign keys and the dreaded NULL value.
Many of you might think the following to be not possible, we'll start with the tables:
ops$tkyte%ORA11GR2> create table p
2 ( x int,
3 y int,
4 z int,
5 constraint p_pk primary key(x,y)
ops$tkyte%ORA11GR2> create table c
2 ( x int,
3 y int,
4 z int,
5 constraint c_fk_p foreign key (x,y) references p(x,y)
Here’s a model of a problem I solved quite recently at a client site. The client’s query was much more complex and the volume of data much larger, but this tiny, two table, example is sufficient to demonstrate the key principle. (Originally I thought I’d have to use three tables to model the problem, which is why you may find my choice of table names a little odd). I ran this example on 188.8.131.52 – which was the client version:
create table t2 as select rownum id2, trunc(dbms_random.value(0,1000)) n2a, trunc(dbms_random.value(0,1000)) n2b, lpad(rownum,6,'0') vc2, lpad('x',100,'x') padding from all_objects where rownum <= 5000; alter table t2 add constraint t2_pk primary key(id2, n2a); create table t3 as select rownum id3, trunc(dbms_random.value(0,1000)) n3a, trunc(dbms_random.value(0,1000)) n3b, lpad(rownum,6,'0') vc3, lpad('x',100,'x') padding from all_objects where rownum <= 5000; alter table t3 add constraint t3_pk primary key(n3a, n3b, id3); -- now collect stats on the table and execute this query (with autotrace enabled) select * from t2 where not exists ( select /*+ unnest */ null from t3 where n3a = n2a and n3b = n2b and (id3 = id2 or id3 = id2 + 1000) ) ; select * from table(dbms_xplan.display);
You’ll note that I’ve included the /*+ unnest */ hint in the subquery because I want Oracle to run this as a hash anti-join; and it would appear to be legal (and simple) to do this given the various not null contraints and primary key information. In fact the plan uses a filter subquery:
------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost | ------------------------------------------------------------ | 0 | SELECT STATEMENT | | 5000 | 585K| 5015 | |* 1 | FILTER | | | | | | 2 | TABLE ACCESS FULL| T2 | 5000 | 585K| 15 | |* 3 | INDEX RANGE SCAN | T3_PK | 1 | 12 | 2 | ------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter( NOT EXISTS (SELECT /*+ UNNEST */ 0 FROM "T3" "T3" WHERE "N3B"=:B1 AND "N3A"=:B2 AND ("ID3"=:B3 OR "ID3"=:B4+1000))) 3 - access("N3A"=:B1 AND "N3B"=:B2) filter("ID3"=:B1 OR "ID3"=:B2+1000)
I suspect that the optimizer code bypasses the anti-join because of the (carefully bracketed) disjunct (OR) predicate. On the client site this resulted in the subquery being executed 9 million times, reducing an intermediate data set from 9M rows to 2M rows at a cost of 27 million buffer visits and about 60 CPU seconds. Fortunately I was able to dredge up a little bit of propositional calculus and quote the following equivalence:
not( A or B ) <=> (not A and not B)
Equally fortunately I didn’t have to worry about three-valued logic (all relevant columns were declared not null), so I was able to rewrite the query in the form:
select * from t2 where not exists ( select null from t3 where n3a = n2a and n3b = n2b and id3 = id2 ) and not exists ( select null from t3 where n3a = n2a and n3b = n2b and id3 = id2 + 1000 ) ;
With this code Oracle did two unnests and converted to hash anti-joinsin both cases (at least, that’s what happened on the client site – my small sample switched to nested loop anti-joins):
------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ------------------------------------------------------------- | 0 | SELECT STATEMENT | | 50 | 7200 | 15 | | 1 | NESTED LOOPS ANTI | | 50 | 7200 | 15 | | 2 | NESTED LOOPS ANTI | | 4999 | 644K| 15 | | 3 | TABLE ACCESS FULL| T2 | 5000 | 585K| 15 | |* 4 | INDEX UNIQUE SCAN| T3_PK | 1 | 12 | | |* 5 | INDEX UNIQUE SCAN | T3_PK | 5000 | 60000 | | ------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("N3A"="N2A" AND "N3B"="N2B" AND "ID3"="ID2"+1000) 5 - access("N3A"="N2A" AND "N3B"="N2B" AND "ID3"="ID2")
The change to hash anti-joins was a huge benefit (the nested loop anti-join would have improved things for the client to a degree, but there’s not really an enormouse difference in some cases between a filter subquery and an equivalent nested loop anti/semi-join). In this case the query run time dropped from 95 seconds to 27 seconds – all of it CPU time.
Update: a check of the 10053 trace file for 10.2.0.3 shows the following:
SU: Checking validity of unnesting subquery SEL$2 (#2) SU: SU bypassed: Invalid correlated predicates. SU: Validity checks failed.
On the other hand, it is possible to get unnesting and a hash anti-join with the predicate: id3 between id2 and id2 + 1000. (But see comment 6 and its reply)
When my TV broke a few months ago I made the decision not to replace it. That means I only get to watch DVDs on the computer or stuff streamed on the web (BBC iPlayer, ITV Player or 4OD) using my iPad. I’m pretty happy with the situation as it prevents me wasting too much time in front of the TV. My only issue was being tied to the computer for DVDs. Yesterday I entered the 21st century and started streaming DVDs to my iPad.
A little Googling revealed HandBrake is about as simple as it gets, where DVD video transcoders on Linux are concerned. With that installed I saved a copy of a DVD (Alien) into the “movies” folder on my NAS, which is pre-configured for streaming videos. That’s nice and simple.
The next step was to get an iPad app to allow me to watch the streamed content. There are a bunch of them around, but I ended up with PlugPlayer. It automatically detected my NAS and worked immediately.
So now I can watch DVDs streamed to my iPad. All I need is a 60TB hard disk so I can store all my DVDs.
I was reading a story where Seagate were talking about 60TB disk drives. That’s all well and good, but how quick can I get data to and from them? If I need a certain number of spindles to get the performance I require, then I’m just going to end up with masses of wasted capacity.
I can picture the scene now. I have a database of “x” terabytes in size and I need “y” number of spindles to get the performance I require, so I end up having to buy disks amounting to “z” petabytes of space to meet my performance needs. Not only is it hard to justify, but you know the “spare” capacity will get used to store stuff that’s got nothing to do with my database.
Just think of those 60TB bad-boys in a RAID5 configuration. Shudder.
Feel free to insert a, “SSD/Flash will solve the worlds storage problems”, comment of your choice here.
I’ve downloaded the Fedora 17 build, ready for the next weeks final release of Fedora 17…
Let’s first discuss how RAC traffic works before continuing. Environment for the discussion is: 2 node cluster with 8K database block size, UDP protocol is used for cache fusion. (BTW, UDP and RDS protocols are supported in UNIX platform; whereas Windows uses TCP protocol).
UDP protocol, fragmentation, and assembly
UDP Protocol is an higher level protocol stack, and it is implemented over IP Protocol ( UDP/IP). Cache Fusion uses UDP protocol to send packets over the wire (Exadata uses RDS protocol though).
MTU defines the Maximum Transfer Unit of an IP packet. Let us consider an example of MTU set to 1500 in a network interface. One 8K block transfer can not be performed with just one IP packet as the IP packet size (1500 bytes) is less than 8K. So, one transfer of UDP packet of 8K size is fragmented to 6 IP packets and sent over the wire. In the receiving side, those 6 packets are reassembled to create one UDP buffer of size 8K. After the assembly, that UDP buffer is delivered to an UDP port of a UNIX process. Usually, a foreground process will listen on that port to receive the UDP buffer.
Consider what happens If MTU is set to 9000 in the network interface: Then 8K buffer can be transmitted over the wire with just one IP packet. There is no need for fragmentation or reassembly with MTU=9000 as long as the block size is less than 8K. MTU=9000 is also known as jumbo frame configuration. ( But, if the database block size is greater than jumbo frame then fragmentation and reassembly is still required. For example, for 32KB size, with MTU=9000, there will three 9K IP packets and one 5K IP packet to be transmitted).
Fragmentation and reassembly is performed at OS Kernel layer level and hence it is the responsibility of Kernel and the stack below to complete the fragmentation and assembly. Oracle code simply calls the send and receive system calls, passes the buffers to populate.
Few LMS system calls in Solaris platform:
0.6178 0.0001 sendmsg(30, 0xFFFFFFFF7FFF7060, 32768) = 8328 0.6183 0.0004 sendmsg(30, 0xFFFFFFFF7FFFABE0, 32768) = 8328 0.6187 0.0001 sendmsg(36, 0xFFFFFFFF7FFFBA10, 32768) = 144 ... 0.7241 0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768) = 192 0.7243 0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768) = 192
UDP vs TCP
If you talk to a network admin about use of UDP for cache fusion, usually, there will be few eyebrows raised about the use of UDP. From RAC point of view, UDP is the right choice over TCP for cache fusion traffic. With TCP/IP, for every packet transfer has overhead, connection need to be setup, packet sent, and the process must wait for TCP Acknowledgement before considering the packet send as complete. In a busy RAC systems, we are talking about 2-3 milli-seconds for packet transfer and with TCP/IP, we probably may not be able to achieve that level of performance. With UDP, packet transfer is considered complete, as soon as packet is sent and error handling is done by Oracle code itself. As you know, reliable network is a key to RAC stability, if much of packets (closer to 100%) are sent without any packet drops, UDP is a good choice over TCP/IP for performance reasons.
If there are reassembly failures, then it is a function of unreliable network or kernel or something else, but nothing to do with the choice of UDP protocol itself. Of course, RDS is better than UDP as the error handling is offloaded to the fabric, but usually require, infiniband fabric for a proper RDS setup. For that matter, VPN connections use UDP protocol too.
In a busy system, there will be thousands of IP packets traveling in the interface, in a given second. So, obviously, there will be many IP packets from different UDP buffers received by the interface. Also, because these ethernet frames can be delivered in any order, how does Kernel know how to assemble them properly? More critically, how does the kernel know that 6 IP packets from one UDP buffer belongs together and the order of those IP packets?
Each of these IP packet has an IP identification and fragment offset. Review the wireshark files uploaded in this blog entry, you will see that all 6 IP packets will have the same IP identification. That ID and the fragment offset is used by the kernel to assemble the IP packets together to create UDP buffer.
Identification: 0x533e (21310) .. Fragment offset: 0
What happens if an IP packet is lost, assuming MTU=1500 bytes?
From the wireshark files with mtu1500, you will see that each of the packet have a Fragment offset. That fragment offset and IP identification is used to reassemble the IP packets to create 8K UDP buffer. Consider that there are 6 puzzle pieces, each puzzle piece with markings, and Kernel uses those markings( offset and IP ID) to reassemble the packets. Let’s consider the case, one of 6 packet never arrived, then the kernel threads will keep those 5 IP packets in memory for 64 seconds( Linux kernel parameter ipfrag_time controls that time) before declaring reassembly failure. Without receiving the missing IP packet, kernel can not reassemble the UDP buffer, and so, reassembly failure is declared.
Oracle foreground process will wait for 30 seconds (it used to be 300 seconds or so in older version of RAC) and if the packet is not arrived within that timeout period, FG process will declare a ‘GC lost packet’ and re-request the block. Of course, kernel memory allocated for IP fragmentation and assembly is constrained by Kernel parameter ipfrag_high_thres and ipfrag_low_thres and lower values for these kernel parameters can lead to reassembly failures too (and that’s why it is important to follow all best practices from RAC installation guides).
BTW, there are few other reasons for ‘gc lost packets’ too. High CPU usage also can lead to ‘gc lost packets’ failures too, as the process may not have enough cpu time to drain the buffers, network buffers allocated for that process becomes full, and so, kernel will drop incoming packets.
It is probably better to explain these concepts visually. So, I created a video. When you watch this video, notice that there is HD button on the top of the video. Play this in HD mode so that you will have better learning experience.
You can get the presentation file from the video here: MTU
BTW, when you review the video, you will see that I had little bit trouble identifying the packet in the wireshark output initially. I understood the reason for not seeing the packets filled with DEADBEEF characters. Why do you think I didn’t see the packets initially?
Also, looks like, video quality is not that great when embedded. If you want actual mp4 files, let me know, may be I can upload to a drop box and let you download, email me.
The basic formula for calculating the costs of a Nested Loop Join is pretty straightforward and has been described and published several times.
In principle it is the cost of acquiring the driving row source plus the cost of acquiring the inner row source of the Nested Loop as many times as the driving row source dictates via the cardinality of the driving row source.
Cost (outer rowsource) + Cost (inner rowsource) * Card (outer rowsource)
Obviously there are cases where Oracle has introduced refinements to the above formula where this is no longer true. Here is one of these cases that is probably not uncommon.
Let's start with a simple two table join that shows above formula in action. It represents a parent-child relationship where the parent table has 10,000 rows with a unique identifier, and a child table with 100 rows each that map to a single parent row, having 1,000,000 rows in total.
which gives the following execution plan in 11.2:
There are a couple of noteworthy comments:
1. T2.FK and T.ID have the same number of distinct keys each (10,000), so assuming corresponding primary and foreign key constraints in place this means that there is at least one matching row for every parent row.
2. There is a filter on T that restricts the driving row source to 500 rows. It is interesting to note that this filter results in a "biased join" where Oracle picks the NUM_DISTINCT from the other table for the join selectivity calculation rather than using the greatest NUM_DISTINCT of both, in this case arriving at a correct cardinality estimate of approx. 50,000
3. The T2_IDX index has a worst-case clustering factor due to the scattering of T2.FK via the MOD function, hence the resulting cost of a single iteration of the Nested Loop join is 100 for picking up 100 rows from the table (plus 2 for the index access)
The overall cost for the loop is 45 for acquiring the driving row source plus 500 times 102, the remaining part being CPU costing overhead. This matches above formula.
Now let's re-create table T2 using a different distribution of the T2.FK column:
The only difference is now that the FK column no longer has 10,000 distinct keys but only 500.
Of course on average now 2,000 rows in T2 match a parent row in T, but obviously no longer all of them have a match in T2.
Let's review the execution plan for the same SQL statement as above:
Since now 2,000 rows match a single parent key on average, the cost per iteration has increased accordingly. The index range scan needs to visit some root / branch blocks plus four leaf blocks on average, resulting in a cost of 5. The table access needs to visit 2,000 rows on average, hence with the same worst-case clustering factor the cost per iteration is 2,000 plus CPU overhead.
Given the fact that we need to do this 500 times the final cost of the Nested Loop join ought to be something close to one million, but surprisingly it is only a little bit higher than the cost of the previous scenario.
So clearly the original formula doesn't apply here, although the cost for a single iteration in the execution plan seems to match the expectations.
It looks like Oracle a long way back introduced a refinement to the original formula in the case of less distinct keys of the inner row source join column than the driving row source join column.
The idea behind it seems to be that this is what Oracle calls a "sparse" join. Obviously not every row from the driving row source will find a match in the inner row source, hence some of the loop iterations should end with the index lookup, not finding a match and therefore no need to visit the inner table afterwards. The refinement hence calculates a lower average cost of the table visit per loop iteration.
This is of course true when looking at the join by itself. But if you are unlucky and a corresponding filter on the driving row source turns the "sparse" join into a (rather) "dense" join, then this refinement can lead to a significant cost underestimate of a Nested Loop join, potentially driving the optimizer towards favouring that join method over others.
And it is exactly that scenario what above query simulates: The filter on T results in a full match of all 500 driving rows, and the actual cost of the Nested Loop join ought to be closer to one million than 50,000 in this case.
The refinement to the cost calculation seems to be based on the ratio between the NUM_DISTINCT of two join coin columns: In my example the ratio is 10,000:500, so the overall cost is only approx. 1/20 of the original cost.
There are some further details how the formula deals with a small number of loop iterations. For example, the first iteration will get the full original cost, a number of further iterations (again seems to correspond to the ratio, here for example 20) get only the index cost added, and from then on the costing of the table access levels at the original cost downscaled by the ratio (2000 / 20 which is 100 in this case).
The refinement obviously has been added in release 10.1.0.3 and can be found in the Fix Control list as bug number 3120429. The text for this fix is: "account for join key sparsity in computing NL index access cost" and apparently only applies if the inner row source uses an index access.
This also means that the original costing can be activated by disabling the fix:
So if you happen to have such cases where the filter on a driving row source leads to an underestimate of the Nested Loop join cost as outlined here, using the fix control allows arriving at more reasonable costing figures.
I recently had such a case at a client where only a small part of a fully populated time dimension was used in the foreign key of a fact table, but the filter on the time dimension lead to exactly the scenario described here - the join wasn't really sparse and the Nested Loop join cost was significantly underestimated.
May 20, 2012 An interesting quirk was recently brought to my attention by Mich Talebzadeh. He generated a 10046 trace at level 8 for a session, executed some SQL statements, disabled the trace, and then processed the resulting trace file with TKPROF. His TKPROF output included the following: UPDATE TESTWRITES SET PADDING1 = RPAD('y',4000,'y') WHERE [...]
The last of Jack’s relatives are gone. Are his girlfriend (Gia), her daughter (Vicky) and Jack’s unborn child the next in line? Is there anything Jack can do to protect them?
This book focuses more on Jack’s relationship to “The Otherness” and “The Ally”. We see a more aggressive side of Jack, as well as the cold calm detachment when he’s doing his job. Dark, dark, dark, but also kinda exciting.
I've built up quite a list of upcoming events that I want to blog about, so I'll deal with them in a single post (... then the OUGN review and then *finally*, *maybe* a technical post or three ...)
Chris Date Seminar
I think I've seen this mentioned in a few blog posts already, but Peter Robson is organising a 2-day seminar presented by relational database luminary, C. J. Date. I was lucky enough to attend a single day listening to Chris in Edinburgh a few years ago now and, whilst I might not have agreed with everything he said that day (so sue me), the clarity and force of his arguments and the intellectual rigour of the discussion was a contrast to the rather more lightweight content of most conference presentations. Definitely recommended and you can find more details here Throw in a copy of his latest book and, if I wasn't taking so many days off work at the moment anyway, I would definitely be there.
Speaking of too many days off work ... It's been too long since I've been to the OUG Scotland Conference - at least over the past 3 years that I've been working in London. Otherwise I would be there this year and it's probably about time I was going back next year. It's a cracking event. Free registration; free beer; an absolutely stellar agenda that the Scottish user group types like Thomas Presslie always pull together; and, of course, it's in the absolutely best part of the world Given that it costs nothing to attend, I think that anyone who can't find a way of getting a day out of the office to hear some fantastic speakers needs their head examined. (Oh, a little bit of local information. Oracle's place in Linlithgow - where I used to work for a while when it was Sun's place - might look a little out of the way but the train service to Edinburgh is regular, fast and excellent.
UKOUG Call For Papers
There's only a few weeks left to submit an abstract for one of the finest Oracle conferences - the UKOUG conference in (erm) sunny Birmingham. It was the first Oracle conference I presented at so, like Tom Kyte, I would encourage as many people to give it a try as possible (as I have with many people over the years). I've discussed this with a few people recently and they seem a little frightened to submit an presentation abstract for such a well-known conference with big-name speakers, but they seem to be forgetting something. UKOUG is a User Group. The whole point is for other users to hear your experiences and some of the best presentations I've attended have been from inexperienced speakers sharing their Real World experiences of using Oracle. Sure, I want to hear deeply technical presentations based on experimentation, but a user group conference is about so much more than that and I dread the day when I see an agenda filled with the same old speakers I've heard many times before. Consider this, too ... How did those speakers become experienced and well-known? We all have to start somewhere and UKOUG is as good as anywhere. Of course, having an abstract accepted can be tough, but unless you give it a try, how will you know?
Having been so positive about the conference over the years and encouraged people to present, I can't help being honest in saying that if UKOUG had stuck to their original idea of giving speakers a one-day pass to the conference, rather than the standard full conference pass (and when I say standard, I mean like every other conference does!) then I wouldn't have recommended it. Fortunately they've performed a U-turn on what I always considered a ridiculous decision Personally, I found it a bit cheap and strangely hurtful so I'm still not sure whether I'll submit an abstract this year, but it shouldn't stop others and might lead to a wider and more varied agenda. Regardless, the agenda is always excellent.
Red Gate Seminars
There have been several times over the past few years when I've considered producing an online video of my OEM Performance Page presentations but it hasn't happened for various reasons. I recall Alex Gorbachev suggesting I repeat my (failed) Hotsos version for Pythian to post online but I never got round to it. So I was absolutely delighted when Red Gate Software asked me if I had any ideas for online webinars to supplement their existing series! I have two sessions planned in the near future. One based on the original presentation that covers Performance Page fundamentals and a second focussing more closely on the new OEM 12 features, including ASH Analytics.
You can see a list of upcoming seminars here. Even better, it includes archived copies of previous presentations by some of the best Oracle presenters around. Definitely worth a look. I'll post more details on by own webinars nearer the time.
Again, although there are some pre-selected speakers, there is also an open Call For Papers so if you feel you have something interesting to say about Exadata, the opportunity is there. I'm definitely planning on going and will hopefully be giving a new Parallel Query presentation.