Since I’m going to write a couple of articles dissecting parallel execution plans, I thought I’d put up a reference post describing the set of tables I used to generate the plan, and the query (with serial execution plan) that I’ll be looking at. The setup is a simple star schema arrangement – which I’ve generated by created by creating three identical tables and then doing a Cartesian join across the three of them.
create table t1 as select rownum id, to_char(rownum) small_vc, rpad('x',100) padding from all_objects where rownum <= 70 ; alter table t1 add constraint t1_pk primary key(id) ; create table t2 as select rownum id, to_char(rownum) small_vc, rpad('x',100) padding from all_objects where rownum <= 70 ; alter table t2 add constraint t2_pk primary key(id) ; create table t3 as select rownum id, to_char(rownum) small_vc, rpad('x',100) padding from all_objects where rownum <= 70 ; alter table t3 add constraint t3_pk primary key(id) ; create table t4 nologging as select t1.id id1, t2.id id2, t3.id id3, rpad(rownum,10) small_vc, rpad('x',100) padding from t1, t2, t3 ; begin dbms_stats.gather_table_stats( user, 't1', method_opt => 'for all columns size 1' ); dbms_stats.gather_table_stats( user, 't2', method_opt => 'for all columns size 1' ); dbms_stats.gather_table_stats( user, 't3', method_opt => 'for all columns size 1' ); dbms_stats.gather_table_stats( user, 't4', method_opt => 'for all columns size 1' ); end; /
Since there are 70 rows in each of the “dimension” tables, there are 343,000 rows in the “fact” table.
Here’s the query, with serial execution plan – you may find that you have to apply a few hints to reproduce this plan, so I’ve included the minimum necessary set.
select /*+ gather_plan_statistics leading(t4 t1 t2 t3) full(t4) use_hash(t1) full(t1) swap_join_inputs(t1) use_hash(t2) full(t2) swap_join_inputs(t2) use_hash(t3) full(t3) swap_join_inputs(t3) */ count(t1.small_vc), count(t2.small_vc), count(t3.small_vc), count(t4.small_vc) from t4, t1, t2, t3 where t1.id = t4.id1 and t2.id = t4.id2 and t3.id = t4.id3 and t1.small_vc in (1,2,3) and t2.small_vc in (1,2,3) and t3.small_vc in (1,2,3) ; ---------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem | ---------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.19 | 10731 | 10729 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.19 | 10731 | 10729 | | | | |* 2 | HASH JOIN | | 1 | 26 | 27 |00:00:00.01 | 10731 | 10729 | 2061K| 2061K| 832K (0)| |* 3 | TABLE ACCESS FULL | T3 | 1 | 3 | 3 |00:00:00.01 | 3 | 3 | | | | |* 4 | HASH JOIN | | 1 | 612 | 630 |00:00:00.01 | 10728 | 10726 | 2061K| 2061K| 959K (0)| |* 5 | TABLE ACCESS FULL | T2 | 1 | 3 | 3 |00:00:00.01 | 3 | 3 | | | | |* 6 | HASH JOIN | | 1 | 14491 | 14700 |00:00:00.03 | 10725 | 10723 | 2061K| 2061K| 980K (0)| |* 7 | TABLE ACCESS FULL| T1 | 1 | 3 | 3 |00:00:00.01 | 3 | 3 | | | | | 8 | TABLE ACCESS FULL| T4 | 1 | 343K| 343K|00:00:00.32 | 10722 | 10720 | | | | ---------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T3"."ID"="T4"."ID3") 3 - filter((TO_NUMBER("T3"."SMALL_VC")=1 OR TO_NUMBER("T3"."SMALL_VC")=2 OR TO_NUMBER("T3"."SMALL_VC")=3)) 4 - access("T2"."ID"="T4"."ID2") 5 - filter((TO_NUMBER("T2"."SMALL_VC")=1 OR TO_NUMBER("T2"."SMALL_VC")=2 OR TO_NUMBER("T2"."SMALL_VC")=3)) 6 - access("T1"."ID"="T4"."ID1") 7 - filter((TO_NUMBER("T1"."SMALL_VC")=1 OR TO_NUMBER("T1"."SMALL_VC")=2 OR TO_NUMBER("T1"."SMALL_VC")=3))
As you can see I’ve pulled this plan from memory after enabling rowsource execution statistics – and the optimizer’s estimates are very good. (Given the way I generated the data and the predicates involved, each hash join retains 3/70ths of the incoming intermediate row set.)
The shape of the plan tells me that Oracle hashed the required rows from t3 into memory, then did the same with t2 and t1 before scanning t4 and probing t1, t2, and t3 in that order to see if a row should be allowed to survive and be counted.
There are various ways I could add weight to this claim, of course – for example flushing the buffer cache and enabling extended trace so that I could see the order of physical disk access; and then enabled event 10104 so that I could see the hash join traces and the order in which they took place. I hope you’ll take my word for it, though, that this is what happens in this serial path.
As I’ve pointed out a few times in the past, when you read the plan carefully you realise that it is following the specified join order (t4, t1, t2, t3) – even though to the casual glance the plan might suggest a join order of (t3, t2, t1, t4). This is an example of “what happens first” and “the join order” being very different from each other.
I’ll get started on the details of possible parallel activity on this path some time tomorrow.
Following on from my recent batch of “what I’m doing at the moment” style posts, I just thought I would mention some of the infrastructure I’ve been installing and configuring recently…
We are still part way through a migration from Oracle Application Server to WebLogic 11g. There are many applications to migrate and test, fortunately not by me, but they fit into two main categories.
Some of our high profile applications of each type are already running in production on WebLogic and the general feedback has been very positive. I guess most of this comes down to the hardware refresh.
There are still a few more apps to migrate, but everything is pretty close to the end of testing now, so hopefully it won’t be long before we can say a not-so-fond farewell to Oracle Application Server!
All of these WebLogic installations are running on top of Oracle Linux 6 inside VMware virtual machines. So far we’ve seen nothing untoward about this setup and I would have no reservations about recommending this approach to others.
If you have any questions/concerns about Oracle Linux, you might want to read my Oracle Linux : Frequently Asked Questions article. If you have any concerns about Oracle’s stance as far as VMware support goes, you might want to read this.
If you missed the interview with Jonathan Lewis or don’t want to spend 93 minutes watching it, here is its transcript. Thanks to Stanislav Osekin from Innova, who did most of the hard work. I have only corrected special Oracle terms. Also thanks to Jonathan, who helped me in several complex cases where I couldn’t understand the wording.
T: Hello and welcome! My name is Timur Akhmadeev and I’m sitting here at the Innova office with Jonathan Lewis. Hello, Jonathan!
J: Morning, Timur! Nice to be here!
T: Nice to see you too! Thank you for coming!
J: My pleasure!
T: We are going to record an interview and I’m going to ask you some questions related to Oracle, to your work and your history. I would like to start with the questions about user groups. There are a lot of user groups in the world. UKOUG is one of the biggest, one of the (most) famous. How do you describe the main purpose of the user groups?
J: Well, as you say, the UK user group is very large, very well-known and I have been involved in the user group for, I think, nearly 20 years in various capacities. I was a director for a little while on the board and so forth. I’ve contributed a great deal to the user group. I think from my perspective the most important thing about the user group idea is that individuals, who use the software, who have to deal with the software, who do their job everyday with the software, get an opportunity to talk to other people. And often it’s the conversation when you actually see someone eye to eye, when you start talking about what you are doing and where your problems are. You can start a conversation where you can all learn perhaps how to do something better or find a way of actually solving the problem. Because you’ve got time to present an idea, have someone argue about it, make a counter suggestion, make a suggestion to improve things and so on. And so despite the fact that you’ve got all things like the OTN Database Forums, you’ve got other forums, you’ve got My Oracle Support (I’ve nearly said Metalink, you know), you’ve got all these things, you got all these bloggers who can supply you with the information you can read, you’ve got these places where you can post a question and get five, ten or twenty different people giving you bits of answers. But of course if you post a question it’s quite hard to have a dialogue with someone. And they can’t really ask you “Well, tell me a little bit more about this, this and this”, because it’s quite hard to type enough information to give them the right idea. So there is still plenty of scope for actually meeting someone because in the space of a five minute conversation you can exchange ten times as much information as you could by going online and typing something perhaps into a blog or typing a question in the Technet. So it’s the personal physical face-to-face communication which I think is a tremendous benefit for the user group. There are others, of course, but that’s the very first one, I think, that really, really matters.
T: Do you enjoy working and participating in the UKOUG and what is your view on your work in this user group?
J: I’ve always enjoyed the time I spend with the user group. Mostly what I’ve done with the user group is just show up for the main conferences. The UK user group particularly has perhaps as many as a hundred little events through the course of the year – a huge number of events! They are all broken down by special interests, so there’s a group which is basically database server group and they have three meetings per year and perhaps something between thirty and a hundred people, which show up for the each meeting. Then there will be a group, which is HEUG, Oracle’s Higher Education user group, and they might have two or three meetings, but perhaps there maybe ten people to show up. And they perhaps will show up in User Oracle office and they’ll just be sitting in one of the Oracle offices in London or Birmingham. We have a very large number of specialized subgroups which have a few meetings per year. And I tend to focus particularly on the group which is, perhaps, the database server. So, I did a lot of presentations. That was my, I think, biggest single contribution to the UK user group. I just loved going and standing up and saying “Look! This is what Oracle can do” and all explaining how to solve problems and things like this. And then of course it’s just quite fun having people to talk to afterwards and the general chatter that goes on at these events. And obviously I was also involved with the board of directors for some time. It’s now changed the structure a bit. I was on a council for some time. And, of course, there is also a lot of hard work there: paperwork, you know, paperwork and all organizational work is not really the thing that I do best. But we try very hard to balance the requirements of all the different communities that make up the UKOUG. So there is always a lot going on in terms of who needs what, how do you find out what people would really like to have, to see and so forth. Luckily, of course, because we’ve been going for so long and because we’ve built up such a large community, we actually have a dedicated office. The user group in the UK is a company. And we have an office which employs 20-25 people who are full-time employees of the company. And they spend all their time preparing events for us, organizing events, finding the places to go and doing all the work which makes the conference possible. I mean, we have our own specialists and then it’s a company and this is absolutely brilliant: they do a huge amount of work to make the whole thing work.
T: Do you think that the whole user group thing influence Oracle or any development strategies?
J: I think they can certainly have an influence. One of the things that the UK user group did each year was carry out a survey through all its members, and the membership numbers were, I think, 2500-3000 in terms of large companies and maybe 7000 or 8000 in terms of individuals. So, there is a large body of people that we can actually survey. And in the survey (it’s quite a long survey, quite a lot detailed questions and so forth) we can summarize this information and we do present it back to Oracle in the UK. And there have clearly been occasions when Oracle has been able to take the bunch of the information we’ve given them and they’ve been interested and followed some of the directions that we suggested would be helpful to our users. So, I think, because of our size, because we bring so many people together, we can act as a voice to Oracle and Oracle will listen. I suspect, they will never listen as much as we would like them to listen, and they might think they are paying far too much attention to us sometimes, but there is certainly a channel between the user group and Oracle and constructive results do come out from our channel. If a group wasn’t there, I don’t think individuals or even large companies would be able to get quite so much communication going between themselves and Oracle. So, I think, that’s another important benefit of the user group. Once it’s got enough bulk, once there are enough people, and you can turn to Oracle and say “Look. We’ve asked a thousand people and 80% of them said, they really wanted to see something like this. That’s a lot of people and that’s enough people to make Oracle think “Ok, let’s investigate that” or “Let’s try and address something in that area”.
T: That’s really great. And I think it could be quite good if the user group could share their experience of what they did, I mean, maybe the server, they get some feedback from users and they provide it to Oracle. And maybe in five years time or seven years time something is changed. I think it will be great to publish the whole results of the achievement of the user group to make other user groups continue this work.
J: Or even give them an incentive. Or let them see that something does happen. I think we probably do let people know, but I’m not sure. I mean, that’s a very good thought and maybe we don’t make it visible enough that we have this important voice from the end users. And, obviously, once we got this survey thing, we do publish summary information on the user group website about it. So, this was the broad picture we got from our survey. This is the message we passed to Oracle. But I don’t think… I’ll have to check this. I really have to check this. I don’t think we gonna have something which says, you know “a year and a half later, we sent this to Oracle, and if you will notice, Oracle has done this and this”. I think we go back and say to the users “we do make a difference”, but I don’t think we actually make it very very clear. I don’t think we document and enlist the things where we can say “well, because you told us you wanted this, look, Oracle has done it”. Perhaps, we have to make it much more obvious.
T: It sounds perfect.
J: Yeah. It helps not just us to encourage our membership, but it would be a beacon, a sign to other user groups as well.
T: Yes, and it could involve more people to the UK user group, as it’s such a strong group. That’s a very nice overview of the user groups, thank you. Continue to the other type of the group which has been formed by some people from the UK. I mean, OakTable Network. It’s a special kind of group. Could you tell us more about it?
J: Right, the OakTable network… It’s quite difficult to say what the OakTable Network is. In a way, it’s a bit like a club. We do technically have… I think it’s a trademark or it might be a copyright mark and we have some sort of a legal claim to the name “OakTable Network”, but really it’s just a, sort of, group of like-minded people and it started 10-12 years ago, I think. Mogens Nørgaard, who comes from Denmark, he is probably the focal point, if you like, of the OakTable, because he arranged a little conference in Denmark, just two or three days, I think. In fact, that might have been a conference, where I was the only speaker, that it was perhaps my first time I did my Oracle seminar, my internal seminar in Denmark. And the thing was, he said I’m gonna do this, and he started sending me mails like, well “Cary Millsap is coming over from America” and “Anjo Kolk is coming over from the Netherlands” and “Steve Adams is coming over all the way from Australia” and, of course, I was getting more and more panicked, because every famous name, every competent expert in the Oracle world seemed to be travelling to Denmark and I was supposed to stand up in front of them. We all are just enthusiasts, we all really enjoy what we are doing, and we are interested in each others’ problems and questions and so forth. It got us to a sort of stage where we were sitting around a kitchen table in Mogens’ room, and it was a huge oak table, something like more than 2 meters long, really-really big thing, you could sit 10-12 people around it very easily. So, we were sitting around it and typing in our laptops and saying “This is what happened”, and someone was trying to say: “I am trying to solve this problem” and another guy said “Let me just work on it, because I may know the answer” and it was this conversation going. And we stopped about 4 o’clock in the morning, you know. I was supposed to get up at 7 o’clock in the morning and start speaking at 9 o’clock in the morning, but at one point it was along the lines of “OK, pizza has just arrived” or “Dinner has just arrived” so we have shut the lids of the laptops, put the plates on top of the laptops, started eating, finished the dinner, open the laptops … completely mad, and it was when we were 10 or 12 years younger. But it was just a very enjoyable meeting of lots of similar-minded people and basically one of us said: “We should do this again, we should form ourselves into a club, we should make sure we do this again. We are going to be the OakTable Network”. And that was it! It started off with about 10 people, I think. But gradually over the time, individuals in the group said, you know, “I’ve been working with this person; or two or three of us has a history of working with this individual, shall we just invite them to be a member”. It’s terribly undemocratic, it’s terribly unfair, there are no doubt lots of people who could say “Well, I’m good enough, I deserve to be a member”. Well, it’s really just that people meet people and they chat and they exchange e-mail and it’s just a club for friends who happened to be very very keen in Oracle and occasionally we would just say: “Hey, let’s just invite him as well. There is nothing terribly formal about it. People tend to recognize that if you are actually a member of the OakTable, you must be pretty good at Oracle at some stage. And, of course, you are a member of the OakTable, aren’t you?
T: Yes, thanks to you!
T: What do you think will be the future of this group of people, of this club?
J: I think the club will just carry on. Anyway, it’s just a private mailing list, most of the time it’s just a private mailing list. But of course there are loads of conferences all over the world, OpenWorld, for example. I can’t think how many members of the OakTable went to the OpenWorld and we had a mini OakTable Conference in OpenWorld and RMOUG, the Rocky Mountain user group, quite a lot of OakTable users go to that one, and again that’s the sort of… not quite a mini-conference, but there’s certainly lots of them get together and they set up a “big dinner” type of thing. It’s really just a friendly club, an e-mail chat room almost, just for people who are fanatic and interested in Oracle. I think that’s pretty much, that’s all it would be. I don’t think we’ll ever try and organize a takeover of a user group, organize a takeover of the Oracle corporation. It’s just a way of keeping in touch with people who are very similar-minded. The group will probably grow a little bit, but not terribly dramatic. It’s just fun.
T: Indeed. About your public activity. Some years ago, I think it was 7 years ago, you published your notes and your findings, and your script on a special website and it was something like a living document or so. You continued to change it over time. And then you went to the blog, I think it was 7 years ago.
J: About 7 years ago. Seems a long time.
T: Absolutely. Do you think it was the right idea?
J: I think so. The original website… I don’t think it was a living document. Well, occasionally I corrected mistakes that I found. And I did try and set up a Frequently Asked Questions thing, where people could post, could e-mail me a question which they needed to answer and I put it on the list and allowed anyone to write in a standardized format a response to that question. So, that was quite dynamic. The trouble with that thing was that I really didn’t get many people contributing articles, there were a few people who sent in several, but didn’t get much contribution, much feedback on that. And then I thought eventually this blog thing, dialogue is important, feedback is important, people contributing more. If you get five people coming together, the combination of what they say can often be much more important than what anyone individual can say; they can fill in the little gaps that other people leave. So, I thought the blog will be a good idea. I write something and perhaps this would prompt a few other people to say “Here is this extra little bit of information that fits” or “You haven’t asked or answered this particular detail, have you got some further comments that would address this particular part of it”. It never actually turned into that quite sort of cumulative thing. In a way, the blog is more the living document thing than a website that it was, but it never really turned into that. And in fact I have to say that one of the problems I have when people do write comments and ask questions. This particularly has come up with the addendum, the errors pages for my books. When I do ask questions, it’s very hard to find time to work out good answers to those questions, so even if in principle I like the idea of when I publish something in a blog and then we can talk about it afterwards – it’s actually very-very difficult to do the talk about it afterwards. I’ve never quite made it work. The principle is good, but it’s very hard keeping up with it. I think it is a better idea than the old website was.
T: Absolutely. And it appears that people just discuss some things in the comments, which is great.
J: Yes, from time to time when I disappear somewhere and don’t have time to look in my own blog and when I come back a week later and there are seven or eight articles going back and forth between other people saying “Yes, Jonathan, how did you so and so?” And I haven’t seen that question, but someone else’s answered. To me, that works perfectly. And, of course, I do still have a sort of overall view which says if something is actually right or something that I know is actually wrong, I can make sure it doesn’t stay there. And that’s of course one of the problems that by comparison the things like OTN and the database forums… You often see people coming there with a question and some of the responses to that question are actually very-very bad answers. And of course if someone is asking the question, they cannot see the difference between a good answer and a bad answer. And quite often you see thing like someone gives a gold star to someone and says “Great, that was really helpful, thank you very much” and I think “No, you definitely don’t want to do that!” There’s no proper mechanism for eliminating the bad stuff and reinforcing the good stuff. If you’re going to have a completely open arena for dialogue, you need a mechanism which eliminates the stuff that’s going to be unhelpful and highlights the stuff that could be helpful. And of course the problem is there is so much stuff, the sheer volume of what you need to know is so great that it really is impossible for someone to be the moderator for it, even on the OTN database forum. No one knows enough to be able to go through and clean up and tidy all these things. And the software, just as an example, the software that Oracle has currently got doesn’t allow other people to read these threads and say “No, that is a bad answer”. You actually have to contribute to this thread, you have to know something, which is: here is a question, here are three bad answers, here is one good answer, here is a comment which says: the first bad answer is a bad answer, then there’s a good answer, another bad answer, then there’s another comment which says “The second bad answer is bad” and anyone who is reading that stuff probably stops after two or three entries, because they think they’ve got an answer and they never get down to the bit which says “Well the first thing read was not the right answer”, so this information is poor information, it’s bad information, just keeps reinforcing itself, it just keeps existing, when really you’d like the bad stuff to die.
T: But it’s impossible.
J: Yes, it’s impossible. The Internet saves everything. And it’s too big job for any individual to be responsible for going and cleaning up the mess. At least on my blog, if someone puts something that is complete rubbish, I can go back and I might actually temporarily put in a reply which says “No, that’s not the right answer, because XYZ” and then, perhaps, six months later, when people will be able to examine the wrong answer and explanation why it is a wrong answer, I’ll just delete it, so that the wrong suggestion doesn’t persist. I delete the wrong answer and my response to the wrong answer. Luckily, that doesn’t happen very often, in fact. But at least the environment makes it possible for that to work.
T: I think it works great.
J: Thank you.
T: Comparing blogging and commenting on forums with something which is now considered old stuff, like old-fashioned magazines, Oracle publishes its magazine, Oracle magazine [at the time of the interview, May 2013]. UK Oracle User Group publishes magazine which is called Oracle Scene. And there are a couple of other magazines, like Northern California Oracle User Group magazine, which is quite good and available to all people around the world. Do you think it’s still relevant to publish a magazine?
J: I think that many people still like having a physical object in their hands to read, so there are quite a lot of people who will be happy to have the magazine. I think possibly though the more important feature of the magazine is that it reminds the users, the user group, that there are experts in the user group who can be helpful. That reminds people that it’s possible to have a dialogue, that it’s possible to come together in a group. If you look at Oracle’s magazine, for example, or if you look at any of those magazines, you’ve mentioned the Northern California User Group magazine, you’ve mentioned the Oracle Scene, if you look at any of them – it’s a small number of pages, it’s thirty-two pages or sixty pages. And it comes out, well, Oracle Scene, I think, 3 times a year at present, the Oracle magazine itself, I think, 4 times in a year. It’s a tiny amount of information comparing to the enormous amount you need to know. And it’s a little bit here, a little bit there. And the Oracle Scene has a section… We try to spread and cover the field, but we have a section which is database technology, we have a section which is particularly E-Business Suite, we have a section which is Developers, and if you think of all these things that Oracle has got as a product, you’ll end up it’s not quite as bad: one page is for this group of people, one page is for that group of people. You get a small number of articles, and it’s a very thin covering. So, no one gets a large amount of helpful information from the magazine, but what they do get is the sense which says “There are people out there who have this information and the user group can pull us together”. And it reminds people that it’s worth going to the user group’s meetings, it reminds people it’s worth going to a conference, where they can meet other people, and they can spend more time and learn much more by having conversations. It’s a good reminder for people, and in a user group perspective, it’s a good reminder that there is a user group, there are people who can help.
T: That’s great. Moving on to some technical questions and some interesting stuff: you are considered one of the major experts, most well-known experts in the optimizer and your book “Cost-based Oracle fundamentals” is a fascinating book, which covers some specifics of the optimizer. What do you think about the future development of this piece of software, I mean, Optimizer? What is the future of the Optimizer? How will it be developed?
J: Well, really the question is, what do I think Oracle ought to do rather than what I think Oracle is going to do.
T: Probably, yes.
J: That’s a very tricky question. There are days when I love the Oracle Optimizer and I think the optimization that Oracle does, the algorithms, it’s got the ideas built into it, the transformation capability… I think it’s absolutely fantastic. It’s brilliant! And then there are days I hate the Optimizer because of the silly little mistakes it makes, and the things that I can’t change when I know they could be changed. And what I’d really like Oracle to do is stop getting too clever. I think they are trying to do very sophisticated, very subtle things and there are a lot of little details left behind which could be corrected, could be enhanced and could be improved, but because there is a massive push, it seems, for the new features, new transformations, new sophisticated logic to the Optimizer, a lot of little details get left behind, they don’t get fixed. And one of the problems that follows on from the little details don’t get fixed is when you get to doing some sophisticated transformation, you find that if Oracle does the transformation A, some little features of the basic arithmetic say, well, the number of rows is going to be a hundred, but if you do a transformation B, some odd quirks in some other part of the code says: “Because you do it this way, the number of the rows is going to be five thousand”. And of course you get this massive capability of Oracle switching catastrophically between two different plans because of not the sophisticated bits, but because of the very simple bits, way way back in a processing, where two things that should say the same thing say different things. Now, in the talk I gave yesterday, I showed you that example which said “Hey, this is a query. All I have got is a single table and a 2-column index” and when I run the query “select where column 1 equals and column 2 equals”, here, in three consecutive versions of Oracle, are three completely different calculations as to how many rows you get and how much work you have to do. How much simpler can you get than “select from a table by index = constant”? And yet it wasn’t till 11.1 that Oracle came up with something which said: “Oh, I do actually have enough information and I have a very good estimate of both cardinality and the cost of that query”. Three phases: 10.1, 10.2.0.5, 22.214.171.124 I think it was. Three completely different calculations. Now there are other oddities. Whenever you look very-very closely (because I’m solving the problems of people, I’m trying to find out why things go wrong and I often look very closely at some of the bits of the arithmetics that the Optimizer does), and it’s astonishing that whenever I look really closely at how some of these details work, I end up thinking “Well, that’s strange. This part here is not consistent with that part there” or “We’ve lost a little bit of information”. And some of these things are incredibly silly. You find little bugs where you say “Well, Oracle has obviously created an array which lists all the indexes which are relevant, but for some reason it’s ignoring the first index”. And you think about it for a bit, you rename the indexes or create a new one or drop the first one and after a while you think “Well, it looks as if what’s happened is that some program has created this array and it started counting from zero when we created it, but it started counting from one when we were using it”. And you always lose the first index! Really strange things like this can happen. Or worse still: you can get something which says: Well, we’ve got 8 indexes on this particular table here and I’m going to do a bitmap conversion of these indexes, so I work out the cost of using the first one – that one seems good, and I work out the cost of using the second one as well – and that seems good, and so does the third and the fourth one… And you look at the arithmetic and it’s not used the cost of the eight consecutive indexes, it used the cost of the first index eight times! I forget which version it was and whether it was fixed, I have to go and check that one, but you see these things, you see that someone’s forgotten to increment a counter! Incredibly simple things like that which lead to odd inconsistencies. And you can look back at this and think, well, how can something which says “should I unnest this subquery and do complex view merging at this point”, how can that possibly make the right decisions, if it can’t even work out what the cost is and how many rows it is going to get when it’s just accessing the data to start with? So I’d like to see someone going back and looking into a lot more, the details to make sure that first these things are fixed and also that if someone says what we’ve got this transformation A or transformation B on this query, why do we not get the same costing at the end of these transformations and the same estimate of cardinality? Where did the inconsistency arise? And then it might highlight some strategy for getting better results. I would like to see that. I mean, having said that, Mohamed Zait, I saw him quite recently in Switzerland with Christian Antognini, he did a presentation and one of the wonderful things he showed was how Oracle had in a very recent version gone back and reengineered star transformations. And you got a real sense from that of how incredibly difficult it is to change the Optimizer, because there are so many interwoven links and all the different little components, and by the time, I think he said it took something like two and a half years to re-engineer this, because by the time he’d finished, his designing of programs had finished, the star transformation is now something which fitted in, if I would say, orthogonally, independently of other transformations. So, you could do A, B, C and star transform or you could star transform and do A, B, C. You could do some of this business of table elimination or one of these table partition tricks, in a middle of a star transformation, or you could do a star transformation and table eliminate, so you could fit it in when it became appropriate to fit it in, whereas previously you had strange things like… If you put a subquery factoring clause (WITH subquery) and what you put in that is something that could be star-transformed, Oracle couldn’t do a star transformation, because it was inside WITH subquery and you see simple examples like this, people say “Why didn’t Oracle do XYZ?”, and the answer is “Well, because you’re doing A, XYZ is not legal”. In theory it ought to be legal, but it just happens that it’s not.
T: Or not possible yet.
J: Exactly. And this was the thing with star transformations. If it’s a WITH subquery, you cannot star transform. There is no reason why not, except you can’t. And then two-and-a-half years of engineering and you can. So it’s immensely complex to go back and make these little changes. And, of course, when you make changes, it’s a cardinality estimate changes or cost estimates changes, right? Someone’s gonna sit down and say: “Why did my query plan change? It used to be good and now it’s bad!”. Anything you do, anything that changes the Optimizer will make somebody unhappy. And the people who are unhappy will make a loud noise, the people who think it’s great and massively improved, are not going to say a thing. They are not gonna stand up and say “Oracle! Fantastic! Wonderful! Thank you very much!”. The only people who say things are the people who complain. So, Oracle Corporation cannot win on optimization. Whatever they do, they’re in the wrong. But it would be nice, I would like to see some of these funny little details cleaned up and the simple stuff, most people end up with simple stuff, I think. I think there could be some improve on that.
T: So, your thought on this is that the Optimizer will not be redesigned.
T: Well, at one point they did redesign it and they introduced the Optimizer, Cost Based Optimizer, so you don’t think at some point in the future they will need some…
J: Start it all over again? Yeah, well, I can’t imagine a complete redesign from the ground up of the whole Optimizer and I believe there were somewhere around version 7, a big push in Oracle to do a massive redesign of some parts of Oracle, which, of course, took a lot of time, took a lot of money. I can’t really imagine any company saying: well, we’ve got this product, it works fantastically well a huge fraction of the time, but we’re going to start all over again. And we’re gonna spend a huge amount of time and a huge amount of money not making any difference to all our customers. That just doesn’t happen. I think that’s extraordinary that they spent so much time working on a star transformation bit. And I think we might see various pieces being reengineered, but I can’t quite imagine anyone from Oracle Corporation saying: we will now fund a complete new Cost-Based Optimizer. It can do so many things that it would be very hard perhaps even to write that in a specification of what it’s got to do and roughly how it’s got to do it. Incremental changes, little bit of evolution, a little bit of runaway growth, occasional bits of rationalization, but not a complete reengineering. I would be very surprised.
T: Evolution, not revolution.
J: Absolutely right, yes.
T: That’s reasonable. One of the options of Oracle is the Real Application Clusters. That’s definitely not a cheap option and lately it has become so common, especially with the introduction of the Engineered Systems, and it became very popular. So, what is your view on this option and what do you think, in the future all Oracle installation will be RAC, for example, or will we still see some single-instance databases?
J: That’s a very difficult one. I think it comes down to what Oracle would like you to buy or what rather they would like to sell to you, which is not necessarily the same thing. In a way, I think, RAC has outlived its usefulness. And there was a time when something like RAC was necessary for scalability. Oracle also used to sell it to you in terms of high availability, but some of the sites I saw it was only highly available if you got a massive team of really competent experts who day and night watched it carefully to make sure it didn’t fall over: it was too fragile, there were too many dependent little pieces which could possibly go wrong. And I think, perhaps, in part, this is why Oracle allowed RAC to go down into a Standard Edition, into a very cheap version of RAC. Because if you got fifty customers using RAC, it takes a long time to find all the bits which are broken. If you can push out a cheap version of RAC, and you’ve got ten thousands customers using RAC, you’ll get lots of profit, all your field testing goes on your cheap customers, you can figure out what you’ve got to do better. So I think there was certainly a push to move out RAC as much as possible, because it helped to develop the product, making it more stable and easier to handle and so forth. But I think – well, this is a speculation – I think with the Exadata construct, Exadata is RAC in a box, ODA is RAC in a box. Well, probably, Grid in a box rather than RAC in a box, or “cloud” in a box, I suppose, is the right term these days. The Exadata is RAC in a box, it’s using the technology which is now being field-tested, but it’s hiding the complexity by saying to the buyers “Just buy the black box”. You don’t really need to know that it’s RAC underneath. And if you get Platinum Support, you don’t even have to worry about fixes and upgrades, because Oracle will just log on to your machine and patch the whole thing, from top to bottom, just like that. And that’s the sort of marketing strategy: make Oracle a black box. The RAC-type stuff is just a part of that black box. And I think I can see the day when Oracle says: if you buying Oracle, what you’re buying is a black box that runs Oracle. Oracle is everything: it’s the whole stack from the database software through the network software down to the cluster software, down to disks, the whole thing is just Oracle, you do not need to know that it’s RAC, that it’s the high-speed network, that it’s this layer of disks. So RAC was a step towards the complete black box solution, I think that’s it. And given enough time – but it won’t be a great deal of time – I think you’d see Oracle saying “Well, we’d prefer if people did not buy a couple of big IBM boxes and link them together into a RAC system”. The only RAC systems should be a RAC system which is inside an Exadata on Oracle Appliance, I think. And if you want anything else, you can buy a single-instance Oracle and run it on a big machine and that’s it. RAC, perhaps, wasn’t seen this way in advance, but I think you can look backwards and say: “Well, RAC was important step towards Exadata, towards ODA, towards the whole cloud thing”.
T: And considering one of the chapters of your latest book. It has RAC in its name… Well… “RAC and Ruin”.
J: Well, RAC and ruin, at least to the English area, I think, it has a sort of poetic flow, it’s for the sound, it’s just a phrase that comes nicely to mind. I did explain carefully in a book that this was for alliterative effect: RAC and ruin. I was not claiming that if you got RAC, you are bound to ruin everything, OK? That wasn’t intent. It just flowed nicely as the title. And although I did actually follow up the fact that it is relatively easy to ruin things if you don’t think a little bit of how RAC is different from the single instances. And of course if you do this business, if you have got multiple instances, multiple memory images of data and lock states and all the rest of it, you’ve got to get your machines to communicate and essentially the idea with RAC then is that you minimize the communication, because you don’t want to waste time talking about what you want to do, you just want to do it. So, there was a little bit of ruin in the book, but the title is really for the sound of the title, not for the intent to mean it.
T: Yes, I understand that. And considering your clients and your consulting work, do you suggest RAC to your clients often?
J: Um… Very rarely. Hm. And this is… this is not because I have a bias against RAC.. um.. I must have said in a sense RAC it has almost had its day. I don’t think um… RAC is a strategic thing. I think if I were to say to someone: “It looks like you need a RAC solution”, I’d probably end up saying: “It looks like you need an Exadata solution”. But essentially RAC came out as a mechanism for saying: “we can’t get a machine which is big enough to do all the work we need to do, so let us tie several machines together”. Well, nowadays you have to have a horrendous, complicated, enormous workload before you need more than one machine to do it. Do something terribly simple, you know, single machine, single instance, you can do so much, meaning, the machines nowadays are just so powerful. I mean network cards you can plug in pump out stuff so fast, the SANs, if configured correctly, and if you really, really busy you.. you have your own dedicated SAN, well, the disks you can connect to can cope with so much information that you don’t need to double up machines in order to get the performance and throughput. And when it comes to high availability, 15 years ago, I think, a good high availability solution which has only gotten better, coming from your operating system, you had things which would say if this machine, machine #1, fails, we can just switch a logical machine over to the machine #2 and it’ll be up and running within minutes. Right. It gets faster and faster, of course, so just simply something which says “fail over at the operating system level and restart” – that’ the sort of resilience which is almost invariably totally adequate for most customers. Of course we don’t have things like this on virtual machines system anyway where… ahm… VMware… where where I forgot how they call it…
J: vMotion, that’s it, yes. With that type of technology you can even say, well, the machine I’m running on, my instance now at the moment, isn’t actually big enough let’s buy a new and bigger machine and oh yes next Saturday without stopping anything we’ll just slide our virtual machine from our little box into my big box. It is ridiculous what you can do with hardware these days, with the layers of software and the management stuff that can sit underneath it. So RAC … I don’t really see any great point in suggesting to someone that they need to buy RAC. I’ve never come across a client not for years where I’ve said the only sensible solution is RAC. Ok. And RAC does require some complex management, complex configuration. It is still, if you’re doing it at a “home-grown” type thing, it is still a bit fragile on the edges.
J: There are people, there are people who are very good with RAC, I mind you. I can name you half a dozen Oak|Table members who are fantastic with RAC. But for the ordinary business people it’s fragile around the edges.
T: Yeah, great. And continuing.. continuing to other development software and the upcoming next Oracle Database 12c announced on Oracle Open World, the previous, almost a year ago. And since then a lot of people started to talk about new features. And what… you heard about them… some of them
T: What is your most favourite?
J: New features in Oracle 12c, well, my my sources of information basically of course that I can mention are the presentations I went to at Open World, as you say last year, and, of course, when you see those presentations, the first thing Oracle puts up, the first slide is “There is no guarantee we’re going to do this. This is what we would quite like to do and what we quite might actually do.” But there is absolutely no guarantee and they call it a “safe harbor” statement. So you can see all this stuff and you don’t absolutely know if it is going to appear. The thing I think that struck me most significantly was that there are a number of enhancements with partitioning which allow you to use them in a more natural way. That let you do things which you would have thought you could perhaps have done 10 years ago, so details to do with referential integrity and how things behave you want to drop old partitions and then things happen, things cascade normally, you don’t have problems with maintaining referential integrity if you declare different partitions at level. Interestingly there’s even one little detail with partitioned tables: you can rebuild a table online. Provided what you’re actually rebuilding is a partition of a partitioned table. So you can actually say things like, “Every month I will make a partition redundancy”, so 12 months after the day this appeared I will make this partition redundant. You can actually say things like (I might get syntax wrong) “alter table move this partition into a different table space with compression as high as I would like to do it online”. So application can still coming quite happily and that data will move and indexes will be rebuilt online. And that’s a nice looking touch, an important little touch. So that’s one of the nice things, I think there are features in partitioning which will be very helpful.
J: So you’ve got pluggable databases. SQL Server had this 20 years ago or something! I exaggerate, but you know, it’s ok, so, I can appreciate that there is a benefit and most particularly there is a benefit for Oracle corporation of having pluggable databases because right now if you run up on Oracle 11 RAC with a few bits and pieces, features added on like Standby database and you look at the number of processes running on your Unix boxes there is something like 45 or 50. Well, an instance to run needs 50 processes. Well, if you’ve got 12 instances, 12 databases and you want to consolidate onto one machine. Well, you can’t actually just put 12 databases and 12 instances on one machine, because you’d have around 600 processes running on that machine before even you get your end users connected! And it has to be a very very big machine to handle even idle processes when there are 600 processes. So Oracle would of course like to put it all in one database and only run 50 processes so your sort of background process overhead, that you simple operating system housekeeping – the background processes – is much less significant. Well, actually taking an application and physically, say, well let’s take it a schema and put that schema into a database is very hard work. But if you can say, well, let’s create a container database which has 50 processes and then you can move a pluggable database in, it’s easy. In principle nothing changes. The plugged database doesn’t know it’s inside a container. It just happens that the container is the only thing that needs to run processes. So you can see from Oracle’s perspective, it’s easier for them to say, you can do consolidation with Oracle because you just put all your databases into one and you’re actually only running one database which is holding all these pluggable databases. And if it’s Oracle, it’s a conversion, say, consolidation convenience for the customer, but people who really benefits it are Oracle because it lets them sell Oracle more easily.
T: Don’t you think this is, I mean, container databases, the change of the architecture is considerable?
J: I suspect, that they don’t have to change the architecture very much. I don’t think it’s a wholly engineering job. I think there are probably a few key points where they have to make very careful changes. One of the questions I asked one of the architects, I think it was a designer when I was at Open World, I had a chat on one of the stands was, the first thing that cross my mind is this: if you have 6 copies of E-Business Suite running on 6 separate databases and you now put them to 1 database, pick a very common SQL statement. There are now 6 times as many copies of that SQL statement running as there used to be. Or rather there are 6 times as many executions of that statement. Isn’t that going to introduce higher pressure on just a few mutexes, I would have said latches just a couple of years ago, just a few mutexes because more people are running that one statement concurrently. And the basic answer to that question was well, at present our hashing algorithms are based on the SQL text. If you go to pluggable databases your hashing algorithms are based on the SQL text and the container ID. So two statements which are the same but from different plugged databases actually hash to different locations. So you don’t end up with hotspots becoming N-times as hot. So, you could imagine that there are a few critical places where Oracle says we have to bring in this one extra value, this container ID, for each of the plugged databases. But otherwise everything else is pretty much the same as it used to be. You could imagine that Oh, obviously it’s one of these sayings where someone says: “Ok, we’ve got 10 millions lines of code we only need to change 5 concepts. So it’s easy: there are only 5 things we need to change. Now can we find all the currencies of those 5 things?”And you can almost guarantee that one day someone will come along and say: If you do this and this and this and you are using this feature and you got such and such going on at the same time, you missed one. So, huge amount of work to make the change, but conceptually it could be relatively easy. Conceptually. Physically – hard work, but conceptually straightforward.
T: Have you heard about the threaded database?
J: Did I hear about?…
T: Threaded database. They did something with the way they manage processes.
J: Alright, yes, yeah.
T: And there was an article on some site which talks about the way Oracle will use processes in the 12c.
T: And it seem like they will do some curious stuff which is similar to shared server but dedicated (sic!)
J: I don’t think I’ve read that. One of the other thoughts that crossed my mind is that the single hottest point in any Oracle database ends up with is being the log writer. The rate at which the database can change is dependent on the log writer. Well, if you’ve got 6 busy databases with 6 busy log writers and you put them all into one container, if the container only has one log writer process, it will not keep up with 6 lots of log being generated. Therefore Oracle must have done something about making, I should say threading, some sort of threading with the log writer to, say, somehow even if we only have one log writer it has to be multiprocessing. Somehow the log writer has to work a lot faster, so perhaps one of the things we may see and perhaps the most critical and complicated bit that Oracle has to do is work out how to handle the log writer. And yes, I think I’ve seen something about, I think the expression was “threaded redo”. I can’t remember where I read it, but there was threaded redo. And something very different about what goes on with the log writer. And yes, there may be interesting things happening there. And certainly if I wanted to run up some tests, let’s say how easily I can break-it type testing, and that’s my speciality, if you want me to break something I’ll find a way of breaking it. Before it goes production. Always break it as soon as you can! Know what’s going to break in production by trying to break it first in test. It would be, I think I’d try and put pressure on the log writer to run as many databases plugged as possible which would all try to generate lots of log just to find out exactly what Oracle did in terms of handling excessive amounts of log if it’s still have got only one log writer process.
T: A question.
T: How much is a “much”? You said “excessive amount” redo
J: What’s an excessive amount of redo. Yeah, oh yeah. I often point out that if someone says “if this number is large…”, large doesn’t mean anything. It doesn’t tell you anything. And the problem of course is that sometimes it’s just not possible to put an absolute number in place. You know, if you’ve got 2 CPUs and 4 Gb of memory, something you think is very big, is completely ignorable for someone else’s concern if they’ve got 168 CPUs and 320 GB of memory. So “large’ always turns into some sort of comparison between the amount of work you think you have to do and the amount of hardware you’ve got to do it with, so, you have to watch out for that one. But if you think about the log writer, the log writer basically says something like, every time you commit or every megabyte – flush – I will actually write to disk and write continuously to disk at the end of a file, occasionally switch log files to keep on going I’m just just streaming out continuously to disk. Well, if you said something like, how fast can you actually do a DD? How fast can you literally copy physically a file from one location to a different disk. What’s the fastest you can write to disk? And you might say something like, well, it’s a continuous stream, 300 megabytes per second? Maybe? So once you’ve got some SSD sitting there in the middle to help out a little bit to accumulate lots of little writes into big writes and then pump out the big writes down onto the physical disk. A figure like that.
J: Big depends on what you’ve actually got to use, but you can see there are simple mechanical limits at some stage there are for it today, and you can say, if this number goes beyond X that’s what my disk can do. That’s it. That’s the fastest I can change the database.
T: I’ve seen some AWR reports on the Twitter. Which told me (sic!) some incredible numbers about redo generation. Like 600 megabytes per second or 1 gigabyte per second. This is only simple 2 socket boxes. So on 2 socket server. On a simple service which is also a part of the Exadata. But with some flash-based storage.
J: Yeah, right, that’s sort of where you’ve got to have some sort of… I mean, you don’t have disks, you have flash, you have some sort of solid state mechanism behind it. Trouble with that is that you can go on for a certain amount of time. And at some stage, and particularly if you’re archiving, archive log mode, you’ve got to de-stage from the solid state to the physically cheaper medium like spinning disks and tapes even maybe. You get your high speed from solid state, but that’s quite expensive and if you keep on doing it you’ll fill up your solid state. Eventually, you’ll run out of solid state, and this is an interesting thing about benchmarks, you engineer your benchmarks very carefully to go as fast as possible for as long as it is necessary. One of the old jokes about some of the Oracle’s benchmarks, the TPC-C things, as the benchmark says you’ve got to do this for this amount of time. You look on some of the ways in which Oracle has managed to go really really quickly and you think, yeah, that works. But if you have to carry on for another 2,5 hours it would stop. You know, you look at some of these and you think, in real life you would not implement this way because it’s not actually scalable it’s not a growing solution, it’s a fixed size solution. Some of the things Oracle does with single table hash clusters are just a little bit naughty. You know, fastest way to get into single row data is if it’s perfectly positioned in a single table hash cluster. Single table hash clusters have to be pre-declared, they are of fixed size, and if you get it wrong and you start overflowing things slow down dramatically. So…
T: You have to be clever
J: You have to be very-very well-informed. And it’s not a continuously growing solution. It’s very often benchmark special. Everyone does it.
T: But not in real life.
J: Not in real life, not in real life, no. I hope not.
T: Apart from the database Oracle is a big company business and the latest trend in this business is new word BigData and “the cloud”. What do you think about this new stuff which comes up as NoSQL databases and some applications which Oracle starts selling. Like big… like Exalytics. Or something like it.
J: I haven’t looked into this sort of area closely enough. I’d go to a few presentations at conferences and things like these. My feeling about the whole big data thing… I think it was Gwen Shapira who summed it up quite nicely. When she explained or put across the concept that you can often get very interesting information if you do something very simple to a very large amount of data. And you can get better information, better understanding than if you do something very complex to a very small amount of data. And this is, I don’t know how well it comes in big data concept, but it seemed a very sensible observation. And my feeling about the BigData business is that it doesn’t matter if the data is wrong. Or rather it’s not quite right. If you’ve got a vast amount of data and it’s not quite up-to-date and it’s not quite perfect, the sheer volume of what you’ve got that is correct means that when you do your BigData analysis the conclusions you draw from it are appropriate. So, big data doesn’t quite…
T: Statistically significant
J: … yes, exactly. What you’re after with big data is statistical significance. It’s relevance and if your data is pretty close to correct, then your statistics… and the volume is large… your statistics will overall be appropriate. Where’s the relational data is the opposite. Relational theory is a theory about insuring that you can represent data absolutely correctly or rather that you create a model and implement the model correctly, what you put into the model represents truth. It is self-consistent, it is correct within itself. And obviously you can create a model and then you can lie, you can say, ok, I can see I’ve got 10 boxes on the shelf but I’m gonna say I’ve got 8 boxes when I put in in the database. In that sense the relational theory doesn’t stop you from lying and getting the wrong answers but the relational theory says if you’re actually take the truth from the real world and put it in the correct model in the database what you’ve got in the database is correct, is true, and the database will ensure that the database is correct. And that’s why billing systems use relational databases. They do not use: “Oh, statistically, you probably owe us about, you know, 6000 roubles. Accurately you’re 6042 roubles and whatever 10 kopeek”. Is it kopeks? I’m sorry I’m trying to speak local and I should have checked currencies before I did. The relation database is correct, so it may end up that it handles – in general for most people – small data or the big data, but big data doesn’t have to be perfect; relational systems exist because sometimes the data has to be perfect. So the relational won’t go away, whatever anyone says, relational won’t go away, ‘cos relational is theoretical model. Now people may find mechanical ways of implementing it differently. The big data and NoSQL are not aimed to be implementations of the relational theory.
T: As a replacement it’s not appropriate?
J: It’s not, that is correct. There may be things that people do at present with Oracle databases which have a vast amount of data in them. And maybe they will look at them and say, you know, I’m paying for all this overhead to guarantee correctness of the data, but I don’t actually need correct data. And one of the clients I had some time back was one of the social networks. I won’t say which one, but it’s the social networks’ database which is about who said what to who and the little chat boxes and all the rest of it. Well, technically, you don’t need to make sure that you capture every little comment that every person has made to every other person. If the database crashed and you lost just a little bit of data and the database came up again, and someone said: “oh sure I told this person about my pet dog, but I can’t see it”. Well, it doesn’t really matter terribly much. And you may find out that people running things like social networks on Oracle, might say, well, let’s move to a completely different technology where I can have lots more people chatting and it’s cheaper to run the data set. And if I do lose a tiny bit, they are not gonna complain. They may be surprised, they may be puzzled, they may be a little bit confused that something’s gone missing. But, you know, as long as this can go pretty close, they won’t complain. Maybe if something appears which disappeared for 24 hours because it was over on a different node and it had a funny feel today with this node and it cleaned itself up a bit later – fine.
T: And there is redundancy also.
J: Exactly! The big data thing, the HDFS, this type of stuff allows for, you know, the data protects itself. It’s the sort of internet database, you know, you can destroy bits of it, but the rest of it will work around the hole in the middle and fill in the gaps type of thing. And that’s very much that sort of, it’s my mind anyway, that sort of feel to it, to the big data thing. It gets there, it gets there in the end. Right, and for some systems it’s perfectly OK to say it’s not perfect all the time but it’s good enough.
T: And another word which comes very often with BigData is the Cloud. And every person thinks different, I think, about the Cloud.
J: I’m not convinced that there is any direct connection between BigData and the Cloud. To me, when people say “the Cloud” my mind goes all the way back to about 1968-1969, when PDP 11 was top of the range, absolutely the most expensive fantastically powerful computer. And people used to set up computing bureaus where you got on to a telephone line, and you’d whistle down the telephone line and the computer understand it understood the whistling. It was bureau computing. The Cloud to me is just a very modern equivalent of that. We don’t really need to know where the resources are. Things like vMotion let us move virtual computers from physical machine to physical machine without even stopping. Well, the Cloud simply says, I’ve got a load of computers here that are very closely tied together, I can create a virtual computer out of many of those real computers or even blades or bits in a single box that I need. Oh, and I don’t really care whether it’s one smallish box sitting in my server room or whether I actually slide the whole virtual machine over the Internet into the enormous great box which is sitting in Oracle shores. To me that’s what the Cloud is, it’s a way of saying: “ I need computer resources, I don’t actually have to own them”. And it just happens the technology has just moved on so much, things like the VMWare type stuff, you look at things like RAC, RAC then went into Grid, where Grid said you’ll have a big RAC cluster, like we’ll have 14 old RAC clusters, but 8 of those nodes will be my HR applications and 12 of those nodes will be my general ledger, and 22 of those nodes, if I’ve got enough left, will be my web application server type of thing. Oh, and on Saturdays the thing gets a bit busy, so it takes some of the HR nodes off the HR cluster and a few off the GL cluster and move them to web cluster, and on Sunday night we’ll shift them back again. So this was just Oracle moving nodes across clusters. Across the virtual database and into the right clusters. Well, Cloud is just more of the same. We can move the resources where we need them. Or put it another way around. If you want to buy the resources off me you don’t need to know exactly where I’ve put those resources. So, it’s nothing new and exciting, it’s just bigger. Bigger and necessarily a lot faster. But it’s just selling computer time. And we now have much better technology for selling much more computer time. And you can start small. You can start… and this is Oracle in ODA. You can start with one little ODA sitting in your office and there’s a nice little sort of management interface and you can set it up so end users can say, you know, I’m allowed to ask for 6 CPUs if I need them. You know, run up the interface which is, you know, build me a virtual system, build me a virtual web application server with database behind it. With just a few clicks of a button. That little box can migrate its virtual systems across the network and you can put them in a big box or you can put them out onto the Internet and onto a third party supplier. Well, third party supplier means Oracle Corporation ideally. That’s, you know.
T: Thank you very much for this wonderful overview of this new technology. And…
J: I should put up on these slides under red lights: “I might be completely wrong. I might change my mind later. That’s just opinions over here. I’m not a technology forecasting guru. I could be quite off base on that sort of line of business”.
T: Still, you sound reasonable. And I think you are… can’t be mistaken too much.
J: I would not.
T: You are a book writer.
J: I have written books, yes.
T: So far it’s 3 books, I think.
J: Yes, 3 books I’ve written and 3 more I have contributed to.
T: Yes, and can you share us your experience how difficult it is to write a book.
J: Extremely difficult! My view on the books I’ve written, the thing I was trying to achieve by writing books, was to tell a story. You know, you look at the Oracle manuals and there is a huge amount of very useful information in them. And particularly over the last 3 or 4 years. I think, possibly the best two Oracle’s 10.2 manuals, but over the last years the quality of information is very very good. But in many ways it’s fragmented. You have to know which bit you have to look at to understand what you need to do. So, what I wanted to do in all of my books was try and pull all the things together into a continuous story so you can begin at the beginning and walk through that and gradually learn and make more complex and more advanced your understanding of how Oracle worked. So my aim is always to have a narrative. I can tell you 5000 little things about Oracle and it wouldn’t help you very much unless there were a couple of them you really needed to know. Lots of little facts in my mind is not helpful. I wanted to tell the whole story. From end to end. And that’s extremely difficult to start with. I’m actually structuring story so that you can produce the flow.
T: Absolutely. The question was about the books.
J: Oh, yeah. Yes, it was the books. You end up going to some of the really complicated tests and spending a lot of time. That’s where the problem comes with the book. You know what the basic idea is, but sometimes you gonna have to spend a lot of time just to finish off a couple of sentences or one paragraph. And yes I often found that I will get to the point where I’d say, OK, what I want to write is a paragraph and 2 days later I could write that paragraph. Or sometimes not write that paragraph and write a completely different paragraph, because I’ve discovered something new and different, because I was creating more realistic tests. And the effort to get the last 5% is enormous. Occasionally it’s very frustrating.
T: Thank you. Any plans, maybe, for more books?
J: Not very strong ones. People do ask me from time to time. Well, there was this suggestion from the CBO book, the Cost-Based Optimizer Book that there would be 2 more volumes. I think I’ve said there might be, or I have enough material for 2 more volumes, but problem is. This is why Oracle should go back and fix the old optimizer not bring in new bits! Then I can keep up with them and actually write a book about what happens. They change the optimizer faster than I can write about it. Plus of course there are a number of really highly competent people now in the Internet, who actually do write stuff about the optimizer, so it’s a lot easier for people to find out some of the stuff they need. I mean, my book gives you the basics, it gives you the core features that you need to know to recognize the sort of basic problems. The later books were about some of the more advanced stuff, some of the less common stuff. So I’ve decided there’s no way I can actually catch up and do, put all the rest things. Instead of being the 5% testing that I have to do for the new bits, it would be, half of it would be, like 50% of it would be work at it and do all the tests. But I do have a thought that time when I wrote Oracle Core and I sit there I think it’s roughly the first 2 chapters of Practical Oracle 8i revisited and made into a whole book. There is scope for a book called Oracle Structures . I’ve actually designed a seminar, a 2 days seminar for Oracle structures which is pretty much the rest of Practical Oracle 8i but brought very much up to date. Because, well, you look at most database systems and you have simple headp tables, occasionally people use partitioning, of course, the big stuff, and B-Tree indexes or if it’s a data warehouse you have bitmap indexes. And you can engineer a lot of performance into a system if you know what physical structures are most appropriate for what you want to do. And that book is a possible book and I do have odd little thoughts and odd little notes about, well, I would quite like to write that book. And perhaps I could write it without having to do the 50% maybe it would still be a 95-5 type of job. So, maybe, maybe, but no promises! There may be a book Oracle Structures.
T: I hope so. And the closing question. Tell us your favourite story about your consulting, like, the best thing you did to a client.
J: I think the best story I can tell is the story that makes you realize how nobody is really, how should I put this, what you think is the best thing you’ve ever done isn’t what people want you for. Cause the best thing I ever did, or rather the thing I did at one company which I think has got me the loudest, wildest, most enthusiastic praise was to fix the coffee machine!
T: Oh, really?
J: Absolutely. And they called me because they had performance problems. That’s why I ever go anywhere: “we have performance [problems]…” or actually some people call me and say, well, could you do a day briefing on Oracle to the development team or to the DBAs or whatever. So sometimes my visits to companies are actually about training days and talking about Oracle. But mostly it’s “fix our performance problems” or at least tell us about how to fix our performance problem. So I went to this company, quite a long time ago now and they had some really unpleasant SQL, and it was really really quite difficult to work out what it’s trying to do. It would be nice at that times to have all these little scripts that Carlos [Sierra] has got into his SQLT, cause you need to collect a huge amount of information about what they were trying to archive with that SQL, to understand what it was doing. DBAs didn’t know what the business was doing, we had to reverse engineer from the SQL to what the requirement was. Complicated stuff I picked up the worst one they had, quite a long it was 3 or 4 pages. Some people will say that’s nothing, but it was difficult. It took me 2 or 3 hours to figure out what it was doing, find out what the logical flaws were in the way it was doing, and write a piece of SQL that did the same thing much much faster. It was a classic 2 hours drop down to 3 minutes sort of fix. So I got this done and I called the DBAs and said OK, here’s the query, this is what it’s trying to do, this is why this version of it went badly, here is my version of it, this is why it’s actually logically the same query, will give you the same results, watch it run: BANG, 3 minutes. And the DBA says: “Yeah, fine, good, thanks”. And off they went. And I thought, oh, I deserve a cup of coffee. I’ve done a really good job here, I’m gonna make some coffee. I went to the kitchen, they had a wonderful espresso machine in the kitchen, and I put the cup on it, pressed the button and it went KADUNK and a red light came up. So I got a bit annoyed. So I took the machine apart and I discovered that it’s made up of 4 pieces really and I put them back as they were. So I took the machine apart, I checked there was coffee in it, I checked if there was water in it, I checked the reject container was empty, took it apart, put it back together again, pressed the button, and the machine went chchchchchchc and all this noise of espresso coming out of the machine. The DBAs are going rushing into the kitchen: “WOW YOU FIXED THAT COFFEE MACHINE!!!”. I thought, yeah, great they are paying me to fix their database but what they really care about? I fixed the coffee machine. And as someone once said, yes of course the coffee machine is the most important piece of kit in the IT department. But yeah, the best thing I ever did for a company, the thing they appreciated most, fixed the coffee machine. They were probably expecting me to fix their database, but fixing the coffee machine, that’s my proudest moment, that’s my happiest moment, shall we say. Obviously, I fixed the SQL and re-engineered databases, I’ve had great fun explaining things to people. And it’s very satisfying when you explain to someone why this thing works badly, and they say, “Ah! Right! OK, oh I can do…” You give them an idea and they say, “I’ve got 6 or 7 places I’m doing the same thing”. It’s very gratifying when you can do that. But the sheer excitement I recall from fixing a coffee machine is ridiculous. It’s a memory which will not go away.
T: So I think we can stop here. I’d like to thank you very much for this exciting talk!
J: My pleasure!
Filed under: Oracle
The purpose of this post is to show what the wait event ‘cell smart table scan’ means, based on reproducible investigation methods.
First of all, if you see the ‘cell smart table scan’ event: congratulations! This means you are using your exadata how it’s intended to be used, which means your full table scan is offloaded to the cells (storage layer), and potentially all kinds of optimisations are happening, like column filtering, predicate filtering, storage indexing, etc.
But what is exactly happening when you see the wait event ‘cell smart table scan’? Can we say anything based on this waits, like you can with other wait events?
If you have access to an exadata database machine, and have a table that is big enough to be smartscanned by the database, issue the following statements:
(Please mind that I’ve found these statements to be harmless, but execute these statements at your own risk. As with any execution of SQL: this uses resources on the Exadata, and could influence the performance)
SQL> alter session set tracefile_identifier='frits'; --make it easy to find the tracefile SQL> alter session set events 'trace[libcell.client_library.*]'; SQL> alter session set events 'sql_trace level 8'; SQL> --- now issue a full table scan on a table SQL> select /*+ noparallel */ count(*) from bigtable1; COUNT(*) ---------- 63938560 SQL> exit
Now take the tracefile and look inside!
WAIT #139739525441360: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752114437651 *** 2013-10-14 07:02:01.516 WAIT #139739525441360: nam='SQL*Net message from client' ela= 7078360 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121516483 CLOSE #139739525441360:c=0,e=13,dep=0,type=1,tim=1381752121516628 ===================== PARSING IN CURSOR #139739525440176 len=57 dep=0 uid=0 oct=3 lid=0 tim=1381752121516762 hv=2246504030 ad='85b3e090' sqlid='gmm4k0y2ydvky' select /*+ noparallel */ count(*) from bigtable1 END OF STMT PARSE #139739525440176:c=0,e=83,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516760 EXEC #139739525440176:c=0,e=87,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516992 WAIT #139739525440176: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121517053 WAIT #139739525440176: nam='reliable message' ela= 2053 channel context=2700762248 channel handle=2700440152 broadcast message=2669087352 obj#=-1 tim=1381752121519402 WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 504 name|mode=1263468550 2=65571 0=1 obj#=-1 tim=1381752121520099 WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 330 name|mode=1263468545 2=65571 0=2 obj#=-1 tim=1381752121520594 WAIT #139739525440176: nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=6 filetype=2 obj#=-1 tim=1381752121520664 *** 2013-10-14 07:02:01.525 initializing OSS ossnet_optip_trace: Host is not NUMA. Returning no IP preference
This is the start of my tracefile (but yours could be a bit different). First the usual SQL*Net waits, then the query which gets parsed and executed, then the wait for ‘enq: KO – fast object checkpoint’, which is an indication we are going to do direct path IO, because the KO enqueue indicates this object is checkpointed to make sure that the version of the blocks on disk is current.
Then there’s ‘initialising OSS’. This is where the communication with the Exadata storage servers is started. OSS probably means ‘Oracle Storage Server’ in this context. The next line tells us something nice: apparently, the code in the oracle database to behave as a client of the Exadata storage server is NUMA (Non Uniform Memory Access) aware. However, on Exadata versions “dash two” (X2-2, X3-2), NUMA is turned off on the compute layer (the kernel boot line in /etc/grub.conf explicitly disables NUMA with “numa=off”).
Now let’s look in the tracefile, but to get a better idea of what is happening, let’s filter the tracefile information. Issue the following command on the tracefile:
grep -e ^oss tracefile.trc | grep -v done | less
The first part looks something like this:
ossnet_optip_trace: Host is not NUMA. Returning no IP preference oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7 oss_open called for o/192.168.12.5//box/predicate oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762 oss_open called for o/192.168.12.3//box/predicate oss_open called for o/192.168.12.3/DATA_CD_09_enkcel01 with guid 217e470e-d3f5298-1ffed28e-a00133d2 oss_open called for o/192.168.12.5/DATA_CD_00_enkcel03 with guid 78804872-173010d6-2c12c6d2-b7bc6780 oss_open called for o/192.168.12.5/DATA_CD_08_enkcel03 with guid fd1f4fce-15180efd-83fbde2f-8465401c oss_open called for o/192.168.12.4/DATA_CD_00_enkcel02 with guid ead44d03-e64725e4-c25a79b5-bff05885 oss_open called for o/192.168.12.4//box/predicate oss_open called for o/192.168.12.5/DATA_CD_05_enkcel03 with guid 20c344ef-53a0387b-c2dbd306-ad971476 oss_open called for o/192.168.12.4/DATA_CD_08_enkcel02 with guid e2294076-cce51fa-edaecc3f-a966ca8e oss_open called for o/192.168.12.4/DATA_CD_09_enkcel02 with guid 3073459f-8f6928a0-c70f77dc-9ed9e840 oss_open called for o/192.168.12.5/DATA_CD_10_enkcel03 with guid 73b44c5a-a44544b2-82933066-8cf0410a
I’ve included the NUMA line again. Next we see ‘oss_open’ lines. For every griddisk there’s a line to indicate that the ‘disk’ is identified. Also, when a storage server is found for the “first time” for a smart scan, we see a line with “/box/predicate”, which probably sets up a communication channel to the storage server to push the predicate information, so the storage server can use that. It seems logical to me that first information about the storage environment is gathered by the foreground process that is going to use it.
The next phase looks like this:
oss_ioctl open file: 2, opcode: 99 oss_wait called for request: 0xdca3030 oss_ioctl open file: 4, opcode: 99 oss_wait called for request: 0xdca3030 oss_ioctl open file: 9, opcode: 99 oss_wait called for request: 0xdca3030 oss_ioctl open file: 2, opcode: 99 oss_ioctl open file: 4, opcode: 99 oss_ioctl open file: 9, opcode: 99 oss_wait called for request: 0xdca3030 oss_wait called for request: 0xdcb52e0 oss_wait called for request: 0xdcb59f0
The phase seems to execute “oss_ioctl”, and after every ioctl there seems to be a corresponding “oss_wait” line. “ioctl” probably means Input Output Control. I am not sure what happens during the ioctl phase. The number of ioctl requests (19 in my case) is too low to have a ioctl per griddisk, and the number of extents is much higher for too.
The next phase is executing “oss_cread”:
oss_cread open file: 2, offset: 0, size: 1048576 oss_cread open file: 4, offset: 0, size: 1048576 oss_cread open file: 9, offset: 0, size: 1048576 oss_wait called for request: (nil) oss_wait called for request: (nil) osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38
My guess (again) is “cread” means compressed read, and seems to be the indication that the database is reading the results of an offloaded query. When a non-offloaded query is executed, the OSS read function is “oss_read”, but is referred to as “BREAD” in the tracefile (block(mode) read). The offset of oss_cread seems to be always “0″, and the size seems to be “1048576″ (1MB) in all cases. The line with ‘failed’ does not seem to indicate a problem.
The IOCTL and CREAD’s seem to be mixed for a little while, after which the CREAD’s are the only ones left.
So, an offloaded query has a open, ioctl and cread phase. Now let’s add in the waits:
grep -e ^oss -e ^WAIT tracefile.trc | grep -v done | less
This is an example with different snippets to indicate what the ‘cell smart table scan’ means:
oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7 oss_open called for o/192.168.12.5//box/predicate WAIT #139739525440176: nam='cell smart table scan' ela= 384 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121550711 oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762 oss_open called for o/192.168.12.3//box/predicate WAIT #139739525440176: nam='cell smart table scan' ela= 526 cellhash#=3176594409 p2=0 p3=0 obj#=15909 tim=1381752121558781 oss_ioctl open file: 2, opcode: 99 oss_wait called for request: 0xdca3030 WAIT #139739525440176: nam='cell smart table scan' ela= 898 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121631152 oss_cread open file: 2, offset: 0, size: 1048576 oss_cread open file: 4, offset: 0, size: 1048576 oss_cread open file: 9, offset: 0, size: 1048576 oss_wait called for request: (nil) oss_wait called for request: (nil) osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38 WAIT #139739525440176: nam='cell smart table scan' ela= 9249 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122240642 oss_wait called for request: (nil) oss_wait called for request: (nil) WAIT #139739525440176: nam='cell smart table scan' ela= 197 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122250860
What we see, is that during the entire “life” of a smartscan, during the 3 phases the same wait event ‘cell smart table scan’ is used. This means that there’s little that can be derived from the wait event, except for the fact that a smartscan is used.
Last week, while working on customer engagement, I learned a new method of quantifying behavior of time-series data. The method is called “Control Chart” and credit to Josh Wills, our director of data science, for pointing it out. I thought I’ll share it with my readers as its easy to understand, easy to implement, flexible and very useful in many situations.
The problem is ages old – you collect measurements over time and want to know when your measurements indicate abnormal behavior. “Abnormal” is not well defined, and thats on purpose – we want our method to be flexible enough to match what you define as an issue.
For example, lets say Facebook are interested in tracking usage trend for each user, catching those with decreasing use
There are few steps to the Control Chart method:
Those zones can be used to define rules for normal and abnormal behaviors of the system. These rules are what makes the system valuable.
Examples of rules that define abnormal behavior can be:
Note how flexible the method is – you can use any combination of rules that will highlight abnormalities you are interested in highlighting.
Also note that while the traditional use indeed involves charts, the values and rules are very easy to calculate programmatically and visualization can be useful but not mandatory.
If you measure CPU utilization on few servers, visualizing the chart and actually seeing the server behavior can be useful. If you are Facebook and monitor user behavior, visualizing a time series of every one of their millions of users is hopeless. Calculating baselines, standard deviations and rules for each use is trivial.
Also note how this problem is “embarrassingly parallel“. To calculate behavior for each user, you only need to look at data for that particular user. Parallel, share-nothing platform like Hadoop can be used to scale the calculation indefinitely simply by throwing increasing number of servers on the problem. The only limit is the time it takes to calculate the rules for a single user.
Naturally, I didn’t dive into some of the complexities in using Control Charts. Such as how to select a good baseline, how to calculate standard deviation (or whether to use another statistic to define zones) and how many measurements should be examined before a behavior signals a trend. If you think this tool is useful for you, I encourage you to investigate more deeply.
When you read an execution plan you’re probably trying to identify the steps that Oracle went through to acquire the final result set so that you can decide whether or not there is a more efficient way of getting the same result.
For a serial execution plan this typically means you have to identify the join order, join methods and access methods together with the point at which each predicate was applied. Generally this isn’t too difficult, although subquery pushing (and a couple of presentation bugs relating to other scalar subqueries) can cause a little confusion; and the difference between join order and the order of operation can be slightly obscured when considering hash joins.
Parallel execution plans are harder, though, because you really need to understand the impact of the order of operation, distribution mechanisms chosen, and (in recent versions of Oracle) the timing of the generation and use of Bloom filters. The topic is stunningly large and easy to describe badly; it’s also going to be easy to make generalisations that turn out to be untrue or (at least) sufficiently incomplete as to be misleading. Rather than attempting to cover the topic in one note, I think I’m going to end up writing two or three.
I’ll be basing my comments on the two execution plans I produced for a recent question on Bloom filters, so you may want to read that article and its subsequent comments first (the link will open in a new window).
The ideal target for parallel execution is linear scalability. In other words, if you execute a statement with parallel degree N then the workload should be shared N ways with no interference or duplication and the job should complete in one Nth of the time. There are immediately four caveats as far as Oracle implementation is concerned:
Having mentioned the caveats, I’m going to ignore them and stick to a simplified image of two sets of parallel query slaves engaged in some fairly simple query processing and ask how Oracle shares the work while avoiding contention and duplicated effort.
Key to doing “1/Nth” of the work is the attempt to get a PX slave to do 1/Nth of the necessary disc I/O – so for a tablescan the query co-ordinator (QC) cuts the table into N pieces (technically it actually uses N * 13 pieces so that it can share N ways while allowing for an imbalance in content – the 13 comes from a historic 9:3:1 strategy); for access to partitioned tables the query co-ordinator may share out the partitions. In either case the co-ordinator can arrange to share the disk I/O fairly without collision or duplication of effort. That’s a basic first step.
But if we have to join table X and table Y, how do we ensure that that 1/Nth of table Y that a slave reads is the bit associated with the 1/Nth of table X that it has read so far ? The answer is that we can’t, and that’s why distribution mechanisms are important and why (for simplicity, probably) Oracle generally ends up using 2 sets of N parallel execution slaves when running with degree N. One set of parallel execution slaves reads the first table as efficiently as possible and then (typically) distributes the data to the other set of parallel execution slaves using a mechanism that tries to anticipate the most effective strategy for reading the second set of data.
So, for example, if the optimizer estimates that Table X is 100MB but we’re only going to find 100 relevant rows in it while table Y will return 100,000 rows in 10GB of data the plan is likely to be:
We generate the right answer with this “(broadcast, none)” distribution because every relevant row from table X is known to every slave in the second slave set, so there is no risk of this distribution method “losing” a join between a row in table X and table Y that should have happened. The plan (probably) shows good scalability – i.e. time to complete is inversely proportional to degree N – because our two tables are both split into N separate areas with N slaves scanning them concurrently. The performance threat comes from the cost of duplicating the 100 rows to each of the slaves in the second set – which is probably a small overhead in CPU and memory, with a little contention due to the concurrent “cross-chat” between the levels of slaves (every slave in set 1 spends a little time talking – or wanting to talk – to every slave in set 2).
Alternatively, if the optimizer estimates that Table X will return 100,000 rows and table Y will return 1,000,000, then it may decide that there is too much data coming from table X to follow the broadcast strategy – which would end up with every slave in the second set receiving, and building a hash table from, 100,000 rows. In this case we may see:
In this case we know we will get the right answer because if two rows are supposed to join then the nature of the “(hash, hash)” distribution means that both rows will have been sent to the same parallel execution slave in slave set 2. Note that in this case Slave set 1 has scanned both the tables – but we still only scan the tables once each, and each scan is shared evenly between N non-interfering slaves. The performance trade-off is about memory, CPU and (particularly) inter-process contention.
On the plus side, we have minimised the memory usage by sharing the data that goes into the hash table across N slaves, rather than having every slave build the whole hash table; we have minimised the (first pass of) messaging by sending each row from each slave in the first set to just one slave in the second set, rather than sending every row to every slave.
On the minus side, the first set of slaves has had to pass a large number of rows from table Y to the second set of slaves – which increases the CPU usage, and increases the scope for contention in the cross-chat between the two sets of slaves. This is where the Bloom filter can come into play. If, for example, the optimizer thinks table X will return 100,000 rows and table Y will (based on the non-join predicates) return 1M rows, but the join will eliminate 90% of those rows, it may direct the second set of slaves to create a (small) Bloom filter from the values each has received and pass this back to the first set of slaves which can then use the filter to eliminate a lot of the rows from table Y that would otherwise be passed to the second set of slaves. We do not expect Bloom filters to be perfect – they will eliminate only rows that will not be needed, but they may not manage to eliminate all of those rows – so the data that reaches the second set of slaves may still contain some rows that will disappear on the join, but it will certainly contain all the rows that should have survived.
That completes my brief outline of the absolute basics of the interaction of parallel execution slaves and what they are trying to achieve. In the next article I’ll talk about “table queues” and the order of operation.
I’ll be representing the Oracle ACE Program as part of the OTN Nordic Tour this year. I’ve just booked my flights, so in a little over a week I’ll be starting the three date tour.
I put forward a few different papers each event picked different ones, so it looks like I could be presenting up to 6 distinct sessions over the tour. This next week is going to be very busy. I’ve got to put the finishing touches to one presentation, then rehearse all 6 a few times…
This is the final part of a multi-part blog post on using OpenLDAP for Net Service Name Resolution. Part 1 can be found here.
Ok, now that you have installed and configured phpldapadmin, you’ll want to do some customization, that will facilitate using phpldapadmin specifically for maintaining Net Service Names. Remember, phpldapadmin is a generic tool, that can be used to manage any LDAP server, with virtually any type of content. At this point, you can use it to manage your LDAP server, but, I’ll provide you two new custom templates, for managing Net Service Names and Net Service Aliases.
Basic phpldapadmin Navigation
First, let’s look at a screen in phpldapadmin. If you point your browser to http://localhost/phpldapadmin (replace localhost with your appropriate hostname), and then login with the rootdn (cn=admin,dc=proquest,dc=com), and your admin password, you should see a screen that looks like this:
Now, if you look at the left side panel, you should see an icon that looks like a box with a plus sign inside of it, followed by your domain. If you click on the plus sign, you’ll see a screen that looks like this:
Now, after expanding the domain, you can see that there are two items listed, ‘cn=admin’, which is the admin user credentials that you defined while setting up OpenLDAP, and ‘cn=OracleContext’, which is where all the Net Service Names and Net Service Aliases are categorized under. Now, click the plus sign next to the ‘cn=OracleContext’ node, and you’ll something like this:
So, you you can see all the Net Service Names and Aliases listed. Notice the star icon, followed by ‘Create new entry here’, immediately following the ‘cn=OracleContext’. If you want to add a new Net Service Name or Net Service Alias, that’s where you’ll want to click. This will create a new entry under the ‘cn=OracleContext’, which is where Net Service Name and Net Service Alias objects must exist, or they will not be recognized by your Oracle client. So, go ahead and click the ‘Create new entry here’ link, and you’ll see this screen:
So, this shows all the default templates that phpldapadmin provides. Exactly none of these are useful for maintaining Oracle Net Service Names and Aliases. Furthermore, to add a Net Service Name or Alias, you’ll need to choose the ‘Default’ template. This is a difficult and cumbersome way to interface with the LDAP server. So, what to do?
Custom Templates for phpldapadmin
Well, I’ve managed to create two new custom templates, for use with phpldapadmin, which I think you’ll find make creating and editing Net Service Names and Aliases, much simpler. First, download these two files: custom_orclNetServiceAlias.xml, custom_orclNetService.xml to the root user’s home directory. Now, my assumption is that you’ll be using phpldapadmin strictly for managing your Oracle Net Service Names and Aliases. Of course, your OpenLDAP server could be used to serve a wide variety of information, well beyond just Oracle Net Service Names and Aliases, but, I’ll assume you’re not interested in any of that. So, you’ll want to disable the standard templates that are distributed with phpldapadmin and drop in the two templates you just downloaded. So, to copy the custom templates into place, execute the following command:
mv /root/custom_orclNetService*.xml /usr/share/phpldapadmin/templates/creation
So, now, to disable the standard templates, and display only the templates specific to Net Service Names and Net Service Aliases, you’ll need to edit the /etc/phpldapadmin/config.php file.
You’ll want to locate lines that look like this (should be around line #159):
/* Just show your custom templates. */
// $config->custom->appearance['custom_templates_only'] = false;
/* Disable the default template. */
// $config->custom->appearance['disable_default_template'] = false;
You’ll want to uncomment those directives, and change the ‘false’ to ‘true’. When you’re done, those lines should look like this:
/* Just show your custom templates. */
$config->custom->appearance['custom_templates_only'] = true;
/* Disable the default template. */
$config->custom->appearance['disable_default_template'] = true;
Finally, to make the Net Description String a multi-line field, you’ll need to add a directive that looks like this:
// Make orclNetDescString a multi-line attribute
$config->custom->appearance['multi_line_attributes'] = array('orclNetDescString');
I don’t think it matters where in this file this directive appears, but I put it at the end of the ‘Appearance’ section. (Around line #190, on my system.)
Ok, now that you’ve disabled the phpldapadmin provided templates, as well as the default template, and supplied your own custom templates, let’s revisit the screen where you selected ‘Create new entry here’. (Be sure to select the ‘Create new entry here’ which is directly under ‘cn=OracleContext’.) Before you do that, do a quick logout/login, to have the system read in the new templates. Now, when you select ‘Create new entry here’ under ‘cn=OracleContext’, you’ll see a screen that looks like this:
So, now we only see the ‘Oracle: Net Service Name’ and ‘Oracle: Net Service Alias’ templates. This should greatly simplify the task of adding Net Service Names and Aliases, as it removes a bunch of stuff you’ll never use, and adds specific templates for the two types of objects you’re interested in.
Here’s a screenshot of the Net Service Name template:
And here is a screenshot of the Net Service Alias template:
So, for the Net Service Name, you’ll want to enter the Net Service Name and Net Description String, something like this:
And, for the Net Service Alias, you’ll want to specify an Alias name and Aliased Object Name, which looks like this:
Note the format of the Aliased Object Name. Also, note that you can use the icon with the magnifying glass on top of the file folder to open a window where you can browse and navigate through the Net Service Names that have been defined.
So, this concludes the setup of OpenLDAP for Oracle Net Service Name resolution. You have installed OpenLDAP and configured it for Net Service Name resolution. Optionally, you have set up simple Master/Slave replication. You have the tns2ldif filter for large scale importation of your enterprise-wide tnsnames.ora file, and finally, you have installed the phpldapadmin tool for GUI-based editing of individual Net Service Names and Aliases. This should be a complete solution for enterprise-wide Net Service Names resolution. If you have any questions, comments, or problems, feel free to leave a comment below.
This is part 5 of a multi-part blog post on using OpenLDAP for Net Service Name Resolution. Part 1 can be found here.
Day to Day Management
The tns2ldif program is great for converting a large number of Net Service Names into the LDIF format, for subsequent bulk loading into the LDAP server. But, for day to day operation, where you may want to add or delete individual Net Service Names or modify an existing Net Service Name, hand editing LDIF files is probably not an optimal solution. So, is there a GUI solution that may help with this type of day to day work? There are a few free LDAP GUIs available. The one I have chosen is phpldapadmin.
So, first, let’s install phpldapadmin. If you’ve set up the Master/Slave replication, you’ll want to install phpldapadmin on the Master server.
Start with installing a few prerequisites, thus:
yum -y install php php-ldap
Now, phpldapadmin is not available in the standard Oracle Linux 6 repositories. So, we’ll need to enable an alternate repository, to gain access to phpldapadmin. Executing the following commands will set up the alternate repository:
sudo rpm -Uvh remi-release-6*.rpm epel-release-6*.rpm
Finally, to actually install phpldapadmin, execute the following:
yum --enablerepo=epel -y install phpldapadmin
That’s it for installation!
Now, we just need to configure a few things, to get it working with your OpenLDAP server.
So, to configure phpldapadmin, there’s just a few quick things you’ll want to do. First, edit the /etc/phpldapadmin/config.php file in your favorite editor:
From the editor, look for two lines that look like this (on my system, they are lines 397 and 398 in the file):
In the example above, the first line is commented out, and the second is not. We just want to reverse that. Comment out the second line, and uncomment the first. So, make it look like this instead:
Ok, now save that file and exit the editor. Next, we need one more quick change, this time to the /etc/httpd/conf.d/phpldapadmin.conf file. So, edit the file:
This is a short file, comprising of only a few lines. It should look something like this:
Alias /phpldapadmin /usr/share/phpldapadmin/htdocs
Alias /ldapadmin /usr/share/phpldapadmin/htdocs
Deny from all
Allow from 127.0.0.1
Allow from ::1
Now, identify the like that says “Allow from 127.0.0.1″, as that is the line you’ll need to modify. If you leave this file unmodified, then you’ll only be able to access phpldapadmin from the local host. 127.0.0.1 is local loopback address. So, if you’re not on the server that the phpldapadmin is installed on, you won’t be able to access it. If you want to add access from other hosts, their IP addresses, or network addresses need to be added to this line. In my case, my master server and slave server are running on two different virtual machines on my Macbook Pro laptop. The VMWare software established a network on the 192.168.x.x subnet. So, in my case, i modified that line to look like this:
Allow from 127.0.0.1 192.168.0.0/16
You should adjust it as necessary for your network setup. Ok, time to save this file and exit the editor. Now, all you need to do is to start Apache, thus:
service httpd start
If you want it to restart automatically at boot time, also execute:
chkconfig httpd on
In the final installment of this series, I’ll talk a bit about about using phpldapadmin to edit and update your LDAP server.
When the optimizer has to estimate the data volume (the BYTES column in the plan output), it usually bases this information on the column statistics, if applicable and available (think of complex expressions).However, whenever there is a VIEW operator in an execution plan, that represents an unmerged view, the optimizer obviously "loses" this information and starts applying defaults that are based on the column definition.Depending on the actual content of the columns this can lead to dramatic differences in data volume estimates.Both, under- and overestimates are possible, because for character based columns these defaults seem to be based on an assumed 50% fill grade, so a VARCHAR2(100 BYTE) column counts as 50 bytes data volume.For multi-byte character sets the same rule applies based on the maximum width of a column using the "char" semantics, so a VARCHAR2(1000 CHAR) column counts as 2000 bytes data volume when using the AL32UTF8 character set, which is 50% of the 4000 bytes such a column could require at maximum - so with multi-byte character set this effect can be exaggerated.The cost calculation of data access operations like full table scans isn't influenced by these different data volume estimates, but the decision for hash joins which of the two rowsources will used as hash and probe are basically driven by the estimated data volume.Of course the cost estimates of other operations like sorts or aggregates are also based on the data volumes.But for hash joins particularly the possible difference in data volume estimates can lead to bad decisions, using the effectively larger row source for building the hash table, and therefore leading to slower, less efficient join processing with increased memory, TEMP and CPU usage.Here is a simple sample demonstrating the point.First, create two tables, both using VARCHAR2(4000 BYTE) fields, but one has these fields only populated using a single character, whereas the other one fills them completely:
create table t1
rownum as t1_id
, cast('x' as varchar2(4000)) as large_vc1_not_filled
, cast('x' as varchar2(4000)) as large_vc2_not_filled
, cast('x' as varchar2(4000)) as large_vc3_not_filled
, cast('x' as varchar2(4000)) as large_vc4_not_filled
level <= 1e5
exec dbms_stats.gather_table_stats(null, 't1')
create table t2
rownum as t2_id
, rpad('x', 4000) as large_vc1_filled
, rpad('x', 4000) as large_vc2_filled
, rpad('x', 4000) as large_vc3_filled
, rpad('x', 4000) as large_vc4_filled
level <= 1e4
exec dbms_stats.gather_table_stats(null, 't2')
So what do we get if we simply join these two tables:
select * from t1, t2 where t1_id = t2_id;
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 10000 | 152M| | 13773 (1)| 00:02:46 |
|* 1 | HASH JOIN | | 10000 | 152M| 2448K| 13773 (1)| 00:02:46 |
| 2 | TABLE ACCESS FULL| T1 | 100K| 1269K| | 70 (2)| 00:00:01 |
| 3 | TABLE ACCESS FULL| T2 | 10000 | 152M| | 6011 (1)| 00:01:13 |
So we can see that the optimizer understands that the table with more rows actually results in a much smaller row source in terms of data volume as the character columns are only holding only a single character.What happens if we now deliberately turn the tables into views?
select * from (select /*+ no_merge */ * from t1), (select /*+ no_merge */ * from t2) where t1_id = t2_id;
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 10000 | 152M| | 47850 (1)| 00:09:35 |
|* 1 | HASH JOIN | | 10000 | 152M| 76M| 47850 (1)| 00:09:35 |
| 2 | VIEW | | 10000 | 76M| | 6011 (1)| 00:01:13 |
| 3 | TABLE ACCESS FULL| T2 | 10000 | 152M| | 6011 (1)| 00:01:13 |
| 4 | VIEW | | 100K| 764M| | 70 (2)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T1 | 100K| 1269K| | 70 (2)| 00:00:01 |
You can now spot what I've described above: The table T2 row source is actually 50% underestimated by the VIEW operator, (152M vs. 76M Bytes), because the character columns are actually filled to their maximum size, whereas the table T1 is heavily overestimated in size now (1269K vs. 764M (!) Bytes), and these differences mean that the hash join now uses the actually much larger row source T2 to build the hash table. You can see the effect already in the estimates of the optimizer - it assumes now a 76M TEMP space usage of the hash join instead of 2448K when simply joining the tables.As a side note, this is one of the areas where Dynamic Sampling has a severe shortcoming when comparing the estimates to those based on actual statistics.This is what I get when deleting the stats from both tables and running the simple join again:
exec dbms_stats.delete_table_stats(null, 't1')
exec dbms_stats.delete_table_stats(null, 't2')
select * from t1, t2 where t1_id = t2_id;
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 10909 | 166M| | 49209 (1)| 00:09:51 |
|* 1 | HASH JOIN | | 10909 | 166M| 83M| 49209 (1)| 00:09:51 |
| 2 | TABLE ACCESS FULL| T2 | 10909 | 83M| | 6011 (1)| 00:01:13 |
| 3 | TABLE ACCESS FULL| T1 | 102K| 785M| | 70 (2)| 00:00:01 |
Since Dynamic Sampling doesn't evaluate the average row size it uses a similar (but somewhat different) assumption as the VIEW operator, and again the hash join due to these estimates uses the "wrong" row source as source for the hash table.And finally: It gets even worse when using the VIEW variant with Dynamic Sampling:
select * from (select /*+ no_merge */ * from t1), (select /*+ no_merge */ * from t2) where t1_id = t2_id;
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 102K| 1570M| | 49209 (1)| 00:09:51 |
|* 1 | HASH JOIN | | 102K| 1570M| 83M| 49209 (1)| 00:09:51 |
| 2 | VIEW | | 10909 | 83M| | 6011 (1)| 00:01:13 |
| 3 | TABLE ACCESS FULL| T2 | 10909 | 83M| | 6011 (1)| 00:01:13 |
| 4 | VIEW | | 102K| 785M| | 70 (2)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T1 | 102K| 785M| | 70 (2)| 00:00:01 |
The VIEW operator now doesn't change the data volume estimate based on Dynamic Sampling information, but the hash join suddenly changes the estimated data volume to 1570M (!) bytes, because the join cardinality estimate is now 102K rows instead of the more realistic 10.000 - so the cardinality estimate is now screwed due to the VIEW operator.
If you happen to have a large discrepancy between the column definitions and the actual column usage, which is particularly relevant for character based columns, the data volume estimates can vary significantly between merged and non-merged views. The usage of multi-byte character sets can exaggerate this effect in case of char semantics.Most significantly this can lead to bad decisions regarding hash joins, using the larger rowsource as hash table.Whether this effect of the VIEW operator is a feature or a bug I can't tell, there might good reasons why the information about the column statistics gets lost, but it certainly can lead to performance problems in particular with hash joins.The effect can be reproduced across all currently supported versions including 12.1.