Search

OakieTags

Who's online

There are currently 0 users and 30 guests online.

Recent comments

Perceptions

If you Really Can’t Solve a “Simple” Problem..

Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.

I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.

We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.

I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:

   9 |          NESTED LOOPS                       |                           |     1 |   139 |    37   (3)| 00:00:01 |
* 10 |           HASH JOIN SEMI                    |                           |     1 |    50 |    11  (10)| 00:00:01 |
* 11 |            TABLE ACCESS BY INDEX ROWID      | PARTY_ABCDEFGHIJ          |     3 |   144 |     4   (0)| 00:00
* 12 |             INDEX RANGE SCAN                | PA_PK                     |     3 |       |     3   (0)| 00:00:01 |
  13 |            COLLECTION ITERATOR PICKLER FETCH|                           |       |       |            |          |
  14 |           PARTITION RANGE ITERATOR          |                           |    77 |  6853 |    26   (0)| 00:00:01 |
* 15 |            INDEX RANGE SCAN                 | EVEN_PK                   |    77 |  6853 |    26   (0)| 00:00:01 |

On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).

|   9 |          NESTED LOOPS                 |                           |     1 |   125 |  1828   (3)| 00:00:16 |
|  10 |           NESTED LOOPS                |                           |     1 |    63 |     2   (0)| 00:00:01 |
|* 11 |            TABLE ACCESS BY INDEX ROWID| PARTY_ABCDEFGHIJ          |     1 |    45 |     2   (0)| 00:00:01 |
|* 12 |             INDEX UNIQUE SCAN         | PA_PK                     |     1 |       |     1   (0)| 00:00:01 |
|* 13 |            INDEX UNIQUE SCAN          | AGR_PK                    |     1 |    18 |     0   (0)| 00:00:01 |
|  14 |           PARTITION RANGE ITERATOR    |                           |     1 |    62 |  1826   (3)| 00:00:16 |
|* 15 |            INDEX FAST FULL SCAN       | EVE_PK                    |     1 |    62 |  1826   (3)| 00:00:16 |

Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.

But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.

By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.

So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.

Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.

So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.

Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.

In this case, it was this:

Here is my primary key:

EVEN_PK EABCDE 1 AGR_EXT_SYS_ID
EVEN_PK EABCDE 2 EXT_PRD_HLD_ID_TX
EVEN_PK EABCDE 3 AAAMND_DT
EVEN_PK EABCDE 4 EVT_EXT_SYS_ID
EVEN_PK EABCDE 5 EABCDE_ID

Except, here is what it is on Live

EVE_PK EABCDE 1 EVT_EXT_SYS_ID
EVE_PK EABCDE 2 EABCDE
EVE_PK EABCDE 3 AGR_EXT_SYS_ID
EVE_PK EABCDE 4 EXT_PRD_HLD_ID_TX
EVE_PK EABCDE 5 AAAMND_DT

Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.

I could not force my access plan as it was not possible – I had missed the screaming obvious.

So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.

That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.

Where is Sun?

First of all, may I wish everyone who comes by my blog a heartfelt Happy New Year.
Secondly, I promise I’ll blog more often and more on technical aspects this year than I have for most of 2010.
Thirdly, I’ll admit the title to this blog is nothing to do with the hardware company now owned by Mr Larry Ellison, but is about the huge glowing ball of fire in the sky (which we have not seen a lot of here in England and Wales for the last couple of weeks – not sure about Scotland but I suspect it has been the same). I apologise for the blatantly misleading (and syntactically poor) title.

A quick question for you – It is the depths of winter for most of us, and it has been unusually cold here in the UK and much of Europe. When are we, as a planet, furthest from the Sun during winter? January the 1st? The Shortest day (21st December)? The day the evening start drawing out (December 14th)?
I think many in the Northern Hemisphere will be surprised to learn that we are closest to the sun today (3rd Jan 2011). A mere 147.104 million kilometers from the centre of our solar system. I mentioned this to a few friends and they were all taken aback, thinking we would be furthest from the warmth of the sun at the depths of our winter.

Come the 4th July 2011 it is not only some strange celebration in the US about having made the terrible decision to go it alone in the world {Joke guys!}, but is the day in the year that the Earth is furthest from the sun – 152.097 million kilometers. That is about 3.39% further away and, as the energy we receive from the sun is equal to the square of the distance, does account for a bit of a drop in the energy received. {Surface of a sphere is 4*pi*(R{adius}squared), you can think of the energy from the sun as being spread over the sphere at any given distance}.

Some of you may be wondering why this furthest/closest to the sun does not match the longest/shortest day. As some of you may remember, I explained about the oddities of the shortest day not matching when the nights start drawing out about this time last year. It is because as we spin around our own pole and around the sun, things are complicated by the fact that the earth “leans over” in it’s orbit.

Check out this nice web site where you can state the location and month you want to see sunrise, sunset, day length and (of particular relevance here) the distance from the sun for each day.

I find it interesting that many of the things us most of us see as “common sense” are often not actually right (I always assumed that the shortest day coincided with both the evenings starting to draw out and mornings getting earlier until I stumbled across it when looking at sunset times – I had to go find a nice Astronomer friend to explain it all to me). I also like the fact that a very simple system – a regularly spinning ball circling a large big “fixed point” in a fixed way – throws up some oddities due to little extra considerations that often go overlooked. Isn’t that so like IT?

That lean in the Earth’s angle of spin compared to the plane we revolve around the sun is slowly rotating too, so in a few years (long, long, long after any of us will be around to care) then the furthest point in the orbit will indeed match the northern hemisphere winter. Again like IT, even the oddities keep shifting.

Friday Philosophy – Run Over by a Bus

I chaired a session at the UKOUG this week by Daniel Fink, titled “Stop Chasing your tail: Using a Disciplined Approach to Problem Diagnosis”. It was a very good talk, about having a process, an approach to solving your IT problems and that it should be a process that suits you and your system. All good stuff and I utterly agree with what he said.

But it was a passing comment Daniel made that really set me thinking. It was something like:

You should be considering how people will look after the system after you have gone, the classic ‘what will we do if you are hit by a bus’….. No, I don’t like thinking like that, that phrase… I prefer ‘after you win the lottery and retire to a great life’.

It just struck a chord with me. Mr Fink’s {and I do go all formal when I intend respect} take on this is a far more positive way of looking at the situation of leaving the system in a state that others can look after once you are no longer able to help. The “Bus” phrase is very, very common, at least in the UK and I suspect in the US, and it is a very negative connotation. “Make sure it all works as something nasty is going to happen to you, something sudden, like being smeared across the tarmac by 25 tons of Greyhound doing 50mph, something basically fatal so you can’t prepare and you can’t help any more”. So, not just moved on, but dead.

Daniel made me realise that we should be looking at this from totally the other perspective and that doing so is much, much, much better. “Make it work so that they love you, even when you have gone away to a happier situation – one involving no road-based unpleasantness at all”.

Everyone leaves their job eventually and I like to think it is often for more positive reasons. Like retiring, or a better job {better for you, but a real shame for your old company as they like you so much}, moving to a new area, attempting a dream. Yes, sometimes (depressingly often at present) it is because you get made redundant or things go bad with your managers, or HR take over the organisation. But even so, better to leave knowing you did so with your professional duty intact I think. It’s one way of winning in a losing situation.

If turning the “bus” metaphor into a “lottery” metaphor results in the response in your brain of “well, when I do leave rich and happy, I still want to leave a painful mess behind me” – then it may indicate that you better leave where you are working as soon as possible in any case? As it is not a good situation and you are deeply very unhappy about it.

Up until now I have sometimes used a far more gruesome but less fatal phrase for the concept of making sure things continue after you leave and can no longer help, which is “involved in a freak lawnmower accident”. As in, can’t type but not dead. I’m going to stop using it, I’ve decided that even with my macabre sense of humour, it really is not a good way to think about doing your job properly. Daniel, your attitude is better. Thank you.

Oh, if you went along to the conference you can get the latest version of Daniel’s talk slides from the UKOUG web site (try this link), otherwise, he has a copy here – pick “papers and presentations”. It has lots of notes on it explaining what the slides mean (ie, what he actually says), which I think is a very nice thing for him to have spent the time doing.