When preparing for the the IOUG Collaborate 12 deep dive on deploying Oracle Databases for high Availability, I wanted to provide some feedback on what hardware components are failing most frequently and which ones are less frequently. I believe I have reasonably good idea about that but I thought that providing some more objective data would be better. I couldn’t find and results of a more scientific research so I decided to organize a poll. This blog post shows the results and I promised to share it with several groups.
The results are also in the presentation material but it might be hidden deep into 100+ slides so here is the dedicated blog with some comments on the results.
I asked the following question — “Sort the hardware components in the descending order of failure rates.” The list must be fully ordered so there is only one component for each place — #1, #2 and so on so we basically have sorted list of 7 components. I didn’t collect failure rate estimate or anything like that — it’s already perception based so asking for ratios estimate would definitely be seeking for garbage data. Each component gets the weight based on the place it ends up on the list. Place #1 (most frequently failing) gives weight 7. Place #2 gives weight 6 and so on until place #7 that gives only weight 1.
The resulting weights for each component are averaged so the higher the resulting average weight, the more frequently respondents think this hardware component fails. Again, this is perception based and I’m sure our brains make it quite subjective but I hope having 50+ people on the survey would provide good indication. Please note that you can’t compare failure rates of each component by comparing the resulting average weight below — remember it’s just the placement on the list and we didn’t collect any failure rates so no conclusions about it.
I was satisfied that the results are quite close to what my estimates are except that I wouldn’t place failure of network cards and switches as high. I’ll let you comment on that — does the result match your estimates? Anything you found surprising?
The survey is coming from users of Oracle technology mostly. I think it should generally be applicable to any other platforms but if we compare with, say, MySQL deployments then Oracle infrastructure is usually built using more enterprise hardware class and MySQL infrastructure is usually running on lower grade servers. Still, I think it should be useful to anyone — just keep in mind the source of the feedback.
Question: What do Pokemon and When Others in PL/SQL have in common?
For many developers the answer is:
Asnwer: You gotta catch 'em all.
I saw this other blog post on "Dodgy Coder" and was greatly amused by the comparison of Pokemon and catching all exceptions and basically making them disappear.
I've written about this "worst" practice a lot. You can see some of them at:
You really DON'T gotta catch em' all! And if you do - you really gotta throw them again (re-raise them)
While doing my high availability deep dive at Collaborate 12 few weeks ago, I stated that storage replication qualifies for the cold failover licensing rules (see slide #128). During collaborate I spoke to one person at Oracle who definitely knows the rules. Simon Haslam also reached out to me by email pointing out that things...
During the course of recent twitter / RSS reading, I came across an interesting post related to the old FizzBuzz problem. The post intrigued me for 2 reasons: it talks about solving a basic problem in Scala, and it emphasizes the use of functional programming. And just for grins and chuckles it calls out OOP based on an old paper from 1984 by John Hughes on Why Functional Programming Matters.
I’m a big fan of functional programming, even if my language of choice, SQL, is a poor example of it. Although, I tend to think functional when I write SQL, I know that the language is limited in that respect.
However, when I think about problems that need to be solved in an Oracle database system, I always try to exhaust SQL-based solutions before looking to PL/SQL (I’ve always enjoyed Tom Kyte’s mantra on the subject). It’s one of the reasons that systems I have influence over have very little imperative PL/SQL constructs.
I also strive to write modular, relatively self-explanatory SQL to combat the reputation of SQL as a “difficult” language — all languages are difficult when you are confronted with them
Anyway, the post under discussion describes an inverse FizzBuzz problem with a description of working through various solutions using the function constructs of Scala, and many people have had a go at solving it.
One of my favorites is in F#, Microsoft’s functional language.
Anyway, here’s my attempt at it — I’m not entirely satisfied with it yet, but it was fun to write:
with fb0 as ( select 'fizz,buzz' input from dual ), fb1 as ( select level n, case when mod(level,3) = 0 and mod(level,5) <> 0 then 'fizz' when mod(level,5) = 0 and mod(level,3) <> 0 then 'buzz' when mod(level,3) = 0 and mod(level,5) = 0 then 'fizzbuzz' end fb, lead(level,1) over (order by level) next_n from dual where mod(level,3) = 0 or mod(level,5) = 0 connect by level <= 100 ), fb2 as ( select connect_by_root n n_root, level pos, fb1.n, fb1.fb from fb0, fb1 where level <= nvl(length(regexp_replace(fb0.input,'\w'))+1,1) start with fb1.fb = regexp_substr(fb0.input,'\w+') connect by prior fb1.next_n = fb1.n ), fb3 as ( select listagg(fb2.fb,',') within group (order by fb2.pos) fb_str, listagg(fb2.n,',') within group (order by fb2.pos) n_str, max(fb2.n) - min(fb2.n) n_length, min(fb2.n) n_min from fb0, fb2 group by fb2.n_root having listagg(fb2.fb,',') within group (order by fb2.pos) = fb0.input ), fb4 as ( select fb3.n_length, fb3.n_min, fb3.fb_str, fb3.n_str, min(fb3.n_length) over () min_length, min(fb3.n_min) over (partition by fb3.n_length) min_start from fb3 ), fb5 as ( select fb4.fb_str, fb4.n_str from fb4 where fb4.n_length = fb4.min_length and fb4.n_min = fb4.min_start ) select * from fb5;
I'm not entirely happy with it -- I'd like to combine fb2 with fb3, and fb4 with fb5. I'd also like to rewrite it with recursive subqueries instead of the connect by's and listagg's, but it's not too bad.
One of the things I'm starting to like with the WITH construct is how easy it is to debug the overall statement -- I can just replace the final SELECT with whichever subquery I want to test in order to see what results it's "passing" to the next subquery. Go ahead and try it to see!
There is a saying "correlation is not causation" meaning - just because you observe A and then observe B, it does not mean that A causes B - even if every single time you observe A - you see B as a 'side effect'.
Here is a great story to back that up - a debugging session where the end users had determined that they could not send email more than 500 miles. Read it and note how much work the end users had gone into 'proving' that email cannot be sent more than 500 miles...
I use a similar story when talking about bind variables and bind variable peeking. In the story - I make the claim that "when ever it rains on a Monday morning - you have to restart the database in the afternoon to make it perform correctly". That is - rain on Mondays implies poor performance, and the only fix is to reboot.
This conclusion was reached via empirical observation - every single time, *every single time*, it would rain on Monday mornings - the database would respond with really poor response times to certain critical queries in the afternoon. Every single time, without failure. It must be that rain affects database performance somehow.
But here is what really was happening. The database in question was a database that was backed up using cold backups (DBA's were of the archaic type). These cold backups took place Sunday nights - late at night. That meant that every Monday morning all of the caches would be virtually empty. No parsed SQL for the applications in particular would be present. So - Monday morning was the morning of hard parses every single week.
Now, in the critical application - there was a query that went against some very skewed data. 99.9999% of the time - the queries would be very selective, they would return 1 or 2 rows. The other times - they would be very non-selective returning much of the table - but these queries were rare.
Now, the main user of this application was a person that came in early most of the time - they would get into the office at 6am and start working. They always used bind inputs that returned very few rows - hence the problem queries would hard parse and tend to use indexes. When this happened - everyone was happy.
The other set of users - they came in later, usually around 9am or so. They would run the non-selective query and were OK with the index performance. It might not have been the *best* possible performance - but it worked OK for them.
So, what was the link to rain? Well, the office where everyone worked was located in Washington DC - when it rains in DC it doesn't matter if you leave home at 6am - it'll been noon before you get to work. If you leave at 7am - it'll be noon when you get to work. If you leave at 8am - it'll still be noon before you get to work. In short - it really didn't matter when you left for work - if you are driving into the city - it is going to take a long time. So, when it rained, the person that did the selective queries would just hit the snooze button on the alarm and go back to sleep. Not worth driving in yet.
However, the people that did the non-selective queries - they lived in an apartment across the street from the office. They were not affected by the rain. They came in at 9am regardless. Since they ran the non-selective queries and the cache was empty - they would hard parse the queries and result with full scan plans. This group of people was in fact a little convinced that rain *helped* performance a bit - when it rained on Mondays - they saw slightly better performance in the morning.
When the person that slept in finally got to work - and ran the application - they would definitely notice the full scans and how slow the database was. Since it was raining - and they had observed this hundreds of times before - they call the data center - tell them "It is Monday, it is raining, you know what to do", after the reboot, everything is running fast again.
But - did it have anything to do with rain - or was it something entirely different :)
Getting to the root cause (bind peeking issue) can lead you to an efficient, effective corrective action (perhaps query plan stability for a certain set of queries for example).
Going down the empirical path of "correlation must imply causation" will lead to never actually fixing the problem - and the invention of many myths...
Thanks to everyone who attended my webinar sponsored by Red Gate and All Things Oracle. My topic was "How to Gather SQL Resource Consumption Metrics in Oracle". The webinar recording has been posted at AllThingsOracle as of May 9 but I thought I'd also make the materials available here for anyone interested. In the zip file, you'll find a PDF of the presentation slides, several example reports (ASH, AWR, SQL Monitor) as well as several scripts I used and a text file containing all the demo queries and output.
I was also asked a couple of questions that I wanted to follow up on. First, someone asked about the use of dbms_sqltune. As it turns out, it does require a license for the Oracle Tuning Pack in order to be able to legally use that package. That means that SQL Monitor reports are off limits without that license. Sigh...
Another person asked about how to create a SQL Profile and I included a script I use that was created by Kerry Osborne called create_sql_profile_awr.sql. You can get that script and several others along with a great overview of SQL Profiles at Kerry's blog.
Again, thanks to everyone who attended and I hope you found it interesting and helpful. Thanks also to the folks at Red Gate and particularly James Murtaugh who leads the charge at AllThingsOracle.com. Keep your eyes on the site for upcoming webinars, great articles and more!
These are my personal rules that I’ve been following moderating the public forums on LinkedIn. I’ve posted on that topic in the discussion on IOUG Exadata SIG forum. As I’m passing RAC SIG group to the next folks on the board (I’m the RAC SIG president until end of August) I needed to hand over...
Back in the (really) old days, systemstate dumps had to be used for diagnosing hangs and finding blockers of hung databases. Basically you just identified which resource your waiting sessions were waiting for and then scanned through the rest of the system state dump to see which session held that particular resource (note that over time the instrumentation and systemstate dumps have evolved to resolve and dump the blocker information right at the waiting session section in the dump).
Diagnosing complex hangs was a tedious manual process (or required some scripting), so the hanganalyze was a really welcome addition to hang diagnosis. The hanganalyze basically walked through the blocker-waiter chains, found which session was the LEAF in the chain (the LEAF is the ultimate/final blocker in the wait chain), which everybody either directly or indirectly was waiting for.
Additionally the hanganalyze output allows you to draw the topology of the hang too, which may be useful in complex hang scenarios. In fact Enterprise Manager’s hang topology page also just calls ORADEBUG HANGANALYZE internally and formats the output into a visual hang graph (this is why the hang page requires OS credentials or sysdba privileges).
Starting from 11g, there is a welcome change though – there’s a new V$WAIT_CHAINS view, which lists all the hangs it has detected, it’s RAC-aware and is accessible via plain SQL. No need to run oradebug anymore, assuming that you are able to log on to query that V$ view :-)
And this is where we have a little problem – in rare circumstances an instance can get hung so that even a SYSDBA privilege holder can not log on, so how would you run the diagnostic queries & commands? This is why Oracle has introduced the preliminary connection option in sqlplus for diagnosing hung databases. With a prelim connection you will have a server process started for you, it attaches to the SGA shared memory segments, but it skips the last step of setting up that session/process-related structures in SGA. These operations on shared SGA structures have to be protected by latches/mutexes and would potentially get blocked if the related latches/mutexes are already held by someone else part of that hang.
I have already written about How to log on to Oracle when even sysdba cannot do so, but here’s an example:
$ sqlplus -prelim "/as sysdba" SQL*Plus: Release 188.8.131.52.0 Production on Sun Jan 29 12:48:03 2012 Copyright (c) 1982, 2011, Oracle. All rights reserved. SYS:dbm1> ORADEBUG HANGANALYZE 3 Statement processed.
The “problem” is though that prelim connections only allow you to run ORADEBUG commands as running regular SQL do require that all the session/process structures are set up properly, plus that the parsing, library cache access etc would get latches/mutexes which could again block.
But we just wanted to dump hang analysis and ORADEBUG HANGANALYZE above seems to have succeeded. However when looking into the resulting tracefile, we see this:
*** 2012-01-29 12:48:11.041 Processing Oradebug command 'HANGANALYZE 3' =============================================================================== HANG ANALYSIS: ERROR: Can not perform hang analysis dump without a process state object and a session state object. ( process=(nil), sess=(nil) ) ===============================================================================
It looks like hanganalyze isn’t able to work anymore without a fully logged on session where the process & session state objects (in v$process and v$session respectively) have been set up. This is a bit weird as one of the benefits of hanganalyze (and systemstate dumps) was that it did not require allocating any SGA structures nor take any latches & locks.
This behavior change seems to have appeared since Oracle 184.108.40.206 and it’s documented in MOS note 452358.1, but that note doesn’t offer a good alternative for diagnosing hangs in such situations.
The alternative is actually simple – you should attach to an existing fully-initialized & logged on process with oradebug and let that process do the hanganalyze for you. Whatever process you attach to with oradebug is the process which will be performing the work for you.
So if you ever see the error above when diagnosing extreme hangs, then just identify some least critical process logged in to the database (not a background process ideally and definitely not critical bg processes such DBWR & LGWR etc) and attach to that with oradebug (as opposed to “oradebug setmypid”) and the hang analysis should work ok.
However, the catch is that the above ORADEBUG HANGANALYZE command is a shortcut, which always tries to run hanganalyze in your (preliminary) connection, by calling the hanganalyze function directly in the process. So you have to use ORADEBUG DUMP HANGANALYZE 3 instead (the 3 is the detail level where I usually start from) as this syntax will send the “dump hanganalyze” command to the target (fully-initialized) process where you’ve attached to.
Update: I added this to clarify which commands should you use when you hit this problem:
Usually I'm pushing information out - now I need to pull some in.
We're evaluating the usage of XMLDB in the database community. We would like to know what components of XMLDB are most frequently used. Please let us know which of the following you use:
Any background on how you use the capability will be of interest. Just leave a comment and we'll definitely be reading them all.
Thanks for taking the time to respond!
In 1990, when Ken Jacobs hosted the RDBMS campground talks at the Anaheim International Oracle User Week appreciation event, one of the topic areas was whether we (some users representing the Very Large DataBases VLDB of the Oracle world which meant anything north of about 7 GB back then) thought that the rule based optimizer (RBO) was good enough, or whether we needed a cost based optimizer (CBO) for the real applications we were running at enterprise scale to work well. “Oracle’s optimizer is like Mary Poppins. It’s practically perfect in every way. But we do have some cases where it would be helpful for the optimizer to consider the relative sizes of tables and whether a table was local or remote when the plan for joining and filtering is constructed. And some of the tie breakers for which table should go first and which index should be used when two indexes with similar characteristics are present can change, so we would like to be able to specify them for important plans.” Fast forward 22 years. Now there is a bumpersticker slogan contest for Oracle’s CBO. To enter you fire off a tweet with the hashtag #optimizerbumperstickers. Now I’ve always thought that the goal of the CBO is good plans, not perfect stats (or even perfect plans). So in bumper sticker size, my entry is “The goal is good plans!”