Search

OakieTags

Who's online

There are currently 0 users and 36 guests online.

Recent comments

Affiliations

February 2010

How to CANCEL a query running in another session?

Here’s a treat for Oracle geeks, hackers and maniacs out there…

Update: As the beginning says, this article was meant as something interesting about Oracle’s internals and CTRL+C / OCICancel() handling. There’s a more practical way for canceling session calls if you are running Oracle Enterprise Edition and are currently using resource manager:

You can set the consumer group for a session to CANCEL_SQL to cancel its current call:

DBMS_RESOURCE_MANAGER.SWITCH_CONSUMER_GROUP_FOR_SESS (
session_id IN NUMBER,
session_serial IN NUMBER,
consumer_group IN VARCHAR2);

Thanks to commenter “null” for this info. Note that I haven’t tested how/whether this feature works correctly so there’s homework for you ;-)

I recently received a question about how to cancel queries running in another Oracle session, so that the session would not be killed, but would remain alive.

Well, there’s no supported way I can tell you, but thanks to how Oracle handles out-of-band breaks on Unix platforms, you can cancel database calls using an OS tool – kill.

Before we go on, here’s how query cancellation (pressing CTRL+C in sqlplus for example) works in Oracle:

RMOUG 2010: My presentations

It is very disappointing to me that I had to cancel my trip to RMOUG training days. I am sick and was not able to catch the flight due to that.

But, I can always share my presentations here. I had two presentations planned in this training day and can be accessed as below:

Advanced RAC troubleshooting
Riyaj_Advanced_rac_troubleshooting_RMOUG_2010_doc
Riyaj_Advanced_rac_troubleshooting_RMOUG_2010_ppt

Why optimizer hates my sql
Riyaj_Why_optimizer_hates_my_sql_2010

RMOUG training days audience: Please accept my sincere apologies.

A Free afternoon seminar in Singapore (24th Feb)

If you are in Singapore and have 24th Feb afternoon available then you can register and join a free Oracle performance troubleshooting seminar I’m doing in Singapore Management University’s (SMU) campus.

The seminar will be about:

  • Systematic Oracle Performance Troubleshooting
  • Identifying performance troublemakers
  • Understanding execution plans

The date is Wednesday, 24th Feb

The seminar time is from 15:30-19:00 (don’t be late)

Registration and more details are here:

Share/Bookmark

The Oracle Wait Interface Is Useless (sometimes) – part 3b

Welcome back for the concluding part of this series of blogs.
In the last part I went through a brief primer about code execution and stack frames in preparation for this posting. I hope that wasn’t too boring for everyone – I think it’s really important to make sure all the preliminary knowledge is covered and not make too many assumptions about the reader. In this part I will finally get to the point, and talk about some alternative techniques for determining the reasons for poor performance for our example user session.
To recap very quickly, we have a user session taking 100% CPU in top, nothing is emitted in SQL trace files, Oracle’s Time Model shows it is executing SQL, and that’s about all the information we have. As Tanel stated in his posting for Part Two, this is the stage where it makes good sense to dig into session-level statistics (not system-level statistics) to see if any of the counters are incrementing in a conspicuous way. If you don’t already have a copy of Tanel’s Snapper code, I suggest you grab it and get familiar with it – it’s a nice implementation of an essential tool that many DBAs already have some kind of script for.

Tanel found in Part Two that the SQL execution was revisiting the same small number of buffers very frequently, indicating that there might be a problem with one of the Nested Loop join methods used by the optimiser. Perhaps there is something wrong with that part of the plan (or with the whole plan)? Using counters can often reveal a potential problem this way, but they don’t necessarily  put a neon sign reading “Here’s Where All The Time Went” on top of the offending item. This is where we might want to dig further using some of our knowledge of code execution and stack traces, in combination with a dynamic tracing infrastructure such as Dtrace or Systemtap.

The power of Dtrace and friends lies in a few key areas:

  1. It can report time-based information
  2. It can show user and kernel information
  3. It is dynamic – unlike gprof and other code profilers, it is not necessary to run everything with a profiled binary

These are really key features which, when combined, allow for tremendously powerful on-demand tracing.

Consider the following sample output from a very small Dtrace script to look into what happens when a single printf() library call is made from a C program:

CPU FUNCTION
  0  -> my_function                            USER CODE
    (57 lines of dynamic library loading omitted)
  0    -> printf                               USER CODE
  0      -> _setorientation                    USER CODE
  0      <- _setorientation                    USER CODE
  0      -> _ndoprnt                           USER CODE
  0        -> _findbuf                         USER CODE
  0          -> isatty                         USER CODE
  0            -> ___errno                     USER CODE
  0            <- ___errno                     USER CODE
  0            -> ioctl                        USER CODE
  0              -> syscall_mstate             KERN
  0                -> gethrtime_unscaled       KERN
  0                  -> tsc_gethrtimeunscaled_delta  KERN
  0                    -> tsc_gethrtimeunscaled  KERN
  0                    <- tsc_gethrtimeunscaled  KERN
  0                  <- tsc_gethrtimeunscaled_delta  KERN
    (180 more lines of kernel functions for ioctl call omitted)
 0            <= ioctl                         SYSCALL: ioctl 1743296 ns
    (34 lines of kernel function (probably DTrace intrumentation!) omitted)
 0        <- ioctl                             USER CODE
 0  <= fstat64                                 SYSCALL: fstat64 352699 ns
    (~300 lines of kernel  and dynamic library loading omitted)
 0    -> _write                                USER CODE
    (~200 lines of kernel  and dynamic library loading omitted)
 0  <= write                                   SYSCALL: write 2154427 ns
    (123 lines of kernel code omitted)
 0  <- printf                                  USER CODE

In the above output, there are around 1000 functions called in User and Kernel domains. The output gives you some idea of the level of detail possible with Dtrace; we can see all of the following attributes:

  • All User mode function calls are logged
  • All Kernel mode function calls are logged
  • All system calls are timed

Note: The printf() itself is very much a user-mode call,part of the standard C library (libc). It actually makes a couple of system calls under the bonnet: Initially an ioctl() and eventually the write(2) system call.

Let’s look at something else – time spent in the Virtual Memory subsystem of the kernel doing Page Ins (page faults). Again, this is a very simple script, and produces output similar to the following:

WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 90 ms
WAIT: PAGE IN: 836 ms
WAIT: PAGE IN: 1098 ms
WAIT: PAGE IN: 853 ms
WAIT: PAGE IN: 816 ms
FINAL WAIT: PAGE IN: 263 ms

So now we can quantify how much time is spent servicing page faults, in this case showing approximately how much per second. Here we have very specific information about how much time our particular session is spending servicing page faults. Don’t assume that soft page faults are cheap/fast, either – with increasing virtualisation trends, soft page faults as observed by the guest operating system could actually be hard page faults in the underlying hypervisor.

These are just a few of the things that can be monitored in Dtrace. Now let’s combine the Oracle stack tracing that we introduced in part 3a.  In that posting I concluded that we could dump the stack trace once in a while, observe the name of the current function, and determine where we are spending the most time. With Dtrace we can make this much more concrete, and can quantitatively report the time spent in each function, and combine other contributing factors from the Operating System! I have put together a Dtrace script to demonstrate this concept. I won’t post the script here (yet) because it is still very much in development and still has a number of deficiencies. What I’m trying to get to here is a really quick way to see exactly what an Oracle session is doing by only looking at the running process using Dtrace. The script currently has the following features:

  • Shows all OWI wait events and durations
  • Shows all system calls and durations
  • Shows timing stats every second for: OWI wait times, page faults, latch acquisition, and a selection of common rowsource fetch operators

The rowsource operators are taken from tracing the function calls to the various rowsource operators defined in Metalink note 175982.1. Specifically, the script currently traces calls to the following rowsource operators:

  • GS – Group By Sort
  • JO – Join (Nested Loop)
  • UA – Union All
  • VW – View
  • HJ – Hash Join
  • TB – Table
  • JOT – Unknown. Possible relation to Join, observed frequently during stack tracing
  • SO – Sort
  • IXT – Index
  • PF – Prefetch

Running the same misbehaving SQL as Tanel in his last posting, we get the following output (snippet only):

LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 464:JO 0:UA 0:VW 0:HJ 0:TB 21:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=1 rsrc=(0:GS 430:JO 0:UA 0:VW 0:HJ 0:TB 9:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 471:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 467:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
SYSCALL: times 7104 ns
SYSCALL: times 1749 ns
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 22:JOT 0:SO)

Remember that this SQL example precisely fits our profile where the session is using 100% usermode CPU and emitting no wait information, so we cannot see any OWI waits detailed here – just summaries of where the time was spent every second and a couple of system calls. We also have no page faults taking place on this test, so no time is accounted for there. The big user of CPU (468ms per second) is the nested loop join rowsource operator, which agrees with Tanel’s findings using session-statistics. We are also spending a little time acquiring latches, but it’s not very significant. There are clearly some rowsource functions missing, given that we can only account for half the CPU time, so I did a bit more function tracing to find the names of the missing functions, and added them to the script. The output now looks as follows (scroll right to see the new functions):

LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 461:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 163:IXT 759:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 691:JO 0:UA 0:VW 0:HJ 0:TB 455:JOT 0:SO 304:IXT 843:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 460:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 162:IXT 759:PF)
SYSCALL: times 7810 ns
SYSCALL: times 1750 ns
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 686:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 301:IXT 842:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 436:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 177:IXT 721:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 708:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 286:IXT 880:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 357:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 228:IXT 588:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 788:JO 0:UA 0:VW 0:HJ 0:TB 445:JOT 0:SO 235:IXT 1013:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 277:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 276:IXT 458:PF)

Looks like we’ve found the missing time, and then some (the timers add up to more than 1000ms each second). The reason for this timing exaggeration is that there is a good deal of double accounting going on with the current version of the script – child function times are also accounted for in the parent. This is a known bug, hopefully I’ll get a chance to fix that one soon.

Well that’s about it for now. I hope that this journey has been an interesting one, I know it’s been a lot of reading to take in all the parts. We started with a session that was emitting no wait information, and now we have a simple Dtrace script that is easy to execute with no impact to other sessions, and easy to stop once the underlying cause is determined. The query does not need to be stopped and restarted, and no additional overhead is imposed from enabling additional Oracle diagnostics such as setting statistics_level=all. I intend to expand on this approach over time, with a particular focus on:

  • Multiplatform support – Systemtap version is required
  • Better design to support more function tracing
  • Fix double counting issue
  • Add identifiers for optimiser plan line number rather than rolling up by function name

How Does An Execution Plan Suddenly Change When The Statistics (And Everything Else) Remains The Same ? (In Limbo)

I’ve slipped this post in as there have been a number of discussions recently on how execution plans have changed while nothing else appears to have changed in the database. How can an execution plan suddenly change when no one has made any changes to the database ?   By no changes, it means that there [...]

New versions of LatchProf and LatchProfX for latch contention troubleshooting and tuning

The LatchProf and LatchProfX scripts allow you to be more systematic with latch contention troubleshooting and tuning. No more guesswork is needed as these scripts give you exact session IDs and in this version also SQLIDs of the troublemaking applications.

You can download the new versions here:

Example output (with SQLID info) is below:

Oracle Troubleshooting: How to read Oracle ERRORSTACK output?!

I have written the first article to the troubleshooting section of my new website tech.E2SN.com:

It’s about a very valuable Oracle troubleshooting tool -> ERRORSTACK trace.

I cover 4 frequently asked questions there:

  1. Reading the current executing SQL statement text from errorstack trace
  2. Reading the current executing PL/SQL package and PL/SQL source code line number from errorstack trace
  3. Reading the current bind variable values from errostack trace
  4. Identifying how much private (UGA) memory a cursor is using

You can read it here:

By the way, if you like my new website, feel free to link to it !!! ;-)

Share/Bookmark

Future appearances, conferences and seminars

Just to let you know where I’ll be in the near future :)

Seminars

Advanced Oracle Troubleshooting 2.0

I have rearranged and adjusted the material so it flows smoother, has even more practical tools and scripts and describes some internals even deeper ;-)

These dates are close, last chance to register ;-) Note that after these I won’t be doing an Advanced Oracle Troubleshooting class in US for a while…

Conferences

Hotsos Symposium, Dallas, TX, 8-11 March

I will deliver 2 presentations at the Hotsos Symposium 2010

  • Back to Basics: Choosing the Starting Point of Performance Tuning and Troubleshooting Wisely
  • Understanding LGWR, log file sync Waits and Commit Performance

Also, I will be the Training Day speaker after the conference (I feel quite honored about this btw ;)

  • Scripts and Tools for Oracle Troubleshooting and Advanced Performance Analysis

Note that I will show some very cool (and of course, useful) stuff at the Training Day! Some things are gonna be so cool that I almost want to sit in the audience myself! One of the things is called MOATS – the Mother Of All Tuning Scripts for example ;-)

Using Process Memory Matrix script for calculating Oracle process memory usage on Solaris

I just published a new script and article about calculating the real Oracle process memory usage on Solaris.

The problem with V$PROCESS* views (and the V$SESSTAT) is that they will tell you what Oracle thinks it’s using, not the real amount of memory used. There will be discrepancies due how memory is actually allocated in OS, libraries out of Oracle’s control, the static memory areas inside Oracle binary and of course bugs.

I was working on one of such problems and decided to make my life easier by writing the script. It’s not so much about calculating the exact figures (they will never be 100% correct), but more about presenting the memory usage data in a better and more convenient fashion.

The script is called procmm and stands for Process Memory Matrix as it shows the memory usage in a matrix grid.

Here’s an example output to show what I’m talking about:

Oracle Flash Cache on Linux with Exadata

cool stuff:
Guy Harrison explains how to set up flash caching on Linux for Oracle without Exatata
Late he give some analysis on performance of no flash, flash cache and flash tablespae

http://guyharrison.squarespace.com/blog/2010/1/24/flash-tablespace-vs-db-flash-cache.html