Search

OakieTags

Who's online

There are currently 0 users and 26 guests online.

Recent comments

Affiliations

ASH

New Version Of XPLAN_ASH Tool - Video Tutorial

A new major release (version 3.0) of my XPLAN_ASH tool is available for download.

You can download the latest version here.

In addition to many changes to the way the information is presented and many other smaller changes to functionality there is one major new feature: XPLAN_ASH now also supports S-ASH, the free ASH implementation.

If you run XPLAN_ASH in a S-ASH repository owner schema, it will automatically detect that and adjust accordingly.

XPLAN_ASH was tested against the latest stable version of S-ASH (2.3). There are some minor changes required to that S-ASH release in order to function properly with XPLAN_ASH. Most of them will be included in the next S-ASH release as they really are only minor and don't influence the general S-ASH functionality at all.

New Version Of XPLAN_ASH Utility

A new version 2.0 of the XPLAN_ASH utility introduced here is available for download.You can download the latest version here.The change log tracks the following changes:- Access check- Conditional compilation for different database versions- Additional activity summary- Concurrent activity information (what is/was going on at the same time)- Experimental stuff: Additional I/O summary- More pretty printing- Experimental stuff: I/O added to Average Active Session Graph (renamed to Activity Timeline)- Top Execution Plan Lines and Top Activities added to Activity Timeline- Activity Timeline is now also shown for serial execution when TIMELINE option is specified- From 11.2.0.2 on: We get the ACTUAL DOP from the undocumented PX_FLAGS colu

DST in Russia

Daylight Saving Time in Russia has been changed last year. Oracle published a FAQ on the support site about this: Russia abandons DST in 2011 – Impact on Oracle RDBMS [ID 1335999.1].

RuOUG in Saint Petersburg

On February 10th I was in Saint Petersburg on a seminar organized by Russian Oracle User Group. Actually, it was mostly prepared by company called Devexperts. Seminar took place in their office, with most presentations done by their people.

Nice Additions For Troubleshooting

This is just a short note that Oracle has added several nice details to 11.2.0.1 and 11.2.0.2 respectively that can be helpful for troubleshooting.

ASH, PGA Memory And TEMP Consumption

Since 11.2.0.1 the V$ACTIVE_SESSION_HISTORY view (that requires Enterprise Edition plus Diagnostic License) contains the PGA_ALLOCATED and TEMP_SPACE_ALLOCATED columns.

In particular the latter closes an instrumentation gap that always bothered me in the past: So far it wasn't easy to answer the question which session used to allocate TEMP space in the past. Of course it is easy to answer while the TEMP allocation was still held by a session by looking at the corresponding V$ views like V$SORT_USAGE, but once the allocation was released answering questions like why was my TEMP space exhausted three hours ago was something that couldn't be told by looking at the information provided by Oracle.

Visualizing Active Session History (ASH) Data With R

One of the easiest ways to understand something is to see a visualization. Looking at Active Session History (ASH) data is no exception and I’ll dive into how to do so with R and how I used R plots to visually present a problem and confirm a hypothesis. But first some background…

Background

Frequently DBAs use the Automatic Workload Repository (AWR) as an entry point for troubleshooting performance problems and in this case the adventure started the same way. In the AWR report Top 5 Timed Foreground Events, the log file sync event was showing up as the #3 event. This needed deeper investigation as often times the cause for longer log file sync times is related to longer log file parallel write times.

PeopleTools 8.50 uses DBMS_APPLICATION_INFO to Identify Database Sessions

I recently worked on a PeopleTools 8.50 system in production for the first time and was able to make use of the new Oracle specific instrumentation in PeopleTools.

PeopleTools now uses the DBMS_APPLICATION_INFO package to set module and action session attributes.  This data is then copied into the Active Session History (ASH).

PeopleTools 8.50 uses DBMS_APPLICATION_INFO to Identify Database Sessions

I recently worked on a PeopleTools 8.50 system in production for the first time and was able to make use of the new Oracle specific instrumentation in PeopleTools.

PeopleTools now uses the DBMS_APPLICATION_INFO package to set module and action session attributes.  This data is then copied into the Active Session History (ASH).

Network Events in ASH

Note - using ASH and the Top Activity screen require the use of the Diagnostics Pack License.

This post was prompted by yet another performance problem identified using
pretty pictures and Active Session History data. Although, as you'll see, some
pretty old-fashioned tools played their part too!

ASH entries only exist for certain SQL*Net events. As usual, I think the
design is very successful as long as you have a basic understanding of how ASH
works.

The events all fall into one of three wait classes - Application, Network
and Idle.

SQL> select wait_class, name from v$event_name where name like 'SQL*Net%' order by 1, 2;

WAIT_CLASS           NAME
-------------------- ----------------------------------------
Application          SQL*Net break/reset to client
Application          SQL*Net break/reset to dblink
Idle                 SQL*Net message from client
Idle                 SQL*Net vector message from client
Idle                 SQL*Net vector message from dblink
Network              SQL*Net message from dblink
Network              SQL*Net message to client
Network              SQL*Net message to dblink
Network              SQL*Net more data from client
Network              SQL*Net more data from dblink
Network              SQL*Net more data to client
Network              SQL*Net more data to dblink
Network              SQL*Net vector data from client
Network              SQL*Net vector data from dblink
Network              SQL*Net vector data to client
Network              SQL*Net vector data to dblink

16 rows selected.

Hopefully that immediately dismisses the suggestion I've heard a few times that ASH/Top Activity ignores SQL*Net events. However, there are some events that are classed as Idle here that will not be captured in ASH data and will therefore not appear in the Top Activity screen. (In fact, even if you were to set the _ash_sample_all parameter you would not see these events in the Top Activity screen, even though ASH would contain entries for them. There is an additional filter applied to the data before it's displayed. In any case, I would *not* recommend setting _ash_sample_all except for fun.)

The most contentious of these is probably
SQL*Net message from client. This event indicates that the Oracle server is
waiting for the client to  do something. If the client is a user session, Oracle
is waiting for the user to do something - perhaps someone has a sql*plus session open that they're doing nothing with or they are filling in an application screen with data? From an Oracle perspective that session
is Idle and there isn't much we can do to tune it, including the waits would heavily skew the data we're using for our analysis and, because events in the Idle
class are specifically excluded from both ASH data and the OEM performance
pages,  such activity won't appear. That seems sensible to me.

However, if the client is actually an application server,
those supposedly idle events can tell us something about end-to-end system
performance. This blog post describes one real life situation where that event
helped me identify the performance problem.

Back to the problem at hand. Sadly, in the midst of a a very chaotic work situation, I didn't manage to grab any of the graphs in question but we had a persistent problem with our critical batch processes spending time waiting on 'SQL*Net more data from client', which showed up very clearly as network class waits.

Now there are a number of reasons why this might be happening. Just a couple off the top of my head

1) There is a performance problem with the application server so Oracle is expecting more data to arrive more quickly than the app server can deliver it.

2) A network problem is affecting the delivery speed.

Although the issue here turned out to be number 2, I suppose the true underlying issue is that this application is insanely 'chatty' and uses very frequent round trips to deliver data. Maybe the application design could be improved first?

Ultimately we enabled Extended Tracing on the relevant sessions and confirmed that, yes, the sessions were spending most of their time waiting for the next batch of data and also how long these waits were. In practise, I'm not sure how much value the trace files added in this particular case. In fact, I think this is an excellent example of the underlying beauty of the sampling approach. As an event became *slower* it appeared *more often* in the samples and we had plenty of examples of how long some of the waits were. (But only some - probably the longest ones.)

How did we solve this mysterious problem? With two very useful tools

- ping/tracert. I simply went on to the app server, pinged the db server and confirmed that the response time (from my memory) was in the range of 70-80 milliseconds - far higher than I would expect. One of the Java developers then ran a trace route to the server which highlighted that network performance was great until we hit the db server. It turned out that one of the network interface cards had auto-negotiated down to 100M/s and this was the root cause - easily fixed. Sometimes old-fashioned and simple tools are all that you need, once you know where you should focus your attention (and ASH is great at focussing attention in the right place).

- The next tool is one of the reasons I wanted to write this post. What made me try ping? Whilst I would have got there eventually, I thought I would try my performance optimisation secret weapon! One just for the database wizards! Something I've been keeping an eye on is Oracle Support's attempt to give people a methodical approach to solving performance issues via the Oracle Performance Diagnostic Guide on My Oracle Support - Note ID 390374.1. There are several guides on that landing page but here is the link to the Slow Performance PDF. (You'll need an account to be able to access this.) This is an example of the kind of advice there. (Note that this is for SQL*Net message from client, but you would tend to get most events together anyway and there's similar advice for more data from client.)

"Cause Identified: Slow network limiting the response time between client and database.
The network is saturated and this is limiting the ability of the client and database to communicate with each other.
Cause Justification
TKProf:
1. SQL*Net message from client waits are a large part of the overall time (see the overall summary
section)
2. Array operations are used. This is seen when there are more than 5 rows per execution on average (divide total rows by total execution calls for both recursive and non-recursive calls)
3. The average time for a ping is about equal to twice the average time for a SQL*Net message from client wait and this time is more than a few milliseconds. This indicates that most of the client time is spent in the network."

I've been loathe to mention this document until now because it looks like a work in progress (albeit not updated since the start of 2009), isn't perfect, has gaps and I can imagine a number of my peers taking issue with some of the content. But for something to aid learning if you're not naturally great at solving performance problems, I can think of much worse starting places.

In the end, what could have been a really tricky performance problem (and might have taken a while to notice) was apparent with a brief glance at OEM and we could then follow up by applying the right analysis tools to the problem.

To finish off, when I searched Google for any examples of 'SQL*Net more data from client', one of the first results that cropped up was a chapter from one of my favourite books - 'Optimising Oracle Performance' by Cary Millsap and Jeff Holt. Well worth a read ....

Alternative Pictures Demo

Note - features in this post require the Diagnostics Pack license

Not long after I'd finished the last post, I realised I could reinforce the points I was making with a quick post showing another one of the example tests supplied with Swingbench - the Calling Circle (CC) application. Like the Sales Order Entry application, CC is a mixed read/write test consisting of small transactions. As always, there's more information at Dominic's website.

One of the main differences to the SOE test is that the CC test consumes data so you need to generate a new set of data before each test using the supplied ccwizard utility. I won't show you the entire workflow here but enough to give you a flavour of the process. The utility is the same one used to create the necessary CC schema in the first place but the option I'm looking for here is "Generate Data for Benchmark Run".

I'd already decided that my CC schema is populated with data for 1 million customers when I created it so I just need to specify the number of transactions the next test will be for. I happen to know that on my particular configuration, a 1000 transaction test will take around 5 minutes to run.

I ran the test twice. I've highlighted the first run here in the Top Activity page.

It should be clear that CC suffers significant log file sync waits on my particular test platform, just like the SOE test. Therefore I'll regenerate the test data set, enable asynchronous commits and re-run the test. Here I've highlighted the second test run.

As well as seeing a similar change in the activity profile according to the ASH samples (the log file sync activity has disappeared as has the LGWR System I/O), there's a significant difference to the SOE test. Because this test run is based on a specific workload volume, as defined by the size of the test data, rather than a fixed time period, the second test run completed more quickly than the first run. The activity only fills the 5 minute activity bar partially, rather than the first test which filled the whole bar.

If you test a specific and limited workload volume it is much clearer from the Top Activity page which test is processing transactions more quickly, based on the Time axis. That's why I didn't pick this example the first time - it's too obvious what's going on!