Search

OakieTags

Who's online

There are currently 0 users and 26 guests online.

Recent comments

Oakies Blog Aggregator

Webinar Followup

Thanks everyone who attended my recent webinar at AllThingsOracle.com.

The link to the webinar recording can be found here.

The presentation PDF can be downloaded here. Note that this site uses a non-default HTTP port, so if you're behind a firewall this might be blocked.

Thanks again to AllThingsOracle.com and Amy Burrows for hosting the event.

Scrutinizing Exadata X5 Datasheet IOPS Claims…and Correcting Mistakes

I want to make these two points right out of the gate:

  1. I do not question Oracle’s IOPS claims in Exadata datasheets
  2. Everyone makes mistakes

Everyone Makes Mistakes

Like me. On January 21, 2015, Oracle announced the X5 generation of Exadata. I spent some time studying the datasheets from this product family and also compared the information to prior generations of Exadata namely the X3 and X4. Yesterday I graphed some of the datasheet numbers from these Exadata products and tweeted the graphs. I’m sorry  to report that two of the graphs were faulty–the result of hasty cut and paste. This post will clear up the mistakes but I owe an apology to Oracle for incorrectly graphing their datasheet information. Everyone makes mistakes. I fess up when I do. I am posting the fixed slides but will link to the deprecated slides at the end of this post.

We’re Only Human

Wouldn’t IT be a more enjoyable industry if certain IT vendors stepped up and admitted when they’ve made little, tiny mistakes like the one I’m blogging about here? In fact, wouldn’t it be wonderful if some of the exceedingly gruesome mistakes certain IT vendors make would result in a little soul-searching and confession? Yes. It would be really nice! But it’ll never happen–well, not for certain IT companies anyway. Enough of that. I’ll move on to the meat of this post. The rest of this article covers:

  • Three Generations of Exadata IOPS Capability
  • Exadata IOPS Per Host CPU
  • Exadata IOPS Per Flash SSD
  • IOPS Per Exadata Storage Server License Cost

Three Generations of Exadata IOPS Capability

The following chart shows how Oracle has evolved Exadata from the X3 to the X5 EF model with regard to IOPS capability. As per Oracle’s datasheets on the matter these are, of course, SQL-driven IOPS. Oracle would likely show you this chart and nothing else. Why? Because it shows favorable,  generational progress in IOPS capability. A quick glance shows that read IOPS improved just shy of 3x and write IOPS capability improved over 4x from the X3 to X5 product releases. These are good numbers. I should point out that the X3 and X4 numbers are the datasheet citations for 100% cached data in Exadata Smart Flash Cache. These models had 4 Exadata Smart Flash Cache PCIe cards in each storage server (aka, cell). The X5 numbers I’m focused on reflect the performance of the all-new Extreme Flash (EF) X5 model. It seems Oracle has started to investigate the value of all-flash technology and, indeed, the X5 EF is the top-dog in the Exadata line-up. For this reason I choose to graph X5 EF data as opposed to the more pedestrian High Capacity model which has 12 4TB SATA drives fronted with PCI Flash cards (4 per storage server). exadata-evolution-iops-gold-1 The tweets I hastily posted yesterday with the faulty data points aimed to normalize these performance numbers to important factors such as host CPU, SSD count and Exadata Storage Server Software licensing costs.  The following set of charts are the error-free versions of the tweeted charts.

Exadata IOPS Per Host CPU

Oracle’s IOPS performance citations are based on SQL-driven workloads. This can be seen in every Exadata datasheet. All Exadata datasheets for generations prior to X4 clearly stated that Exadata IOPS are limited by host CPU. Indeed, anyone who studies Oracle Database with SLOB knows how all of that works. SQL-driven IOPS requires host CPU. Sadly, however, Oracle ceased stating the fact that IOPS are host-CPU bound in Exadata as of the advent of the X4 product family. I presume Oracle stopped correctly stating the factual correlation between host CPU and SQL-driven IOPS for only the most honorable of reasons with the best of customers’ intentions in mind. In case anyone should doubt my assertion that Oracle historically associated Exadata IOPS limitations with host CPU I submit the following screen shot of the pertinent section of the X3 datasheet:   X3-datasheet-truth Now that the established relationship between SQL-driven IOPS and host CPU has been demystified, I’ll offer the following chart which normalizes IOPS to host CPU core count: exadata-evolution-iops-per-core-gold I think the data speaks for itself but I’ll add some commentary. Where Exadata is concerned, Oracle gives no choice of host CPU to customers. If you adopt Exadata you will be forced to take the top-bin Xeon SKU with the most cores offered in the respective Intel CPU family.  For example, the X3 product used 8-core Sandy Bridge Xeons. The X4 used 12-core Ivy Bridge Xeons and finally the X5 uses 18-core Haswell Xeons. In each of these CPU families there are other processors of varying core counts at the same TDP. For example, the Exadata X5 processor is the E5-2699v3 which is a 145w 18-core part. In the same line of Xeons there is also a 145w 14c part (E5-2697v3) but that is not an option to Exadata customers.

All of this is important since Oracle customers must license Oracle Database software by the host CPU core. The chart shows us that read IOPS per core from X3 to X4 improved 18% but from X4 to X5 we see only a 3.6% increase. The chart also shows that write IOPS/core peaked at X4 and has actually dropped some 9% in the X5 product. These important trends suggest Oracle’s balance between storage plumbing and I/O bandwidth in the Storage Servers is not keeping up with the rate at which Intel is packing cores into the Xeon EP family of CPUs. The nugget of truth that is missing here is whether the 145w 14-core  E5-2697v3 might in fact be able to improve this IOPS/core ratio. While such information would be quite beneficial to Exadata-minded customers, the 22% drop in expensive Oracle Database software in such an 18c versus 14c scenario is not beneficial to Oracle–especially not while Oracle is struggling to subsidize its languishing hardware business with gains from traditional software.

Exadata IOPS Per Flash SSD

Oracle uses their own branded Flash cards in all of the X3 through X5 products. While it may seem like an implementation detail, some technicians consider it important to scrutinize how well Oracle leverages their own components in their Engineered Systems. In fact, some customers expect that adding significant amounts of important performance components, like Flash cards, should pay commensurate dividends. So, before you let your eyes drift to the following graph please be reminded that X3 and X4 products came with 4 Gen3 PCI Flash Cards per Exadata Storage Server whereas X5 is fit with 8 NVMe flash cards. And now, feel free to take a gander at how well Exadata architecture leverages a 100% increase in Flash componentry: exadata-evolution-iops-per-SSD-gold This chart helps us visualize the facts sort of hidden in the datasheet information. From Exadata X3 to Exadata X4 Oracle improved IOPS per Flash device by just shy of 100% for both read and write IOPS. On the other hand, Exadata X5 exhibits nearly flat (5%) write IOPS and a troubling drop in read IOPS per SSD device of 22%.  Now, all I can do is share the facts. I cannot change people’s belief system–this I know. That said, I can’t imagine how anyone can spin a per-SSD drop of 22%–especially considering the NVMe SSD product is so significantly faster than the X4 PCIe Flash card. By significant I mean the NVMe SSD used in the X5 model is rated at 260,000 random 8KB IOPS whereas the X4 PCIe Flash card was only rated at 160,000 8KB read IOPS. So X5 has double the SSDs–each of which is rated at 63% more IOPS capacity–than the X4 yet IOPS per SSD dropped 22% from the X4 to the X5. That means an architectural imbalance–somewhere.  However, since Exadata is a completely closed system you are on your own to find out why doubling resources doesn’t double your performance. All of that might sound like taking shots at implementation details. If that seems like the case then the next section of this article might be of interest.

IOPS Per Exadata Storage Server License Cost

As I wrote earlier in this article, both Exadata X3 and Exadata X4 used PCIe Flash cards for accelerating IOPS. Each X3 and X4 Exadata Storage Server came with 12 hard disk drives and 4 PCIe Flash cards. Oracle licenses Exadata Storage Server Software by the hard drive in X3/X4 and by the NVMe SSD in the X5 EF model. To that end the license “basis” is 12 units for X3/X5 and 8 for X5. Already readers are breathing a sigh of relief because less license basis must surely mean less total license cost. Surely Not! Exadata X3 and X4 list price for Exadata Storage Server software was $10,000 per disk drive for an extended price of $120,000 per storage server. The X5 EF model, on the other hand, prices Exadata Storage Server Software at $20,000 per NVMe SSD for an extended price of $160,000 per Exadata Storage Server. With these values in mind feel free to direct your attention to the following chart which graphs the IOPS per Exadata Storage Server Software list price (IOPS/license$$). exadata-evolution-iops-per-license-cost-gold The trend in the X3 to X4 timeframe was a doubling of write IOPS/license$$ and just short of a 100% improvement in read IOPS/license$$. In stark contrast, however, the X5 EF product delivers only a 57% increase in write IOPS/license$$ and a troubling, tiny, 17% increase in read IOPS/license$$. Remember, X5 has 100% more SSD componentry when compared to the X3 and X4 products.

Summary

No summary needed. At least I don’t think so.

About Those Faulty Tweeted Graphs

As promised, I’ve left links to the faulty graphs I tweeted here: Faulty / Deleted Tweet Graph of Exadata IOPS/SSD: http://wp.me/a21zc-1ek Faulty / Deleted Tweet Graph of Exadata IOPS/license$$: http://wp.me/a21zc-1ej

References

Exadata X3-2 datasheet: http://www.oracle.com/technetwork/server-storage/engineered-systems/exadata/exadata-dbmachine-x3-2-ds-1855384.pdf Exadata X4-2 datasheet: http://www.oracle.com/technetwork/database/exadata/exadata-dbmachine-x4-2-ds-2076448.pdf Exadata X5-2 datasheet: http://www.oracle.com/technetwork/database/exadata/exadata-x5-2-ds-2406241.pdf X4 SSD info: http://www.oracle.com/us/products/servers-storage/storage/flash-storage/f80/overview/index.html X5 SSD info: http://docs.oracle.com/cd/E54943_01/html/E54944/gokdw.html#scrolltoc Engineered Systems Price List: http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf , http://www.ogs.state.ny.us/purchase/prices/7600020944pl_oracle.pdf

Filed under: oracle

A new blog post on using Snap Clone on EMC storage with ASM

Just a quick note here to say I’ve added a new blog post to the official Enterprise Manager blog site. You can find it here.

Working With AWR Reports From EM12c

I’ve had a recent request to write an update on working with AWR reports, so as promised, here it is!

The Automatic Workload Repository

The Automatic Workload Repository, (AWR) was one of the best enhancements to Oracle back in release 10g.  There was quite a goal put in front of the development group when they were asked to develop a product that:

1.  Provided significant performance recommendation and wait event data enhancements over its predecessor statspack.

2.  Was always on, meaning that the data would continually collect without manual intervention from the database administrator.

3.  Wouldn’t impact the current processing, having its own background processes and memory buffer, designated tablespace, (SYSAUX).

4.  The memory buffer would write in the opposite direction vs. direction the user reads, eliminating concurrency issues.

Along with many other requirements, all of the above was offered with the Automatic Workload Repository and we end up with architecture that looks something like this:

pt5

 

Using AWR Data

The AWR data is identified by the DBID, (Database Identifier) and a SNAP_ID, (snapshot identifier, which has an begin_interval_time and end_interval_time to isolate the date and time of the data collection.) and information about what is currently retained in the database can be queried from the DBA_HIST_SNAPSHOT.  AWR data also contains ASH, (Active Session History) samples along with the snapshot data, by default, about 1 out of every 10 samples.

The goal to using AWR data effectively really has to do with the following:

1.  Have you identified a true performance issue as part of a performance review?

2.  Has there been a user complaint or a request to investigate a performance degradation?

3.  Is there a business challenge or question that needs to be answered that AWR can offer an answer to?  (we’ll go when to use AWR vs. other features…)

Performance Review

A performance review is where you have either identified a problem or have been assigned to investigate the environment for performance issues to solve.  I have a couple Enterprise Manager environments available to me, but I chose to go out to one in particular and cross my fingers hoping I would have some heavy processing to fit the requirements of this post.

The quickest way to see workload in your database environment from EM12c, click on Targets –> Databases.  Choose to view by load map and you will then view databases by workload.  Upon going to a specific Enterprise Manager environment, I found out it was my lucky day!

pt1I really don’t know who Kurt is that has a database monitored on this EM12c cloud control environment, but boy, is he my favorite person today! :)

Hovering my cursor over the database name, (kurt) you can view the workload he has running on his test database currently:pt2

 

Boy, is Kurt my favorite person today!

EM12c Database Home Page

Logging into the database, I can see the significant IO and resource usage for the database and host from the database home page:

pt3

 

If we move to Top Activity, (Performance menu, Top Activity) I begin to view more details about the processing and different wait events:

pt4

 

Kurt is doing all KINDS of inserts, (seen by the different SQL_IDs, by SQL Type “INSERT”.  I can drill down into the individual statements and investigate this, but really, there are a TON of statements and SQL_ID’s here, wouldn’t it just be easier to view the workload with an AWR report?

Running the AWR Report

I choose to click on Performance, AWR, AWR Report.  Now I have a choice.  I could request a new snapshot to be performed immediately or I could wait till the top of the hour, as the interval is set hourly in this database.  I chose the latter for this demonstration, but if you wanted to create a snapshot immediately, you can do this easily from EM12c or request a snapshot by executing the following from SQLPlus with a user with execute privileges on the DBMS_WORKLOAD_REPOSITORY:

BEGIN
DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT ();
END;
/

For this example, I simply waited, as there was no hurry or concern here and requested the report for the previous hour and latest snapshot:

pt6

 

I always start at the Top Ten Foreground Events and commonly look at those with high wait percentages:

pt7

 

Direct Path Write, that’s it.  Nothing else to see here… :)

Direct path write involves the following:  inserts/updates, objects being written to, tablespaces being written to and those datafiles that make up the tablespace(s).

It’s also IO, which we quickly verify down in the Foreground Wait Class:

pt8

 

Looking at the Top SQL by Elapsed Time confirms that we are dealing with a workload consisting of all inserts:

pt9

 

Clicking on the SQL ID, takes me to the Complete List of SQL Text and shows me just what Bad Boy Kurt is doing to produce his testing workload:

pt10

Wow, that Kurt is quite the rebel, eh? :)

Insert in a loop into one table from the same table, rollback and then end the loop, thanks for playing. He’s kicking some tires and doing it with angst!  Don’t worry people, like I said, Kurt is doing his job, using a module called “Load Generator”.  I’d be a fool not to recognize this as anything other than what it is-  generating workload to test something.  I just get the added benefit of having a workload to do a blog post on using AWR data… :)

Now, if this was a real issue and I was trying to find out what this type of performance impact this type of insert was creating on the environment, where to go next in the AWR report?  The top SQL by elapsed time is important as it should be where you focus your efforts.  Other sections broken down by SQL is nice to have, but always remember, “If you aren’t tuning for time, you’re wasting time.”  Nothing can come of an optimization exercise if no time savings is seen after you’ve completed the work.  So by taking first the Top SQL by Elapsed Time, then looking at the statement, we now can see what objects are part of the statement, (large_block149, 191, 194, 145).

We also know that the problem is IO, so we should jump down from the SQL detailed information and go to the object level information.  These sections are identified by Segments by xxx.  

  • Segments by Logical Reads
  • Segments by Physical Reads
  • Segments by Read Requests
  • Segments by Table Scans

so on and so forth….

These all show a very similar pattern and percentage for the objects we see in our top SQL. Remember, Kurt was reading each of these tables, then inserting those same rows back into the table again, then rolling back.  As this is a workload scenario, unlike most performance issues I see, there is no outstanding object showing with an over 10% impact in any area.

pt11

As this is an Exadata, there is a ton of information to help you understand offloading, (smart scans) flash cache, etc. that will assist in relaying the information you need to make sure you are achieving the performance you desire with an engineered system, but I’d like to save that for another post and just touch on a few of the IO reports, as we were performing table scans, so we do want to make sure those were being offloaded to the cell nodes, (smart scans) vs. being performed on a database node.

We can start by looking at Top Database IO Throughput:

pt14

And then view the Top Database Requests Per Cell Throughput, (sans the Cell node names) to see how they compare:

pt12

 

Then we add to it this last graph that tells us, are we offloading enough of the workload to each of the cell nodes, Top IO Reasons by Request?

pt15

 

This has been one scenario trip through an AWR Report.  Hopefully it’s been interesting and next time I’ll search for a different load type to take you through!

Happy Friday!

 

 

 

 

 



Tags:  , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Working With AWR Reports From EM12c], All Right Reserved. 2015.

Data Center of the Future – now

9087977014_a5a2d4bc11_k

 

photo by youflavio

In a recent blog post Eric D. Brown defined an Agile Data Center as

#777777;">An Agile Data Center is one that allows organizations to efficiently and effectively add, remove and change services at the speed of the business, not the speed of technology - Eric D. Brown

In follow up post he said that a Agile Data Center could be implemented by Software Defined Data Center (SDDC) for example using machine virtualization to spin environments up and down.

#555555;">With SDDC, it is possible for companies to replace their data center’s infrastructure with a virtualized environment and then deliver services and software as a service - #777777;">Eric D. Brown

The question arrises, what technologies constitute and agile SDDC? What software should be leveraged to succeed at having an agile data center, a SDDC? The most important software to look at is software that addresses the top constraints in the data center. As the theory of constraints says, any improvement not made at the constraint is an illusion. So what are the top constraints in the data center. The top constraint, as found after working with 100s of companies and surveying 1000s of companies, is provisioning environments for development and QA. Why is that? It’s because almost every industry now is finding itself to be more and more a software industry from stock trading to booksellers to taxi companies to hotels. The competitive advantage is more and more about the software used to provide and sell the service. To build that software requires development and QA and thus development and QA environments. The hard part of an environment to provision is no longer the machine thanks to machine virtualization. The hardest part of the environment to provision is the data. Data that represents the production system is required to develop applications that use, display and manage that data. Data is typically kept in large complex databases such as Oracle, SQL Server, Sybase, Postgres and DB2.

Provisioning development and QA environments that rely on databases  can be an expensive, slow endeavor. But like machine virtualization there is a new technology data virtualization that instead of making full physical copies, instead makes one copy of each unique data block on the source including a stream of changes blocks. With this “time flow” of unique blocks from the source database, data virtualization can provide copies in minutes not by actually making copies but by providing pointers back to the existing blocks. These existing blocks are read/writeable thanks to a technology of redirect on write which saves modified blocks in a different location than the original. It all sound a bit complex but when that’s the beauty of data virtualization solutions. They take the complexity, wrap it up into automated software stack and provide simple interface and APIs to provision full developer environments from the binaries, to the code files to the most complex and difficult part of the environment provisioning which is provisioning full running copies of the data. Included in most data virtualization solutions is masking as well since sensitive data is often required to be masked in development environments. The software defined data centers (SDDC) depend on machine virtualization and data virtualization.

What other technologies are also required?

Check out  Software defined everything: Throwing more software into the mix 

Restrict the query optimizer through OPTIMIZER_FEATURES_ENABLE and Top-Frequency/Hybrid Histograms

To make upgrades easier, I regularly see people considering disabling query optimizer features by setting the OPTIMIZER_FEATURES_ENABLE initialization parameter to a non-default value. My general opinion about this “habit” is summarized in TOP with the following two sentences:

Changing the default value of the OPTIMIZER_FEATURES_ENABLE initialization parameter is only a short-term workaround. Sooner or later the application should be adapted (optimized) for the new database version.

The issue is that not all new features are disabled by the OPTIMIZER_FEATURES_ENABLE initialization parameter. For example, if you set it to 10.2.0.5 in version 12.1.0.2, you will not get exactly the 10.2.0.5 query optimizer. A particular case to be considered in 12.1 is related to the new types of histograms (top-frequency and hybrid). In fact, even though you set the OPTIMIZER_FEATURES_ENABLE initialization parameter to 10.2.0.5, the DBMS_STATS package can produce top-frequency and hybrid histograms. As a result, the estimations of the query optimizer can change. An example is provided by this script.

Note that also setting the COMPATIBLE initialization parameter to, for example, 11.0.0 with the 12.1.0.2 binaries does not prevent the DBMS_STATS package to gather top-frequency and hybrid histograms.

So, if you are really planning to upgrade to 12.1 and changing the default value of the OPTIMIZER_FEATURES_ENABLE initialization parameter, in my opinion you should make sure that the DBMS_STATS package does not create top-frequency and hybrid histograms. Unfortunately, with the METHOD_OPT parameter, it is not possible to specify which type of histograms the DBMS_STATS package can create. The only input parameter of the GATHER_*_STATS procedures that affects this decision is ESTIMATE_PERCENT. When it is set to a numeric value greater than 0, the DBMS_STATS package does not create top-frequency and hybrid histograms. In other words, the DBMS_STATS package considers the creation of the new types of histograms only if the ESTIMATE_PERCENT parameter is set to the DBMS_STATS.AUTO_SAMPLE_SIZE constant (0).

Another possibility to prevent that the DBMS_STATS package creates top-frequency and hybrid histograms is to set the undocumented preferences ENABLE_TOP_FREQ_HISTOGRAMS and ENABLE_HYBRID_HISTOGRAMS. For example, to set them globally, you can use the following PL/SQL block:

BEGIN
  dbms_stats.set_global_prefs(pname => 'ENABLE_HYBRID_HISTOGRAMS', pvalue => 0);
  dbms_stats.set_global_prefs(pname => 'ENABLE_TOP_FREQ_HISTOGRAMS', pvalue => 0);
END;

WARNING: Even though according to my tests the preferences do what I expect (i.e. disable top-frequency and hybrid histograms), I never used them in a real case and I did not ask Oracle about their usage. Hence, if you want to use them, I would not only test them carefully, but also ask Oracle Support whether their utilization can lead to known problems.

Before setting them, I also adivse you to get their (default) value with a query like the following one:

SELECT dbms_stats.get_prefs('ENABLE_HYBRID_HISTOGRAMS') AS hybrid,
       dbms_stats.get_prefs('ENABLE_TOP_FREQ_HISTOGRAMS') AS top_freq
FROM dual

Finding Oracle VM Manager Core Management Port After OVM Manager Installation

Recently I was working on integration of OVM Manager with Oracle EM Cloud Control 12cR4 and I had to enter the "Oracle VM Manager URL" a.k.a. as "Core management port" which is reported at the end of OVM Manager installation. The default value for this port is 54321. As I was not in position to get this post-installation report because the installation was performed by another company, I had to find out the right port. The definition should be entered in the form: tcp://:
.
EM Cloud Control and OVM Manager Integration Screen
OVM Manager Installation report sample

One of the the first ideas was to check local firewall configuration and see if there is anything for port 54321.

# iptables -L | grep 5432
ACCEPT     tcp  --  anywhere  anywhere   state NEW tcp dpt:54322

Search showed opened port 54322 which could also be one of the possibilities.
Searching with netstat revealed no acctivity on port 54322

# netstat -vatn | grep 54322

However, a lot of rows was returned when searching for port 54321:

# netstat -vatn | grep 54321
tcp        0      0 ::ffff:127.0.0.1:54321      :::*                        LISTEN
tcp        0      0 ::ffff:127.0.0.1:36797      ::ffff:127.0.0.1:54321      ESTABLISHED
tcp        0      0 ::ffff:127.0.0.1:51828      ::ffff:127.0.0.1:54321      ESTABLISHED
tcp        0      0 ::ffff:127.0.0.1:53096      ::ffff:127.0.0.1:54321      ESTABLISHED
tcp        0      0 ::ffff:127.0.0.1:55461      ::ffff:127.0.0.1:54321      ESTABLISHED
tcp        0      0 ::ffff:127.0.0.1:53017      ::ffff:127.0.0.1:54321      ESTABLISHED

But I was still not 100% sure, if the iptables showed the right port. So further digging through the documentation revealed the right port. The definition of the "Core management port"  is hidden in .config file under the name COREPORT in OVM Manager home directory /u01/app/oracle/ovm-manager-3/.config

The contents of this file in my case was:

# cat  /u01/app/oracle/ovm-manager-3/.config

DBTYPE=MySQL
DBHOST=localhost
SID=ovs
LSNR=49500
OVSSCHEMA=ovs
APEX=8080
WLSADMIN=weblogic
OVSADMIN=admin
COREPORT=54321
UUID=0004fb000001000075e54bae74172d82
BUILDID=3.3.1.1065

The explanation for the configuration items can be found in chapter "5.1 Backing up and Restoring Oracle VM Manager"  and obviously in my case the port used was actually the default one 54321.
The .config file should be part of the backup.

Hope this post will shorten search time for this kind of information to my readers.

10046.pl by Clive Bostock

#555555;">Note:

#555555; padding-left: 30px;">This is a reposting of an old blog post that was on dboptimizer.com but is no longer accessible

#555555; padding-left: 30px;">More trace file analyzer tools at  #2970a6;" href="http://ba6.us/node/177">http://ba6.us/node/177

#555555; padding-left: 30px;">Related blog post: Oracle “Physical I/O” ? not always physical with a 10046 parser specifically for I/O parsetrc.pl and readme


#555555;">Often when I have a 10046 trace file, especially when looking at I/O issues, I want a histogram of I/O response time.  To get I/O response time I’ve hacked out incomple awk scripts from time to time, always meaning to write a more complete one, well now I don’t have to. It’s already been done!

#555555;">Here is a cool perl script from Clive Bostock: #2970a6;" href="https://sites.google.com/site/oraclemonitor/10046-pl-readme">README.TXT    #2970a6;" href="https://sites.google.com/site/oraclemonitor/10046-pl-1">10046.pl

#555555;">(also checkout #2970a6;" href="http://www.oracledba.ru/orasrp/">orasrp which produces a more indepth report in HTML. I like both. I like 10046.pl as a short  easy portable script that I can modify, whereas orasrp is a binary and only works on some ports)

#555555;">For example, if I trace a session with 10046, and retrieve the tracefile, then I can run:

#555555;">$ 10046.pl -t  mytrace.trc

#555555;">and it will output  a header and three sections

#555555;">Header

    #555555;">
  • Summary of all events for tracefile
  • Events by object summary
  • Events by object histogram

#555555;">This looks like

#555555;">Header

#555555;">Trace file mytrace.trc
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u02/oracle
System name:    SunOS
Node name:      toto
Release:        5.10
Version:        Generic_142900-12
Machine:        sun4u
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 177
Unix process pid: 16553, image: oracle@toto
Trace input file : mytrace.trc

#555555;">Wait summary

#555555;">EVENT AGGREGATES
================
Wait Event              Count Elapsed(ms)   Avg Ela (ms)  %Total
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~
db file sequential read  2715      11593              4    3.74
       direct path read  4484       4506              1    1.45
 db file scattered read   141        898              6    0.29
          log file sync     3          8              2    0.00
                              ~~~~~~~~~~~~
             Total Elapsed:       309821

#555555;">Wait Summary by object

#555555;">Object Id  : Wait Event                  Count Tot Ela (ms) %Total Avg Ela (ms)
~~~~~~~~~~ : ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~~
28581      : direct path read            4484         4506   1.45            1
1756433    : db file sequential read      725         1891   0.61            2
764699     : db file sequential read      332         1762   0.57            5
37840      : db file sequential read      200         1044   0.34            5
38018      : db file sequential read      108         1009   0.33            9
81596      : db file scattered read       140          887   0.29            6

#555555;">wait histogram by object

#555555;">EVENT HISTOGRAM BREAKDOWN
===========================
This section splits the event counts into elapsed time
buckets so that we can see if there are any suspiciousn
or anomalous response time / frequency patterns.
Object Id : Wait Event              <1ms <2ms <4ms <8ms <16ms <32ms <64ms <128ms <256ms <512ms >=1024ms
~~~~~~~~~ : ~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~~ ~~~~~ ~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~
28581     : direct path read        7680   87  148  221  144    40     4     0      0       0        0
1756433   : db file sequential read  606  268   45   35   66     6     2     0      0       0        0
764699    : db file sequential read   74  119   11   78   78     9     0     0      0       0        0
37840     : db file sequential read   50   72    6   45   47     5     0     0      0       0        0
38018     : db file sequential read   12   38    7   10   30    12     5     0      0       0        0
81596     : db file scattered read    64    4   13   62   18     8     3     0      0       0        0
41995     : db file sequential read   20   39    0    7   16     8     4     0      0       0        0
108718    : db file sequential read   74   54    5   12   24     4     0     0      0       0        0
33490     : db file sequential read    0    5   11   25   19     4     0     0      0       0        0

An Oracle Instance is Like An Upmarket Restaurant

I recently did an Introduction to Oracle presentation, describing how the oracle instance worked – technically, but from a very high level. In it I used the analogy of a restaurant, which I was quite happy with. I am now looking at converting that talk into a set of short articles and it struck me that the restaurant analogy is rather good!

Here is a slide from the talk:

Simple partial overview of an Oracle Instance

Simple partial overview of an Oracle Instance

As a user of the oracle instance, you are the little, red blob at the bottom left. You (well, your process, be it SQL*Plus, SQL*Developer, a Java app or whatever) do nothing to the database directly. It is all done for you by the Oracle Sever Process – and this is your waiter.

Now, the waiter may wait on many tables (Multi-threaded server) but this is a very posh restaurant, you get your own waiter.

You ask the waiter for food and the waiter goes off and asks the restaurant to provide it. There are many people working in the restaurant, most of them doing specific jobs and they go off and do whatever they do. You, the customer, have no idea who they are or what they do and you don’t really care. You don’t see most of them. You just wait for your food (your SQL results) to turn up. And this is exactly how an Oracle Instance works. Lots of specific processes carry out their own tasks but they are coordinated and the do the job without most of us having much of an idea what each bit does. Finally, some of the food is ready and the waiter delivers the starter to you – The server process brings you the first rows of data.

Let’s expand the analogy a bit, see how far we can take it.

When you arrived at the restaurant, the Maître d’ greets you and allocates you to your waiter. This is like the Listener process waiting for connection requests and allocating you a server process. The Listener Process listens on a particular port, which is the front door to the restaurant. When you log onto an oracle database your session is created, ie your table is laid. If someone has only just logged off the database their session might get partially cleared and re-used for you (you can see this as the SID may well get re-used), as creating a session is a large task for the database. If someone had just left the restaurant that table may have a quick brush down and the cutlery refreshed, but the table cloth, candle and silly flower in a vase stay. Completely striping a table and relaying it takes more time.

The restaurant occupies a part of the building, the database occupies part of the server. Other things go in the server, the restaurant is in a hotel.

The PMON process is the restaurant manager or Head of House maybe and SMON is the kitchen manager, keeping an eye on the processes/staff and parts of the restaurant they are responsible for. To be candid, I don’t really know what PMON and SMON do in detail and I have no real idea how you run a large kitchen.

There are lots of other processes, these are equivalent to the Sous-chef, Saucier, commis-chef, Plonger (washes up, the ARC processes maybe?), Ritisseur, Poissonier, Patissier etc. They just do stuff, let’s not worry about the details, we just know there are lots of them making it all happen and we the customer or end user never see them.

The PGA is the table area in the restaurant, where all the dishes are arranged and provided to each customer? That does not quite work as the waiter does not sit at our table and feed us.

The SGA is the kitchen, where the ingredients are gathered together and converted into the dishes – the data blocks are gathered in the block buffer cache and processed. The Block Buffer Cache are the tables and kitchen surfaces, where all the ingredients sit. The Library cache is, yes, the recipes. They keep getting re-used as our kitchen only does certain recipes, it’s a database with a set of standard queries. It’s when some fool orders off-menu that it all goes to pot.

Food is kept in the larder and fridges – the tablespaces on disc. You do not prepare the dishes in the larder or fridge, let alone eat food out of them (well, some of the oracle process might nick the odd piece of cooked chicken or chocolate). everything is brought into the kitchen {the SGA} and processed there, on the kitchen tables.

The orders for food are the requests for change – the redo deltas. Nothing is considered ordered until it is on that board in the kitchen, that is the vital information. All the orders are preserved (so you know what was ordered, you can do the accounts and you can re-stock). The archived redo. You don’t have to keep this information but if you don’t, it’s a lot harder to run the restaurant and you can’t find out what was ordered last night.

The SCN is the clock on the wall and all orders get the time they were place on them, so people get their food prepared in order.

When you alter the ingredients, eg grate some of the Parmesan cheese into a sauce, the rest of the cheese (which, being an ingredient is in the SGA) is not put back into the fridge immediately, ie put back into storage. It will probably be used again soon. That’ll push it up the LRU list. Eventually someone will put it back, probably the Garçon de cuisine (the kitchen boy). A big restaurant will gave more then one Garçon de cuisine, all with DBW1 to x written on the back of their whites, and they take the ingredients back to the larder or kitchen when they get around to it – or are ordered to do so by one of the chefs.

Can we pull in the idea of RAC? I think we can. We can think of it as a large hotel complex which will have several restaurants, or at least places to eat. They have their own kitchens but the food is all stored in the central store rooms of the hotel complex. I can’t think what can be an analogy of block pinging as only a badly designed or run restautant would for example only have one block of Parmesan cheese – oh, maybe it IS a lot like some of the RAC implementations I have seen :-)

What is the Sommelier (wine waiter) in all of this? Suggestions on a post card please.

Does anyone have any enhancements to my analogy?

Enterprise Manager and Services, Part I

The term “Service” is used to represent numerous objects within Enterprise Manager 12c.  Although the most common reference to the term is related to the Oracle Management Service, (OMS), in this blog post, we are going to drill down and refer to all services identified in the drop down menu from Targets, Management Services and Repository.

The two main services we are going to cover here are set up as part of the Oracle Enterprise Manager 12.1.0.4 installation and shouldn’t be edited or reconfigured, but I think it’s good to know where everything is if you should ever need to investigate it or are working with Oracle Support.

By default the following services will display in the EM12c environment-

  • EM Console Service
  • EM Job Service

These two services are essential to successful Enterprise Manager task completion.  The EM Console Service is the service that keeps the console active and accessible and the EM Job Service controls not just the EM Jobs viewable in the EM Job Activity console, but also a ton of background processing for the Enterprise Manager.

The Main Console View

Once you’ve accessed the services from the target menu, as these are both monitored targets in the Enterprise Manager, you’ll see the following information:

services1

 

Along with the two, default services we expect to see for a standard Enterprise Manager installation, we also can see the status, if there are any open incidents and what system the services belong to.  You can create and remove services from here, but unless you know what you’re doing, this is mostly the view to quickly filter service information.

Creating services from this console offers you three, two generic service types and one advanced service option called an Aggregate service.

services3

Using our existing services as examples, it may give you a bit of an idea of how a System Based service is different than a Test Based one.

A System based service monitors one single system vs. a Test based one, which monitors a one or more test based systems.  Notice that the EM Jobs Service does ONE thing-  it monitors the Enterprise Manager Job Service.  The EM Console Service has to monitor EVERYTHING that is involved in the console, which is much more complex and requires a constant test to ensure all aspects are available.

And Aggregate Service is a combination of sub-services.  If one of the many sub-services is experiencing an issue, the entire Aggregate service is dependent upon that one failure.

There are a number of other services that Enterprise Manager 12c supports:

  • Business Application
  • Service Entry Point
  • Oracle Database Service
  • Oracle Applications Service
  • Siebel Services
  • Beehive Services
  • EM Service
  • And others….

You can create, manage and remove services via EM CLI commands, too.  The following commands will support the management of each:

  • create_service
  • assign_test_to_target
  • change_service_system_assoc
  • remove_service_test_assoc
  • set_availability
  • set_properties
  • And many, many others….

To the right of the console list of services, are Key Components incident information.  This can assist the administrator when they are experiencing a problem, but the culprit isn’t displayed in the service, a related target may be the issue:

services2

 

The Key Tests also show current status and any monitoring beacons that are attached to the service.

EM Console Service

If we click on the EM Console Service, we are taken to the home page for the service.

service4

Centering in on the upper, left section of the home page, you notice that there are a number of tabs:

  • Home
  • Charts
  • Test Performance
  • System
  • Monitoring Configuration
  • Topology

For any generic service, you are going to see these tabs, but we’ll go through each one of the tabs, (outside of the home tab) for the EM Console Service so that you get a good feel for what is available in each and what tab has control over the configuration of the service.

Charts

Upon clicking the charts menu, you’ll see the basic information about page response per millisecond and page hits per minute:

service5

Unless there is a serious performance issue in the console, I commonly don’t find a lot of useful data from the default view of “last 24 hrs” and recommend changing the view to [at least] the last 7 days.  You’ll be surprised how different the data will look when you take into account the nightly maintenance jobs that keep EM12c clean and effective:

service18

Test Performance

The next tab is for test performance.  All services have tests involved that check and see if a service is actively running and passes either one or more tests that verify it is properly functioning and available.

service6

The value for the last EM Console Service Test is shown above in the graph, but the real information lies lower in the performance timestamp.  The test isn’t run very often, (it is a performance impact to the system, so full testing of the service is only run once every 24 hours.)  You can see the Collection Timestamp in the right hand window of 6:55am, which will also match the extended Web Transaction used to verify that the console is accessible if you click on the EM Console Service Test at the bottom.

System

The System tab displays the latest information about the status and incidents of the components connected to the service.  service7

As you can see from the screen shot above, a recent test of a non-released patch has caused some grief, (this is why we test these things… :)) and there are a few applications components that were impacted and need to be reviewed and addressed.  Each of the components are linked on this page, so they can be easily accessed and the incident investigated.  As this is the EM Console Service, there are a number of related components, (16 total, as shown in the right hand totals) and you can change to the next page to display the rest of the components involved.

Monitoring Configuration

The Monitoring Configuration requires a section to itself, as this has some pretty impressive links in this tab.  We are going to go through each one of these, so you can get a solid understanding of what is available:

service8

System Association

First thing you need to remember, DON’T CLICK ON THE “REMOVE SYSTEM” BUTTON.  Although I’ve never seen anyone attempt this from the console, I can’t imagine the damage that would be done if you chose to remove a key component from the EM Console Service unless instructed by Oracle Support, so just don’t attempt it to see what happens… :)

service10

As discussed above, each component associated to the EM Console Service is displayed, along with the designation as a key component with a check mark.

Root Cause Analysis Configuration

By setting the analysis mode to manual, less stress is put on the system resources to collect root cause data at any issue, letting you control when the analysis is performed.  The recommendation is to leave this set to manual mode for analysis collections and only to change it [again] with the blessing of Oracle Support.

service11If any root cause analysis has been performed, you would view it to the right of the list of components and types, as the bottom section is reporting info only, listing each component and any test counts.

Service Tests and Beacons

This is how cloud control knows that the console service and key components are up and available.  A set of tests are run on a regular interval in the form of a web transaction.  This means that the test logs in, verifies a result from the URL for cloud control and verifies each key component is responsive.  If all tests result in a positive result, then the status is ‘up’.

service12

A beacon is a target used to monitor service tests, primarily to measure performance of the service. One beacon belongs to at least each service test and is used to check connectivity between the client and the OMS host or the OMS host and the OMR host, etc.  Adding multiple beacons for one service adds no value, so if for some reason, you see multiple beacons assigned to a service, there’s a problem right there.

Availability Definition

This console looks very similar to the one above, but it covers test information about the EM Console Service Tests and EM Management Beacon:

service13

By default, both of these require ALL KEY SERVICE TESTS must be SUCCESSFUL to be marked successful and return a status of ‘up’.  As you can see, the EM Management Beacon has not done so, which results in an unavailable status.  The next step in trouble-shooting this would be to click on the Beacon name link and investigate what  didn’t return successfully.

Performance Metrics

Performance metric data is where you set your thresholds for EM Console Service response, (and the readers say “Aha!  That’s another location I can get to the valuable metric threshold setting!”)

service14

The metric value settings for the threshold are in milliseconds and if you are receiving notifications that this is beyond your response time, look at network connection, network connection between data centers, etc. and if it’s something you can’t correct, bump up the warning and critical threshold values to provide more value.

Usage Metrics

This setting by default isn’t set, but it’s one area, you can configure if you would like.  If you are concerned about page performance and want to report on this metric data, set up the with a logical value to start.  For a small web server, 12000 page hits per minute would be a pretty good warning value with 20000 page hits per minute critical.  Use this as a beginning base, test, rinse and repeat.

service15

Now we’ve completed a high level review of the Monitoring Configuration, we can return to the last tab, Configuration Topology.

Topology

This last tab is for those of you that love topology charts.  This is the ultimate in configuration topology, allowing you to interact with all data, component dependency and connections to the EM Console Service.

service9

 

This is also where you come to realize how dependent everything is on your EM Console Service-  no website, no Enterprise Manager access.  By hovering my cursor over the service, you can quickly see the dependent HA servers, database instance, management agent and other essential hardware and software components that are involved with the EM12c hardware and framework.  To the right is filtering options to drill down into a more manageable view:

service19

High level information about each of the targets you want to drill down on are displayed and you can use this section to create a much easier to read topology:

service20

 

The topology makes even more sense if you can view all the information on each of the targets vs. me hiding host/database names… :)

Hopefully this review of services helps explain a little bit more about this very, very important key component to Enterprise Manager 12c.  Next post on Services, I’ll go over the all important, EM Job Service.

 



Tags:  , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Enterprise Manager and Services, Part I], All Right Reserved. 2015.