Search

OakieTags

Who's online

There are currently 0 users and 37 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

Plan puzzle

I was in Munich a few weeks ago running a course on Designing Optimal SQL and Troubleshooting and Tuning, but just before I flew in to Munich one of the attendees emailed me with an example of a statement that behaved a little strangely and asked me if we could look at it during the course.  It displays an odd little feature, and I thought it might be interesting to write up what I did to find out what was going on. We’ll start with the problem query and execution plan:


select     s section_size,
           max(program_id) ,avg(program_id)
from       fullclones_test
cross join  (select distinct section_size s from fullclones_test)
where      section_size = (select min(section_size) from fullclones_test)
and        clone_size >= s
group by   s
order by   1; 

Since I found this a little difficult to follow (with repetitions of the same table name, and column aliases switching back and forth) I did a little cosmetic modification; all I’ve done is add table aliases and query block names, then arranged the text for my visual benefit. The aliases and query block names can help when dissecting the anomaly.


select
	/*+ qb_name(main) */
	max(ft1.program_id),
	avg(ft1.program_id),
	ftv.s	section_size
from
	fullclones_test	ft1
cross join
	(
	select	/*+ qb_name(inline) */
		distinct ft2.section_size s
	from	fullclones_test	ft2
	)	ftv
where	ft1.section_size = (
		select 	/*+ qb_name(subq) */
			min(section_size)
		from	fullclones_test	ft3
	)
and	ft1.clone_size >= ftv.s
group by
	ftv.s
order by
	ftv.s
; 

This query ran reasonably quickly (about half a second), producing the following execution plan:


--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                 |      1 |        |   404 (100)|      4 |00:00:00.05 |    1116 |       |       |          |
|   1 |  SORT GROUP BY NOSORT  |                 |      1 |      5 |   404   (6)|      4 |00:00:00.05 |    1116 |       |       |          |
|   2 |   MERGE JOIN           |                 |      1 |  48637 |   299   (7)|  50361 |00:00:00.58 |    1116 |       |       |          |
|   3 |    SORT JOIN           |                 |      1 |      5 |   114  (11)|      5 |00:00:00.02 |     372 |  2048 |  2048 | 2048  (0)|
|   4 |     VIEW               |                 |      1 |      5 |   114  (11)|      5 |00:00:00.02 |     372 |       |       |          |
|   5 |      HASH UNIQUE       |                 |      1 |      5 |   114  (11)|      5 |00:00:00.02 |     372 |  5894K|  3254K|  884K (0)|
|   6 |       TABLE ACCESS FULL| FULLCLONES_TEST |      1 |  99999 |   105   (3)|  99999 |00:00:00.31 |     372 |       |       |          |
|*  7 |    SORT JOIN           |                 |      5 |  20000 |   185   (4)|  50361 |00:00:00.16 |     744 |   619K|   472K|  550K (0)|
|*  8 |     TABLE ACCESS FULL  | FULLCLONES_TEST |      1 |  20000 |   106   (4)|  20076 |00:00:00.09 |     744 |       |       |          |
|   9 |      SORT AGGREGATE    |                 |      1 |      1 |            |      1 |00:00:00.01 |     372 |       |       |          |
|  10 |       TABLE ACCESS FULL| FULLCLONES_TEST |      1 |  99999 |   105   (3)|  99999 |00:00:00.29 |     372 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$071BB01A
   4 - INLINE       / FTV@SEL$1
   5 - INLINE   
   6 - INLINE       / FT2@INLINE   
   8 - SEL$071BB01A / FT1@SEL$1
   9 - SUBQ 
  10 - SUBQ         / FT3@SUBQ

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access("FT1"."CLONE_SIZE">="FTV"."S")
       filter("FT1"."CLONE_SIZE">="FTV"."S")
   8 - filter("FT1"."SECTION_SIZE"=)

As you can see by comparing the block name / object alias information, we can identify a single full tablescan being executed at line 9 to produce the min(section_size) in the subquery.

We can also see that the “select distinct” executes at lines 5/6 to produce 5 rows which are then joined with a merge join to the first full tablescan of t1.

If you’re wondering about the appearance of a sel$1 despite my efforts to name every query block, that’s the (almost) inevitable side effect of using ANSI syntax – virtually every join after the first two tables will introduce a new (unnameable) query block to introduce the next table.

Now here’s the anomaly: if we eliminate the avg() from the select list we’re going to produce a result that ought to require less work – but look what happens:

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                 |      1 |        | 10802 (100)|      4 |00:02:48.83 |    1116 |       |       |          |
|   1 |  SORT GROUP BY NOSORT  |                 |      1 |      5 | 10802  (94)|      4 |00:02:48.83 |    1116 |       |       |          |
|   2 |   MERGE JOIN           |                 |      1 |    972M| 10697  (95)|   1007M|03:21:28.41 |    1116 |       |       |          |
|   3 |    SORT JOIN           |                 |      1 |  99999 |   380   (4)|  80042 |00:00:00.39 |     372 |  2037K|   674K| 1810K (0)|
|   4 |     TABLE ACCESS FULL  | FULLCLONES_TEST |      1 |  99999 |   105   (3)|  99999 |00:00:00.26 |     372 |       |       |          |
|*  5 |    SORT JOIN           |                 |  80042 |  20000 |   185   (4)|   1007M|00:57:11.13 |     744 |   619K|   472K|  550K (0)|
|*  6 |     TABLE ACCESS FULL  | FULLCLONES_TEST |      1 |  20000 |   106   (4)|  20076 |00:00:00.11 |     744 |       |       |          |
|   7 |      SORT AGGREGATE    |                 |      1 |      1 |            |      1 |00:00:00.01 |     372 |       |       |          |
|   8 |       TABLE ACCESS FULL| FULLCLONES_TEST |      1 |  99999 |   105   (3)|  99999 |00:00:00.28 |     372 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$6B65F52B
   4 - SEL$6B65F52B / FT2@INLINE
   6 - SEL$6B65F52B / FT1@SEL$1
   7 - SUBQ
   8 - SUBQ         / FT3@SUBQ

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("FT1"."CLONE_SIZE">="FT2"."SECTION_SIZE")
       filter("FT1"."CLONE_SIZE">="FT2"."SECTION_SIZE")
   6 - filter("FT1"."SECTION_SIZE"=)

Ignore the timings from lines 2 and 5 – I was using the hint gather_plan_statistics to collect the rowsource execution stats, and those lines are showing a massive sampling error. The query took about 7 minutes 30 seconds to run. The key difference is that line 4 shows that the “select distinct” is NOT aggregated early – the optimizer has used complex view merging to “join then aggregate” rather than “aggregate then join”. As you can see, this was a bad choice and the join has produced over a billion (US) rows at line 2 which then have to aggregated down to just 4 rows in line 1.

The question then is why ? If I put a /*+ no_merge */ hint in query block named “inline” the optimizer accepts the hint and goes back to the plan that aggregates early and runs very quickly – so it’s not a question of the optimizer bypassing some mechanism to avoid getting the wrong answer. I think the only option available to use for further investigation at this point is to examine the 10053 (optimizer) trace file to see what’s going on.

From the (12c)  trace file where we select the avg() we see the following lines:

OJE: Begin: find best directive for query block INLINE (#0)
OJE: End: finding best directive for query block INLINE (#0)
CVM: CBQT Marking query block INLINE (#0) as valid for CVM.
CVM:   Not Merging INLINE (#0) into SEL$1 (#0) due to CBQT directive.

From the equivalent position in the trace file where we select only the max() we see the lines:

OJE: Begin: find best directive for query block INLINE (#0)
OJE: End: finding best directive for query block INLINE (#0)
CVM:   Merging SPJ view INLINE (#0) into SEL$1 (#0)

It’s always hard to decide exactly WHY things happen – but it looks as if the optimizer merges the view heuristically in the max() case “because it can”, but has a heuristic (internal directive) that stops it from merging in the avg() case. What a good job we have hints !

Footnote:

In cases like this it’s always possible that there’s a generic case that might produce wrong results even though the specific case won’t – so it’s worth spending a little time thinking about how the wrong results might appear. It’s also worth keep hold of the SQL as a test case because if there’s a generic block in place to handle specific cases you may find that future enhancements allow the block to be lifted for some cases, and it’s nice to be able to check for such cases as you upgrade.

On the other hand, you can get back to the efficient plan if you change the inline view to be:


        (
        select  /*+
                        qb_name(inline)
                */
                ft2.section_size s , count(*) ct
        from    fullclones_test ft2
        group by
                ft2.section_size
        )       ftv


That’s just a little bit surprising – but I’m not going to pursue this one any further, beyond noting that there are some interesting anomalies available with inline aggregate views, even in 12.1.0.2.

Footnote 2:

If you want to experiment further, here’s the SQL to generate the data set:

create table fullclones_test (
        program_id      number(5,0),
        start_line      number,
        clone_id        number(5,0),
        clone_line      number,
        clone_size      number,
        range_start     number,
        section_size    number(4,0)
)
;

insert into fullclones_test (
        program_id, start_line, clone_id, clone_line,
        clone_size, range_start, section_size
)
Select
        1,
        trunc(dbms_random.value(1,1000)),
        trunc(dbms_random.value(1,10)),
        trunc(dbms_random.value(1,1000)),
        trunc(dbms_random.value(20,100)),
        0,
        20*trunc(dbms_random.value(1,6))
from
        dual
connect by
        level <100000
;

commit;

exec dbms_stats.gather_table_stats(user,'fullclones_test')

Finally, for consistently reproducible results I had engineered my system stats as follows:


        begin
                dbms_stats.set_system_stats('MBRC',16);
                dbms_stats.set_system_stats('MREADTIM',10);
                dbms_stats.set_system_stats('SREADTIM',5);
                dbms_stats.set_system_stats('CPUSPEED',1000);
        end;

Will the real data virtualization please stand up?

There is a post from a good friend at Oracle entitled “Will the REAL SnapClone functionality please stand up?” and, as well-written and technically rich as the post is, I am particularly moved to comment on the very last and conclusive sentence in the post…

So with all of that, why would you look at a point solution that only covers one part of managing your Oracle infrastructure?

The post does not refer to Delphix by name, and it could in fact be referring to any number of companies, but Delphix is the market leader in this space, so it is reasonable to assume that the “Product X” mentioned throughout the post is Delphix.  The same holds true for any post commenting on relational database technology, which can reasonably be assumed to refer to Oracle.  Regardless, I was struck by the use of the phrase point solution in that final sentence of the post, and how it really is a matter of perspective, and how interesting is that perspective.

First of all, before we go any further, please let me say that, as an Oracle DBA for the past 20 years, I think that the current release of Oracle’s Enterprise Manager, EM12c, is the finest and most complete release of the product since I tested early versions of Oracle EM alongside the Oracle8i database in the late 1990s.  At that time, the product was full of promise, but it wasn’t something upon which an enterprise could truly rely.  That has certainly changed, and it has been a long time coming, starting with the advent of utilities like AWR, ASH, and Active Session History.  If you have extensive Oracle technology in your organization, you should be using EM12c to manage it.  Not EM11g, or EM10g, but EM12c.  It really is that good, and it is getting better, and there are talented people behind it, and you simply need it if you want to maximize your investment in Oracle technology.

But just because EM12c is the center of the universe of Oracle technology, what about organizations for whom Oracle technology is merely a component?  Many organizations have diverse IT infrastructures comprising Microsoft, IBM, SAP, and open-source technologies, and all of those technology components share the need for the basic use-cases of quickly and economically cloning production to create non-production environments to support development, testing, reporting, archival, and training activities.

Should those diverse IT organizations employ a silo tool like EM12c just for cloning Oracle databases, and then find the same functionality separately for each of those other separate technologies?  Would doing so be a tactical or a strategic decision?

So in response to the final question in the SnapClone post, I ask another question in turn…

Why would one look at a point solution that covers only Oracle database?

Access to data for development and testing is the biggest constraint limiting development and testing, so it doesn’t make sense to not enable data virtualization for all applications, regardless of whether they are comprised of Oracle technology or not.  IT agility is a strategic capability important to the entire business, not a technical challenge for a component silo.

But perhaps, in the interest of continuing the Oracle-only focus of the SnapClone post, we could stay inside the bounds of Oracle.  Fair enough, as a theoretical exercise…

So, even if we limit the discussion only to Oracle technology, it quickly becomes obvious that another important question looms…

Why would one look at a point solution that covers only the Oracle database, leaving the application software, database software, configuration files, and all the other necessary parts of an application as a further problem to be solved?

Anybody who has managed IT environments knows that the database is just one part of a complete application stack.  This is true for applications by Oracle (i.e. E-Business Suites, PeopleSoft, JDEdwards, Demantra, Retek, etc), as well as prominent applications like SAP, and every other application vendor on the planet, and beyond.

To do this, one needs a solution that virtualizes file-system directories with software, files, and everything that comprises the application, not just an Oracle database.

To provision those complete environments for developers and testers quickly and inexpensively, one needs both server virtualization and data virtualization.

Unless one has spent the past 10 years in deep space chasing a comet, you’ve already got server virtualization on board.  Check.

Now, for data virtualization, you need to virtualize Oracle databases, check.  And you also need to virtualize SQL Server databases, check.  And PostgreSQL and Sybase databases, check and check.  In the near future, Delphix will likely be virtualizing IBM DB2 and MySQL databases, not to mention MongoDB and Hadoop, ‘cuz that’s what we do.  Check, check, … check-a-mundo dudes and dudettes.

Despite this, even if you’re a single-vendor organization, you need to virtualize files directories and files, on UNIX/Linux platforms as well as Windows servers.

Delphix does all of the above, which is one reason why we’re the market leader in this space.  Check.

A substantial portion of the Fortune 500 already relies on data virtualization from Delphix today, across their entire technology portfolio, as the partial list online here shows.

Perhaps it is only a point solution from one perspective, but be sure that your perspective is aligned with those of your whole organization, and that you’re not just thinking of a strategic business capability as merely “functionality” within a silo.

Removing Redundant Startup/Restart for the OMS Service in Windows

I’ve been told many times that the OMS for EM12c can take quite some time to start on Windows.  Some told me it took anywhere from three to up to fifteen minutes and wanted to know why.  I’ve done some research on the challenge and it is a complex one.

Let’s start this post by stating that even though I’m focusing on the OMS service that is part of the Windows installation of EM12c from Oracle, that in no way is it to blame, nor is it the only application to have this problem, (so this post may help many others) and it has more to do with over-engineering on MANY different non-Oracle levels and in no way is it a bug.  At the same time, it can really impact the quality of user experience with EM12c on Windows and it helps if you know WHAT is causing the challenge vs. what will easily have fingers pointed to as the blame. We all know that Oracle is blamed until proven innocent, so it’s important that we understand what is happening to correct the problem vs. just pointing fingers.

As most DBAs aren’t as familiar with the Windows OS Platform, lets quickly review what a Windows service is and why its important-

A Microsoft Windows services, formerly known as NT services, creates long-running executable applications that run in their own Windows sessions. These services can be automatically started when the computer boots, can be paused and restarted, and do not [require a] user interface.

When installing Enterprise Manager 12c on Windows or installing even the Oracle database on the Microsoft Windows OS platform, a service is created to support the application.  This service can be created a number of ways, but for Oracle, they support the following:

oradim - new -[sid] -intpwd [password] -maxusers [number] -startmode [auto|manual] -spfile [directory location of spfile]
emctl create service [-oms_svc_name  -user  -passwd ]

and then we have Windows method of a service command:

sc create [service name] -binPath= "[path to executable to start app and argument]" start= [auto|manual] displayName= [name to display]

Each of these options are supported to create many of the different services that are needed to support different features/targets in Enterprise Manager and are used as part of the installation process via the Database Configuration Assistant, the Network Configuration Assistant and the Oracle Installer.

One of the enhancements that they are working on for EM12c is moving the java thread startup and stop from serial to multi-threaded processing.  This is going to speed up the start and stop of the OMS extensively, (anyone tracing the startup of an OMS to see where time is being spent will undoubtedly see that over 80% is the weblogic tier….)

Until this enhancement is made, the extended time trips a few safety measures that are built at a number of levels into services to ensure they stay up.  If a service isn’t up, well, you aren’t going to be using the application, so unfortunately for us, this is where the OCD of the development world has come back to haunt us…. :)

Tracing and Advanced Logging

First, we need to get more info from our node manager to see what is starting the service and when it’s timing out and what is restarting it.  We can do this by going to the following:

$GCINST_HOME\NodeManager\emnodemanager

Make a backup copy and then choose to edit the original nodemanager.properties file

By default, the loglevel=info

There are numerous log level settings:

  • SEVERE (highest value)
  • WARNING
  • INFO
  • CONFIG
  • FINE
  • FINER
  • FINEST (lowest value)

My recommendation is to set it to FINEST if you really want to log whats going on, but don’t leave it there, as it will produce a lot of logging and unless you are trouble-shooting something, there just isn’t any need for this amount of fine detail, so remember, a restart of the OMS service is required to update any change to the logging.

Update the loglevel info, save the file and restart the service.  The data will be saved to the following file:

$GCINST_HOME\NodeManager\emnodemanager\nodemanager.log

To understand more about tracing and logging, see the Oracle Documentation that can take you through it, (as well as save me a lot of typing… :))

Trace and Log Files

em.start                    Tells you if there were any time outs and at what step the timeout occurred.

OracleManagementServer_EMGC_OMS1_1srvc.log  This is the logged startup and shutdown of the actual service.

nodemanager.log     This is the log of the nodemanager’s interaction with the OMS service.

EMGC_OMS1.out    Steps of weblogic startup, java threads and times.

emctl.log                  Also shows timeouts set by emctl start process.

emoms_startup.trc  Shows timeout by connections, (including sqlnet timeouts)

emoms_pbs.trc      Shows actual timeouts at java level

There’s more data out there than this, especially if you use the EM Diagnostics kit, but just to start, it’s a good beginning.

Services

The OMS Service in Windows uses a standard naming convention, so it should look very similar to the one below:

oms_services

 

Even though we are seeing one service, it can be controlled by many different daemons to ensure it is always running, as well as managing how long it has before timing out when it starts and restart options.

1. Service Timeouts:

There are two in the registry, depending on the version of Windows server that you have.  These are here to assist you, but due to redundancy, they could impact you as well. These two values control how long to wait for a service to start before timing out and how long to before killing a service or if unresponsive to kill.

To see these, you will be working in the registry.  The registry is the nervous system of the OS, so take great care when working with it and always make a backup of the folder you are working in before making any changes.

Click on Start –> Run Enter “Regedit” and click OK Go to Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control

Right click on the Control folder, choose export and save off the registry file as Services_TO.reg
In right hand “details” view and remove the following values, (either or both may be present) or even better, set them to a time that will allow enough time for the OMS to start before these come in and try to restart:
ServicesPipeTimeout
WaittoKillServicesTimeout

Remember, any changes you make here do not take effect until after you restart the computer.  You can revert any changes by importing the saved registry file backup you made beforehand and performing another restart.

2. Auto restart of the OMS by the Node Manager

The node manager’s job is to ensure that the Windows Service is up and running for OMS. It is there check the OMS service and if it sees its down, restart it.  If this is attempting to restart the OMS service while the registry setting are attempting to restart the OMS service, well, you are going to start seeing the issue here.

To stop Nodemanager from attempting to auto-restart service upon timeout:

Go to $GCINST_HOME/user_projects/domains/GCDomain/servers/EMGC_OMS1/data/nodemanager/startup.properties

Create backup of the startup.properties file and then open the file in an editor such as notepad or wordpad:
go to the following line: AutoRestart=true
Change value to “false
Save the changes and the node manager will no longer attempt to autorestart the service if it sees it down once restarted.

3.  Clustered services added to a failover, Oracle Failsafe or other clustering process, (not RAC).

Clustering, at an OS level is primarily for high availability, so redundant checks and restart options are built in everywhere for Windows services added.  In the example of a failover cluster, the OMS service is added to the failover node.

fo_cluster_main

This allows for it to automatically fail over with the shared Virtual server and shared storage to the passive node and start up if there is a failure.  The clu2 Virtual server has policy settings telling the OS what to do in case of failure and how to restart.  This, by default is applied to all dependent resources and shared storage allocated to it.

fo_cluster_prop

As you can see in the clu2 properties, the policies have been set if:

  • A failure occurs, restart services and storage on the original active node.
  • If the restart fails, then failover to the passive node.
  • If the service or resource doesn’t start within 15 minutes, timeout.

You’ll also notice there is an option to not restart, as well as how soon a restart should be attempted.

You can update this at the server level properties, which will automatically propagate to the dependent resources, (it is the master of all policy settings, so you should set them here.)

fo_cluster_prop_2

We have now asked in case of failure, do not restart and don’t timeout for 30 minutes.

Summary

I’ve shown you all the redundant settings that have been built in to ensure that the service is restarted and how long it can attempt to start before timing out and if it should restart and how long between restarts.  The key to all this is knowing that only ONE should be managing this.  If you decide to let Oracle manage it, then use the Node Manager settings and disable option 1 and 3.  If you decide to let Microsoft handle it at the Service level, then disable 2 and 3 and so on.

Understand that if they are all left to manage on top of each other, you will have one timing out the start up while the another is still attempting to start and another notes it’s down and issues a restart.  If you wonder why it’s taking 15 minutes or more to start your OMS on Windows, I’ll bet money you trace out the session and you’ll find more than one process attempting to start or restart the poor thing in your logs.

Honesty dictates that we shouldn’t just blame a complex issue on any one contributor and realize that with added complexity comes the need for added skills to ensure that you have the best configuration to support the technology.  Taking the time to trace out and understand the issue will help make that happen.

 



Tags:  


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Removing Redundant Startup/Restart for the OMS Service in Windows], All Right Reserved. 2014.

Shrink Tablespace

If you start moving objects around to try and reclaim space in a tablespace there are all sorts of little traps that make it harder than you might hope to get the maximum benefit with the minimum effort.  I’ve written a couple of notes in the past about how to proceed and, more recently, one of the difficulties involved. This is just a brief note about a couple of ideas to make life a little easier.

  • Check that you’ve emptied the recyclebin before you start
  • Before you try moving/rebuilding an object check that the total free space “below” that object is greater than  the size of the object or you’ll find that parts of the object move “up” the tablespace.
  • Before moving a table, mark its indexes as unusable. If you do this then (in recent versions of Oracle) the default behaviour is for the index space be freed as the segment vanishes and you may find that the extents of the table can move further “down” the tablespace.  (If you’ve kept tables and indexes in different tablespaces this is irrelevant, of course).
  • When you move an object think a little carefully about whether specifying an minimum initial extent size would help or hinder the move.
  • Don’t assume that moving the “highest” object first is the best strategy – work out where you expect the final tablespace HWM to be and you may find that it makes more sense to move other objects that are above the point first.
  • Moving objects with several small (64KB) extents first may allow you to free up larger (1MB, 8MB) gaps that can be used by other larger objects”””””
  • Creating a new tablespace and moving objects to it from the old tablespace “top down” may be the quickest way to proceed.  Work towards dropping the old tablespace HWM regularly.

 

HOWTO: Validate XML content against an XML Schema

Somewhere on this site is a post regarding how to validate against an XML schema…

Will the REAL Snap Clone functionality please stand up?

Every so often our competitors like to spread false information about our solutions so as to gain unfair advantage. I wanted to address this for the Enterprise Manager Snap Clone functionality, so this blog post describes the Snap Clone solution as it currently stands (November 2014).

Let’s start by introducing Snap Clone functionality. I blogged on that a few months back, but a few months can be an eternity in software development terms, so here’s an update on the salient points of what Snap Clone does for you over the various EM releases:

  • EM12cR2 provided Snap Clone for NAS storage (NetApp and Sun ZFSSA).  It provided RMAN backup based clones, and included the Snap Clone Analyzer to show you the storage savings you could make using Snap Clone
  • EM12cR3 added in support for Snap Clone using the Solaris File System (ZFS) and admin flows for Snap Clone for PDB’s (pluggable databases)
  • EM12cR4 added a lot more
    • Snap Clone using CloneDB – this is the biggie, as it means Snap Clone can now be used with ANY Oracle database release that supports CloneDB, regardless of what storage it’s on
    • Data Guard standby as a test master – allows offloading the impact of creating the test master from your Production environment
    • NetApp Ontap 8.x cluster mode support
    • Certification for engineered systems, with I/O over Infiniband
    • Support for NFSv4
  • And coming down the pipe, support for:
    • Integrated data lifecycle management
    • Snap Clone using EMC SAN and ASM
    • Admin flows for test master creation
    • Integration with masking, patching, upgrades etc.

Looking at it from the cloning options that are now supported, it means you can either provide full clones using RMAN Restore, RMAN Duplicate or Data Pump, or thin clones via either software solutions (ZFS and CloneDB) or hardware solutions (Sun ZFSSSA, NetApp and soon EMC).  Let’s touch on some of these in a but more detail.

Snap Clone using Solaris File System (ZFS)

Snap Clone using ZFS uses a single stock Solaris 11.1+ image which can be either physical or virtual (note: it doesn’t use the Sun ZS3 appliance).  It supports both NAS and SAN.  If you are using SAN, then mount the LUNs as raw disk and format with the ZFS filesystem.  It’s important to note here that this does NOT require any snapshot/cloning licenses from the storage vendor, as these features are available for free.

Additional features provided with this solution include compression, de-duplication, I/O caching and so on.  If you also need HA in this configuration, that can be handled externally either via Solaris Clusters, or by using the HA features of the underlying hypervisor.

Diagrammatically, the configuration looks like this:

ZFS

CloneDB using dNFS

With CloneDB using dNFS, you get the ability to create thin copies of a database from RMAN image copies. This uses the NFS v3 client that’s embedded in the database technology since 11.2.0.3. Currently, this is supported for single instance databases, but only on file systems (i.e. ASM is not yet supported).

The advantages of this approach include:

  • It’s easy to setup
  • No special storage software is needed
  • It works on all platforms
  • It’s both time efficent (instantaneous cloning) and space efficient (you can create multiple clones based on one backup
  • It uses dNFS to improve the performance, high availability and scalability of NFS storage

Snap Clone on ASM and EMC Storage

Using Snap Clone on ASM and EMC storage provides the ability to create ‘live’ thin clones of databases that are on ASM. A live clone is NOT snapshot based but rather a live clone of the database that can be within the same cluster or indeed another one. Both single instance and RAC are supported – supported versions are 10.2.0.5 or higher of the database, 11.2 and higher of the Grid Infrastructure code. This functionality works on both EMC VMAX (with Time Finder VPSnap) and VNX storage appliances.

Diagrammatically, the configuration looks like this:

EMC

End to End Automation

Now let’s look at the data lifecycle and how data moves through this environment. To start off with, there are a few concepts you need to understand:

  • Production DB – Obviously, you need to identify the production database used for cloning.
  • Backups – As most (hopefully all!) of us currently do, you need to take regular backups (RMAN, datapump exports, etc.) These backups can then be used through this process as well.
  • Masking / Subsetting – When moving data from the Production database, clearly it’s important to mask sensitive data. Optionally, you may also want to (or indeed, have to with very large databases) subset to reduce the storage footprint
  • Test Master – the Test Master is a sanitized (see previous bullet) copy of production data for use in Dev / Test environments OR a Data Guard Standby database. It can then be used as the source for our snapshotting.
  • Clones – Depending on your needs, these may be full clones or snap (thin) clones. Full clones are often used for performance / stress testing; snap clones may often be used for functional testing. Which one you use is generally determined by the amount of storage you have available to you and the size of the Test Master.
  • Refresh – the Refresh process is what you use to keep your clone in sync with data changes in Production.

How these concepts all relate together is possibly best shown by the following:

Data movement1

A couple of points of explanation:

  1. Notice that the data in the Test Master has been masked but still remains in a format that looks similar to the original data. That’s important if you want to use the clones to examine performance plans that may have gone awry for some reason. One drawback of using the Data Guard standby approach is that because of its very nature, masking and subsetting is not possible in this scenario. You would need to take manual, discrete copies of the data from Production, which could of course be automated to occur at scheduled intervals.
  2. On the flip side, using the Data Guard Standby means that data refresh to the Test Master is both automatic and instantaneous, so your data can be much more up to date than it might be if you were using discrete copies.
  3. The refresh process can occur either against the Test Master or to backups of your Production database. If you have configured this as a self-service admin flow, self-service users can then refresh their existing clones with new data without you needing to be involved.

Full or Snap Clone: How It Works

With that in mind, let’s talk now about the details of how it works. In simple terms, the Test Master (or the Data Guard standby if you’re using that) is regularly refreshed with current data from Production. From the Test Master / Standby, you can make as many scheduled or manual storage snapshots or RMAN backups as you like. These are called “Profiles” and are indicated by the t0, t1, t2, … tN times in the diagram below. From each of these profiles, clones can be taken. Each usercan have a personal read-write database clone, as the data was at the time the profile was created, and of course, can take as many private backups of their clone as they desire:

how it works

Self Service Provisioning

Clearly, all of this is not much use to you as an administrator if you’re the one who has to keep building all of this, so it’s important to have a way of allowing users to build their own environments while ensuring resource usage is restricted based on roles. That’s where self service provisioning comes in. EM12c comes with an out of the box self service portal capability. You as the administrator create a catalog of different database configurations, possibly with a variety of datasets, which the self service user can then select. Have a look at the following diagram:

Self Service Portal

The larger box at the back left shows the standard Database Cloud Self Service Portal, as seen by a self service user. To the left, you can see this particular user has created 4 databases, along with the memory and storage consumption they have taken. This particular user has been granted permission to create 12 databases, using a maximum of 50 Gb of memory and 100 Gb of storage. These limits have been set by you as the self service administrator.

You are also responsible for building a number of templates to define the configurations they can choose from. When the self service user clicks on the “Request Database” button on the Database Cloud Self Service Portal, the box on the right appears, showing them the service templates they are allowed to choose from. In this example I’m showing you, they can choose from:

  • Full 1.5 TB Database Clone – this is a full copy of the Production Database
  • Generic Application Schema – a copy of an 11.2.0.3 database
  • HR Sample Schema – allows the user to create a copy of the HR sample schema with data
  • Small 200 Gb database from RMAN backup – provides a subset of an existing database for functional testing
  • StoreFront Application Schema – creates a copy of an in-house application called StoreFront, complete with data

(NOTE: These are just examples of the sort of thing you can add as templates. Obviously some of these would fail if the self service user tried to create them because the resource quotas you have given them would be exceeded. :) )

Now, let’s look at an example of the user interface you would see specifically for the Snap Clone functionality. As the self service user, when you select a template to create a thin clone, it takes you to a page that looks like this:

Image00003

The inputs you provide are:

  • Database SID – the SID for the database that will be created as part of this request
  • Database Service Name – the service name that you will use to connect to this database after creation
  • Optionally, a schedule – when will the database be created, and how long will it be available for. The defaults are to start the creation immediately, and a duration of indefinitely, but you can change these to meet your needs
  • User Name and Password – the username and password that will be assigned for you to manage the database that is being created
  • Service Instance Properties – again, these are optional but you can specify things like Lifecycle Status, contact name and so on
  • Snapshots – this is really the most important part, as it is here where you specify the snapshot time (i.e. the profile) that you will use to create the thin clone. In the example shown here, we are using a profile built on September 16, 2014

Once the database has been created, you will see it listed in the Services region of the Database Cloud Self Service Portal:

Image00009

By clicking on the name of the service as shown above, you are taken to the database home page (note this is still all being done by the self service user, NOT the self service administrator):

Image00010

It is from here that you can click on the “Refresh” button to refresh the data to a more recent profile. Clicking that button brings up a pop-up window allowing you to select the date you want to refresh your clone to. In this example, I can choose a snapshot that was taken on September 19:

Image00011

That’s how straightforward the process is! And of course, in inimitable Oracle style, there is so much more coming just around the corner. You can soon

Addressing the Misinformation

Now that you’ve seen the capabilities of the Snap Clone product, let’s come back to the issue I raised at the beginning of this post – misinformation. I’m not really wanting to take aim at any particular company involved here, just the data, so I won’t name them here except as “Product X”. While some of what they are claiming is certainly correct, some of it is only partially true and some of it is just plain wrong. The claims that fall into these latter categories included:

  • Snap Clone allows customers to leverage existing NetApp or ZFS storage hardware more efficiently but Product X installs on commodity hardware – well, yes Snap Clone does allow that, but as I mentioned above it also supports CloneDB using dNFS, and ASM on EMC storage. Adding Clone DB using dNFS, which is functionality supported natively since the 11.2.0.3 release of the Oracle database, means that Snap Clone is supported on any hardware that Oracle Database 11.2.0.3 or later is supported on, not just on NetApp or ZFS. And of course, the addition of EMC storage just broadens that support base even further.
  • Product X is the only technology in the industry to provide “Live Archive” – archiving and providing point in time versions of a database. This is EXACTLY what Snap Clone provides, so please don’t say your product is the only one in the industry that does that!
  • Product X is the only technology in the industry to provide “Business Intelligence” – 24×7 ETL windows and fast refresh of new data in minutes. Again, not true. True business intelligence normally requires a summarized copy of your Production data, plus data from other sources as well, so any product that simply refreshes from your Production database would not have the capabilities needed by BI. If, however, your BI requirements are simple enough that they can be resolved by having a copy of just your Production data, then Snap Clone provides that capability as well.
  • Product X is the only technology in the industry to provide “Data Center Migration” – “Product X supports active replication between Product X appliances making cloud migration simple, efficient and continually synced between in house and cloud.” That functionality also exists in EM12c.
  • “Snap Clone is a feature that is a simple and nice enhancement for the usage of specialized existing hardware, either Netapp or ZFS storage to make static thin clones at one point in time.” – a competitor’s words. As already mentioned, Snap Clone is NOT restricted to using specialized hardware AND clones can be refreshed as needed, so this statement is just plain wrong.
  • Scale, scale, scale – With Snap Clone, you can scale from 1 to 1000’s of clones. Some competitive tools that are out there would require multiple instances of their product to be deployed to achieve that, all of which adds overhead.
  • Protection of your existing investments – using Snap Clone, you have the choice between hardware solutions that you might already have, as well as software solutions. We also use trusted technologies like Data Guard for test master refresh.

I could go on, but let’s leave it there and look more at another important area.

Post Provisioning

Once you have provisioned your data, there are a lot of other important areas that need to be looked at as well. It’s simply not enough to provision new databases and then leave the management of those environments alone. All that does is lead to database sprawl, creating management headaches. So what are the other areas you need to look at? They include:

  • Patching – Any computing environment will, over time, require patching, either as security and bugs issues are found and addressed or as more major releases occur. EM12c provides a fully integrated patch management functionality to address this space
  • Compliance – EM 12c provides a rich and powerful compliance management framework that automatically tracks and reports conformance of managed targets to industry, Oracle, or internal standards. Enterprise Manager 12c ships with compliance standards for Oracle hardware and software including Database, Exadata Database Machine, Fusion Middleware, and more. These compliance standards validate conformance to Oracle configuration recommendations, best practices, and security recommendations.
  • Performance Management – EM12c includes a variety of tools to management performance, from ASH Analystics, SQL Performance Analyzer and Database Replay at the database level to a complete Application Performance Management (APM) solution for custom applications and Oracle applications (including E-Business Suite Siebel PeopleSoft JD Edwards and Fusion Applications).
  • Chargeback – Chargeback is used to allocate the costs of IT resources to the people or organizations who consume them. While it can be applied in situations where IT resources are dedicated, it is particularly relevant in situations where resources are shared, as without some way to meter and charge for consumption there will be a tendency for users to use more resources than they need. This problem is exacerbated in cloud environments where users are able to provision their own resources using self-service.

All of these areas and so many more are covered in EM12c, along with the Snap Clone functionality we started this post looking at. EM12c is Oracle’s flagship management products for all your database needs, and is in sync with database releases. We provided support for functionality such as RAC and the multi-tenant architecture from day 1 of the software being released, whereas competitor products can take months to catch up. In addition, EM12c provides a full security model, including role based access control, which is used by many Fortune 1000 customers. So with all of that why would you look at a point solution that only covers one part of managing your Oracle infrastructure?

Foreign Archived Log in #Oracle – what does it mean?

When you look into V$RECOVERY_AREA_USAGE, you see a strange row at the bottom:

SQL> select * from v$recovery_area_usage;

FILE_TYPE               PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES     CON_ID
----------------------- ------------------ ------------------------- --------------- ----------
CONTROL FILE                             0                         0               0          0
REDO LOG                                 0                         0               0          0
ARCHIVED LOG                         10.18                         0              73          0
BACKUP PIECE                             0                         0               0          0
IMAGE COPY                               0                         0               0          0
FLASHBACK LOG                            0                         0               0          0
FOREIGN ARCHIVED LOG                     0                         0               0          0
AUXILIARY DATAFILE COPY                  0                         0               0          0

Curious what that could be? You will see values other than zero on a Logical Standby Database:

SQL> connect sys/oracle@logst as sysdba
Connected.
SQL> select database_role from v$database;

DATABASE_ROLE
----------------
LOGICAL STANDBY

SQL> select * from v$recovery_area_usage;

FILE_TYPE               PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES     CON_ID
----------------------- ------------------ ------------------------- --------------- ----------
CONTROL FILE                             0                         0               0          0
REDO LOG                                 0                         0               0          0
ARCHIVED LOG                         14.93                         0               9          0
BACKUP PIECE                             0                         0               0          0
IMAGE COPY                               0                         0               0          0
FLASHBACK LOG                            0                         0               0          0
FOREIGN ARCHIVED LOG                  2.03                         0              26          0
AUXILIARY DATAFILE COPY                  0                         0               0          0

In contrast to a Physical Standby Database, this one writes not only into standby logs but also into online logs while being in standby role. That leads to two different kinds of archive logs:
Foreign Archived LogsWhen DML (like insert and update) is done on the primary 1) that leads to redo entries into online logs 2) that are simultaneously shipped to the standby and written there into standby logs 2) also. The online logs on the primary and the standby logs on the standby will be archived 3) eventually. So far that is the same for both physical and logical standby. But now a difference: Logical standby databases do SQL Apply 4) by logmining the standby or the archive logs that came from the primary. That generates similar DML on the standby which in turn leads LGWR there to write redo into online logs 5) that will eventually get archived 6) as well.

A logical standby could do recovery only with its own archive logs (if there was a backup taken before) but not with the foreign archive logs. Therefore, those foreign archive logs can and do get deleted automatically. V$ARCHIVED_LOG and V$FOREIGN_ARCHIVED_LOG can be queried to monitor the two different kinds of logs.

That was one topic of the course Oracle Database 12c: Data Guard Administration that I’m delivering as an LVC this week, by the way. Hope you find it useful :-)

Tagged: Data Guard, High Availability

Parallel Fun

As I write, there’s an ongoing thread on Oracle-L that started with the (paraphrased) question: “I’ve got this query that returns 7 million rows; when I change it to ‘select count(*)’ it returns in 4 seconds but when we display the full result set on screen it takes hours, and every second or two the screen pauses; how do I make it go faster.”

The general rapid response was: “You shouldn’t be running 7M rows to a screen – the time is the time for the network traffic and display.”

The first part of the statement is right – the second part is quite likely to be wrong and there’s a very strong hint in the question that makes me say that, it’s the “pauses every second or two”. Of course we don’t know what the OP isn’t telling us, and we don’t know how accurate he is in what he is telling us, so any ideas we have may be completely wrong. For example, we haven’t been given any idea of how long a “pause” is, we don’t really know how accurate that “second or two” might be and whether “every” is an exaggeration, and maybe the query is returning CLOB columns (and that could make a big difference to what you can do to improve performance).

If we take the statement at face value, though, there is one very obvious inference: although some of the time will be due to network traffic time, most of the time is probably due to Oracle doing something expensive for a significant fraction of the rows returned. The pattern of activity probably looks like this:

  • client: call server to fetch next array of rows
  • server: spend some time populating array  — this is where the client sees a pause
  • client: display result array
  • client: call server to fetch next array of rows
  •  etc…

Here’s a trivial example:

connect / as sysdba
set arraysize 500
set pagesize 40

select
        o1.spare1 ,
        (
        select  max((ctime))
        from    obj$    o2
        where   o2.owner# = o1.owner#
        and     o2.obj# < o1.obj#
        ) ct
from obj$ o1
;

On my laptop, running an instance of 11.2.0.4 with about 80,000 rows in obj$ (and a lot of them owned by SYS), I can count seconds and find that (approximately) I alternate between one second watching results scrolling up the screen and one second waiting as the server generates the next 500 rows.

Of course it’s possible to argue that the problem really is the network and nothing but the network struggling to cope with the never-ending stream of little packets produced by 7M rows. Could there be a choke point that causes the data to stop and start with great regularity, maybe – but previous experience says probably not. I have experienced bad network problems in the past, but when they’ve occurred I’ve always observed extremely random stop/go behaviour. The regularity implied in the question makes the Oracle-based problem seem far more likely.

Conveniently a couple of people asked for more clues – like the query text and the execution plan; even more conveniently the OP supplied the answers in this response. Since the email format makes them a little hard to read I’ve copied them here:


SELECT  bunch of stuff.....,

        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr1.RELATED_SID
                        ||
                        ',')
                ORDER BY sr1.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr1
                WHERE   sr1.SID                    = slv.SID
                        AND sr1.RELATIONSHIP_LEVEL = '1'
                GROUP BY sr1.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL1,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr2.RELATED_SID
                        ||
                        ',')
                ORDER BY sr2.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr2
                WHERE   sr2.SID                    = slv.SID
                        AND sr2.RELATIONSHIP_LEVEL = '2'
                GROUP BY sr2.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL2,
        (
               SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr3.RELATED_SID
                        ||
                        ',')
                ORDER BY sr3.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr3
                WHERE   sr3.SID                    = slv.SID
                        AND sr3.RELATIONSHIP_LEVEL = '3'
                GROUP BY sr3.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL3,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr4.RELATED_SID
                        ||
                        ',')
                ORDER BY sr4.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr4
                WHERE   sr4.SID                    = slv.SID
                        AND sr4.RELATIONSHIP_LEVEL = '4'
                GROUP BY sr4.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL4,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr5.RELATED_SID
                        ||
                        ',')
                ORDER BY sr5.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr5
                WHERE   sr5.SID                    = slv.SID
                        AND sr5.RELATIONSHIP_LEVEL = '5'
                GROUP BY sr5.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL5
FROM    service_lookup slv
        LEFT JOIN service_location sl
        ON      sl.service_location_id = slv.service_location_id;

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1570133209

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                      |  7331K|  5593M|  1877   (5)| 00:00:01 |        |      |            |
|   1 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|   2 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10000             |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | PCWC |            |
|*  5 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | PCWP |            |
|   6 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|   7 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|   8 |    PX SEND QC (RANDOM)           | :TQ20000             |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | P->S | QC (RAND)  |
|   9 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | PCWC |            |
|* 10 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | PCWP |            |
|  11 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  12 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  13 |    PX SEND QC (RANDOM)           | :TQ30000             |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | P->S | QC (RAND)  |
|  14 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | PCWC |            |
|* 15 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | PCWP |            |
|  16 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  17 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  18 |    PX SEND QC (RANDOM)           | :TQ40000             |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | P->S | QC (RAND)  |
|  19 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | PCWC |            |
|* 20 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | PCWP |            |
|  21 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  22 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  23 |    PX SEND QC (RANDOM)           | :TQ50000             |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | P->S | QC (RAND)  |
|  24 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | PCWC |            |
|* 25 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | PCWP |            |
|  26 |  PX COORDINATOR                  |                      |       |       |            |          |        |      |            |
|  27 |   PX SEND QC (RANDOM)            | :TQ60002             |  7331K|  5593M|  1877   (5)| 00:00:01 |  Q6,02 | P->S | QC (RAND)  |
|* 28 |    HASH JOIN RIGHT OUTER BUFFERED|                      |  7331K|  5593M|  1877   (5)| 00:00:01 |  Q6,02 | PCWP |            |
|  29 |     PX RECEIVE                   |                      |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,02 | PCWP |            |
|  30 |      PX SEND HASH                | :TQ60000             |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | P->P | HASH       |
|  31 |       PX BLOCK ITERATOR          |                      |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | PCWC |            |
|  32 |        TABLE ACCESS STORAGE FULL | SERVICE_LOCATION     |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | PCWP |            |
|  33 |     PX RECEIVE                   |                      |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,02 | PCWP |            |
|  34 |      PX SEND HASH                | :TQ60001             |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | P->P | HASH       |
|  35 |       PX BLOCK ITERATOR          |                      |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | PCWC |            |
|  36 |        TABLE ACCESS STORAGE FULL | SERVICE_LOOKUP       |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------

We have a simple two-table outer join, and five scalar subqueries in the select list. (Not being very familiar with the various XML calls I had no idea of what the scalar subqueries were doing, or how they produced a result, beyond the fact that they were querying and aggregating multiple rows. In fact the combination of calls does much the same as listagg(), though it allows for a CLOB result (which could be part of the performance problem, of course) rather than being limited to a varchar2() result).

Would you like to guess at this point why I constructed my demonstration query again obj$ the way I did when presenting the idea of high-cost per row queries as a reason for regular pauses in the output ? The execution plan matched one of my two initial guesses about what the query was going to look like. When you “select count(*) from {this query}”, the optimizer will factor out the scalar subqueries and only have to count the result set from the hash join – and it might even manage to use a couple of parallel index fast full scans to get that result rather than doing the tablescans. When you run the query you have to run the scalar subqueries.

If we trust the statistics, we have 5 subqueries to run for each row of the hash join – and the hash join is predicted to return 7.3 million rows. Given that the subqueries are all going to run parallel tablescans against a fairly large table (note – the cost of the tablescans on SERVICE_RELATIONSHIP is 368, compared to the cost of the tablescan on SERVICE_LOCATION which is 366 to return 3.1M rows) that’s an awful lot of work for each row returned – unless we benefit from an enormous amount of scalar subquery caching.

Here’s another performance threat that the plan shows, though: notice where the PX SEND QC operation appears – that means the PX slaves send their (7M) rows to the Query Co-ordinator and the QC is responsible for doing all the work of running the scalar subqueries. Another interesting little threat visible in the plan shows up in the TQ column – the plan uses six “data flow operations” (using the original naming convention, though that changed some time ago but survived in the column names of v$pq_tqstat). In principle each DFO could allocate two sets of PX slaves (and every DFO could have a different degree of parallelism); in this example DFO number 6 (the driving hash join) uses two sets of slave, and the other five DFOs (the scalar subqueries) use a single set each. The upshot of this is that if the default degree of parallelism in play is N this query will allocate 7N parallel query slaves. It gets a little nastier than that, though (based on checking the output from v$sql_plan_monitor), because each time one of the scalar subqueries runs Oracle seems to allocate and deallocate the slaves that are supposed to run it – which is probably going to cause some contention if there are other parallel queries trying to run at the same time.

Optimisation

So what could you do with this query ? It depends on how much change you want to make to the code.

It’s possible that an index on service_relationship(relationship_level, sid) – with compress 1 – might help if it’s very precise, and if the target table stays in the buffer cache for the duration of the query – but, in the absence scalar subquery caching that could still leave the query co-ordinator executing 35 million (5 queries x 7 million rows) subqueries in a serialised process.

A better bet may be to convert from subqueries to joins – remembering that the listagg() / xmlserialize() calls will require you to aggregate (which means sorting in this case) an estimated 25 rows per driving row per relationship_level; in other words you may need to sort 7M * 125 = 875M rows – but at least you could do that in parallel, and there’s always the possibility that the estimated 25 drops off as you work through the different levels. You could choose to do 5 outer hash joins or (as Iggy Fernandez outlined in the thread) you could do a single outer join with a decode on the relationship_level. Another variation on this theme (which would probably have a plan showing ‘join then aggregate’) would be to ‘aggregate then join’. It’s possible that creating a non-mergeable inline view for the 5 values of relationsip_level from a single table access, aggregating it to produce the five required columns, then using the result in an outer join, would be the most efficient option. In the absence of a detailed understanding of the data volume and patterns it’s hard to make any prediction of which strategy would work best.

Footnote:

I may be wrong in my analysis of this problem. When I first saw the question the reason for the performance pattern suggested an “obvious” design error in either the SQL or the infrastructure, and when I saw that the query and execution plan matched my prediction it became very hard for me to think that there might be some other significant cause.

There were a couple of interesting details in the execution plan that made me pursue the problem a little more. In the first case I built a very simple model to get an estimate of the time needed to display 7M rows of a reasonable width in SQL*Plus running across a typical LAN (my estimate was in the order of 45 minutes – not hours); then I spent a little more time (about 10 minutes) to build a model that reproduced the key features of the execution plan shown.

I then spent two or three hours playing with the model, and I’ll be writing a further blog with some of the results later on. One detail to carry away today, though, is that in 12c Oracle can do a new form of subquery unnesting which transformed the query from its 5 scalar subquery form into the seven table join form that was one of the suggestions made on the thread; even more interestingly, if I blocked the unnesting (to force the subquery execution) Oracle 12.1.0.2 came up with a new operator (EXPRESSION EVALUATION) that allowed it to run the subqueries from the PX slaves before passing the results to the query co-ordinator – in other words eliminating the serialisation point.

To be continued …

Quick Links to Helpful Literature on Oracle Database In-Memory Column Store.

I’m surprised to find that Google is not cleanly ranking the helpful set of blog posts by Oracle’s Maria Colgan on the Oracle Database 12c In-Memory Column Store feature so I thought I’d put together this convenient set of links. Google search seems to only return a few of them in random order.

Over time I may add other helpful links regarding Oracle’s new, exciting caching technology.

Starter Information

Getting Started With Oracle Database In-Memory. Part I.

Getting Started With Oracle Database In-Memory. Part II.

Getting Started With Oracle Database In-Memory. Part III.

Getting Started With Oracle Database In-Memory. Part IV. 

In-Memory Column Store With Real Application Clusters

The following are links to information about Oracle Database In-Memory on Real Application Clusters:

Oracle Database In-Memory on RAC. Part I.

In-Memory Product That Requires Proprietary Storage?

How could the brand of storage matter for an in-memory cache feature? Good question.

Fellow Oaktable Network member Christian Antognini has produced a very important article regarding how Oracle Database 12c In-Memory Column Store with Real Application Clusters is questionable unless using Oracle storage (Exadata, SPARC SuperCluster).  I found Christian’s article very interesting because, after all, the topic at hand is an in-memory cache product (a.k.a., In-Memory Column Store). I fail to see any technical reason why Oracle wouldn’t support an in-memory product with blocks from any and all storage. It is in-memory after all, isn’t it? Please visit Christian’s article here: The Importance of the In-Memory DUPLICATE Clause for a RAC System.

 

 

 

 

 

Filed under: oracle

Audio semi-Visual Presentation on Clustering Data in Oracle

I suppose it had to happen eventually but one of my presentations has ended up on YouTube. It’s a recent presentation I did for the Oracle Midlands user group in September.

The topic is (as the title of this blog post hints at!)Boosting select performance by clustering data. The video consists of the slides I presented, changing as the presentation progresses, with my audio over the top. It goes on for a bit, close to an hour, but you could watch a section and then go and do something else before watching a bit more.

I have to say, it is very odd hearing my voice (and the slight touch of the “brummie” {Birmingham} accent coming through) and I do wince at the places where I blather or say something slightly wrong or make a joke that involved a visual element that is lost. Oh well, at least you don’t see me wandering around and jumping up,literally, to point out bits on the slides.

I’m glad to say I will be repeating a slightly more polished version of the presentation at this year’s UKOUG Tech14 conference in December. I was a bit under the weather back on September the 16th, due to having just returned to the Working Life, and with this perfect example of what I did then I should be able to make the next shot at it a cracker… maybe.

On the topic of Oracle Midlands, I like this user group that is run by Mike Mckay Dirden, the meetings consist of evening presentations in Birmingham with a bit of support from Redgate. This includes half-time samosas to keep you going. The next meeting is described here and is on Tuesday 25th November. This meeting has two presentations by my friend Bjoern Rost, who is an Oracle Ace Director {gasps of appreciation from the audience} and a consummate presenter. I wish I could make it there as I would like to share a pint with Bjoern (well, he’ll have wine or a cocktail I suspect as he is not a beer fan) as well as some of my other friends up that part of the country.