Who's online

There are currently 0 users and 41 guests online.

Recent comments

Oakies Blog Aggregator

AskTom TV

Sometimes at conferences I get asked – “How do you know all of the answers for the AskTom questions?”

I’d love to say “Because we’re super smart” Smile but the reality is, we’re just like anyone else with a passion for database technology.  We like to research things, apply our skills and use our experience to solve problems.

So I’ve realised that even though we try to give as much detail as possible when we are helping the AskTom community, it might not always be immediately apparent what thought process we followed when tackling a problem.

To help with that, we now have “AskTom TV”.   It’s not really a TV show Smile but what we’re doing is taking a sample AskTom question, and talking about how we tackled a problem, rather than just jumping straight to the solution.

The first two episodes are here, and there will be more to come throughout 2017.



This is a note I wrote a couple of years ago, but never published. Given the way it’s written I think it may have been the outline notes for a presentation that I was thinking about rather than an attempt to write a little essay. Since it covers a number of points that are worth considering and since I’ve just rediscovered it by accident I thought I’d publish it pretty much as is. Many of the examples of change are now quite old – the intent was to demonstrate how to be cautious rather than trying to supply every possible change that might your next upgrade.

We start with a couple of

  • The effort worth spending to minimise the risk of performance-related surprises on an upgrade depends on the how critical the system is.
  • The effort needed to minimise the risk of performance-related surprises on an upgrade depends on how complex the system is.
  • The more “rare” features and “cunning strategies” and the greater the level of concurrent activity the more likely you are to find surprising effects.

Typical causes of problems are:

  • New automatic jobs installed by Oracle, which might cause extra load during overnight batch tasks
    • e.g. automatic stats collection (10g)
    • Automatic evolution of baselines (12c)
  • Changes to existing packaged procedures
    • e.g. switch to atomic refresh of MVs (11g)
      • changed the time take to do the refresh itself and added a new load to the redo log activity
  • automatic histograms (10g)
    • changed the time taken to collect stats
    • changed lots of execution plans as a side effect
    • changed the behaviour of cursor_sharing=similar as a side effect,
      • which increased CPU and library cache loading
  • “Notable changes” in behaviour
    • e.g. The standard audit trail no longer updates aud$ rows, it always inserts new records
      • This had side effects on redo generation
      • This also allowed Oracle to drop an index, with side effects on typical DBA queries
    • Oracle also introduced a timestamp with time zone as the “audit timestamp”
      • This had a catastrophic side effects on a particular OEM “failed logins” query
  • New optimizer  features appear
    • This can introduce lots of new plans – and a few may perform badly for your data set
  • Optimizer bugs will be fixed
    • This can also introduce new plans – and a few may perform badly for your data set
  • New stats collection mechanisms
    • e.g. approximate_ndv with auto_sample_size (11g)
    • If introduced by default you could get unlucky
      • (See histogram comment above, for example).
    • If not introduced by default (now) you might want to test and adopt them on the upgrade
    • This may require changes to your current coding, and checks for bad changes in plans.
  • Concurrency issues
    • Could introduce side effects in locking, latching, mutexes on busy systems
    • Very hard to test without knowing the system
    • e.g. partition exchange and query optimisation colliding (various changes in 11g)

Background reading which can prompt you to look at particular features of your application that might be at risk.

  • Read the “New Features” guide
  • Read the “Changes in this release” section of the various reference guides
  • Check MoS for upgrade notes for the version you are aiming at
  • Search MoS for any documents with the text: “Notable changes {in|of} behaviour”
  • Check MoS for bug fix notes for later releases (e.g. 12.1 if upgrading to 11.2)

Test as much as possible, and as realistically as possible. You could take advantage of flashback database to repeat large test suites starting from the same point in time with the same data set, but making changes to database structures or code strategies between tests. Proper concurrency tests are ultimately going to be the hardest things to do right – but you have to work on it (or understand Oracle’s architecture really well) if you are upgrading a highly concurrent system.


Scripts to Download Oracle Database 12c Release 2 Documentation

In the past (here, here and here) I already shared with you the scripts I use to download the Oracle Database documentation. The aim of this short post is to reference the scripts I just wrote for the latest version: 12c Release 2.

Happy downloading as well as happy reading!

How to Avoid Disasters with Delphix

Delphix focuses on virtualizing non-production environments, easing the pressure on DBAs, resources and budget, but there is a second use case for product that we don’t discuss nearly enough.

Protection from data loss.

Jamie Pope, one of the great guys that works in our pre-sales engineering group, sent Adam and I an article on one of those situations that makes any DBA, (or an entire business, for that matters) cringe. was performing some simple maintenance and someone deleted the wrong directory, removing over 300G of production data from their system.  It appears they were first going to use PostgreSQL “vacuum” feature to clean up the database, but decided they had extra time to clean up some directories and that’s where it all went wrong.  To complicate matters, the onsite backups had failed, so they had to go to offsite ones, (and every reader moans…)

Even this morning, you can view the tweets of the status for the database copy and feel the pain of this organization as they try to put right the simple mistake. 300w, 768w, 1184w" sizes="(max-width: 464px) 100vw, 464px" data-recalc-dims="1" />

Users are down as they work to get the system back up.  Just getting the data copied before they’re able to perform the restore is painful and as a DBA, I feel for the folks involved: 300w, 768w, 1188w" sizes="(max-width: 464px) 100vw, 464px" data-recalc-dims="1" />

How could Delphix have saved the day for GitLab?  Virtual databases, (VDBs) are read/write copies and derived from a recovered image that is compressed, duplicates removed and then kept in a state of perpetual recovery having the transactional data applied in a specific interval, (commonly once every 24 hrs) to the Delphix Engine source.  We support a large number of database platforms, (Oracle, SQL Server, Sybase, SAP, etc) and are able to virtualize the applications that are connected to them, too.  The interval of how often we update the Delphix Engine source is configurable, so depending on network and resources, this interval can be decreased to apply more often, depending on how up to date the VDBs need to be vs. production.

With this technology, we’ve come into a number of situations where customers suffered a cataclysmic failure situation in production.  While traditionally, they would be dependent upon a full recovery from a physical backup via tape, (which might be offsite) or scrambling to even find a backup that fit within a backup to tape window, they suddenly discovered that Delphix could spin up a brand new virtual database with the last refresh before the incident from the Delphix source and then use a number of viable options to get them up and running quickly.

  1. Switch the users and application to point to the new VDB that was recovered to the point in time, (PIT) before the incident occurred.  Meanwhile, IT is able to take their time recovering the production database with the physical backup, with little outage to the business.
  2. Create a VDB to the PIT before the failure and then create a connection between the production and the VDB, making a copy back to production of the data that was lost.
  3. If there was dire loss, (i.e. disk, etc.)  create a VDB to the PIT before the failure and perform what’s called a V2P, or virtual to physical, rehydrating the virtual data to become the new physical database.

This is the type of situation happens more often then we’d like to admit.  Many times resources have been working long shifts and make a mistake due to exhaustion, other times someone unfamiliar and with access to something they shouldn’t simply make a dire mistake, but these things happen and this is why DBAs are always requesting two or three methods of backups.  We learn quite quickly we’re only as good as our last backup and if we can’t protect the data, well, we won’t have a job for very long.

Interested in testing it out for yourself?  We have a really cool free Delphix trial via Amazon cloud that uses your AWS account.  There’s a source host and databases, along with a virtual host and databases, so you can create VDBs, blow away tables, recovery via a VDB, create a V2P, (virtual to physical) all on your own.


Tags:  , ,





Copyright © DBA Kevlar [How to Avoid Disasters with Delphix], All Right Reserved. 2017.

The post How to Avoid Disasters with Delphix appeared first on DBA Kevlar.

My New Role at Delphix

I’ve been at Delphix for just over six months now.  In that time, I was working with a number of great people on a number of initiatives surrounding competitive, the company roadmap and some new initiatives.  With the introduction of our CEO, Chris Cook, new CMO, Michelle Kerr and other pivotal positions within this growing company, it became apparent that we’d be redirecting our focus on Delphix’s message and connections within the community.

I was still quite involved in the community, even though my speaking had been trimmed down considerably with the other demands at Delphix.  Even though I wasn’t submitting abstracts to many of the big events I’d done so in previous years, I still spoke at 2-3 events each month during the fall and made clear introductions into the Test Data Management, Agile and re-introduction into the SQL Server communities.

As of yesterday, my role was enhanced so that evangelism, which was previously 10% of my allocation, is now going to be upwards of 80% as the Technical Evangelist for the Office of the CTO at Delphix.  I’m thrilled that I’m going to be speaking, engaging and blogging with the community at a level I’ve never done before.  I’ll be joined by the AWESOME Adam Bowen, (@CloudSurgeon on Twitter) in his role as Strategic Advisor and as the first members of this new group at Delphix.  I would like to thank all those that supported me to gain this position and the vision of the management to see the value of those in the community that make technology successful day in and day out.

I’ve always been impressed with the organizations who recognize the power of grassroots evangelism and the power it has in the industry.  What will I and Adam be doing?  Our CEO, Chris Cook said it best in his announcement:

As members of the [Office of CTO], Adam and Kellyn will function as executives with our customers, prospects and at market facing events.  They will evangelize the direction and values of Delphix; old, current, and new industry trends; and act as a customer advocate/sponsor, when needed.  They will connect identified trends back into Marketing and Engineering to help shape our message and product direction.  In this role, Adam and Kellyn will drive thought leadership and market awareness of Delphix by representing the company at high leverage, high impact events and meetings. []

As many of you know, I’m persistent, but rarely patient, so I’ve already started to fulfill my role and be prepared for some awesome new content, events that I’ll be speaking at and new initiatives.  The first on our list was releasing the new Delphix Trial via the Amazon Cloud.  You’ll have the opportunity to read a number of great posts to help you feel like an Amazon guru, even if you’re brand new to the cloud.  In the upcoming months, watch for new features, stories and platforms that we’ll introduce you to. This delivery system, using Terraform, (thanks to Adam) is the coolest and easiest way for anyone to try out Delphix, with their own AWS account and start to learn the power of Delphix with use case studies that are directed to their role in the IT organization.

Tags:  ,





Copyright © DBA Kevlar [My New Role at Delphix], All Right Reserved. 2017.

The post My New Role at Delphix appeared first on DBA Kevlar.

The Delphix Trial on AWS- Get It While Its HOT!

I don’t want to alarm you, but there’s a new Delphix trial on AWS!  It uses your own AWS account and with a simple set up, allows you to deploy a trial Delphix environment.  Yes, you hear me right-  just with a couple steps, you could have your own setup to work with Delphix!

There’s documentation to make it simple to deploy, simple to understand and then use cases for individuals determined by their focus, (IT Architect, Developer, Database Administrator, etc.) 300w" sizes="(max-width: 600px) 100vw, 600px" data-recalc-dims="1" />

This was a huge undertaking and I’m incredibly proud of Delphix to be offering this to the community!

So get out there and check this trial out!  All you need is an AWS account on Amazon and if you don’t have one, it only takes a few minutes to create one and set it up, just waiting for a final verification before you can get started!  If you have any questions or feedback about the trial, don’t hesitate to email me at dbakevlar at gmail.

Tags:  , ,





Copyright © DBA Kevlar [The Delphix Trial on AWS- Get It While Its HOT!], All Right Reserved. 2017.

The post The Delphix Trial on AWS- Get It While Its HOT! appeared first on DBA Kevlar.


I’ve written a couple of articles in the past about the problems of ASSM spending a lot of time trying to find blocks with usable free space. Without doing a bit of rocket science with some x$ objects, or O/S tracing for the relevant calls, or enabling a couple of nasty events, it’s not easy proving that ASSM might be a significant factor in a performance problem – until you get to 12c Release 2 where a staggering number of related statistics appear in v$sysstat.

I’ve published the full list of statistics (without explanation) at the end of this note, but here’s just a short extract showing the changes in my session’s ASSM stats due to a little PL/SQL loop inserting 10,000 rows, one row at a time into an empty table with a single index:

Name                                  Value
----                                  -----
ASSM gsp:get free block                 185
ASSM cbk:blocks examined                185
ASSM gsp:L1 bitmaps examined            187
ASSM gsp:L2 bitmaps examined              2
ASSM gsp:Search hint                      2
ASSM gsp:good hint                      185

It looks like we’ve checked a couple of “level 2” bitmap blocks (one for the table, one for the index, presumably) to pick a sequence of “level 1” bitmap blocks that have been very good at taking us to a suitable data (table or index) block that can be used.

You might have expected to see numbers more like 10,000 in the output, but remember that PL/SQL has lots of little optimisations built into it and one of those is that it pins a few blocks while the anonymous block is running so it doesn’t have to keep finding blocks for every single row.

In comparison here’s the effect of the same data load when operated as 10,000 separate insert statements called from SQL*Plus:

Name                                  Value
----                                  -----
ASSM gsp:get free block              10,019
ASSM cbk:blocks examined             10,220
ASSM cbk:blocks marked full             201
ASSM gsp:L1 bitmaps examined         10,029
ASSM gsp:L2 bitmaps examined              6
ASSM gsp:L2 bitmap full                   1
ASSM gsp:Search all                       1
ASSM gsp:Search hint                      2
ASSM gsp:Search steal                     1
ASSM gsp:bump HWM                         1
ASSM gsp:good hint                   10,016
ASSM rsv:fill reserve                     1

It’s interesting to note that in this case we see (I assume) a few cases where we’ve done the check for an L1 bitmap block, gone to a data blocks that was apparently free, and discovered that our insert would make to over full – hence the 201 “blocks marked full”.

Critically, of course, this is just another of the many little indications of how “client/server” chatter introduces lots of little bits of extra work when compared to the “Thick DB “ approach.

One final set of figures. Going back to an example that first alerted me to the type of performance catastrophes that ASSM could contribute to, I re-ran my test case on 12.2 and checked the ASSM figures reported. The problem was that a switch from a 4KB or 8KB blocks size to a 16KB bblock size produced a performance disaster. A version of my  test case and some timing results are available on Greg Rahn’s site.

In my test case I have 830,000 rows and do an update that sets column2 to column1 changing it from null to an 8-digit value. With a 16KB block size and PCTFREE set to a highly inappropriate value (in this case the default value of 10) this is what the new ASSM statistics looks like:

Name                                  Value
----                                  -----
ASSM gsp:get free block             668,761
ASSM cbk:blocks examined            671,404
ASSM cbk:blocks marked full           2,643
ASSM gsp:L1 bitmaps examined      1,338,185
ASSM gsp:L2 bitmaps examined        672,413
ASSM gsp:Search all                     332
ASSM gsp:Search hint                668,760
ASSM gsp:Search steal                   332
ASSM gsp:bump HWM                       332
ASSM wasted db state change         669,395

I’d love to know what the figures would have looked like if they had been available in the original Oracle case (my guess is that the “blocks examined” statistic would have been in the order of hundreds of millions); they look fairly harmless in this case even though the database (according to some of the other instance activity stats) did roughly 10 times the work you might expect from a perfect configuration.

Even here, though, where the original catastrophic bug has been addressed, the ASSM stats give you an important clue: we’ve been doing a simple update so why have we even been looking for free space (get free block); even stranger, how come we had to examine 1.3M L1 bitmaps when we’ve only updated 830,000 rows surely the worst case scenario shouldn’t have been worse that 1 to 1; and then there’s that “wasted db state change” – I don’t understand exactly what that last statistic is telling me but when I’m worried about performance I tend to worry about anything that’s being wasted.

In passing – if you want to insert a single row into an unindexed table you can expect Oracle to examine the segment header, then an L2 bitmap block, then an L1 bitmap block to find a data block for the insert. (In rare cases that might be segment header, L3, L2, L1). There are then optimisation strategies for pinning blocks – the session will pin the L1 bitmap block briefly because it may have to check several data blocks it references because they may be full even though they are flagged as having space; similarly the session will pin the L2 bitmap block because it may need to mark an L1 bitmap block as full and check another L1 block. The latter mechanism probably explains why we have examined more L1 bitmaps than L2 bitmaps.

Finally, the full monty

Just a list of all the instance statistics that start with “ASSM”:

ASSM bg: segment fix monitor
ASSM bg:create segment fix task
ASSM bg:mark segment for fix
ASSM bg:slave compress block
ASSM bg:slave fix one segment
ASSM bg:slave fix state
ASSM cbk:blocks accepted
ASSM cbk:blocks examined
ASSM cbk:blocks marked full
ASSM cbk:blocks rejected
ASSM fg: submit segment fix task
ASSM gsp:Alignment unavailable in space res
ASSM gsp:L1 bitmaps examined
ASSM gsp:L2 bitmap full
ASSM gsp:L2 bitmaps examined
ASSM gsp:L3 bitmaps examined
ASSM gsp:Optimized data block rejects
ASSM gsp:Optimized index block rejects
ASSM gsp:Optimized reject DB
ASSM gsp:Optimized reject l1
ASSM gsp:Optimized reject l2
ASSM gsp:Search all
ASSM gsp:Search hint
ASSM gsp:Search steal
ASSM gsp:add extent
ASSM gsp:blocks provided by space reservation
ASSM gsp:blocks rejected by access layer callback
ASSM gsp:blocks requested by space reservation
ASSM gsp:bump HWM
ASSM gsp:get free block
ASSM gsp:get free critical block
ASSM gsp:get free data block
ASSM gsp:get free index block
ASSM gsp:get free lob block
ASSM gsp:good hint
ASSM gsp:reject L1
ASSM gsp:reject L2
ASSM gsp:reject L3
ASSM gsp:reject db
ASSM gsp:space reservation success
ASSM gsp:use space reservation
ASSM rsv:alloc from reserve
ASSM rsv:alloc from reserve fail
ASSM rsv:alloc from reserve succ
ASSM rsv:clear reserve
ASSM rsv:fill reserve
ASSM wasted db state change

An introduction to Performance Co Pilot, part 2.

This second blogpost on Performance Co Pilot or PCP in short is about visualisation. In case you haven’t read the first part, here it is, which describes how it works, why you should use it, and how you can install it.

One way of visualising PCP is using the pmchart utility. The pmchart utility is installed via the pcp-gui package (yum install pcp-gui). The pmchart utility uses X to display a window and draw lines, bar graphs, area graphs, etc. in a sense alike the performance manager on Microsoft Windows. You can select the individual performance statistics PCP measures, which can be viewed and investigated with pminfo, for example kernel.all.cpu.user:

# pminfo -dT kernel.all.cpu.user

    Data Type: 64-bit unsigned int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: counter  Units: millisec
total user CPU time from /proc/stat for all CPUs, including guest CPU time

Use ‘pminfo’ without any argument to get a list of available statistics.

When I tried the pmchart utility on OL7.2, I needed to install the dejavu-sans fonts in order not to get black squares as letters in the pmchart window (yum install dejavu-sans-fonts). Also, there are pre-created ‘views’ for filesystem, load, memory, etc. (file, open view). This is very nice, it requires you to log on to a host to investigate with X tunnelled to your system (or have VNC setup), start pmchart and select metrics or open a view, and then record them. It doesn’t seem to be possible to use the archives pmlogger creates for historical analysis, nor it doesn’t seem to be able to save the recorded data, only the graphics that are created based on the selected statistics or using a view as a picture.
Note: upon reading the documentation, it turns out it is possible to use archives. This means you can pick up an archive from a machine and load it using pmchart locally.

However, there are different ways of investigating a host via PCP statistics… The pcp-webapi package provides exactly what the name suggests: a deamon that externalises PCP data via a web based (REST) API. Here’s how that installed:
There is a caveat: the pcp-webapi package is in the ‘optional latest’ repository, which might not be enabled. You can use ‘yum repolist all’ to see all configured repositories.

# yum install --enablerepo="ol7_optional_latest" pcp-webapi

After installing, you need to enable automatic startup (using systemctl in OL7) and start it:

# systemctl enable pmwebd
# systemctl start pmwebd

At this point I need to point out that starting pmwebd means a daemon starts at port 44323 for HTTP requests, which does not provide any authentication. If the host you are using is directly connected to the internet, you should protect this port and do not allow access. The best to use this port is using a tunnel.

This, however, just enables the API, it doesn’t provide anything that displays data useful for a person. That is done when you add the pcp-webjs package:

# yum install --enablerepo="ol7_optional_latest" pcp-webjs

This installs 3 webapps: vector, graphite and grafana.

Vector ( is a web based application that is written in javascript that fetches current data from the pcp webapi and creates graphs from it. You can use the vector app that is installed as part of the pcp-webjs package together with PCP, or start it up from somewhere else, as an example, the documentation mentions putting the app in an Amazon S3 bucket. If it’s started from somewhere else, you need to point vector to the PCP webapi of the machine you want to investigate because your browser, which runs vector, needs to have access to port 44323 to fetch data from the webapi. To use vector that is installed with PCP using the pcp-webjs package, point your browser to http://hostname:44323/vector and specify the hostname or ip address at “Hostname”. This is how that looks like:
vector 600w, 150w" sizes="(max-width: 300px) 100vw, 300px" />
Vector displays PCP measured statistics about CPU, memory, network, disk and even about containers, with an high resolution (2 seconds by default).

Grafana ( installed with the webjs package is a web based application that takes an approach opposite to vector, it fetches the data from the archives created by pmlogger and displays the statistics stored in these. The default dashboard displays the 1 minute load average, network I/O, disk reads and writes, memory statistics and filesystem fullness. The different colours in the graphs over time indicate different PCP archives being used, because by default PCP archives store data per day. To use grafana, point your browser to http://hostname:44323/grafana. This is how that looks like:
grafana 600w, 150w" sizes="(max-width: 300px) 100vw, 300px" />

Oracle Database Cloud (DBaaS) Performance - Part 3 - Storage - 12.2 Update

Recently I repeated the I/O related tests on a instance for curiosity and was surprised by the fact that I consistently got significantly better results as on and

Now you're probably aware that the version 12.2 so far is "cloud-only", so I can't tell / test whether the version 12.2 is generically providing that increased performance or whether Oracle has optimized the underlying stack, so that previous versions in general could also benefit from better performance if they ran on the same platform. Repeated tests with versions and confirmed the performance figures reported in the previous installment of this series, so as of the time of writing it's only the version 12.2 that provides the improved I/O performance.

Note that as of the time of writing only a single instance configuration was supported with version 12.2, so I wasn't able to run the tests in RAC configuration.

Here are the 12.2 I/O related test results:


Again running the test on a 4 OCPU single instance configuration (results in 8 CPUs / 8 cores as outlined previously) with eight sessions:

So that is more than 65,000 IOPS on average at 8 KB block size, significantly more than the corresponding (and already impressive) 40,000 IOPS seen in the previous versions, and even at 16 KB block size still more than 50,000 IOPS - that is more than 800 MB / sec in single block requests!

I/O Throughput

Repeating the same Parallel Execution based test that first creates a very large table (8 KB block size, "direct path write") and then re-reads it using Parallel Query ("direct path read") I got the following results on the 4 OCPU single instance configuration:

Again the results are significantly improved over previous versions. The read performance improved from 640 MB / sec to almost 940 MB / sec. More importantly however the write performance improved from 120 MB / sec to 200 MB / sec, a performance that should allow even more write intensive workloads to perform well.

I/O Latency

The same test as previously was run, switching from asynchronous I/O ("db file parallel read") to synchronous I/O ("db file sequential read") allowing measurement of single block reads, running with 8 threads on the 4 OCPU / 8 CPU / cores configuration at 8 KB block size.

Again an significant improvement across all figures, 0.375 ms average wait time vs. 0.45 ms previously. almost 19,000 IOPS vs. 16,380 IOPS and more than 90% of the waits within 512 microseconds vs. 75% before.

Write Performance

With the significantly improved write throughput figures the expectation was that the random read + write test would perform much better than before, and it did:

In previous versions this test waited almost 50% on "free buffer waits" (due to the minimized buffer cache), so clearly the DBWR couldn't keep up with writing dirty blocks to disk. The picture has changed here significantly, with "free buffer waits" going down to just 4.5% of the overall database time, and performing almost 5,000 write requests per second (and almost 20,000 IOPS in total).

Note that the version 12.2 obviously introduces an optimization that treats repeated modifications of the same kind to the same block (in this case here update a column of a row to the same value as before) not as a "db block change", therefore I had to adjust the test to update the column value to a different value each time. Running the original test case showed a totally different profile, due the minimized "db block changes".


The 12.2 version in the Oracle Cloud shows a significantly improved I/O performance profile compared to previous versions. So far it is unclear whether that is a generic improvement of the new release, or the underlying stack used by the virtual machines has been optimized.

In particular the improved write performance is an important improvement.

RMOUG Training days 2017

I will be speaking about the following topics in Rocky Mountain Oracle User group Training days (RMOUG, Denver) February 7-9, 2017.

Come to my presentations and say Hi to me </p />

    	  	<div class=