Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Running pgBench on YugaByteDB 1.3

Running pgBench on YugaByte DB 1.3

My first test on this Open Source SQL distributed database.

Did you hear about YugaByteDB, a distributed database with an architecture similar to Google Spanner, using PostgreSQL as the query layer?

I started to follow when I’ve heard that Bryn Llewellyn, famous PL/SQL and EBR product manager, left Oracle to be their developer advocate. And YugaByteDB got more attention recently when announcing that their product license is now 100% Open Source.

I like to learn new things by trying and troubleshooting rather than reading the documentation. Probably because there’s more to learn aside of the documentation path. And also because troubleshooting is fun. So, one of the great features is that the query layer is compatible with PostgreSQL. Then I’ll try to run pgBench on YugaByteDB.

It is important to mention here that I’m running all nodes in a single lab VM, so performance is not representative. And I’m testing the YSQL query layer which is still in beta. The goal is to discover and learn about the distributed database challenges, rather than evaluating the product.

Install YugaByteDB

Nothing is easier than the installation of YugaByteDB, all documented:

Install YugaByte DB | YugaByte DB Docs

I install it in RHEL 7.6 (Actually OEL 7.6 as I’m running my lab in an Oracle Cloud VM). The install is just an un-tar followed by the post-install which patches the binaries so that we don’t have to set LD_LIBRARY_PATH. I set PATH to the YugaByte bin directory:

wget -O- | tar -xvf - 
export PATH=$PATH:$PWD/yugabyte-

I create a 3 nodes cluster on this host:

yb-ctl --rf 3 create
yb-ctl status

Those nodes are created as to here. I access remotely and I tunnel the interesting ports with my ssh_config file:

Host yb
User opc
ForwardX11 yes
DynamicForward 8080
LocalForward 15433
LocalForward 25433
LocalForward 35433
LocalForward 19042
LocalForward 29042
LocalForward 39042
# Web UI
LocalForward 17000
LocalForward 27000
LocalForward 37000

I create a database. The “psql” equivalent here is “ysqlsh”:

\timing on
drop database if exists franck;
create database franck;

ysqlsh (11.2)

Install PgBench

As I tunneled the 5433 port I can use pgBench from my laptop:

pgbench --host localhost --port 15433 --username postgres franck

But the full postgres installation is also there in ./yugabyte-

then, I add this in my path:

export PATH=$PATH:$PWD/yugabyte-$PWD/yugabyte-

Initialize pgBench

I run the initialization:

pgbench --initialize --host localhost -p 5433 -U postgres franck

ERROR: DROP multiple objects not supported yet

Ok, I have an error because pgBench uses the multi-table drop statement which is not supported yet. But what’s really nice is the error message containing a link to the GitHub issue about this.

No problem, I don’t need to drop the tables and pgBench has an --init-steps option to choose the steps: drop tables, table creation, generate data, vacuum, primary key creation, foreign key creation.

pgbench --initialize --init-steps=tgvpf -h localhost -p 5433 -U postgres franck

ERROR: VACUUM not supported yet
HINT: Please report the issue on

There’s no FILLFACTOR and no VACUUM per-se (the storage engine has a transparent garbage collector). YugaByteDB uses the PostgreSQL query layer, but not the same storage layer. My tables are created and data is generated:

select count(*) from pgbench_branches;
select count(*) from pgbench_accounts;
select count(*) from pgbench_tellers;
select count(*) from pgbench_history;

Let’s continue without vacuum, only the primary and foreign key definition:

pgbench --initialize --init-steps=pf -h localhost -p 5433 -U postgres franck

ERROR: This ALTER TABLE command is not yet supported.

pgBench adds the constraints with an ALTER TABLE but YugaByteDB supports only inline declaration in the CREATE TABLE. You can check the DDL from a PostgreSQL database (initialize with the ‘foreign key’ step which is not the default):

pg_dump --schema-only -h localhost -p 5433 -U postgres franck

Basically, here is what is missing in my YugaByteDB database:

alter table pgbench_branches add primary key (bid);
"alter table pgbench_tellers add primary key (tid)",
"alter table pgbench_accounts add primary key (aid)"
"alter table pgbench_tellers add constraint pgbench_tellers_bid_fkey foreign key (bid) references pgbench_branches",
"alter table pgbench_accounts add constraint pgbench_accounts_bid_fkey foreign key (bid) references pgbench_branches",
"alter table pgbench_history add constraint pgbench_history_bid_fkey foreign key (bid) references pgbench_branches",
"alter table pgbench_history add constraint pgbench_history_tid_fkey foreign key (tid) references pgbench_tellers",
"alter table pgbench_history add constraint pgbench_history_aid_fkey foreign key (aid) references pgbench_accounts"

Re-Create with Primary and Foreign Keys

Finally, here is what I want to run to get everything in a supported way:

ysqlsh franck
 drop table if exists pgbench_history;
drop table if exists pgbench_tellers;
drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
CREATE TABLE pgbench_branches (
bid integer NOT NULL
,bbalance integer
,filler character(88)
,CONSTRAINT pgbench_branches_pkey PRIMARY KEY (bid)
CREATE TABLE pgbench_accounts (
aid integer NOT NULL
,bid integer references pgbench_branches
,abalance integer
,filler character(84)
,CONSTRAINT pgbench_accounts_pkey PRIMARY KEY (aid)
CREATE TABLE pgbench_tellers (
tid integer NOT NULL
,bid integer references pgbench_branches
,tbalance integer
,filler character(84)
,CONSTRAINT pgbench_tellers_pkey PRIMARY KEY (tid)
CREATE TABLE pgbench_history (
tid integer references pgbench_tellers
,bid integer references pgbench_branches
,aid integer references pgbench_accounts
,delta integer
,mtime timestamp without time zone
,filler character(22)

This creates the tables without any error. Next step is to generate data

Generate data

Now the only “ --initialize” step I have to do is the generation of data.

Note that this step is doing the truncate with multi-table syntax and this one is already implemented.

So here is the “generate data” step:

pgbench --initialize --init-steps=g -h localhost -p 5433 -U postgres franck

ERROR: Operation only supported in SERIALIZABLEisolation level

PostgreSQL runs by default in “read committed” isolation level. As the GitHub issue mentions, YugaByteDB support for Foreign Keys requires “Serializable”. The reason is that the storage engine (DocDB) has no explicit row locking yet to lock the referenced row. But with referential integrity, inserting in a child table must lock the parent row in share mode (like a SELECT FOR KEY SHARE) to ensure that it is not currently being deleted (or the referenced columns updated). Then the no-lock solution is to run in a true Serializable isolation level.

About the support of Foreign Keys, read Bryn Llewellyn blog post:

Relational Data Modeling with Foreign Keys in a Distributed SQL Database - The Distributed SQL Blog

Note that I said “true” Serializable isolation level because I’m used to Oracle Database where this term is used for “Snapshot Isolation”- more about this:

Oracle serializable is not serializable - Blog dbi services

Serializable transaction isolation

So, the current transaction isolation level is “Read Committed” where repeatable reads, which is required by foreign keys, needs SELECT FOR KEY SHARE:

ysqlsh franck
select current_setting('transaction_isolation');

Then, in order to be able to insert in a table that has some foreign keys, I set the default isolation level to Serializable for my database, and re-connect to check it:

ysqlsh franck
alter database franck set default_transaction_isolation=serializable;
\c franck postgres localhost 5433
select current_setting('transaction_isolation');

Ok, let’s try to generate data now:

pgbench --initialize --init-steps=g -h localhost -p 5433 -U postgres franck

That is taking a long time so I attach the debugger on it:

gdb $(pgrep pgbench)

pgbench.c:3704 is the call to PQendcopy() which is waiting for the asynchronous COPY completion, COPY is used by pgBench for the large table “pg_accounts”. On YugaByteDB side, it seems that only one Tablet Server is doing the work, 100% in CPU:

The Tablet Server is running code from (the YugaByteDB storage engine, DocDB, is based on RocksDB):

perf top

While I was waiting I generated a Brendan Gregg flamegraph on the busy Tablet Server, but there’s no visible bottleneck.

sudo perf record -e cpu-cycles -o /tmp/perf.out -F 99 -g
git clone
sudo perf script -i /tmp/perf.out | ./FlameGraph/ | ./FlameGraph/ /dev/stdin > /tmp/perf.folded.svg

This goes beyond the goal of this post and, anyway, looking at performance is probably not relevant in this version. [21-JUL-2019: the support for COPY has just been added since this test - more info here]

Finally, the initialization of 100000 tuples finished after one hour:

time pgbench --initialize --init-steps=g -h localhost -p 5433 -U postgres franck

So, finally, I have my pgBench schema loaded, with all referential integrity and data.

pgBench simple-update in single-session

I’m running the “simple-update” which basically updates the “abalance” for a row in “pgbench_accounts”, and inserts into “pgbench_history”.

pgbench --no-vacuum --builtin=simple-update --protocol=prepared --time 30 -h localhost -p 5433 -U postgres franck

pgbench --no-vacuum --builtin=simple-update --protocol=prepared --time 30 -h localhost -p 5433 -U postgres franck

At least I know that this basic OLTP application can run without any change on YugaByteDB and that’s a very good point for application transparency. I’ll explain later why I start here the “simple update” workload rather than the default. The Transaction Per Second rate is not amazing (it is 150x higher on a plain PostgreSQL on the same platform), but this is not what I am testing here.

pgBench simple-update in multi-sessions

Obviously, a distributed database should be scalable when multiple users are working in parallel. For this I run pgBnech with 10 clients from 10 threads:

pgbench --no-vacuum --protocol=prepared --builtin=simple-update --time 30 --jobs=10 --client=10 -h localhost -p 5433 -U postgres franck

pgbench — no-vacuum — protocol=prepared — builtin=simple-update — time 30 — jobs=10 — client=10 -h localhost -p 5433 -U postgres franck

Good. I said that we should not look at the elapsed time, but the comparison with the previous run shows the scalability as 10 sessions can run about 10x more transactions per second. I run the 3 YugaByteDB nodes on a 24 core virtual machine here. The servers are multi-threaded:

nTH: number of threads (LWP), P: last used CPU (SMP)

pgBench TCPB-like in multi-sessions

Demystifying Benchmarks: How to Use Them To Better Evaluate Databases

Actually, the first test I did was the default pgBench workload, which is the “TPC-B (sort of)”. In addition to the “simple update”, each transaction updates, in addition to “pg_account”, the balance “pgbench_tellers” and “pgbench_branches”.

This is more tricky because multiple clients will have to concurrently update the same records. And there’s a high probability of contention given the cardinalities. There, pessimistic locking would be better than optimistic. I’m still with my Foreign Keys here and Serializable isolation level (which is optimistic locking).

pgbench --no-vacuum --protocol=prepared --builtin=simple-update --time 30 --jobs=10 --client=10 -h localhost -p 5433 -U postgres franck

ERROR: could not serialize access due to concurrent update

Quickly I can see that 9 out of the 10 clients failed with “Conflicts with higher priority transaction”. This is the equivalent to the PostgreSQL “ERROR: could not serialize access due to concurrent update”. With optimistic locking, the application must be ready to re-try a failed transaction, but pgBench has no option for that. This is not special to YugaByteDB: you will get the same in PostgreSQL when running pgBench default workload in a Serializable transaction isolation level.

Anyway, given the high probability of collision here, the solution is pessimistic locking.

Read Committed transaction isolation

If I set back the isolation level to “Read Committed” the UPDATE will use pessimistic locking, and then I expect the transactions to be serialized, waiting to see committed changes rather than failing when encountering a concurrent change.

ysqlsh franck
alter database franck set default_transaction_isolation='read committed';

But then, I cannot declare the foreign keys or I’ll get “ ERROR: Operation only supported in SERIALIZABLE isolation level” until the issue #1199 is fixed.

Without referential integrity constraints

I re-create the tables with the REFERENCES clause commented out:

ysqlsh franck
drop table if exists pgbench_history;
drop table if exists pgbench_tellers;
drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
CREATE TABLE pgbench_branches (
bid integer NOT NULL
,bbalance integer
,filler character(88)
,CONSTRAINT pgbench_branches_pkey PRIMARY KEY (bid)
CREATE TABLE pgbench_accounts (
aid integer NOT NULL
,bid integer --references pgbench_branches
,abalance integer
,filler character(84)
,CONSTRAINT pgbench_accounts_pkey PRIMARY KEY (aid)
CREATE TABLE pgbench_tellers (
tid integer NOT NULL
,bid integer --references pgbench_branches
,tbalance integer
,filler character(84)
,CONSTRAINT pgbench_tellers_pkey PRIMARY KEY (tid)
CREATE TABLE pgbench_history (
tid integer --references pgbench_tellers
,bid integer --references pgbench_branches
,aid integer --references pgbench_accounts
,delta integer
,mtime timestamp without time zone
,filler character(22)

And run the initialize again, which is much faster (2 minutes instead of 1 hour):

time pgbench --initialize --init-steps=g -h localhost -p 5433 -U postgres franck

Then, ready to run my 10 clients TPC-B workload:

pgbench --no-vacuum --protocol=prepared --builtin=simple-update --time 30 --jobs=10 --client=10 -h localhost -p 5433 -U postgres franck

Operation failed. Try again.: Conflicts with higher priority transaction
ERROR: Operation failed. Try again.: Conflicts with committed transaction
ERROR: Error during commit: Operation expired: Transaction expired

This is much better. Among the 23574 transactions, I got only 3 errors and 7 clients were still running concurrently. I’ve also run it with only one client where I had TPS=144 and here with 7 clients remaining we reach TPS=785.

Of course, I would expect no “Try again” error when in Read Committed isolation level. Here I have the 3 out of the 5 transactional errors we can get from libpq calls (according to yugabyte

  • Transaction expired
  • Conflicts with committed transaction
  • Conflicts with higher priority transaction
  • Restart read required
  • Value write after transaction start

But that’s probably for another post. The important outcomes from my very first test of YugaByteDB are:

  • It is very easy to install and test, so… try it (and you will get a link to get a nice T-Shirt shipped to your home)
  • I was able to run pgBench, a simple application written for PostgreSQL, without any change for YugaByteDB
  • Not all DDL is supported yet, but easy to workaround and follow the Git issue.
  • Foreign Keys are supported, which is a challenge for a distributed database.
  • Transaction concurrency is managed and the issues that will be fixed in later releases are clearly documented as Git issues


There are probably better ways, so please let me know (@FranckPachot). This is what I use when I want to get the SQL_ID and the PLAN_HASH_VALUE when looking at the SQL Plan Baselines.

The DBA_SQL_PLAN_BASELINES view does not provide them, probably because SQL Plan Management (SPM) is going from a statement and it’s execution plan to the SQL Plan Baselines, but doesn’t need to navigate in the other way. However, we need it when troubleshooting query performance.


I think I got this from It Tony Hasler “Expert Oracle SQL: Optimization, Deployment, and Statistics” book. There’s an internal function available since 11gR2 which calculates the SQL_ID from an SQL_TEXT (as null-terminated C string):

dbms_sqltune_util0.sqltext_to_sqlid(sql_text||chr(0)) sql_id


I lazily use DBMS_XPLAN.DISPLAY_SQL_PLAN_BASELINE for this one, as it displays a line with it:

( select to_number(regexp_replace(plan_table_output,'^[^0-9]*')) 
from table(
) where plan_table_output like 'Plan hash value: %') plan_hash_value


Here is an example where I query DBA_SQL_PLAN_BASELINE with those additional columns:

select dbms_sqltune_util0.sqltext_to_sqlid(sql_text||chr(0)) sql_id,
( select to_number(regexp_replace(plan_table_output,'^[^0-9]*'))
from table(dbms_xplan.display_sql_plan_baseline(sql_handle,plan_name))
where plan_table_output like 'Plan hash value: %') plan_hash_value
,dbms_xplan.format_time_s(elapsed_time/1e6) hours,creator,origin,created,last_modified,last_executed
from dba_sql_plan_baselines b
where sql_text like '%&sql_text_pattern.%'
order by sql_id,hours desc

For the ELAPSED_TIME, I use the FORMAT_TIME_S for pretty formatting.
More info:

Oracle numbers in K/M/G/T/P/E

So here is the output of the previous query. I have seen some SQL_ID executed with different PLAN_HASH_VALUE and I know which one was good or not. This helps me to know exactly which ones I want to accept or disable:

Reviving an iPad and On-Premises lesson 2.

<< Introducing I.T. to an Elderly Relative
<<<< Preparing the device
<<<<<< First Lessons, Frustrations, & Funny Stuff

In my previous post we finished with my mother having bricked her iPad – that is, having turned it onto a useless lump. So I drove up to see her again to sort it out. {BTW if you think I am being a bit mean to my mum – yes I am. But I do love her and in the end the iPad has resulted in us being much more in touch. But I think it helps to share the frustrations of getting someone utterly unused to technology on line}.

The first thing I did was to get Mum to turn it on and put in the password. Martin with an I (not a Y – “as some people spell it like that!”). Mum was, very slowly, putting in the password correctly. And then staring at the screen until it flashed up an error. She had forgotten about pressing DONE. Now, if she’d called me when she had started having trouble… The thing is, that is so true across all of IT support. If only people called up when they first had a problem or did something wrong (like deleted all those rows…). If you call up quickly, there is much more chance the problem will be solved quickly. Anyway, I digress. I now knew what had gone wrong, she may or may not type the password correctly but it was timing out each time. Of course, by this point the iPad would no longer respond to the correct password, it was locked out.

You may not know this but if an iPad is locked out as it thinks it might be stolen (password put in wrongly too many times), you can’t just factory reset it. At least, I could not and google-fu mostly confirmed this. You have to plug it into another device with iTunes on it. And you can’t just plug it into the device you set it up on and refresh it, even if you backed it up to this device. At least, I could not. Maybe I am not very good at this tech lark. You have to download the latest version of the OS to your device, plug the switched off iPad into your device, turn the iPad on and then press certain buttons on it in a given way within a 0.731 second window that occurs at an unspecified time after turning the device on. I don’t know how often I tried to get the sodding iPad into recovery mode and recognised by iTunes, but it sure as hell amused my Mum to watch me try. I then re-set-up the iPad to be the same, simple set-up I had done so before. See post 2 for some hints on that. All the time Mum was making snide comments about “how simple this all is, Martyn!”. I think she was having revenge. Sue was keeping out the way.

After all the issues with “Martin with a Y or I”, I set the pass code to be a number. Yes, it’s less secure but I have the Apple ID details for her account – if she loses the iPad I can either track it or wipe it remotely. But we were up and running again, we had a working iPad and on-premises lesson 2 could begin. I’m not sure either of us was 100% happy about this…

Mum wants 2 main things from “the interweb”. She wants to be able to contact me (and, I presume, her other Son and her daughter-in-law) and she wants to be able to look things up. If she can do the former than I can help, remotely, with the latter.

So I showed her how to use messenger to contact us again. It’s been a week or so since the last lesson so I knew she would need a reminder. I pointed at one of the various icons and asked her what it looked like “It’s a phone!” So what will it do? “I don’t know, you are supposed to tell me!”. If it looks like a phone, it’s probably… “{blank look}”. You pressed this by accident last week and it made you scream? “Oh, it’s a phone!”. Excellent, we gave it a quick go.

What about this one next to it? What does it look like? “A box and a little box”. Fair point, but it looks a but like a tv camera? She agreed. So, what will it do? “blank look”. You know this one, we tested this with Sue in the kitchen last week… “the kettle?!?”. It was like Star-Trek… “Oh yes, she appeared on the iPad and I could talk to her. It’s just like Star Trek!”. We tried that one too. All good.

OK, let’s re-visit sending messages and using the keyboard. I show her me sending her a couple of messages again. Enter some text, any text. Press the icon to send the message. “Which one”. The one next to the message, it looks like a plane. “Which message?” The one you just typed. “So I press this one {points to the enter key}” No! No, the blue plane one. “This one!” No!!! that is a phone symbol, I explained that one 5 minutes ago and you seem to have no trouble hitting that one despite that it is in utterly the wrong place and no where near the message. “What message?” THE ONE…..The one you just typed, there, the one that says ‘GFRYTSB’. “So I click on your name?” NO! NO! THE FUCKING PLANE! TAP THE FUCKING PLANE!!!!

She taps the plane.

It sends the message “Oh. It did that before. How do I know who it sent it to?” It sent it to the Pope. “Why did it send it to the Pope?” {sigh}. How many people did we set this up for? Me, Sue, Steve, no Pope. But you see my name at the top of the conversation? You know, third child your bore? The name right above all the other messages? It sent it to me.

“But there are three names {moves finger} over here”. THAT IS OVER THERE!!!!!!! You have spent 10 minutes calling me, star- treking me, seeing messages from me, who the hell do you think it sent the message to?!? “Susan?” {I’m losing it…}

OK, send me another message. You know it’s me, my picture and name is above the conversation. Here, look at my screen your picture and name is above *my* conversation and those are the messages you have sent me.

She types something.
and stares at the screen.
And stares at the screen…
And looks at my screen…
And back at her screen…
“It’s not sent! Has it gone to someone else?”
The. Plane. Tap the Plane.
‘Ping’ – “Ohh! you got the message! How does it know where you are?” The bloody computer pixies know. They track everyone in the world. “Can I message anyone in the world?” I lie a little and say no. only the people in the list. “Does it know Steve is in Wales?” Yes. Look, do you ask the phone how it knows where I am? “No, but this is not the phone”.

We exchange a few more messages for practice and then I get her to tap on the other names, to change conversations. She swaps to Sue and Mum sends her a couple of messages. Once again Mum is asking how the computer knows where Sue is. I point out that as Sue is in the room, the iPad can see her – and then realise that was a really stupid thing to say as Mum did not get the joke. “So it CAN’T message Steve if it can’t see him?” No, it can, it can message anyone on her list.”Shall I message him?” No, he lives in Wales, life is hard enough for him already.

It’s time to go home. I make mum turn the iPad off, turn it on, put in the code and send me a message. She’s got it. “What about the internet?” The internet is not ready for you yet Mum, that will be lesson three. Read the book I got you and give it a go if you like. You can’t break… Actually, just wait until I come back over.

I have to say, since then Mum has been able to message me without issue and can turn the iPad on and off with no trouble, so you do get there eventually.

But I do seem to be buying a lot more wine these days…

CockroachDB… true distributed system-of-record for the cloud

After exploring the analytics end of distributed NewSQL databases, I decided to explore true system-of-record in the cloud. Without a doubt, the future is moving to the cloud. CockroachDB is at the fore-front of this journey to be the backbone database for new micro-services in the cloud. Escape the legacy database rats nest that wasn’t designed to be truly distributed or in the cloud.

With CockroachDB, you can domicile data for regulatory reasons and ensure the best possible performance by accessing data locally. CockroachDB allows you to use create a SINGLE distributed database across multiple cloud providers and on-prem. Our self healing not only helps keep your data consistent, but it automates cloud-to-cloud migration without downtime.

cockroach geo-distributed cluster

It should be an exciting new journey. If you are interested, take a look at our architecture, download, and take the roach out for a spin!

Create an Oracle VM on Azure in Less than 5 Minutes

If there’s one thing I’ve been able to prove this week, it’s that even with the sweet 4G LTE, Wi-Fi setup in my RV, Montana still has the worst Wi-Fi coverage in the US.  Lucky for me, I work in the cloud and automate everything, because if there’s one thing I love about automating with scripts, is that I can build out a deployment faster and less resource intensive than anyone can from the portal.

Oracle Virtual Machines in Azure

When you build out an Azure VM, with Oracle, you’ll also need to have the supporting structure and a sufficiently sized additional disk for your database.  This can be a lot of clicks inside a portal, but from a script, a few questions and bam, you have everything you need.

One of my customers wanted a script to deploy ANY of the Oracle VM images from the catalog, so I created a new script to do so, (previous one was focused on just the 12c-18c database offerings.)  This required a switch from using the sku and the version to build the URN to pushing the URN directly to the customer and knowing this would be used by a more advanced DBA or Azure administrator to begin with, so although its still a simple script to use, the responses are more cryptic, (i.e. what Azure needs to build out under the covers.)

The deployment results is a separate resource group with an Oracle VM of any version available in the Azure catalog, two disks on drive types of available types, in any US location, along with all support resources: 300w, 768w, 1452w" sizes="(max-width: 650px) 100vw, 650px" />

The above is a 7.6 Red Hat OS installation, prepared for an Oracle database.  Note that the virtual network, etc. has been created as part of the script deployment.  The disks can be sized out from the initial deployment and depending on the version, (URN) chosen from the catalog, Oracle may already be installed.  I’m building out scripts to download and install Oracle on those that don’t have it for my next task..:)

Oh yeah and even on my slow network, it deployed in about 3 minutes for me.

It’s Yours

If you’re interested in working with this bash script to deploy an Azure VM either prepared for Oracle or with Oracle already installed on it, the can be pulled from Github.


Tags:  ,





Copyright © DBAKevlar [Create an Oracle VM on Azure in Less than 5 Minutes], All Right Reserved. 2019.

RBAL (ospid: nnn): terminating the instance due to error 27625 after patching Oracle Restart

I have come across an odd behaviour trying to patch an Oracle Restart environment to January 2019. Based on a twitter conversation this isn’t necessarily limited to my patch combination, there might be others as well. I have used opatchauto to apply patch 28813884 to both RDBMS and GRID homes plus its corresponding OJVM (Java) patch. Before diving into details, this is the environment I have been working with:

  • Oracle Restart with an earlier PSU/OJVM combo applied
  • Separation of duties with oracle as the RDBMS owner, and grid owning the GRID infrastructure installation. This is a key point!
  • Patches to be applied
    • Patch 28813884 (GI PSU)
    • Patch 28790654 (corresponding OJVM patch)
  • OPatch version as installed in the GRID|RDBMS home
    • opatchauto:
    • opatch:

Following the instructions in the readme file I upgraded OPatch to the required version. The file I downloaded was named I double-checked the readme, and to me this is the correct file. After upgrading OPatch in the RDBMS and GRID homes, I started patching.

After this process completed, I wanted to start the database in UPGRADE mode as required by the OJVM patch. This needs to be done via sqlplus since srvctl does not support an “upgrade” option in “start database”.

And that’s where it hit me: whenever I tried to open the database in UPGRADE mode, it threw an error:

 SQL*Plus: Release Production on Tue Jul 16 12:15:34 2019

 Copyright (c) 1982, 2014, Oracle.  All rights reserved.

 Connected to an idle instance.
 SQL> startup upgrade
 ORACLE instance started.
 Total System Global Area 1073741824 bytes
 Fixed Size                  2932632 bytes
 Variable Size             713031784 bytes
 Database Buffers          352321536 bytes
 Redo Buffers                5455872 bytes
 ORA-03113: end-of-file on communication channel
 Process ID: 22354
 Session ID: 14 Serial number: 41958

Looking at the alert log I can see that RBAL terminated the instance:

 2019-07-16 12:15:43.404000 -04:00
 NOTE: ASMB mounting group 1 (DATA)
 WARNING: cellinit.ora is missing. RBAL is terminating the instance.
 RBAL (ospid: 22332): terminating the instance due to error 27625
 System state dump requested by (instance=1, osid=22332 (RBAL)), summary=[abnormal instance termination].
 System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/ORCL/trace/ORCL_diag_22306_20190716121543.trc
 Dumping diagnostic data in directory=[cdmp_20190716121543], requested by (instance=1, osid=22332 (RBAL)), summary=[abnormal instance termination].
 Instance terminated by RBAL, pid = 22332
 2019-07-16 12:15:49.164000 -04:00

But since this is Oracle Restart, Clusterware will simply restart the database. Unless of course you’ve been changing the default behaviour. And funny enough, this works (see further down in the article as to why). Although I’d appreciate this in most cases, the automatic restart isn’t appropriate in my situation: when started by Clusterware, the database is not in upgrade mode:

 SQL> select status from v$instance;


Which is a problem for me. A MOS search about error 27625 didn’t reveal anything too useful, and it took me quite a while to realise the problem has to do with permissions. An Internet search finally gave me the right answer, a fellow blogger has pointed it out a little while ago …

So what exactly is the problem? The RDBMS “oracle” binary needs a specific set of permissions and ownership/group membership for Oracle Restart with separation of duties enabled. This is what it looked like before applying the patch:

$ ls -l $ORACLE_HOME/bin/oracle
-rwsr-s--x. 1 oracle asmadmin 324518992 Jul 16 12:15 /u01/app/oracle/product/

The file permissions are ‘6751’ with oracle owning the file but it belongs to the asmadmin group. Remember, I am using a separate user for Grid Infrastructure with its own specific operating system groups. After running opatch, this changed to:

$ ls -l $ORACLE_HOME/bin/oracle
-rwsr-s--x. 1 oracle oinstall 324518992 Jul 16 12:29 /u01/app/oracle/product/

Permissions remained the same, however the group changed from asmadmin to oinstall. Not quite what I had in mind, and it reproducibly causes instance crashes. There is a simple solution: make sure permissions are set correctly! Using “Database Creation on 11.2/12.1/12.2 Grid Infrastructure with Role Separation ( ORA-15025, KFSG-00312, ORA-15081 ) (Doc ID 1084186.1)” got me on the right track.

 $ ls -l $ORACLE_HOME/bin/oracle
 -rwsr-s--x. 1 oracle asmadmin 324518992 Jul 16 12:29 /u01/app/oracle/product/

 $ sqlplus / as sysdba

 SQL*Plus: Release Production on Tue Jul 16 12:34:45 2019

 Copyright (c) 1982, 2014, Oracle.  All rights reserved.

 Connected to an idle instance.

 SQL> startup upgrade
 ORACLE instance started.

 Total System Global Area 1073741824 bytes
 Fixed Size                  2932632 bytes
 Variable Size             713031784 bytes
 Database Buffers          352321536 bytes
 Redo Buffers                5455872 bytes
 Database mounted.
 Database opened.

 SQL> select status from v$instance;



With the database in upgrade mode (I believe the “migrate” is a left over from the 9i days) I am reassured that running datapatch (the OJVM part actually) works as advertised.

By the way Clusterware corrects the group permissions when you issue a “srvctl start database -db …” command as documented in “Starting the database using srvctl changes the group setting for oracle binary (Doc ID 1508027.1)”. Which didn’t solve my problem as I can’t start the database in upgrade mode using srvctl.

Hope this helps you one day!

No risk to activate Active Data Guard by mistake with SQL Developer SQLcl

If you have a Data Guard configuration without the Active Data Guard license, you can:

  • apply the redo to keep the physical standby synchronized
  • or open the database read-only to query it

but not at the same time.

Risk with sqlplus “startup”

Being opened READ ONLY WITH APPLY requires the Active Data Guard option. But that this may happen by mistake. For example, in sqlplus you just type “startup”, instead of “startup mount”. The standby database is opened read-only. Then the Data Guard broker (with state APPLY-ON) starts MRP and the primary database records that you are using Active Data Guard. And then DBA_FEATURE_USAGE_STATISTICS flags the usage of: “Active Data Guard — Real-Time Query on Physical Standby”. And the LMS auditors will count the option.

The ways to prevent it are unsupported:

alter system set "_query_on_physical"=false scope=spfile;

Active Data Guard's Real Time Query - avoid usage if not licensed

No problem with SQLcl

I’m pretty careful when I work on production databases but not when I’m on a lab as, there, errors are a nice way to learn new things. On a sandbox database on the Oracle Cloud ( I restarted the standby with a quick “startup force” and had the nice surprise to see the startup stopping in mount state. Look:

ORA-16003: standby database is restricted to read-only access

ORACLE_SID=CDB1B sql / as sysdba
SQLcl: Release 19.1 Production on Tue Jul 16 18:15:34 2019
Copyright (c) 1982, 2019, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
ORA-16003: standby database is restricted to read-only access

The message “ORA-16003: standby database is restricted to read-only access” is not new. This is what we have when we try a “alter database open read write;” on a standby database.

Actually, I started writing this blog thinking it was a new feature in 19c. And only when re-reading the paragraph above I realized that I was using SQLcl and maybe it has implemented the startup in two times: “startup mount” + “alter database open read write”. That is one of the reasons I try to write a blog post for each thing I discover. When the goal is to publish it, I spend more time thinking about the reasons, the context, the questions that can arise…

If you want to avoid to risk to activate Active Data Guard inadvertently, forget about sqlplus and use SQLcl. The “startup” command will stop in the mount state for a standby database. You need to type “startup open read only” to explicitly open it. Or issue an “alter database open;” after the “startup” returned ORA-16003.

Update 17-JUL-2019

Note that in multitenant, since 18c, the Active Data Guard is activated only where a pluggable database is opened. Then there’s no risk to open CDB$ROOT read-only and it is even recommended to have a correct replication when cloning PDBs. This means that with SQLcl you should “startup open read only” rather than simply “startup”. More info about it:

18c: No Active Data Guard required (and detected) when only CDB$ROOT and PDB$SEED are opened in read-only - Blog dbi services

Passwordless Data Pump 19c

That’s a very light bug with a very simple workaround, but it may require a little change in scripts. If you use passwordless authentication (external password file or OS authentication) with Data Pump in 19c it will ask for the password. The solution is just to answer whatever you want because the external authentication will be used anyway.


I create the wallet

mkstore -wrl $ORACLE_HOME/network/admin -create <w4ll3t-P455w0rd

I create a tnsnames.ora entry that I’ll use to connect:


I add a credential for this entry — here SYSTEM user and its password:

mkstore -wrl $ORACLE_HOME/network/admin -createCredential \
CDB1A_SYSTEM SYSTEM <user-P455w0rd

I also add a credential for Eazy Connect Plus (see this previous post for the dummy parameter):

mkstore -wrl $ORACLE_HOME/network/admin -createCredential \
//localhost/PDB1?_user=system SYSTEM <user-P455w0rd

Finally, I set the sqlnet.ora for this:

cat >> $ORACLE_HOME/network/admin/sqlnet.ora <<'CAT'

Then, I can connect passwordless

connect /@CDB1A_SYSTEM
show user
show con_name
connect /@//localhost/PDB1?_user=system
show user
show con_name

Nothing new here. It is an old feature and very simple to setup. In 2019 we do not see any password in clear text in scripts or command line, right?

Data Pump expdp/impdp

But let’s try to use the same with Data Pump

expdp /@CDB1A_SYSTEM

I want to connect passwordless because the password is in the wallet, but Data Pump asks me for the password. Don’t worry: it asks but doesn’t care. The connection will use the wallet one. Then the solution is to send /dev/null as the stdin:

‘“/ as sysdba”’

There’s the same when using OS authentication like running expdp or impdp connected SYSDBA. Note that this is not a recommendation: running Data Pump as SYSDBA is not a good idea. But if you don’t want to show the password and you are too lazy to setup a wallet, this was working until 19c. The only thing was to double-quote it so that Data Pump takes it as one parameter, and to single-quote around it so that the shell does not interpret your double quotes:

expdp '"/ as sysdba"'

Of course, another reason not to use this is that OS authentication connects you to the CDB$ROOT where you should not have a lot to export…

Fixed in 20.1

After writing this I realize that there’s now a MOS note about it, a patch and a solution (Upgrade to 20.1 when available):

Best Practices for Oracle Data Guard on Azure

I keep saying I’m going to start sharing what I’m doing in the Analytics space soon, but heck, there’s too much I need to keep adding to on the Oracle in Azure arena!

So, as most people know, I’m not a big fan of Oracle RAC, (Real Application Cluster).  My opinion was that it was often sold for use cases that it doesn’t serve, (such as HA) and the resource demands between the nodes, as well as what happens when a node is evicted to those that are left are not in the best interest for most use cases.  On the other hand, I LOVE Oracle Data Guard, active or standard, don’t matter, the product is great and it’s an awesome option for those migrating their Oracle databases to Azure VMs.

What is Oracle Data Guard

Oracle Data Guard is a cross between what Always on Availability Groups and SQL Server log shipping.  It “ships” the redo logs from the primary database to a standby database, (or active database that can be used for reporting, hence the similarities to AG) and with a sync process.  They is a complex configuration, but surprisingly simple failover process for the business to switch to one of the standbys if there is a failure on the primary.

The product has received excellent support from Oracle and the community and it’s something I highly recommend for those businesses now migrating databases over to Azure, but want something similar to what they had on-premises.    Unlike RAC, I can put a standby, (secondary replica for you Microsoft folks) in a second location, then use a Far Sync Instance as a “pass through” from the primary to the standby vs. using the built in Sync direct to the standby database.

Why Use a Far Sync Instance

Oracle has done some great research on the benefit of a Far Sync Instance and the truth is, it improves the performance of the Data Guard sync on a VM by over 12% vs. using a standard Sync between a primary and a standby in Data Guard. 300w, 768w, 1179w" sizes="(max-width: 489px) 100vw, 489px" />

Twelve may not seem like a high number, but when you’re looking to decrease latency any way you can and working between two different location zones in the cloud, this can be significant improvement.

A Far Sync Instance, is just a small allocation of memory for the Oracle background processes, a matching number of redo logs, (+1 for threads used by the Fast Sync) and no actual datafiles.  The primary then “ships” the redo logs to the Fast Sync Instance and then the Fast Sync can easily focus on its one task-  send the redo logs wherever they need to go, no matter where that standby is located.

All of this is build out on Azure Virtual Machines, (VMs) with the appropriate version of Oracle installed on each VM, but of course, the Fast Sync instance can be placed on a very low VM sizing tier, as it doesn’t require many resources or it can be on a shared server if the existing product sharing the VM isn’t very “chatty”.   The Oracle SGA for the Fast Sync can be as small as 300Mb and the CPU_COUNT=1.  Remember, this is pushing redo logs to and from it all day, so it’s going to be busy on that front.

The Architecture 300w, 768w" sizes="(max-width: 650px) 100vw, 650px" />

The above diagram demonstrates the best practices of both Oracle, as well as Azure for an Oracle Database, using Oracle Data Guard, but having the standby in a secondary location.  By using the Far Sync instance, we’re able to ensure that the standby commit of the redo logs is timely enough for the primary, (as this is part of the configuration for the Oracle Data Guard.)

In this configuration, you’ll notice that to support Oracle Data Guard in Azure, we’ve included Express Route to ensure solid performance to the users, since we all know, the network is the last bottleneck.

The Details

There are a few best practices to consider when building out this solution and this includes:

  • Use Async compression to decrease latency on the Far Sync transfers.
  • Configure redundant network links on the customer side of the Express Route to tolerate outages.
  • Choose IOPS for the standby VM that is faster than that of the redo on the primary VM to keep up with the sync, (something you might not consider for a standby server, but it’s necessary for this configuration)
  • The Far Sync instance should have the same number of redo logs as the primary, +1 for every thread
  • Instead of using RMAN, use Azure Site Recovery to take snapshots of each of the VMs, including the Far Sync VM, eliminating extra stress on the database tier.

If you’ve wondered how you would design an Oracle Data Guard environment, either standard or active in Azure, this is how I recommend my customers to do it.


Tags:  , ,





Copyright © DBAKevlar [Best Practices for Oracle Data Guard on Azure], All Right Reserved. 2019.

Ideas for Event Sourcing in Oracle

Log Miner or are there other alternatives?

With microservices, the architects want to dismantle the monolithic database and replicate data rather than share it. Then raises the need to audit the changes where the modifications are done (like the C in CQRS). The Oracle database already does that for recovery purpose, building the redo records before modifying the data blocks, but that’s a physical change vector. We need something logical with more possibilities to filter and transform. There are multiple possible methods for this. But unfortunately, the ones that were built in the database are slowly removed since Oracle has acquired Golden GAte and sells it separately.

And deprecated became desupported, and even removed in further releases, like Continuous Mine in 19c — the final pathset of 12cR2:

ORA-44609: CONTINOUS_MINE is desupported for use with DBMS_LOGMNR.START_LOGMNR

The doc says that the replacement is Golden Gate but that’s another product to buy, very powerful but expensive (and there’s no Standard Edition).

Debezium DBZ-137

Debezium, an open source distributed platform for change data capture, is working on an Oracle Database connector. Many ideas are mentioned in and in this post I give my thought about them.

[DBZ-137] Ingest change data from Oracle databases using LogMiner - JBoss Issue Tracker

Oracle XStreams

The perfect solution as it has minimal overhead on the source and is very efficient. But it requires Golden Gate licensing, and then is probably not the best solution for an Open Source product.

Oracle LogMiner

LogMiner is included in all Editions, reads the redo stream (archived and online redo logs) and extracts all information. When enabling Supplemental Logging, we have enough information to build the logical change information. Many solutions are already based on that. But I see two problems with it.

LogMiner Limits: Basically, LogMiner was not made for replication. The idea was more to give a troubleshooting tool to understand what happened on the data: what is generating too much redo? Who deleted some data? Which sessions were locking rows?… There are limits, like unsupported datatypes. And it is not designed to be efficient. But there’s also the possibility to mine on another system. However, I think that those limits can be acceptable for an Open Source solution on simple databases with low rate of changes.

LogMiner Future: What is more wondering is how Oracle removes the features that may give an alternative to Golden Gate. In 19c the CONTINUOUS_MINE was removed. This means that we need to constantly open and read the whole the redo logs. And do we know what Oracle will remove in future versions when they will see a robust Open Source product that competes with Golden Gate?

On the DBZ-137 there are some remarks about RAC which is more complex because there are many redo threads. I don’t think that RAC is in the scope for this. RAC is an expensive option that is required only on large databases with very high load. That fits more in the Golden Gate scope.

Note that we can parse the SQL_REDO and SQL_UNDO from V$LOGMINER_CONTENTS but there’s also the possibility to get them from dbms_logmnr.mine_value

Mining the binary log stream

There are some attempts to mine the binary redo logs. Some well known commercial products and some Open Source attempts. That’s very complex, but that’s also fun for an open source community. The redo log structure is proprietary but Oracle will not change it too often because all availability features (recovery, standby,…) are based on it. However, there may be a legal issue to open source this mining as it exposes the proprietary format of the redo. Reverse engineering is clearly forbidden by the Oracle license.

Continuous Query Notification

I studied the usage of dbms_change_notification as a CDC alternative: This feature is aimed at nearly static data, in order to invalidate and refresh a cache. It is not designed for a high change rate and is not efficient at all for this.

Client Result Cache

In the same idea as refreshing a cache from data that do not change often, one can think about querying with client result cache as it has a mechanism to invalidate the cache when a modification occurs. However, the granularity is bad here as any change on the table will invalidate all queries on it.

Materialized View Logs

All changes can be logged in materialized view logs. This feature is built for materialized views fast refresh which is a kind of replication. This has nothing to do with the redo log used by LogMiner. With materialized view logs, the changes are stored in a table and must be deleted when consumed. But this feature exists for a long time and is widely used. However, I would seriously question the architecture if there’s a general need for double writing, then reading it and deleting it, just to put the same data into another place.


With triggers, we can log the changes as with materialized view logs. It gives more possibilities, like sending the change rather than storing it in the database (but then we have to manage the transaction visibility). An optimized example to store the audited change has been published by Connor McDonald:

A Fresh Look at Auditing Row Changes

But this is still a lot of overhead and need to to be adapted when columns are added or removed.


When we enable row dependencies, the ORA_ROWSCN pseudo-column can help to filter the rows that may have been updated recently. However, there are two problems with this approach.

full read: if we want a near real-time replication, we will probably pool for changes frequently. ORA_ROWSCN would be nice if indexed, but that’s not the case. It just reads the information stored in the table block. That means that to find the changes done in the last 5 minutes we need to full scan the table and ORA_ROWSCN will then help to identify those rows that were changed. It is a transparent alternative to a “last update” column timestamp but does not help to access quickly to those rows.

commit time: there’s a general problem with anything that reads a “change” timestamp. Let’s say that I pool the changes every 5 minutes. I have a long transaction that updates a row at 12:39 and commits at 12:42. The pool that runs at 12:40, looking for changes since 12:35, does not see the change as it is not committed yet. The pool that runs at 12:45 can see it but not when it filters on the changes that occurred since the last run, which is 12:20. This means that each run must look on a larger window, including the longest transaction start. And then it must deal with duplicates as some of the changes have been captured by the previous run. This is a general problem when there’s no “commit SCN” available.


While talking about the visibility time (commit SCN) vs. the change there is an undocumented way to get it. insert or update with userenv(‘commitscn’) and this will magically get back to the table row at the end of the transaction to set the Commit SCN. It is not supported and anyway it can be invoked only once in a transaction and then cannot be added automatically in a trigger.

Oracle Flashback Query

If we don’t want to add additional auditing on DML, the redo log is not the only internal logging. Oracle also logs the undo information for consistent reads (MVCC) and this, without any additional overhead on the modification, can show all changes that occurred in a table. Basically, we can SELECT … FROM … VERSION BETWEEN SCN … AND … and all changes will be visible with the new and old values and additional information about the operation and the transaction.

However, this is not indexed. Like with ORA_ROWSCN we need to full scan the table and the consistent read will build the previous versions of the blocks, thanks to the undo.

Flashback Data Archive

Flashback Query can reconstruct the recent changes, limited by the undo retention, and by the last DDL that occurred. Flashback Data Archive (which was called Total Recall) can go further. This feature is available in Enterprise Edition without the need for an additional option. It can go beyond the undo retention and allows some DDL. But, again, it is not optimized to get all changes since a specific point in time. The idea is that when you know the rows you want to read, then it can get to the previous version.

Minimal trigger + flashback query

Some of those solutions can be combined. For example, a trigger can log only the ROWID of the changed rows and the replication process will get more information for these rows through flashback query. This lowers the overhead on the changes, while still avoiding a full scan for the replication. Or you may get those ROWID directly from custom-mining the redo logs, which is much simple than trying to get all information from it.

DDL triggers

SQL is agile and allows the structure to change. If adding a column breaks the whole replication, then we have a problem. All the solutions above need to handle those changes. The redo log contains the changes in the dictionary, but it can be complex to decode. All other solution must adapt to those changes and that means having a DDL trigger and handling the different kinds of changes.

Not easy…

The summary is that there are no easy solutions, and the easiest ones have been removed by Oracle to push the sales for Golden Gate. My first recommendation when someone wants to replicate to changes to query it from another place is: don’t do that. Relational databases are made to ingest new data and modifications, and be able to query for different purposes. We have views to show data in different formats. We have indexes to get fast access for different use cases. Oracle is not like many of its competitors. It has been optimized for mixed workloads from the first versions. You can query the same database where the changes occur because a SELECT does not lock anything. You have a resource manager to be sure that runaway queries cannot slow down the transactional activity. And the CPU usage for those queries, when correctly tuned, will rarely be higher than the replication activity you need to implement to stream the changes to another database.

Then, which technology should an event sourcing be built upon? LogMiner looks good for small databases with basic usage. And the project should adapt to the features that are removed by Oracle in the future.

Hybrid trigger / flashback query

When only a few tables are concerned, generating DML triggers is probably the simplest, especially if they log only the minimum, like the ROWID. The ROWID will be visible only when the transaction is committed. Then the replication process must use flashback query, reading only those blocks from the ROWID. The nice thing here is that flashback query shows when the change was visible (the commit time) rather than the change time, which makes it easier to filter out the changes already processed by the last run.

Here is the idea when a trigger has logged the ROWID changed into a DEMO_CDC table, and we query:

The execution plan for this shows optimal access with the ROWID:

Explain Plan
Plan hash value: 3039832324
| Id | Operation | Name |
| 1 | NESTED LOOPS | |
| 2 | SORT UNIQUE | |

The important thing here is that the query cost is proportional to the changes and not to the full size of the table. And the trigger overhead is limited to the ROWID only. There’s no need to store in a table the values that are stored already in the base table and the undo segments. If reading this is done frequently, there are good chances that all blocks involved (the ROWID list, the UNDO records and the table block) are still in the buffer cache.

This trigger+flashback approach is just an idea that I’ve never used. So feedbacks welcome on