Search

OakieTags

Who's online

There are currently 0 users and 39 guests online.

Recent comments

Oakies Blog Aggregator

Experimenting with the ZFSSA’s snapshot capability using the simulator part 2

In my last post I wrote down some notes about my experience while experimenting with the ZFSSA simulator. A simulator is a great way to get familiar with the user interface and general usability of a product. What I wanted to find out using the ZFSSA simulator was the answer to the question: “what happens to a clone of a database when I roll the master copy forward?”

In the first part of the series I explained how I created a clone of a database, named CLONE1. It is based on a backup of my NCDB database. On top of the backup I have created a snapshot as the basis for my clone. A clone in ZFS(SA) terminology is a writeable snapshot, and CLONE1 uses it. But what would happen to CLONE1 if I modified the source database, NCDB? And can I create a new clone-CLONE2-based on a new backup of the source without modifying the first clone? Let’s try this.

Changing the Source Database

Let’s change the source database a little by creating a new tablespace with a couple of data files. To add a little bit of spice to the scenario I decided to create a new user. Its only schema object will be created on the new tablespace.

[oracle@oraclelinux7 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Mar 3 17:25:57 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

SQL> create tablespace dropme datafile size 5m;

Tablespace created.

SQL> alter tablespace dropme add datafile size 5m;

Tablespace altered.

SQL> grant dba to new_user identified by ...;

Grant succeeded.

SQL> create table new_user.t1 tablespace dropme as 
  2  select * from dba_source where rownum <= 100; 

Table created. 

SQL> 

This should be enough for this quick test.

Backing Up

The next step is to roll my existing image copies forward to reflect the changes. That’s not too hard, essentially you create an incremental backup “for recover of copy” … followed by a recover “copy of database with tag” … When the backup/recover command combination completed you also need to back up the archived logs in a second step.

At the risk of repeating myself, please be careful: adding this procedure to an existing backup strategy might have adverse side effects-as always make sure you understand the implications of this technique and its impact. Test thoroughly!

As with the previous backup I stored the incremental backup “for recover” of my image copies in ‘/zfssa/ncdb_bkp/data/’. That’s a location mounted via NFS from the ZFSSA. This is the same location I previously used for the image copies. There is nothing too exciting to report about the backup.

Just as with the previous post my archived logs went to ‘/zfssa/ncdb_bkp/archive/’ to complete the preparations. Here are the files that were created:

[oracle@oraclelinux7 ~]$ ls -lrt /zfssa/ncdb_bkp/data/
total 3035612
-rw-r-----. 1 oracle asmdba    7192576 Mar  3 17:26 6qqvijpt_1_1
-rw-r-----. 1 oracle asmdba    3702784 Mar  3 17:26 6pqvijps_1_1
-rw-r-----. 1 oracle asmdba    1851392 Mar  3 17:40 6tqvikjv_1_1
-rw-r-----. 1 oracle asmdba     442368 Mar  3 17:40 6sqvikjv_1_1
-rw-r-----. 1 oracle asmdba    5251072 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-DROPME_FNO-7_6uqvikk2
-rw-r-----. 1 oracle asmdba  650125312 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-SYSAUX_FNO-3_6kqvie3k
-rw-r-----. 1 oracle asmdba    5251072 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-DROPME_FNO-2_6rqvijq0
-rw-r-----. 1 oracle asmdba  828383232 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-SYSTEM_FNO-1_6jqvie1q
-rw-r-----. 1 oracle asmdba    5251072 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-USERS_FNO-6_6mqvie57
-rw-r-----. 1 oracle asmdba  293609472 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-UNDOTBS1_FNO-4_6lqvie4d
-rw-r-----. 1 oracle asmdba 1304174592 Mar  3 17:40 data_D-NCDB_I-3358649481_TS-EXAMPLE_FNO-5_6iqvie1p
[oracle@oraclelinux7 ~]$ ls -lrt /zfssa/ncdb_bkp/archive/
total 8204
-r--r-----. 1 oracle asmdba    1024 Mar  3 16:06 1_118_905507850.arc
-r--r-----. 1 oracle asmdba 2869760 Mar  3 16:06 1_117_905507850.arc
-r--r-----. 1 oracle asmdba    1024 Mar  3 17:49 1_120_905507850.arc
-r--r-----. 1 oracle asmdba 5426688 Mar  3 17:49 1_119_905507850.arc
[oracle@oraclelinux7 ~]$ 

The image copies are now current, and you can see the 2 data files for the new tablespace “DROPME” that didn’t exist before.

Creating the Second clone Database

With the preparations in place it is time to see if I can create a new clone that reflects the new tablespace and users. I also would like to see if the following steps have any implications on my database CLONE1.

On the ZFSSA Simulator

I headed over to the ZFSSA (simulator) and navigated to shares -> projects. After selecting “NCDB_BKP” I chose Snapshots and hit the (+) button to create snap1. You should see snap0 on that view as well if you are following the examples.

The next task is to create a new project. You should already see the projects pane on the left hand side. Click on the (+) sign next to ALL to create a new one. I named the project NCDB_CLONE2 to stay in line with the naming convention I used previously. With the project created, you should set the properties as needed. I moved the mount point to /export/ncdb_clone2/. On a real ZFSSA you’d set others as well, but that is out of scope of this post. Consult the relevant white papers for more information.

Just as described in the first post now you need to create clones based on snap1. To do so, switch back to the NCDB_BKP project and select (the list of) shares. You should see alert, archive, data and redo. Create clones for each, by following these steps per share:

  • Hover the mouse over the share name
  • Click on the share’s pencil icon to edit share properties
  • Select “snapshots”
  • Hover the mouse over the snapshot name, snap1
  • In the “Clones” column, click on the [+] sign
  • In the resulting pop-up, make sure to create the clone in NCDB_CLONE2 and give it the same name as the share you are creating the snapshot for

The end result should be 4 shares shown in the new project, all of them based on snapshots of their cousins from NCDB_BKP. That concludes the work on the ZFSSA for now.

NB: you can script this :)

On the database Server

If you haven’t done so yet, create the directories to mount the new shares. Everything I mount from the ZFSSA goes to /zfssa on the database server, and as a result I have my files in /zfssa/ncdb_clone2/{alert,archive,data,redo}. Update /etc/fstab accordingly and mount the shares.

The remaining steps are the same as for the creation of CLONE1, and I am not repeating them here. Because it’s very important, here is the warning again: Be careful with the path information in the create controlfile statement and make sure they point to /zfssa/ncdb_clone2/!

Reviewing the Outcome

After CLONE2 is created, my initial questions may be answered.

  1. Is the new tablespace part of CLONE2?
  2. Can I access the newly created table new_user.t1 in CLONE2?
  3. Did anything change for CLONE1?

Starting with CLONE2, I can see the new tablespace, and the data in new_user.t1 is available as well:

SQL> select name from v$database;

NAME
---------
CLONE2

SQL> select tablespace_name from dba_tablespaces;

TABLESPACE_NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
USERS
EXAMPLE
DROPME

7 rows selected.

SQL> select count(*) from new_user.t1;

  COUNT(*)
----------
       100

SQL>

And what about CLONE1?

SQL> select tablespace_name from dba_tablespaces;

TABLESPACE_NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
USERS
EXAMPLE

6 rows selected.

SQL> select count(*) from new_user.t1;
select count(*) from new_user.t1
                              *
ERROR at line 1:
ORA-00942: table or view does not exist

Well that looks ok!

NoSQL workshop at Oracle user group conference

Oracle NoSQL Database has been regularly featured at the conferences of the Northern California Oracle Users Group. But, at its most recent conference, the Northern California Oracle Users Group dared to play outside the Oracle sandbox with a whole day NoSQL workshop featuring three Oracle competitors: MongoDB, Couchbase, and Cassandra.(read more)

Exceptional SQL

SQL's union operators can make queries easy to write and intuitive to read
and understand. One of these is the EXCEPT operator that "subtracts" one
set of rows from another. 



Read the full post at www.gennick.com/database.

Exceptional SQL

Fifth in a series of posts in response to Tim Ford's #EntryLevel Challenge.


SQL implements a number of so-called union operators that under the right circumstances can make queries easy to write and intuitive to read and understand. One of these is the EXCEPT operator that "subtracts" one set of rows from another. 

Say for example that you're doing some work on data quality and want to investigate products that your firm has sold without ever having first purchased. What have you sold but never bought? You can answer that question easily by executing the following EXCEPT query:

SELECT ProductId 
FROM Sales.SalesOrderDetail
EXCEPT
SELECT ProductID 
FROM Purchasing.PurchaseOrderDetail

  ProductId
-----------
        717
        718
        719

SQL's expressiveness allows me to write the query intuitively in a manner closely corresponding to how I think about my business question. Then the optimizer finds an efficient execution path. Here's what's happening at the conceptual level:

  1. A set of product ID numbers is generated from sales data.
  2. A second set of product ID numbers is generated from purchase data. 
  3. All purchase data values are eliminated from the final result.

There's an implied SELECT DISTINCT operation at work. Relational database technology has roots in set theory, and the union operators such as EXCEPT operate in set-based terms. No matter how many duplicate rows are returned by the individual queries, you'll have only one of each row in your final result. 

Not many know this, but IBM DB2 does in fact support an EXCEPT ALL operation that implements duplicate row semantics similar to those in the more commonly implemented UNION ALL. To my knowledge, IBM DB2 is the only relational database engine to support EXCEPT ALL.

By now you're looking at the query results and are realizing that you do in fact manufacture products as well as just buy them. A logical follow-up is to wonder whether those products sold without being purchased correspond to what it is that you make in house. Investigate that angle by adding a second EXCEPT operation involving bill-of-materials data:

SELECT ProductId 
FROM Sales.SalesOrderDetail
EXCEPT
SELECT ProductID 
FROM Purchasing.PurchaseOrderDetail
EXCEPT 
SELECT ProductAssemblyID 
FROM Production.BillOfMaterials

Read from top down, and you'll find the query intuitive and easy to grasp. Here's what's happening:

  1. Query results are initialized to the list of products that have been sold
  2. Any purchased products are removed from the list
  3. Any manufactured products are removed from the list

You get a null result, which is precisely what you hope for. All your products sold are either purchased, or they are built in house from parts that you have purchased. Thankfully, there is no data quality problem to resolve, and you can break for lunch early.

But! You're curious about the query's efficiency. So instead of an early lunch, you decide to have a quick look at execution plans. Following is an alternative formulation of the query, one written around NOT EXISTS operations, but it will provide the same result as the EXCEPT query. Notice in particular the need for a SELECT DISTINCT to ensure against duplicate rows.

SELECT DISTINCT(ProductID) 
FROM Sales.SalesOrderDetail
WHERE NOT EXISTS (
   SELECT ProductID 
   FROM Purchasing.PurchaseOrderDetail
   WHERE PurchaseOrderDetail.ProductID = SalesOrderDetail.ProductID) 
AND NOT EXISTS (
   SELECT ProductAssemblyID 
   FROM Production.BillOfMaterials
   WHERE BillOfMaterials.ProductAssemblyID = SalesOrderDetail.ProductID)

For a one-off or infrequently executed query, you can just run it and call it a day if you're happy with the execution speed. But it pays to look into execution plans when writing queries that might get executed frequently as part of a transactional system. Here is a screenshot showing the actual execution plans for the two queries as generated on my test system:

The EXCEPT query's actual execution plan comes in with the lowest cost, in this one case

The EXCEPT query's actual execution plan comes in with the lowest cost, in this one case

I generated the plans in the figure by putting both queries into one batch, and then executing the batch and requesting the actual execution plans along with the query results. The EXCEPT query has a slight edge in this specific case with a cost amounting to 46% of the total batch cost, whereas the NOT EXISTS query accounted for 54%.

Don't generalize. Don't presume that EXCEPT is always better. My testing with different business questions and queries to answer them showed in some cases that NOT EXISTS had a slight edge, and in other cases the plans and costs were identical. In particular, I noticed that when operating on primary- and foreign-key columns, that it almost didn't seem to matter how I formulated my query.

Anyone who is routinely writing SQL should be familiar with the various union operators such as EXCEPT. These operators can lead to query formulations that closely mirror your thought process around the business question being posed. When that happens, queries are easier to write in the short term, and easier to comprehend when you revisit them in the long term.

perf top ‘Too many events are opened.’ message

This is a small blogpost on using ‘perf’. I got an error message when I tried to run ‘perf top’ systemwide:

# perf top
Too many events are opened.
Try again after reducing the number of events

What actually is the case here, is actually described in the perf wiki:

Open file limits
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor per event per-thread or per-cpu.
On a 16-way system, when you do:
perf stat -e cycles sleep 1
You are effectively creating 16 events, and thus consuming 16 file descriptors.

The point for this blogpost is perf (in Oracle Linux 7.1) says ‘too many events’, and hidden away in the perf wiki the true reason for the message is made clear: perf opens up a file descriptor per cpu thread, which means that if you are on a big system you might get this message if the open files (file descriptors actually) limit is set lower than the number of cpu threads.

You can see the current set limits using ‘ulimit -a’:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 189909832
max locked memory       (kbytes, -l) 21878354152
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If you are root, you can simply set the ‘open files’ limit higher than the amount of cpu threads, and perf will work:

# ulimit -n 10240

Tagged: linux, perf, perf not working, perf top, too many events are opened

SQL statements using literals

16 years ago, someone “Ask-ed Tom” how to find those SQL statements that were not using bind variables.   You can see the question here (because we don’t delete stuff ever Smile) but I’ll paraphrase the answer below:

Tom took the following approach

  • take a copy of SQL statements in the library cache
  • create a routine that would hunt for constants in the SQL text (that is, numbers and anything within quotes) and replace them with place holders
  • then count the resultant SQL’s for duplicates

create table t1 as select sql_text from v$sqlarea;

alter table t1 add sql_text_wo_constants varchar2(1000);

create or replace function 
remove_constants( p_query in varchar2 ) return varchar2
as
    l_query long;
    l_char  varchar2(1);
    l_in_quotes boolean default FALSE;
begin
    for i in 1 .. length( p_query )
    loop
        l_char := substr(p_query,i,1);
        if ( l_char = '''' and l_in_quotes )
        then
            l_in_quotes := FALSE;
        elsif ( l_char = '''' and NOT l_in_quotes )
        then
            l_in_quotes := TRUE;
            l_query := l_query || '''#';
        end if;
        if ( NOT l_in_quotes ) then
            l_query := l_query || l_char;
        end if;
    end loop;
    l_query := translate( l_query, '0123456789', '@@@@@@@@@@' );
    for i in 0 .. 8 loop
        l_query := replace( l_query, lpad('@',10-i,'@'), '@' );
        l_query := replace( l_query, lpad(' ',10-i,' '), ' ' );
    end loop;
    return upper(l_query);
end;
/
update t1 set sql_text_wo_constants = remove_constants(sql_text);

select sql_text_wo_constants, count(*)
  from t1
 group by sql_text_wo_constants
having count(*) > 100
 order by 2
/

It’s a utility I used many times over the years.  But times change and it’s always good to reflect on the tools and techniques you use, and whether anything has come along in later versions to assist you.  Consider the second bullet point above:

create a routine that would hunt for constants and replace them with place holders

That sounds a lot like the process that must be followed when “cursor_sharing” is enabled (and set to “force”), the only difference being the place holders would be come bind variables.  With that in mind, even if we are not using cursor_sharing, then the information that got added to the library cache when cursor_sharing was introduced all those years ago can now be used to assist us.

In V$SQLSTATS, there is a column FORCE_MATCHING_SIGNATURE which is a “signature” (or hash value) representing a SQL statement that has been converted to allow for cursor_sharing = force.  So that becomes the perfect grouping mechanism to identify repeated statements using literals.  For example:



SQL> select force_matching_signature, count(*)
  2  from  v$sqlstats
  3  where force_matching_signature > 0
  4  group by force_matching_signature
  5  having count(*) > 10
  6  order by 2 desc;

 FORCE_MATCHING_SIGNATURE   COUNT(*)
------------------------- ----------
      7756258419218828704         73
     15993438058742417605         16
     15893216616221909352         15
     14052954841849993177         12
     10493170372233636846         11

5 rows selected.

SQL> select sql_text from v$sqlstats
  2  where FORCE_MATCHING_SIGNATURE = 7756258419218828704;

SQL_TEXT
----------------------------------------------------------------
select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 148525
 and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8448 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 6309 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 148524
 and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8442 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8594 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 148693
 and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8602 
and bitand(FLAGS, 128)=0

...

Identification of literal SQL has become trivial. Fixing it ? Well…that’s another story Smile

Storing Date Values As Characters Part II (A Better Future)

In the previous post, I discussed how storing date values within a character data type is a really really bad idea and illustrated how the CBO can easily get its costings totally wrong as a result. A function-based date index helped the CBO get the correct costings and protect the integrity of the date data. During […]

MERGE vs UPDATE/INSERT revisited

I wrote a few years back that for single row operations, MERGE might in fact have a large overhead than the do-it-yourself approach (ie, attempt an update, if it fails, then do an insert).

Just to show that it’s always good to revisit things as versions change, here’s the same demo (scaled up now because my laptop is faster Smile)

As you can see, there is still a little difference between between the two operations.  But even so, unless you need that absolute last little percentage of performance and you know the data distribution (and hence probability of updates versus inserts extremely well, in particular, you’re expecting near always updates) perhaps MERGE is the way to go, in that it is more accurately reflects the operation being performed, and is hence closer to the concept of “self-documenting” code.




SQL> drop table t1 purge;

Table dropped.

SQL>
SQL> create table t1
  2   ( x int primary key,
  3     y int );

Table created.

SQL> -- all inserts
SQL>
SQL> set timing on
SQL> begin
  2  for i in 1 .. 500000 loop
  3    merge into t1
  4    using ( select i x, i y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:19.99

SQL> -- 50/50 updates and inserts
SQL
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    merge into t1
  4    using ( select i x, i y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:19.19
SQL> -- all updates
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    merge into t1
  4    using ( select i x, i+1 y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:17.48
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:04.94
SQL>
SQL> create table t1
  2   ( x int primary key,
  3     y int );

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> set timing on
SQL> begin
  2  for i in 1 .. 500000 loop
  3    update t1 set y = i where x = i;
  4    if sql%notfound then insert into t1 values (i,i); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:22.79
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    update t1 set y = i where x = i;
  4    if sql%notfound then insert into t1 values (i,i); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:18.08
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    update t1 set y = i+1 where x = i;
  4    if sql%notfound then insert into t1 values (i,i+1); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:13.88
SQL>
SQL>
SQL>

Datatypes for DATES

Richard Foote has written a post about not using the DATE datatype for storing dates.

So I’ve come up with a revolutionary system to store dates as well…using the very efficient RAW datatype.

Here’s a demo


SQL> create table t ( x raw(7) );

Table created.

SQL>
SQL> create or replace
  2  procedure store_date(p_yyyymmddhh24miss varchar2) is
  3  begin
  4    insert into t
  5    values
  6      (
  7        hextoraw(
  8         to_char(to_number(substr(p_yyyymmddhh24miss,1,2))+100,'FM0X')||
  9         to_char(to_number(substr(p_yyyymmddhh24miss,3,2))+100,'FM0X')||
 10         to_char(to_number(substr(p_yyyymmddhh24miss,5,2)),'FM0X')||
 11         to_char(to_number(substr(p_yyyymmddhh24miss,7,2)),'FM0X')||
 12         to_char(to_number(substr(p_yyyymmddhh24miss,9,2))+1,'FM0X')||
 13         to_char(to_number(substr(p_yyyymmddhh24miss,11,2))+1,'FM0X')||
 14         to_char(to_number(substr(p_yyyymmddhh24miss,13,2))+1,'FM0X')
 15        )
 16      );
 17  end;
 18  /

Procedure created.

SQL>
SQL> exec store_date('20160528211212')

PL/SQL procedure successfully completed.

SQL> select * from t;

X
--------------
7874051C160D0D

As you can see, the dates are stored in a compact 7-byte format. I’ve added 100 to the century and the year so we can also store negative dates (before 0AD) without any dramas. I’m quite impressed with my ingenuity here. I’m not going to have any of those “number of seconds since 1970” issues, where a 32-bit number might overflow etc etc.

So let us compare that to the DATE datatype.


SQL> create table t1 ( x date );

Table created.

SQL> insert into t1 values ( to_date('20160528211212','yyyymmddhh24miss'));

1 row created.

SQL> select dump(x) from t1;

DUMP(X)
--------------------------------------------------------------------------------
Typ=12 Len=7: 120,116,5,28,22,13,13

Let me convert that to plain old bytes so we can compare


SQL> create or replace
  2  function dump_to_hex(p_str varchar2) return varchar2 is
  3    l_str varchar2(100) := p_str;
  4    l_elem varchar2(10);
  5    l_hex varchar2(100);
  6  begin
  7    l_str := substr(l_str,instr(l_str,':')+2);
  8    loop
  9      l_elem := substr(l_str,1,instr(l_str,',')-1);
 10      exit when l_elem is null;
 11      l_hex := l_hex || to_char(to_number(l_elem),'FM0X');
 12      l_str := substr(l_str,instr(l_str,',')+1);
 13    end loop;
 14    return l_hex;
 15  end;
 16  /

Function created.

SQL>
SQL> select dump_to_hex(dump(x)) from t1;

DUMP_TO_HEX(DUMP(X))
-----------------------------------------------------------------
7874051C160D

Oh…. Looks like someone beat me to it Smile

So if you’re thinking about re-inventing your own datatype for dates, perhaps just stick with the one that’s provided for you … it works just fine Smile

Oracle SGA memory allocation on startup

Recently I have been presenting on what running on a large intel based NUMA system looks like (OTN EMEA tour in Düsseldorf and Milan, and I will be presenting about this at the Dutch AMIS 25th anniversary event in june). The investigation of this presentation is done on a SGI UV 300 machine with 24 terabyte of memory, 32 sockets (=NUMA nodes), 480 core’s and 960 threads.

Recently I have been given access to a new version of the UV 300, the UV 300 RL, for which the CPU has improved from Ivy Bridge to Haswell, and now has 18 core’s per socket instead of 15, which means the number of core’s on a fully equipped system is 576, which makes 1152 threads.

Now the get back to the actual purpose of this blogpost: SGA memory allocation on startup. One of the things you can do on such a system is allocate an extremely high number of memory. In my case I chose to allocate 10 terabyte for the Oracle SGA. When you go (excessively) outside of normal numbers, you run into things simply because there is a (very) low chance that that has been actually tested, and very few (or none) have actually done it before.

First let me show you the other technical details:
Operating system: Oracle Linux version 7.1
Kernel: kernel-uek-3.8.13-98.4.1.el7uek.x86_64
Oracle grid infrastructure version: 12.1.0.2.160419
Oracle database version: 12.1.0.2.160419
For the database instance the use_large_pages parameter is set to ONLY. It simply does not make sense not to use it in a normal database, in my opinion it really is mandatory if you work with large memory databases.

Firing up the instance…
At first I set sga_target to 10737418240000, alias 10 terabyte, and started up the instance. My sqlplus session got busy after pressing enter, however it remained busy. I got coffee downstairs, got back to my office, and it was still busy. Long story short: after 30 minutes it was still busy and I wondered if something was wrong.

What can you do? I first checked the alert.log of the instance. It told me:

Tue May 24 07:52:35 2016
Starting ORACLE instance (normal) (OS id: 535803)
Tue May 24 07:52:35 2016
CLI notifier numLatches:131 maxDescs:3801

Not a lot of information, but also no indication anything was wrong.

Then I looked with the ‘top’ utility what was active at my system:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 535803 oracle    20   0  9.765t  22492  17396 R  99.4  0.0  11:10.56 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

I see my local bequeath session running at and near 100% CPU. It’s working, and that is the correct state when you fire up an Oracle database instance. But what is it doing?

Next logical thing: strace:

[oracle@skynet5 frits]$ strace -fp 535803
Process 535803 attached

Yes, this is meant to show that strace does not show any output. This means the process is not doing any system calls. This does NOT mean the process is doing nothing, we saw with ‘top’ it’s busy, running close to 100% on a CPU (thread).

When strace does not show anything, the next layer to dive into is the C function layer with perf. The first thing I tried, and normally try at first, is ‘perf top’. Perf top -p zooms in on one process:

[oracle@skynet5 frits]$ perf top -p 535803
   PerfTop:    4003 irqs/sec  kernel:99.9%  exact:  0.0% [4000Hz cycles],  (target_pid: 535803)
-----------------------------------------------------------------------------------------------------------------------------------

    95.86%  [kernel]   [k] clear_page_c_e
     1.58%  [kernel]   [k] clear_huge_page
     0.41%  [kernel]   [k] __ticket_spin_lock
     0.32%  [kernel]   [k] clear_page
     0.29%  [kernel]   [k] _cond_resched
     0.28%  [kernel]   [k] mutex_lock
     0.11%  [kernel]   [k] apic_timer_interrupt
     0.11%  [kernel]   [k] hugetlb_fault

Ah! This shows what the process is actually doing! We can see it’s executing a function called ‘clear_page_c_e’ in kernel mode in 96% of the samples perf took of the running process. The name of the function describes what it does quite well: it is a kernel function to clear a page.

So, this does describe that the process starting up the instance is actually spending all of its time clearing memory, but it does not show how this relates to what the Oracle database is actually doing. In order to dig further into what Oracle is doing, we can use perf in another way: record backtraces, and later report on that. What perf does, is sample full backtraces in a file called ‘perf.data’. When perf is later used to report using that file, it will show the current function that was sampled most in the perf.data file, and all the backtraces leading to that function (when perf record is called with -g “call-graph”).

[oracle@skynet5 frits]$ perf record -g -p 535803
...after some time: ^C, then:
[oracle@skynet5 frits]$ perf report
# Overhead  Command       Shared Object                                      Symbol
# ........  .......  ..................  ..........................................
#
    97.14%   oracle  [kernel.kallsyms]   [k] clear_page_c_e
             |
             --- clear_page_c_e
                |
                |--100.00%-- hugetlb_fault
                |          handle_mm_fault
                |          __do_page_fault
                |          do_page_fault
                |          page_fault
                |          slaac_int
                |          slrac
                |          sskgm_segment_notify_action
                |          skgmcrone
                |          skgm_allocate_areas
                |          skgmcreate
                |          ksmcrealm
                |          ksmcsg
                |          opistr_real
                |          opistr
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                 --0.00%-- [...]

What this shows, is again the kernel function ‘clear_page_c_e’ (it also shows [k] to indicate it’s a kernel function), and when we follow the backtrace down, we see
* hugetlb_fault: this is a page fault handling function, which shows that I am using huge pages.
* handle_mm_fault, __do_page_fault, do_page_fault, page_fault: these are linux kernel functions indicating it’s handling a page fault.
* slaac_int, slrac, sskgm_segment_notify_action: these are Oracle database functions. Obviously, slaac_int is the function actually ‘touching’ memory, because it results in a page_fault.

So what is a page fault? A page fault is an interrupt that is raised when a program tries to access memory that has already been allocated into the process’ virtual address space, but not actually loaded into main memory. The interrupt causes the kernel to handle the page fault, and make the page available. Essentially Linux, like most other operating systems, makes pages available once they are actually used instead of when they are ‘just’ allocated. (this is a summary, there is a lot to tell about page faults alone, but that’s beyond the scope of this article)

It’s important to realise there is no system call visible that either allocates memory (think about mmap(), malloc(), etc.). If you want to know more about these calls, please read up on them using: ‘man SYSCALLNAME’ (on linux obviously). So, the above backtrace looks like code that deliberately touches the Oracle SGA memory in order to get it truly allocated!

Actually, in my case, with NUMA enabled, and a large amount of memory to be alloced, the above process of getting 10 terabyte ‘touched’ or ‘page faulted’ can be watched in progress in the proc ‘numa_maps’ file of the ‘bequeathing’ process (the process that is starting up the instance). The linux ‘watch’ utility is handy for this:

[oracle@skynet5 frits]$ watch -n 1 -d cat /proc/535803/numa_maps
00400000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:1 anon=28 dirty=28 N1=28
129b4000 prefer:1 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=360268 N0=11259 N1=11259 N2=11259 N3=11259 N4=11259 N5=11259 N6=11259 N7=11259 N8
=11259 N9=11259 N10=11259 N11=11259 N12=11258 N13=11258 N14=11258 N15=11258 N16=11258 N17=11258 N18=11258 N19=11258 N20=11258 N21=11258 N22=11258 N23
=11258 N24=11258 N25=11258 N26=11258 N27=11258 N28=11258 N29=11258 N30=11258 N31=11258
fca0000000 prefer:1 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:1 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:1 anon=534 dirty=534 N1=73 N2=461
...

With ‘watch’, ‘-n 1’ means a 1 second interval, ‘-d’ means highlight differences.
When the instance is starting (with NUMA enabled, and with a huge amount of memory allocated for SGA), first you will see it’s busy in the line indicated with ‘80000000’ (which is a memory address). In my case it says ‘interleave:0-31’, which means the memory allocation is done over all the indicated NUMA nodes. Actually Nnr=nr means a memory allocation of nr of pages on NUMA node Nnr. In my case I see the number growing per NUMA node jumping from one NUMA node to another for some time. This is the shared pool allocation, that is spread out over all NUMA nodes.

However, after some time, you’ll see this:

00400000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:0 anon=28 dirty=28 N1=28
129b4000 prefer:0 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=516352 N0=16136 N1=16136 N2=16136 N3=16136 N4=16136 N5=16136 N6=16136 N7=16136 N8
=16136 N9=16136 N10=16136 N11=16136 N12=16136 N13=16136 N14=16136 N15=16136 N16=16136 N17=16136 N18=16136 N19=16136 N20=16136 N21=16136 N22=16136 N23
=16136 N24=16136 N25=16136 N26=16136 N27=16136 N28=16136 N29=16136 N30=16136 N31=16136
fca0000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=144384 N25=144384
14320000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=143872 N26=143872
18960000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=12232 N27=12232
1cfa0000000 prefer:0 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:0 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:0 anon=534 dirty=534 N1=73 N2=461

Once the process is done allocating shared pool, it starts allocating buffer cache memory. It does this per NUMA node. You can see it here at the lines fca0000000, 14320000000, 18960000000. Actually, I copied these figures when the instance was still allocating memory at memory address 18960000000. You can see it does specific allocation of memory local to a specific NUMA node by ‘prefer:nr’.

At the end, when all memory was allocated, the sqlplus session showed the sizes of the different memory pools and opened the database. This took approximately 90 minutes. At first, I thought this had to do with the parameter ‘PRE_PAGE_SGA’. However, this parameter was set at its default value, FALSE.

Some investigating and some testing revealed this behaviour (touching the entire SGA) was governed by the hidden parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION, which defaults to TRUE in 12.1.0.2 on Linux.

Next I tested setting it to FALSE, and my startup time reduced to 3.5 minutes (!).

I used perf record again to look where the time is spend when _TOUCH_SGA_PAGES_DURING_ALLOCATION is set to FALSE:

# Overhead          Command         Shared Object                                       Symbol
# ........  ...............  ....................  ...........................................
#
    41.25%  oracle_563318_s  [kernel.kallsyms]     [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          |
               |          |--93.26%-- __intel_new_memset
               |          |          |
               |          |          |--79.48%-- ksmnsin
               |          |          |          |
               |          |          |          |--95.98%-- kcbnfy
               |          |          |          |          kscnfy
               |          |          |          |          ksmcsg
               |          |          |          |          opistr_real
               |          |          |          |          opistr
               |          |          |          |          opiodr
               |          |          |          |          ttcpip
               |          |          |          |          opitsk
               |          |          |          |          opiino
               |          |          |          |          opiodr
               |          |          |          |          opidrv
               |          |          |          |          sou2o
               |          |          |          |          opimai_real
               |          |          |          |          ssthrdmain
               |          |          |          |          main
               |          |          |          |          __libc_start_main
               |          |          |          |
               |          |          |          |--3.32%-- kcbw_setup

This is a new backtrace, which looks like the former backtrace on the top (between clear_page_c_e and page_fault), simply because it’s doing exactly the same, handling a page fault. However, the Oracle functions are different. What is happening here is the Oracle performs a system call, memset(), which is used to wipe a memory area. If we go further down the stack trace, we see (full speculation!):
* ksmnsin: kernel system memory numa segment initialisation?
* kcbnfy/kscnfy: kernel cache buffers numa FY?/kernel system cache numa FY?
* ksmcsg: kernel system memory create sga?

When looking at the memory allocations during startup of the bequeathing process with _TOUCH_SGA_PAGES_DURING_ALLOCATION set to FALSE in numa_maps, way lesser memory pages are touched. numa_maps only shows truly allocated pages, instead of the logical memory allocations. This means the same memory allocations done per numa node (passing over the numa nodes several times) done for the shared pool, and then the buffer cache allocations per numa node, however now also passing over the different numa nodes several times too. Because of the much smaller amount of pages touched, the startup time of the instance is greatly reduced.

So, is this a solution to the long startup time?
On one hand yes, on the other hand no. What that means is: you need to page the memory, no matter what you do. If you choose not to page the memory during startup (_TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE), you need to do it later. Sounds logical, right: pay now, or pay later? But what does that mean?

After startup of the instance with _TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE, a lot of the memory if not really allocated. If you start to use the new prestine SGA with a normal session, you pay the price of paging which the bequeathing session otherwise took. Let me show you a ‘perf record -g’ example of a full table scan (which is reading the Oracle blocks into the buffercache, not doing a direct path read):

# Overhead          Command       Shared Object                                Symbol
# ........  ...............  ..................  ....................................
#
    61.25%  oracle_564482_s  [kernel.kallsyms]   [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          |
               |          |--99.86%-- handle_mm_fault
               |          |          __do_page_fault
               |          |          do_page_fault
               |          |          page_fault
               |          |          __intel_ssse3_rep_memcpy
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

If we follow the call stack from bottom to top:
* qergsFetch, qertbFetch: query execute row source code.
* kdsttgr: kernel data scan table get row.
* kdstf00000010000100kmP: kernel data scan table full, the ultra fast full table scan.
* kdst: other functions in kernel data scan table functions.
* ktrget2: kernel transaction layer.
* kcbgtcr: kernel cache buffers get consistent row.
* kcbz: kernel cache buffers Z, physical IO helper functions.
* __intel_ssse3_rep_memcpy: this is actually the function memcpy, but replaced by an optimised version for intel CPUs.
* page_fault, do_page_fault, __do_page_fault, handle_mm_fault, hugetlb_fault, clear_page_c_e: these are the same page fault kernel functions we saw earlier.

This clearly shows my full table scan now needs to do the paging!

If I set _TOUCH_SGA_PAGES_DURING_ALLOCATION to TRUE, startup the instance (for which the pages are touched and thus pages), and profile a full table scan, I see:

# Overhead          Command         Shared Object                                    Symbol
# ........  ...............  ....................  ........................................
#
    24.88%  oracle_577611_s  oracle                [.] ksl_get_shared_latch
            |
            --- ksl_get_shared_latch
               |
               |--78.46%-- kcbzfb
               |          kcbzgb
               |          |
               |          |--99.63%-- kcbzgm
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

This shows the most prominent function which is called is ‘ksl_get_shared_latch’. No paging to be seen.

At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the SGA pages being touched at startup to prevent paging for foreground sessions.

BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at startup, which also is not true. It also claims ‘every process that starts must access every page in the SGA’, again this is not true.

From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is ‘sa00’. When recording the backtraces of that process, I see:

# Overhead        Command      Shared Object                                      Symbol
# ........  .............  .................  ..........................................
#
    97.57%  ora_sa00_slob  [kernel.kallsyms]  [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--100.00%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          ksmprepage_memory
               |          ksm_prepage_sga_seg
               |          skgmapply
               |          ksmprepage
               |          ksm_sslv_exec_cbk
               |          ksvrdp
               |          opirip
               |          opidrv
               |          sou2o
               |          opimai_real
               |          ssthrdmain
               |          main
               |          __libc_start_main
                --0.00%-- [...]

The kernel paging functions are exactly the same as we have seen several times now. It’s clear the functions executed by this process are specifically for the prepage functionality. The pre-paging as done on behalf of _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle function names). PRE_PAGE_SGA seems to be a ‘workaround’ if you don’t want to spend the long time paging on startup, but still want to page the memory as soon as possible after startup. Needless to say, this is not the same as _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before the sa00 process pages it, still need to do the paging.

Conclusion
If you want to allocate a large SGA with Oracle 12.1.0.2 (but may apply to earlier versions too), the startup time could be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user) sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged memory, which means they use it before the sa00 process could page it, need to page it theirselves.

Thanks to: Klaas-Jan Jongsma for proofreading.