Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

(DESCRIPTION_LIST=(DESCRIPTION=(ADDRESS_LIST=(FAILOVER=YES)(LOAD_BALANCE=NO)

Do you have complex connection strings with DESCRIPTION_LIST, DESCRIPTION, ADDRESS_LIST, ADDRESS and a nice combination of FAILOVER and LOAD_BALANCE? You probably checked the documentation, telling you that FAILOVER=YES is the default at all levels, but LOAD_BALANCE=YES is the default only for DESCRIPTION_LIST. But when disaster recovery and availability is concerned, the documentation is not sufficient. I want to test it. And here is how I do it.

I don’t want to test it with the real configuration and stop the different instances. And I don’t need to. My way to test an address list is to define a tnsnames.ora with the connection string, such as the following:

NET_SERVICE_NAME=
NET_SERVICE_NAME=
(DESCRIPTION_LIST=
(DESCRIPTION=
(CONNECT_DATA=(SERVICE_NAME=pdb1))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=101))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=102))
)
(DESCRIPTION=
(CONNECT_DATA=(SERVICE_NAME=pdb1))
(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=201))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=202))
)
)
)

I used localhost because I know it’s there and I don’t want to wait for the TCP timeout. But I use fake ports, which do not exist. So finally, a connection will never be established but I will be able to see all that are tried. I check them with strace on the connect() system call, with the following script:


for i in {1..10}
do
TNS_ADMIN=/tmp strace -T -e trace=connect sqlplus -s -L sys/oracle@NET_SERVICE_NAME as sysdba <<< "" 2>&1 | awk '
/sa_family=AF_INET, sin_port=htons/{
gsub(/[()]/," ") ; printf "%s ",$5
}
END{
print ""
}
'
done | sort | uniq

So, I used meaningful numbers for my fake ports: 101 and 102 for the addresses in the first description of the description list, and 201 and 202 for the address list in the second description. The awk script shows the sequence that was tried. And, because of the random round robin, I run them in a loop several times to see all patterns, aggregated by sort|uniq

So here is the result from the connection string above using the defaults for load balancing and failover:

101 102 201 202
201 202 101 102

The sequence within the address list is always in order (101,102 and 201,202) because LOAD_BALANCE=NO is the default there. But I have two combinations for the descriptions because LOAD_BALANCE=YES is the default in DESCRIPTION_LIST. Finally, all adresses are tried because FAILOVER=YES is the default at all levels.

LOAD_BALANCE

If I define LOAD_BALANCE at all levels, such as:

NET_SERVICE_NAME=
(DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE=YES)
(DESCRIPTION=(FAILOVER=YES)(LOAD_BALANCE=YES)
(CONNECT_DATA=(SERVICE_NAME=pdb1))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=101))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=102))
)
(DESCRIPTION=(FAILOVER=YES)
(CONNECT_DATA=(SERVICE_NAME=pdb1))
(ADDRESS_LIST=(FAILOVER=YES)(LOAD_BALANCE=YES)
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=201))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=202))
)
)
)

The result shows that all combinations can be tried in any order:

101 102 201 202
101 102 202 201
102 101 201 202
102 101 202 201
201 202 101 102
201 202 102 101
202 201 101 102
202 201 102 101

By running it in a large loop you will confirm that any address will be tried at most once.

FAILOVER

Now, If I set FAILOVER=NO within the first description:

NET_SERVICE_NAME=
(DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE= NO)
(DESCRIPTION=(FAILOVER= NO)(LOAD_BALANCE=YES)
(CONNECT_DATA=(SERVICE_NAME=pdb1))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=101))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=102))
)
(DESCRIPTION=(LOAD_BALANCE=NO )
(CONNECT_DATA=(SERVICE_NAME=pdb1))
(ADDRESS_LIST=(FAILOVER=YES)(LOAD_BALANCE=YES)
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=201))
(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=202))
)
)
)

the first attempt can be 101 or 102 (because of LOAD_BALANCING) but only one will be tried in this address list, because of no failover. Then, the second description is attempted (because FAILOVER=YES at description list level) and with all addresses there (because of LOAD_BALANCING=YES). The result of all possible combinations is:


101 201 202
102 201 202
102 202 201

So here it is. You can test any complex connection description to check what will be the possible connections and in which order they will be tried. From this, you can infer what will happen with a real configuration: the wait for TCP timeout for addresses tested on hosts that are not up, and the load balancing given be the different possible combinations.

 

Cet article (DESCRIPTION_LIST=(DESCRIPTION=(ADDRESS_LIST=(FAILOVER=YES)(LOAD_BALANCE=NO) est apparu en premier sur Blog dbi services.

Project RICO2 and the history of APEX upgrade that went terribly wrong.

In my last blog post I explained a XOR alghorithm that is used to count Oracle database block checksum.

I also wrote, that sometimes you are facing problems, that are unresolvable without a low-level knowledge. This is the story of this kind of situation. The story of misread documentation. The story of haste and hex.

About a year ago, I got a call from one company that did an APEX upgrade for one of their customers. Very big customer. Quite a big database and a very important one, working 24/7.

They told me that they upgraded APEX on a database with one PDB and a week later they tried to apply some patches and restarted the database. After the restart they got the following error:

SQL> alter pluggable database kowalsky open;
alter pluggable database kowalsky open
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01405: fetched column value is NULL

Oracle Support told them that they messed up APEX upgrade and that they should restore a database from before the upgrade and perform it correctly. They were right. But the customer was not happy to find out that they are about to lose a week of production data.

And this is where a true fun begins!!!

To investigate which query is the problem, you can use the following syntax:

SQL> alter session set events '1405 trace name errorstack level 1';

Session altered.

The trace file showed that this query was a problematic one:

----- Error Stack Dump -----
ORA-01405: fetched column value is NULL
----- Current SQL Statement for this session (sql_id=) -----
select version from registry$ where status not in (9, 99) and namespace='SERVER' and cid='APEX'

After tracing the session with event 10046 on level 12 I found the execution plan for this query:

STAT #140575638426528 id=1 cnt=1 pid=0 pos=1 obj=1508 op='TABLE ACCESS BY INDEX ROWID REGISTRY$ (cr=2 pr=0 pw=0 time=10 us cost=1 size=25 card=1)'
STAT #140575638426528 id=2 cnt=1 pid=1 pos=1 obj=1509 op='INDEX UNIQUE SCAN REGISTRY_PK (cr=1 pr=0 pw=0 time=6 us)'

Here we can see, that column "version" should contain some value for ‚APEX’ but it is null. We could try to put this value in a datafile but it is not easy to extend a row manually. The other thing we could do is to locate blocks of index REGISTRY_PK and make sure that the query will not return any rows (like change value APEX to APEK or something like that).

Or just try to change the status of this row to value 9 or 99.

Back then I used BBED to do it, but there were other cases when I had to do it with a bunch of scripts and a hex editor because I couldn’t compile BBED. So I decided to write my own copy of BBED to have something useful when needed </p />
</p></div>

    	  	<div class=

Full page logging in Postgres and Oracle

In my opinion, the volume of logging (aka redo log, aka xlog, aka WAL) is the most important factor for OLTP performance, availability and scalability, for several reasons:

  • This is the only structure where disk latency is a mandatory component of response time
  • This is a big part of the total volume of backups
  • This is sequential by nature, and very difficult to scale by parallelizing

In this post, I look at the volume of logging generated by some DML in Postgres and Oracle. I know Oracle quite well and just start to look at Postgres. The comparison here is not a contest but a way to better understand. For example, the default behavior of Postgres, with full_page_writes=on, is very similar to Oracle ‘begin backup’ mode. The comparison makes no sense for most of Postgres DBAs, but probably helps Oracle DBAs to understand it.

Measure WAL segment writes

Here is how I measured the volume of transaction log written: start the Postgres server with ‘strace -f’ and parse with ‘awk’ the open(), write() and close() calls:

sudo su postgres <<'END'
export PGDATA=/u01/pgdata
/usr/pgsql-10/bin/pg_ctl stop
strace -e trace=open,close,write,recvfrom -f /usr/pgsql-10/bin/pg_ctl start 2>&1 | awk '
/^[^[]/{
$0="[pid MAIN] "$0
}
/strace: Process [0-9][0-9]* attached/{
sub(/^.*strace: /,"strace: ") ; "ps -o cmd -hp " $3 |& getline proc[$3"]"] ; print "" ; print $0,proc[$3"]"]
}
/open[(].*pg_wal[/].* = [0-9]*$/{
z=$0 ; gsub(qq," ") ; fd_wal[$2 $NF]=$4
}
/checkpoint;/{
total_written_wal=0
}
/write[(]/{
#pid=$2 ; sub("]","",$2) ; "ps -o cmd -hp " p |& getline proc[p"]"]
z=$0 ; gsub("[(,]"," ") ; if ( fd_wal[$2 $4]>0 ) { written_wal[$2 $4]=written_wal[$2 $4]+$NF ; total_written_wal=total_written_wal+$NF } next
}
/close[(]/{
pid=$2 ; sub("[^0-9]","",pid) ;
z=$0 ; gsub("[()]"," ") ; if ( ( fd_wal[$2 $4]!="" ) && ( written_wal[$2 $4] > 0 ) ) {
printf " ( written %d bytes to %s -> total WAL segments: %.2f MB ) cmd=%s\n",written_wal[$2 $4],fd_wal[$2 $4],total_written_wal/1024/1024 , proc[$2] ; fd_wal[$2 $4]=""
} next
}
' qq='"'
END

Do not do that in production. This is experimentation in a lab. Do not attach strace to a critical process in production.

There’s probably an easier way to get the same information, maybe with postgres activity statistics, or through a size counting archive_command, so please don’t hesitate to comment. Anyway, from the ‘write()’ calls I am sure that I’m counting exactly what I want: the volume of logging written to disk. As an Oracle DBA used to LogWriter and its slave threads managing all writes, I started to trace only the WAL writer process but quickly realized that part ot the logging is directly written by my server process.

Postgres: insert

I create a table with some numbers and a 100 bytes character string.

create table demo as select generate_series a,generate_series b,generate_series c,generate_series d,generate_series e,generate_series f,lpad('x',100,'x') g from generate_series(0,0);

The first operation I test is the insert of 1 million rows.

insert into demo select generate_series a,generate_series b,generate_series c,generate_series d,generate_series e,generate_series f,lpad('x',100,'x') g from generate_series(1,1000000);
( written 4349952 bytes to pg_wal/000000010000000A0000005F -> total WAL segments: 4.16 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 8192 bytes to pg_wal/000000010000000A0000005F -> total WAL segments: 9.00 MB ) cmd=postgres: wal writer process
( written 17735680 bytes to pg_wal/000000010000000A00000060 -> total WAL segments: 20.07 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 3309568 bytes to pg_wal/000000010000000A00000060 -> total WAL segments: 31.40 MB ) cmd=postgres: wal writer process
( written 33783808 bytes to pg_wal/000000010000000A00000061 -> total WAL segments: 36.03 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 3997696 bytes to pg_wal/000000010000000A00000061 -> total WAL segments: 39.80 MB ) cmd=postgres: wal writer process
( written 49676288 bytes to pg_wal/000000010000000A00000062 -> total WAL segments: 51.19 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 65273856 bytes to pg_wal/000000010000000A00000063 -> total WAL segments: 66.06 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 79364096 bytes to pg_wal/000000010000000A00000064 -> total WAL segments: 82.04 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 6660096 bytes to pg_wal/000000010000000A00000064 -> total WAL segments: 82.39 MB ) cmd=postgres: wal writer process
( written 88285184 bytes to pg_wal/000000010000000A00000065 -> total WAL segments: 98.02 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 14491648 bytes to pg_wal/000000010000000A00000065 -> total WAL segments: 106.82 MB ) cmd=postgres: wal writer process
( written 101703680 bytes to pg_wal/000000010000000A00000066 -> total WAL segments: 113.99 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 17825792 bytes to pg_wal/000000010000000A00000066 -> total WAL segments: 117.19 MB ) cmd=postgres: wal writer process
( written 115769344 bytes to pg_wal/000000010000000A00000067 -> total WAL segments: 128.20 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 18661376 bytes to pg_wal/000000010000000A00000067 -> total WAL segments: 135.09 MB ) cmd=postgres: wal writer process
( written 19824640 bytes to pg_wal/000000010000000A00000068 -> total WAL segments: 144.17 MB ) cmd=postgres: wal writer process
( written 131350528 bytes to pg_wal/000000010000000A00000068 -> total WAL segments: 148.16 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 27435008 bytes to pg_wal/000000010000000A00000069 -> total WAL segments: 159.80 MB ) cmd=postgres: wal writer process
( written 140132352 bytes to pg_wal/000000010000000A00000069 -> total WAL segments: 159.80 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
INSERT 0 1000000

You can see that my ‘strace|awk’ script is running in the background and has counted about 160 MB of logging, partially from the ‘postgres: wal writer process’ and partly from ‘postgres: demo demo 192.168.56.122(38013)’ serving my connection.

The relation size as stored on disk is about 150 MB;

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 19231 | 1e+06 | 0 | 150
(1 row)

This makes sense. An insert has to write all new data into the log in order to be able to recover the pages until they are checkpointed.

Note that I have no index on this table for this test.

Postgres: update

I’m now updating one column for all rows.

update demo set b=b+1;
( written 150528000 bytes to pg_wal/000000010000000A0000006A -> total WAL segments: 4.01 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 162693120 bytes to pg_wal/000000010000000A0000006B -> total WAL segments: 17.84 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 29769728 bytes to pg_wal/000000010000000A0000006B -> total WAL segments: 28.44 MB ) cmd=postgres: wal writer process
...
( written 84287488 bytes to pg_wal/000000010000000A00000081 -> total WAL segments: 343.65 MB ) cmd=postgres: wal writer process
( written 453705728 bytes to pg_wal/000000010000000A00000082 -> total WAL segments: 347.36 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
UPDATE 1000001

I touched only a small part of the volume in bytes, but I touched all rows and all pages. An, even if only a few bytes are modified, Postgres logs the whole page to protect from fractured blocks in case of crash (pages partially written). So that’s about 150 MB. But postgres do not update rows in-place. The whole row is inserted in its new version, which means the whole volume again, which is another 150 MB. If we look at the size of the table, we can see 300MB of pages:

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-------------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 38462 | 1.21882e+06 | 0 | 300
(1 row)

So this update has generated even more logging: 347 MB.

Postgres: sparse update

Now updating only 1 row out of ten, still one column only:

update demo set b=b+1 where mod(a,10)=1;
( written 89923584 bytes to pg_wal/000000010000000A00000083 -> total WAL segments: 13.88 MB ) cmd=postgres: wal writer process
( written 469123072 bytes to pg_wal/000000010000000A00000084 -> total WAL segments: 22.98 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
...
( written 563576832 bytes to pg_wal/000000010000000A0000008D -> total WAL segments: 151.07 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 130940928 bytes to pg_wal/000000010000000A0000008D -> total WAL segments: 151.27 MB ) cmd=postgres: wal writer process
UPDATE 100000
analyze demo;

So, 10% of the rows had to be copied to their new version, which brings the table size to additional 15 MB.

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-------------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 40385 | 1.07267e+06 | 0 | 315
(1 row)

For these additional 15 MB, half of the table pages had to be modified (the current version having to point to the new version), and the logging generated was 150 MB. Because of MVCC at tuple level, doing something similar to ‘chained rows’ and ‘row migration’ for all updates, and because of full page logging, even sparse updates generate a lot log writes.

Postgres: delete

Here is a delete of those million rows:

delete from demo;
( written 576364544 bytes to pg_wal/000000010000000A0000008E -> total WAL segments: 6.44 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 134930432 bytes to pg_wal/000000010000000A0000008E -> total WAL segments: 6.73 MB ) cmd=postgres: wal writer process
( written 589225984 bytes to pg_wal/000000010000000A0000008F -> total WAL segments: 18.70 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
...
( written 162054144 bytes to pg_wal/000000010000000A00000099 -> total WAL segments: 184.70 MB ) cmd=postgres: wal writer process
( written 740352000 bytes to pg_wal/000000010000000A0000009A -> total WAL segments: 189.80 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
DELETE 1000001
( written 163217408 bytes to pg_wal/000000010000000A0000009A -> total WAL segments: 196.22 MB ) cmd=postgres: wal writer process

Marking tuples as deleted does not increase the table:

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 40385 | 275837 | 0 | 315
(1 row)

But all current tuples have to be marked as deleted and not visible once the transaction is committed. This touches all pages for the current version, which is more than 150 MB of logging here.

Postgres: vacuum

After two updates and a delete, I have old tuples in this table. It seems that VACUUM does not generate any logging:

vacuum demo;
( written 762445824 bytes to pg_wal/000000010000000A0000009B -> total WAL segments: 14.67 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
VACUUM

My guess (but remember that I am a newbie in Postgres) is that in case of a crash occurring before the next checkpoint we will just have to vacuum again. But this is not what was answered in the postgres-general list a few years ago.

Note that full page logging is not necessary for all changes, but only for the first change after the page was read from disk after a checkpoint. This is sufficient to cover future writes failures because recovery will start from there. Once we have full page logged, change vector is sufficient for further recovery. However, I had the same amount of WAL, 15 MB, when vacuuming after a checkpoint.

Oracle: insert

Let’s do some similar things in Oracle, which MVCC implementation is completely different: at block level, with undo logging.

SQL> create table demo as select rownum a,rownum b,rownum c,rownum d,rownum e,rownum f,lpad('x',100,'x') g from xmltable('0 to 0');
Table created.

I have exposed in a previous post how I get the delta values from V$MYSTAT join V$STATNAME using (STATISTIC#) for ‘redo size’, so no need to strace here. But we can see the same result by measuring the writes to redo log groups (do not double count the multiplexed members).

SQL> insert into demo select rownum a,rownum b,rownum c,rownum d,rownum e,rownum f,lpad('x',100,'x') g from xmltable('1 to 1000000');
1000001 rows inserted.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
141,342 155,218,876 4,380,448 104,411

This is about 150MB, which is the volume of the table:

SQL> exec dbms_stats.gather_table_stats(user,'DEMO');
PL/SQL procedure successfully completed.
SQL> select table_name,num_rows,blocks,(blocks*block_size/1024/1024) MB, avg_row_len from user_tables join dba_tablespaces using(tablespace_name) where table_name='DEMO';
 
TABLE_NAME NUM_ROWS BLOCKS MB AVG_ROW_LEN
------------------------------ ---------- ---------- ---------- -----------
DEMO 1000001 19280 150.625 131

Conclusion for inserts: all databases have to log the whole data inserted in order to be protected from instance crash. Note that Oracle has a way to insert directly into the file, bypassing the buffer cache, and then reduce the logging required for crash recovery. But I’m not doing bulk inserts here.

Oracle: update

The update in Oracle is done in-place. There is no need to copy the whole row (except in the rare cases where the row increases and do not fit into the block). However, the old value of the column must be copied for MVCC, into the UNDO segment. This is why we see 46 MB of ‘undo change vector size’ here.

SQL> update demo set b=b+1;
1000001 rows updated.
SQL> commit;
Commit complete.
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
170,777 105,301,308 48,641,772 82,221

The UNDO is only the change vector, not the full block. If you read about copies of full blocks to rollback segments, it is a confusion either from veterans of Oracle 5, or a misunderstanding of flashback features. The UNDO being stored in segments, written first into buffer cache, it is protected by redo logging, so about 46 MB of redo is actually the redo vector of undo vectors. The other 54 MB of redo is the new value of the update.

Oracle: sparse update

The logging of change vectors rather than full pages is even cheaper with sparse updates:

SQL> update demo set b=b+1 where mod(a,10)=1;
100001 rows updated.
SQL> commit;
Commit complete.
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
56,583 15,414,328 6,111,608 36,921

The volume of undo and redo generated is only 15 MB here, including 6 MB of undo vectors. This is really optimized and this is one reason why you should update only the columns changed (and not use the default non-dynamic update of Hibernate for example).

Oracle: delete

The delete has to mark all rows as deleted and because the space can immediately be reused then whole row must be logged into the UNDO, and this has to be logged into the REDO, so the delete generates lot of logging:

SQL> delete from demo;
1000001 rows deleted.
SQL> commit;
Commit complete.
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
2,124,823 403,755,892 240,302,088 1,093,821

I have no indexes here. With indexes, all index entries have to be marked as deleted, and this generates undo and redo vector because MVCC in Oracle is at block level: each block modification – for table or index – have to be logged.

Deleting a lot of rows is an expensive operation in Oracle. For bulk purges, it is often better to truncate and insert /*+ append */ when possible (as in non-atomic materialized view refresh). Partitioning helps for that for example to purge old data when partitioned on date.

Postgres without full page logging

Given the huge overhead, is full page logging really required? There are plans to avoid it, mentioned in the Postgres ToDo wiki, or at least to keep it only short term for crash recovery and not media recovery. Another possibility is to implement a checksum on the blocks so that fractured blocks can be detected. Then, when detected, the fractured blocks may not need full page logging to recover them if we can restore a previous backup. This takes longer to recover, but can be acceptable given the low probability of this kind of failure. In addition to that, when you have a physical standby synchronized with log-shipping, you have a easy way to recover without having to restore files. But you need a checksum to detect the problem.

Without a checksum, the problem is the detection of partial writes. But if you trust your storage and if you failover to the standby in case of a crash, you may accept to set full_page_writes=off and this is what I did here.


insert into demo select generate_series a,generate_series b,generate_series c,generate_series d,generate_series e,generate_series f,lpad('x',100,'x') g from generate_series(1,1000000);
...
( written 125255680 bytes to pg_wal/000000010000000A000000E3 -> total WAL segments: 140.65 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
INSERT 0 1000000

The insert still have to log all new data: 140 MB.


update demo set b=b+1;
...
( written 72613888 bytes to pg_wal/000000010000000A000000F2 -> total WAL segments: 213.02 MB ) cmd=postgres: wal writer process
UPDATE 1000001

The update has to log only what is modified, but because of Postgres MVCC implementation, the whole row has to be written in its new version, and the old ones have their pointer updated: 210 MB here.


update demo set b=b+1 where mod(a,10)=1;
( written 305709056 bytes to pg_wal/000000010000000A000000F3 -> total WAL segments: 1.96 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 72613888 bytes to pg_wal/000000010000000A000000F3 -> total WAL segments: 5.62 MB ) cmd=postgres: wal writer process
( written 75718656 bytes to pg_wal/000000010000000A000000F4 -> total WAL segments: 9.65 MB ) cmd=postgres: wal writer process
( written 310665216 bytes to pg_wal/000000010000000A000000F4 -> total WAL segments: 9.65 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
UPDATE 100000

The sparse update benefits from logging only the changed rows: 10 MB here. This one is even smaller than with Oracle because there’s no UNDO to write here: the old values stay in-place.


delete from demo;
( written 323256320 bytes to pg_wal/000000010000000A000000F5 -> total WAL segments: 11.27 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 338829312 bytes to pg_wal/000000010000000A000000F6 -> total WAL segments: 26.92 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 76562432 bytes to pg_wal/000000010000000A000000F6 -> total WAL segments: 31.41 MB ) cmd=postgres: wal writer process
( written 345415680 bytes to pg_wal/000000010000000A000000F7 -> total WAL segments: 39.73 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 83410944 bytes to pg_wal/000000010000000A000000F7 -> total WAL segments: 40.41 MB ) cmd=postgres: wal writer process
DELETE 1000001

The delete is cheap when full_page_writes=off because there’s only the visibility is changed but data remains (until committed and vacuumed). If you have a lot of rows to delete, then consider to set full_page_writes=off and be sure to have a backup to restore in case of crash.

Oracle full page logging in backup mode

So, Oracle by default does not need to protect from fractured blocks, because they can be detected. If the storage crashes while a block is partially written, the block is corrupt. Thanks to the checksum, this corruption will be detected during recovery (or even earlier depending on DB_BLOCK_CHECKSUM and DB_LOST_WRITE_PROTECT). The redo is not sufficient, as it contains only change vectors, but you can recover from the last backup and Oracle can do a simple block recover. This recovery can also be done from the standby database.

However, full page logging exists in Oracle. When running backup from a non-Oracle tool, not aware of block checksum, you need to enclose the copy or snapshot between ‘begin backup’ and ‘end backup’. You do this because online backup may read partially updated blocks, and without the checksum, cannot detect it. A corrupt backup is not very useful and this is why this backup mode will generate more redo to be able to recover them. This is very similar to full page logging: the redo generated for the first modification of the buffer will store the whole block. Next modifications, until buffer is checkpointed, will need only the change vectors.

I think the first article I’ve ever written was a description of the Oracle backup mode. And it is still visible thanks to archive.org only because it was published on… Google Knol!

So, here is the same run with Oracle in backup mode.

Insert does not change a lot as it fills full blocks:

SQL> insert into demo select rownum a,rownum b,rownum c,rownum d,rownum e,rownum f,lpad('x',100,'x') g from xmltable('1 to 1000000');
1000000 rows created.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
141,376 156,527,072 4,380,448 124,195

Full update of one column generates same undo, but more than 2x redo because of full page logging:

SQL> update demo set b=b+1;
1000001 rows updated.
 
SQL> commit;
Commit complete.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
170,778 238,317,632 48,641,772 104,640

Sparse update is exactly the same as full update because this 10% touches all pages:

SQL> update demo set b=b+1 where mod(a,10)=1;
100001 rows updated.
 
SQL> commit;
Commit complete.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
319,622 240,502,284 17,832,196 192,815

Delete generates even more because there’s all the the UNDO in addition to all data pages:

SQL> delete from demo;
1000001 rows deleted.
 
SQL> commit;
Commit complete.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
2,125,285 558,510,928 240,303,768 1,143,131

So what?

Beyond the very different implementation of Postgres and Oracle, we can see that we have flexibility: the large logging generated by Postgres by default may be reduced in some cases, and the minimal logging which is the default for Oracle may be larger in some situations. The most important, as for all technologies, is to understand how it works. Only then you can do the right choice to balance between performance, availability, and cost. Understand how it works means: read the docs (how it is supposed to work) and test (how it actually works). With Oracle there’s additional information from a huge community testing and using it for decades. With Postgres, as with all Open Source projects, the source code with comments is an amazing documentation.

 

Cet article Full page logging in Postgres and Oracle est apparu en premier sur Blog dbi services.

Join Factorization

This item is, by a roundabout route, a follow-up to yesterday’s note on a critical difference in cardinality estimates that appeared if you used the coalesce() function in its simplest form as a substitute for the nvl() function. Connor McDonald wrote a followup note about how using the nvl() function in a suitable predicate could lead to Oracle splitting a query into a UNION ALL (in version 12.2), which led me to go back to a note I’d written on the same topic about 10 years earlier where the precursor of this feature already existed but used CONCATENATION instead of OR-EXPANSION. The script I’d used for my earlier article was actually one I’d written in February 2003 and tested fairly regularly since – which brings me to this article, because I finally tested my script against 12.2.0.1 to discover a very cute bit of optimisation.

The business of splitting a query into two parts can be used even when the queries are more complex and include joins – this doesn’t always happen automatically and sometimes has to be hinted, but that can be a costs/statistics thing) for example, from 12.1.0.2 – a query and its execution plan:


select
        *
from
        t1, t2
where
        t1.v1 = nvl(:v1,t1.v1)
and     t2.n1 = t1.n1
;

---------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |         |  1001 |   228K|    11   (0)| 00:00:01 |
|   1 |  CONCATENATION                          |         |       |       |            |          |
|*  2 |   FILTER                                |         |       |       |            |          |
|*  3 |    HASH JOIN                            |         |  1000 |   228K|     8   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL                   | T2      |  1000 |   106K|     4   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL                   | T1      |  1000 |   122K|     4   (0)| 00:00:01 |
|*  6 |   FILTER                                |         |       |       |            |          |
|   7 |    NESTED LOOPS                         |         |     1 |   234 |     3   (0)| 00:00:01 |
|   8 |     NESTED LOOPS                        |         |     1 |   234 |     3   (0)| 00:00:01 |
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED| T1      |     1 |   125 |     2   (0)| 00:00:01 |
|* 10 |       INDEX RANGE SCAN                  | T1_IDX1 |     1 |       |     1   (0)| 00:00:01 |
|* 11 |      INDEX UNIQUE SCAN                  | T2_PK   |     1 |       |     0   (0)| 00:00:01 |
|  12 |     TABLE ACCESS BY INDEX ROWID         | T2      |     1 |   109 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(:V1 IS NULL)
   3 - access("T2"."N1"="T1"."N1")
   5 - filter("T1"."V1" IS NOT NULL)
   6 - filter(:V1 IS NOT NULL)
  10 - access("T1"."V1"=:V1)
  11 - access("T2"."N1"="T1"."N1")

You can see in this plan how Oracle has split the query into two queries combined through concatenation with FILTER operations at lines 2 (:v1 is null) and 6 (:v1 is not null) to allow the runtime engine to execute only the appropriate branch. You’ll also note that each branch can be optimised separately and in this case the two branches get dramatically different paths because of the enormous difference in the estimated volumes of data.

So let’s move up to 12.2.0.1 and see what happens to this query – but first I’m going to execute a naughty “alter session…”:


------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                 |  1001 |   180K|    11   (0)| 00:00:01 |
|   1 |  VIEW                                    | VW_ORE_F79C84EE |  1001 |   180K|    11   (0)| 00:00:01 |
|   2 |   UNION-ALL                              |                 |       |       |            |          |
|*  3 |    FILTER                                |                 |       |       |            |          |
|   4 |     NESTED LOOPS                         |                 |     1 |   234 |     3   (0)| 00:00:01 |
|   5 |      NESTED LOOPS                        |                 |     1 |   234 |     3   (0)| 00:00:01 |
|   6 |       TABLE ACCESS BY INDEX ROWID BATCHED| T1              |     1 |   125 |     2   (0)| 00:00:01 |
|*  7 |        INDEX RANGE SCAN                  | T1_IDX1         |     1 |       |     1   (0)| 00:00:01 |
|*  8 |       INDEX UNIQUE SCAN                  | T2_PK           |     1 |       |     0   (0)| 00:00:01 |
|   9 |      TABLE ACCESS BY INDEX ROWID         | T2              |     1 |   109 |     1   (0)| 00:00:01 |
|* 10 |    FILTER                                |                 |       |       |            |          |
|* 11 |     HASH JOIN                            |                 |  1000 |   228K|     8   (0)| 00:00:01 |
|  12 |      TABLE ACCESS FULL                   | T2              |  1000 |   106K|     4   (0)| 00:00:01 |
|* 13 |      TABLE ACCESS FULL                   | T1              |  1000 |   122K|     4   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(:V1 IS NOT NULL)
   7 - access("T1"."V1"=:V1)
   8 - access("T2"."N1"="T1"."N1")
  10 - filter(:V1 IS NULL)
  11 - access("T2"."N1"="T1"."N1")
  13 - filter("T1"."V1" IS NOT NULL)

There’s nothing terribly exciting about the change – except for the disappearence of the CONCATENATION operator and the appearance of the VIEW and UNION ALL operators to replace it (plus you’ll see that the two branches appear in the opposite order in the plan). But let’s try again, without doing that “alter session…”:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                    |  1001 |   229K|    10   (0)| 00:00:01 |
|*  1 |  HASH JOIN                              |                    |  1001 |   229K|    10   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL                     | T2                 |  1000 |   106K|     4   (0)| 00:00:01 |
|   3 |   VIEW                                  | VW_JF_SET$A2355C8B |  1001 |   123K|     6   (0)| 00:00:01 |
|   4 |    UNION-ALL                            |                    |       |       |            |          |
|*  5 |     FILTER                              |                    |       |       |            |          |
|*  6 |      TABLE ACCESS FULL                  | T1                 |  1000 |   122K|     4   (0)| 00:00:01 |
|*  7 |     FILTER                              |                    |       |       |            |          |
|   8 |      TABLE ACCESS BY INDEX ROWID BATCHED| T1                 |     1 |   125 |     2   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN                  | T1_IDX1            |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."N1"="ITEM_1")
   5 - filter(:V1 IS NULL)
   6 - filter("T1"."V1" IS NOT NULL)
   7 - filter(:V1 IS NOT NULL)
   9 - access("T1"."V1"=:V1)

The plan now shows a VIEW which is a UNION ALL involving only table t1 in both its branches. The result set from the view is then used as the probe table of a hash join with t2. You’ll note that the name of the view is now VW_JF_SET$A2355C8B – that’s JF for “Join Factorization”, and the alter session I excecuted to get the first plan was to disable the feature: ‘alter session set “_optimizer_join_factorization”= false;’.

Join factorization can occur when the optimizer sees a union all view with some tables that are common to both (all) branches of the query, and finds that it can move those tables outside the query while getting the same end result at a lower cost. In this case it happens to be a nice example of how the optimizer can transform and transform again to get to the lowest cost plan.

It’s worth noting that Join Factorization has been around since 11.2.x.x, and Or Expansion has been around for even longer – but it’s not until 12.2 that nvl() transforms through Or Expansion, which allows it to transform through Join Factorization.

You’ll note, by the way that with this plan we always do a full tablescan of t2, whereas with just Or-Expansion it’s a potential threat that may never (or hardly ever) be realised.  That’s a point to check if you find that the transformation starts to appear inappropriately on an upgrade. There is a hint to disable the feature for a query, but it’s not trivial to get it right so if you do need to block the feature the smart hint (or SQL Patch) would be “opt_param(‘_optimizer_join_factorization’ ‘false’)”.

Footnote:

If you want to run the experiments yourself, here’s the script I used to generate the data. It’s more complicated than it needs to be because I use the same tables in several different tests:

rem
rem     Script:         null_plan_122.sql
rem     Author:         Jonathan Lewis
rem     Dated:          February 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Join Factorization
rem             12.1.0.2        Concatenation
rem
rem

drop table t2;
drop table t1;

-- @@setup  -- various set commands etc.

create table t1 (
        n1              number(5),
        n2              number(5),
        v1              varchar2(10),
        v2              varchar2(10),
        v3              varchar2(10),
        v4              varchar2(10),
        v5              varchar2(10),
        padding         varchar2(100),
        constraint t1_pk primary key(n1)
);

insert into t1
select
        rownum,
        rownum,
        rownum,
        trunc(100 * dbms_random.value),
        trunc(100 * dbms_random.value),
        trunc(100 * dbms_random.value),
        trunc(100 * dbms_random.value),
        rpad('x',100)
from all_objects
where
        rownum <= 1000 -- > comment to avoid WordPress format mess
;

create unique index t1_n2 on t1(n2);

create index t1_idx1 on t1(v1);
create index t1_idx2 on t1(v2,v1);
create index t1_idx3 on t1(v3,v2,v1);

create table t2 (
        n1              number(5),
        v1              varchar2(10),
        padding         varchar2(100),
        constraint t2_pk primary key(n1)
);

insert into t2
select
        rownum,
        rownum,
        rpad('x',100)
from all_objects
where
        rownum <= 1000     -- > comment to avoid WordPress format mess
;

create index t2_idx on t2(v1);

begin dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

variable n1 number
variable n2 number
variable v1 varchar2(10)
variable v2 varchar2(10)
variable v3 varchar2(10)

exec :n1 := null
exec :n2 := null
exec :v1 := null
exec :v2 := null
exec :v3 := null

spool null_plan_122

set autotrace traceonly explain

prompt  ============================================
prompt  One colx = nvl(:b1,colx) predicate with join
prompt  ============================================

select
        *
from
        t1, t2
where
        t1.v1 = nvl(:v1,t1.v1)
and     t2.n1 = t1.n1
;

alter session set "_optimizer_join_factorization" = false;

select
        *
from
        t1, t2
where
        t1.v1 = nvl(:v1,t1.v1)
and     t2.n1 = t1.n1
;

alter session set "_optimizer_join_factorization" = true;

set autotrace off

spool off

NVL vs COALESCE

Jonathan Lewis just published a blog post about NVL and COALESCE and the optimizer costings for each.

There is also perhaps a significant difference between NVL and COALESCE in that the former seems to have an in-built optimization for handling bind variables and nulls. 

Consider an application where users optionally pass in search criteria and you have to query a table based on that criteria.  You have three natural choices here to implement that:

WHERE column = :search_criteria or :search_criteria is null

or

WHERE column = nvl(:search_criteria ,column)

or

WHERE column = coalesce(:search_criteria,column)

Functionally they are identical*, but the implementation detail shows a nice little optimizer trick that only works with NVL.


SQL> create table t as select * From dba_objects;

Table created.

SQL> variable search_criteria number
SQL>
SQL> exec :search_criteria := 123

PL/SQL procedure successfully completed.

SQL>
SQL> create index ix1 on t ( object_id ) ;

Index created.

SQL>
SQL> set feedback only
SQL> select *
  2  from t
  3  where object_id = nvl(:search_criteria,object_id);

1 row selected.

SQL>
SQL> set feedback on
SQL> select * from dbms_xplan.display_cursor();

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------
SQL_ID  0g820t1jw00hm, child number 0
-------------------------------------
select * from t where object_id = nvl(:search_criteria,object_id)

Plan hash value: 2258578794

----------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                 |       |       |   430 (100)|          |
|   1 |  VIEW                                  | VW_ORE_1B35BA0F | 78868 |    36M|   430   (1)| 00:00:01 |
|   2 |   UNION-ALL                            |                 |       |       |            |          |
|*  3 |    FILTER                              |                 |       |       |            |          |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T               |     1 |   132 |     2   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN                  | IX1             |     1 |       |     1   (0)| 00:00:01 |
|*  6 |    FILTER                              |                 |       |       |            |          |
|*  7 |     TABLE ACCESS FULL                  | T               | 78867 |     9M|   428   (1)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(:SEARCH_CRITERIA IS NOT NULL)
   5 - access("OBJECT_ID"=:SEARCH_CRITERIA)
   6 - filter(:SEARCH_CRITERIA IS NULL)
   7 - filter("OBJECT_ID" IS NOT NULL)


27 rows selected.

SQL>
SQL> set feedback only
SQL> select *
  2  from t
  3  where object_id = coalesce(:search_criteria,object_id);

1 row selected.

SQL>
SQL> set feedback on
SQL> select * from dbms_xplan.display_cursor();

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------
SQL_ID  am3uvm7nvx5d9, child number 0
-------------------------------------
select * from t where object_id = coalesce(:search_criteria,object_id)

Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |       |       |   427 (100)|          |
|*  1 |  TABLE ACCESS FULL| T    |     1 |   132 |   427   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("OBJECT_ID"=COALESCE(:SEARCH_CRITERIA,"OBJECT_ID"))


18 rows selected.

SQL>
SQL>
SQL>
SQL> set feedback only
SQL> select *
  2  from t
  3  where ( object_id = :search_criteria or :search_criteria is null );

1 row selected.

SQL>
SQL> set feedback on
SQL> select * from dbms_xplan.display_cursor();

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------
SQL_ID  ff0s2j51scxss, child number 0
-------------------------------------
select * from t where ( object_id = :search_criteria or
:search_criteria is null )

Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |       |       |   427 (100)|          |
|*  1 |  TABLE ACCESS FULL| T    |  3945 |   508K|   427   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter((:SEARCH_CRITERIA IS NULL OR
              "OBJECT_ID"=:SEARCH_CRITERIA))


20 rows selected.

SQL>

Only NVL gets the benefit of the query being “split” into two pieces – one to handle the case where the passed criteria is null, and the other for when the criteria is not null.  The FILTER in line 3 shows that we will only run one or the other.

So for these particular types of queries, make sure you test all the possibilities – you might find NVL (currently) is your best bet.

 

* – Addenda:  Thanks to Jonathan for pointing out that you can get discrepancies in the results for the three strategies above for columns that may contain nulls, so as always, take care.

Coalesce v. NVL

“Modern” SQL should use the coalesce() function rather than the nvl() function – or so the story goes – but do you always want to do that to an Oracle database ? The answer is “maybe not”. Although the coalesce() function can emulate the nvl() function (in many cases) there are significant differences in behaviour, some that suggest it’s a good idea to use the substitution and others that suggest otherwise. Different decisions may be appropriate for different circumstances, and this note highlights one case against the substitution. We’ll start with a simple data set:

rem
rem     Script:         nvl_coalesce_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level  comment to avoid wordpress format mess
)
select
        rownum                          id,
        case mod(rownum,4)
                when 0  then 'Y'
                        else 'N'
        end                             yes_no,
        case mod(rownum,5)
                when 0  then 'Y'
                when 1  then null
                        else 'N'
        end                             yes_null_no,
        lpad('x',100,'x')               padding
from
        generator
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1 for columns size 5 yes_no yes_null_no'
        );
end;
/

I’ve created a table with 10,000 rows which two columns with a highly skewed data distribution. Because I know that the skew is supposed to have a significant effect I’ve used a non-standard method_opt when gathering stats – in a production system I would have the packaged procedure dbms_stats.set_table_prefs() to associate this with the table.

The difference between the yes_no and the yes_null_no columns is that the latter is null for a significant fraction of the rows.

  • yes_no has: 7,500 N, 2,500 Y
  • yes_null_no has: 6,000 N, 2,000 null, 2,000 Y

Let’s now try to count the “N or null” rows using two different functions and see what estimates the optimizer produces for the counts. First counting the yes_no column – using nvl() then coalesce()


set autotrace traceonly explain

select * from t1 where nvl(yes_no,'N') = 'N';
select * from t1 where coalesce(yes_no,'N') = 'N';

set autotrace off

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  7500 |   798K|    24   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  7500 |   798K|    24   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("YES_NO",'N')='N')

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 10900 |    25   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 10900 |    25   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COALESCE("YES_NO",'N')='N')

The estimate for the nvl() is accurate; the estimate for the coalesce() query is 100 rows.

Let’s repeat the test using the yes_null_no column, again starting with nvl() followed by coalesce():


set autotrace traceonly explain

select * from t1 where nvl(yes_null_no,'N') = 'N';
select * from t1 where coalesce(yes_null_no,'N') = 'N';

set autotrace off

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  8000 |   851K|    24   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  8000 |   851K|    24   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("YES_NULL_NO",'N')='N')

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 10900 |    25   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 10900 |    25   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COALESCE("YES_NULL_NO",'N')='N')

Again we get the right result for the nvl() estimate (8,000 = 6,000 N + 2,000 null) and 100 for the coalesce() estimate.

By now you’ve probably realised that the coalesce() estimate is simply the “1% guess for equality” that applies to most cases of function(column). So, as we saw in the previous post, coalesce() gives us the benefits of “short-circuiting” but now we see it also threatens us with damaged cardinality estimates. The latter is probably less important than the former in many cases (especially since we might ne able to address the problem very efficiently using virtual columns), but it’s probably worth remembering.

 

 

V$MYSTAT delta values

Here is a little script I use from time to time to look at V$MYSTAT values and displaying on one line a set of statistics with their delta value between two calls.

The first script, _mystat_init.sql, initializes the variables. The second one displays the values, such as:

SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
57,371 15,445,852 6,111,608 37,709

Those two scripts are generated by defining the statistics:

define names="'redo size','redo entries','undo change vector size','db block changes'"

abd running the following to spool the two scripts:

sqlplus -s / as sysdba <<'END'
set pagesize 0 feedback off linesize 1000 trimspool on verify off echo off
with stats as (
select rownum n,stat_id,name from (select stat_id,name from v$statname where name in (&names) order by stat_id)
)
select 'define LAG'||stat_id||'=0' from stats
union all
select 'column "CUR'||stat_id||'" new_value '||'LAG'||stat_id||' noprint' from stats
union all
select 'column "DIF'||stat_id||'" heading '''||name||''' format 999G999G999G999' from stats
.
spool _mystat_init.sql
/
spool off
with stats as (
select rownum n,stat_id,name from (select stat_id,name from v$statname where name in (&names) order by stat_id)
)
select 'set termout off verify off' from dual
union all
select 'select ' from dual
union all
select ' '||decode(n,1,' ',',')||'"CUR'||stat_id||'" - '||'&'||'LAG'||stat_id||' "DIF'||stat_id||'"' from stats
union all
select ' '||',nvl("CUR'||stat_id||'",0) "CUR'||stat_id||'"' from stats
union all
--select ','''||'&'||'1'' comments' from dual
--union all
select q'[from (select stat_id,value from v$mystat join v$statname using(statistic#) where name in (&names)) pivot (avg(value)for stat_id in (]' from dual
union all
select ' '||decode(n,1,' ',',')||stat_id||' as "CUR'||stat_id||'"' from stats
union all
select '))' from dual
union all
select '.' from dual
union all
select 'set termout on' from dual
union all
select '/' from dual
.
spool _mystat_diff.sql
/
spool off
END

Then, in sqlplus or SQLcl, you run:

SQL> _mystat_init.sql

to initialize the values to 0 and:

SQL> @ _mystat_diff.sql

each time you want to display the difference from last call.

 

Cet article V$MYSTAT delta values est apparu en premier sur Blog dbi services.

A look into Oracle redo, part 3: log writer work cycle overview

This is the third part of a series of blogposts on how the Oracle database handles redo. The previous part talked about the memory area that stores redo strand information: https://fritshoogland.wordpress.com/2018/02/05/a-look-into-oracle-redo-part-2-the-discovery-of-the-kcrfa-structure/.

The single most important process in the Oracle database for handling redo is the log writer, which primary task is flushing the redo information other Oracle database processes put in the public redo strands to disk. Now that we have investigated the public redo strands and concurrency (first part) and kcrfsg_ and the KCRFA structure (second part), it seems logical to me to look at the log writer.

Simply because we have seen the foreground process take the redo allocation latch, ‘fiddles around’ in the kcrfsg_ and KCRFA structures and releases the latch, it is a pretty safe guess the logwriter is using these structures to determine if it needs to write. It looks like the kcrfsg_ structure contains global redo information like redo log file information, and the KCRFA structure strand specific information.

First we need to get an understanding of what the logwriter process is actually doing. The below investigation is done to see how the idle and write work cycles work, I explicitly excluded additional tasks like redo log file switches.

First of all, actually quite surprisingly, there is little information about what the log writer is actually doing. The only well known features of the logwriter is that it writes the log buffer instrumented by the ‘log file parallel write’ wait event, and sleeps on a semaphore in the wait event ‘rdbms ipc message’. In order to learn more, I’ve run a debugtrace on the logwriter process running idle, and stripped excess information and then captured only the calls one level from the stack depth the logwriter process is on in between the work cycles:

$ awk -n '$0 ~ /^\ \x3e\ /' lgwr_full_cycle_stripped.txt
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<
 > ksl_exit_main_loop_wait(0x9d64fb00c, 0x7, ...)
 > ksbcti(0x12b84c18, 0x7f2c1eb804b0, ...)          (nr 1)
 > dbktFlush(0, 0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)          (nr 2)
 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)          (nr 3)
 > sltrgatime64(0, 0, ...)
 > ksl_enter_main_loop_wait(0x1, 0x4ccef9975, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<
 > ksl_exit_main_loop_wait(0x9d67d8b01, 0x7, ...)
 > ksbcti(0x12b84c18, 0x7f2c1eb804b0, ...)          (nr 1)
 > dbktFlush(0, 0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)          (nr 2)
 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)          (nr 3)
 > sltrgatime64(0, 0, ...)
 > ksl_enter_main_loop_wait(0x1, 0x4cd05fb07, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<

ksarcv means kernel service asynchronous receive message, this is the function that sets up the semaphore and puts the logwriter process to sleep. This also explains the ksl_exit_main_loop_wait and ksl_enter_main_loop_wait functions, these are called exactly before (enter) and after (exit) the ksarcv function. The dbktFlush function is managing trace files, this function is used to write messages to the logwriter tracefile. The sltrgatime64 function is a function that calls clock_gettime. That leaves one function that can be seen in my example: ksbcti. ksbcti means kernel service background processes call timeout/interrupts. This is a function that performs several different functions based on the first argument.

The first ksbcti function following ksl_exit_main_loop_wait (with first argument 0x12b84c18) seems to be the function that performs administering the log writer process resource usage details in the SGA among other things:

 | | > kews_timeout(0x7f2c1eb93ec0, 0, ...)
 | | | > sltrgftime64(0x7f2c1eb93ec0, 0, ...)
 | | | | > clock_gettime@plt(0x1, 0x7ffc3dd10ad0, ...)
 | | | | | > clock_gettime(0x1, 0x7ffc3dd10ad0, ...)
 | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | < clock_gettime+0x00000000003a returns: 0
 | | | < sltrgftime64+0x00000000004c returns: 0x9cb6163fa
 | | | > slcpums(0x7f2c23ba3c58, 0x191ae1d5, ...)
 | | | | > getrusage(0x1, 0x7ffc3dd10a30, ...)
 | | | | | > fthread_self(0x1, 0x7ffc3dd10a30, ...)
 | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | <> getrusage_ext(0x1, 0x7ffc3dd10a30, ...)
 | | | | < getrusage+0x00000000000f returns: 0
 | | | < slcpums+0x00000000008f returns: 0x31d592
 | | | > kews_sqlst_flush(0xc, 0x72fd7d40, ...)
 | | | | > sltrgftime64(0xc, 0x72fd7d40, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7ffc3dd10a60, ...)
 | | | | | | > clock_gettime(0x1, 0x7ffc3dd10a60, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x9cb61723f
 | | | | > slcpums(0x7f2c23ba3c58, 0x7ae0ee70, ...)
 | | | | | > getrusage(0x1, 0x7ffc3dd109c0, ...)
 | | | | | | > fthread_self(0x1, 0x7ffc3dd109c0, ...)
 | | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | | <> getrusage_ext(0x1, 0x7ffc3dd109c0, ...)
 | | | | | < getrusage+0x00000000000f returns: 0
 | | | | < slcpums+0x00000000008f returns: 0x31e417
 | | | < kews_sqlst_flush+0x00000000016a returns: 0x7f2c23b9e900
 | | < kews_timeout+0x00000000052c returns: 0x7f2c23b9e900

kews_timeout means kernel event wait statistics, so statistics updated because of the process timing out. sltrgftime64 is an o/s specific call wrapper to read time from the operating system, it calls clock_gettime that reads the system clock (argument 1 is CLOCK_MONOTONIC, which is a clock that is not affected by ‘discontinuous’ changes to the clock time; think daylight saving time time changes for example). Similar, slcpums is an o/s specific call wrapper to read cpu usage, it calls getrusage to obtain CPU accounting data from the operating system. kews_sqlst_flush seems to be related to automatic tuning features, this *probably* (I haven’t investigated further) flushes status data like time and cpu to a buffer so features like sql monitor and adaptive features can use it.

The second ksbcti function call (with first argument 0x12b84be0) is the main function that performs the actual redo write, which is done by the kcrfw_redo_write_driver function:

 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)
 | > ksbckbast(0, 0, ...)
 | < ksbckbast+0x000000000061 returns: 0x1
 | > ksumcl(0, 0, ...)
 | < ksumcl+0x000000000045 returns: 0x1540
 | > kcrfw_redo_write_driver(0, 0, ...) 
...

The next blogpost will look into what the kcrfw_redo_write_driver function actually performs. To spoil it a bit: the kcrfw_redo_write_driver function is called every 3 seconds, even if there’s nothing to write!

The third ksbcti function call (with first argument 0x12b84bfc) performs a function or series of functions that are not clear to me, and it’s not doing a lot inside the functions that are called in ksbcti to give me a hint:

 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)
 | > ksbmsg(0, 0, ...)
 | < ksbmsg+0x000000000040 returns: 0x60065d90
 | > ksbxiaf(0, 0, ...)
 | < ksbxiaf+0x000000000058 returns: 0x795b5828
 | > kjci_action(0, 0, ...)
 | < kjci_action+0x000000000035 returns: 0x60065d90
 | > kcfsmpoll(0, 0, ...)
 | < kcfsmpoll+0x000000000030 returns: 0x60065d90
 | > ksuwaitsysevent(0, 0, ...)
 | < ksuwaitsysevent+0x0000000001e7 returns: 0x795b4aa8
 | > krdrsb_lgwr_chintr(0, 0, ...)
 | | > krsh_trace(0x2000, 0, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | < krdrsb_lgwr_chintr+0x00000000008b returns: 0
 | > ksbckbast(0, 0, ...)
 | < ksbckbast+0x000000000061 returns: 0x1
 < ksbcti+0x00000000018f returns: 0x1

Of course there are several things that point to features that are not used in my database; krdrsb points to dataguard and kjci points to RAC.

If you look again to the sequence you might wonder why the logwriter is not executing the redo write function immediately after waking up from the semaphore in kcarcv, in order to be able to write as soon as a process semctl’s/semop’s the semaphore asking the logwriter to write. Well, actually there is another sequence of function calls that can happen:

 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > ksl_exit_main_loop_wait(0x6003ef40, 0x7af972b8, ...)
 > kcrfw_redo_write_driver(0x7ffc3dd122a0, 0x18, ...)         <<<<<
 > ksbcti(0x12b84c74, 0x7f2c1eb804b0, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > dbktFlush(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksl_enter_main_loop_wait(0x1, 0x7a6032e8, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)

Here you see ksarcv, which calls kcrfw_redo_write_driver immediately after ksarcv and ksl_exit_main_loop_wait, not inside ksbcti. What happened in ksarcv, is that semtimedop call returned 0, indicating the semaphore was semctl’ed or semop’ed, instead of timing out after 3 seconds. If you payed close attention you saw that the first argument of the ksbcti function of kcrfw_redo_write_driver is different. I am not sure why, because the same functions as the call with 0x12b84bfc as first argument are called (the ones with functions hinting at dataguard and RAC). The ksarcv call that follows is not putting the process to sleep on a semaphore, it reads in the asynchronous messages structure in fixed sga (ksasga_) after which it returns immediately. What follows are dbktFlush and sltrgatime64 for administrative purposes (writing to the tracefile if necessary, and updating resource usage information), after which the process prepares the semaphore sleep using ksl_enter_main_loop_wait and then goes to sleep on a semaphore in ksarcv.

Please note this is the general cycle, there are other tasks that the log writer performs. One example of such a task is the log writer is looking up the process monitor (PMON) process to see if it is alive, which is done every minute.

Server process name in Postgres and Oracle

Every database analysis should start with system load analysis. If the host is in CPU starvation, then looking at other statistics can be pointless. With ‘top’ on Linux, or equivalent such as process explorer on Windows, you see the process (and threads). If the name of the process is meaningful, you already have a clue about the active sessions. Postgres goes further by showing the operation (which SQL command), the state (running or waiting), and the identification of the client.

Postgres

By default ‘top’ displays the program name (like ‘comm’ in /proc or in ‘ps’ format), which will be ‘postgres’ for all PostgreSQL processes. But you can also display the command line with ‘c’ in interactive mode, or directly starting with ‘top -c’, which is the same as the /proc/$pid/cmdline or ‘cmd’ or ‘args’ in ‘ps’ format.


top -c
 
Tasks: 263 total, 13 running, 250 sleeping, 0 stopped, 0 zombie
%Cpu(s): 24.4 us, 5.0 sy, 0.0 ni, 68.5 id, 0.9 wa, 0.0 hi, 1.2 si, 0.0 st
KiB Mem : 4044424 total, 558000 free, 2731380 used, 755044 buff/cache
KiB Swap: 421884 total, 418904 free, 2980 used. 2107088 avail Mem
 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20347 postgres 20 0 394760 11660 8696 S 7.6 0.3 0:00.49 postgres: demo demo 192.168.56.125(37664) DELETE
20365 postgres 20 0 393816 11448 8736 S 6.9 0.3 0:00.37 postgres: demo demo 192.168.56.125(37669) idle
20346 postgres 20 0 393800 11440 8736 S 6.6 0.3 0:00.37 postgres: demo demo 192.168.56.125(37663) UPDATE
20356 postgres 20 0 396056 12480 8736 S 6.6 0.3 0:00.42 postgres: demo demo 192.168.56.125(37667) INSERT
20357 postgres 20 0 393768 11396 8736 S 6.6 0.3 0:00.40 postgres: demo demo 192.168.56.125(37668) DELETE waiting
20366 postgres 20 0 394728 11652 8736 S 6.6 0.3 0:00.35 postgres: demo demo 192.168.56.125(37670) UPDATE
20387 postgres 20 0 394088 11420 8720 S 6.6 0.3 0:00.41 postgres: demo demo 192.168.56.125(37676) UPDATE
20336 postgres 20 0 395032 12436 8736 S 6.3 0.3 0:00.37 postgres: demo demo 192.168.56.125(37661) UPDATE
20320 postgres 20 0 395032 12468 8736 R 5.9 0.3 0:00.33 postgres: demo demo 192.168.56.125(37658) DROP TABLE
20348 postgres 20 0 395016 12360 8736 R 5.9 0.3 0:00.33 postgres: demo demo 192.168.56.125(37665) VACUUM
20371 postgres 20 0 396008 12708 8736 R 5.9 0.3 0:00.40 postgres: demo demo 192.168.56.125(37673) INSERT
20321 postgres 20 0 396040 12516 8736 D 5.6 0.3 0:00.31 postgres: demo demo 192.168.56.125(37659) INSERT
20333 postgres 20 0 395016 11920 8700 R 5.6 0.3 0:00.36 postgres: demo demo 192.168.56.125(37660) UPDATE
20368 postgres 20 0 393768 11396 8736 R 5.6 0.3 0:00.43 postgres: demo demo 192.168.56.125(37671) UPDATE
20372 postgres 20 0 393768 11396 8736 R 5.6 0.3 0:00.36 postgres: demo demo 192.168.56.125(37674) INSERT
20340 postgres 20 0 394728 11700 8736 S 5.3 0.3 0:00.40 postgres: demo demo 192.168.56.125(37662) idle
20355 postgres 20 0 394120 11628 8672 S 5.3 0.3 0:00.32 postgres: demo demo 192.168.56.125(37666) DELETE waiting
20389 postgres 20 0 395016 12196 8724 R 5.3 0.3 0:00.37 postgres: demo demo 192.168.56.125(37677) UPDATE
20370 postgres 20 0 393768 11392 8736 S 4.6 0.3 0:00.34 postgres: demo demo 192.168.56.125(37672) DELETE
20376 postgres 20 0 393816 11436 8736 S 4.6 0.3 0:00.37 postgres: demo demo 192.168.56.125(37675) DELETE waiting
20243 postgres 20 0 392364 5124 3696 S 1.0 0.1 0:00.06 postgres: wal writer process

This is very useful information. Postgres changes the process title when it executes a statement. In this example:

  • ‘postgres:’ is the name of the process
  • ‘demo demo’ are the database name and the user name
  • ‘192.168.56.125(37664)’ are the IP address and port of the client.
  • DELETE, UPDATE… are the commands. They are more or less the command name used in the feed back after the command completion
  • ‘idle’ is for sessions not currently running a statement
  • ‘waiting’ is added when the session is waiting on a blocker session (enqueued on a lock for example)
  • ‘wal writer process’ is a background process

This is very useful information, especially because we have, on the same sampling, the Postgres session state (idle, waiting or running an operation) with the Linux process state (S when sleeping, R when runnable or running, D when in I/O,… ).

Oracle

With Oracle, you can have ASH to sample session state, but being able to see it at OS level would be great. It would also be a safeguard if we need to kill a process.

But, the Oracle processes do not change while running. They are set at connection time.

The background processes mention the Oracle process name and the Instance name:

[oracle@VM122 ~]$ ps -u oracle -o pid,comm,cmd,args | head
 
PID COMMAND CMD COMMAND
1873 ora_pmon_cdb2 ora_pmon_CDB2 ora_pmon_CDB2
1875 ora_clmn_cdb2 ora_clmn_CDB2 ora_clmn_CDB2
1877 ora_psp0_cdb2 ora_psp0_CDB2 ora_psp0_CDB2
1880 ora_vktm_cdb2 ora_vktm_CDB2 ora_vktm_CDB2
1884 ora_gen0_cdb2 ora_gen0_CDB2 ora_gen0_CDB2

The foreground processes mention the instance and the connection type, LOCAL=YES for bequeath, LOCAL=NO for remote via listener.


[oracle@VM122 ~]$ ps -u oracle -o pid,comm,cmd,args | grep -E "[ ]oracle_|[ ]PID"
 
PID COMMAND CMD COMMAND
21429 oracle_21429_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21431 oracle_21431_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21451 oracle_21451_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21517 oracle_21517_cd oracleCDB1 (LOCAL=NO) oracleCDB1 (LOCAL=NO)

You need to join V$PROCESS with V$SESSION on (V$PROCESS.ADDR=V$SESSION.PADDR) to find the state, operation and client information

For the fun, you can change the program name (ARGV0) and arguments (ARGS).

The local connections can change the name in the BEQueath connection string:


sqlplus -s system/oracle@"(ADDRESS=(PROTOCOL=BEQ)(PROGRAM=$ORACLE_HOME/bin/oracle)(ARGV0=postgres)(ARGS='(DESCRIPTION=(LOCAL=MAYBE)(ADDRESS=(PROTOCOL=BEQ)))')(ENVS='OLE_HOME=$ORACLE_HOME,ORACLE_SID=CDB1'))" <<< "host ps -u oracle -o pid,comm,cmd,args | grep -E '[ ]oracle_|[ ]PID'"
 
PID COMMAND CMD COMMAND
21155 oracle_21155_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21176 oracle_21176_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21429 oracle_21429_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21431 oracle_21431_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21451 oracle_21451_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21517 oracle_21517_cd oracleCDB1 (LOCAL=NO) oracleCDB1 (LOCAL=NO)
22593 oracle_22593_cd postgres (DESCRIPTION=(LOCA postgres (DESCRIPTION=(LOCAL=MAYBE)(ADDRESS=(PROTOCOL=BEQ)))

The remote connection can have the name changed from the static registration, adding an ARVG0 value on the listener side:


LISTENER=(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=0.0.0.0)(PORT=1521)))
SID_LIST_LISTENER=(SID_LIST=
(SID_DESC=(GLOBAL_DBNAME=MYAPP)(ARGV0=myapp)(SID_NAME=CDB1)(ORACLE_HOME=/u01/app/oracle/product/12.2.0/dbhome_1))
(SID_DESC=(GLOBAL_DBNAME=CDB1_DGMGRL)(SID_NAME=CDB1)(ORACLE_HOME=/u01/app/oracle/product/12.2.0/dbhome_1))
(SID_DESC=(GLOBAL_DBNAME=CDB2_DGMGRL)(SID_NAME=CDB2)(ORACLE_HOME=/u01/app/oracle/product/12.2.0/dbhome_1))
)

When reloading the listener with this (ARGV0=myapp) to identify connection from this MYAPP service

[oracle@VM122 ~]$ sqlplus -s system/oracle@//localhost/MYAPP <<< "host ps -u oracle -o pid,comm,cmd,args | grep -E '[ ]oracle_|[ ]PID'"
PID COMMAND CMD COMMAND
21155 oracle_21155_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21176 oracle_21176_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21429 oracle_21429_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21431 oracle_21431_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21451 oracle_21451_cd oracleCDB2 (LOCAL=NO) oracleCDB2 (LOCAL=NO)
21517 oracle_21517_cd oracleCDB1 (LOCAL=NO) oracleCDB1 (LOCAL=NO)
24261 oracle_24261_cd myapp (LOCAL=NO) myapp (LOCAL=NO)

However, I would not recommend to change the default. This can be very confusing for people expecting ora_xxxx_SID and oracleSID process names.

 

Cet article Server process name in Postgres and Oracle est apparu en premier sur Blog dbi services.

OSWatcher, Tracefile Analyzer, and Oracle Restart 12.2

You are about to read the second part of this mini-series on TFA and Oracle 12.2. In the previous article I wrote about TFA and Oracle 12.2 single instance. In this short article I am going to have a look at TFA in a 12.2 Oracle Restart environment before rounding it up with an investigation into a full-blown RAC installation in part 3.

Summarising the first part I can only say that I am very happy that we now get TFA as part of the standard installation. Running it in daemon mode provides some great insights, and even if you did not upgrade the installation to “MOS-TFA”, you have a very fine tool for Oracle troubleshooting at your disposal.

Summary of the environment

My environment is largely the same as last time, except the machine name changed to server4 and I have additional storage for use with ASM.

  • It’s still Oracle Linux 7.4 with UEK4
  • Oracle 12.2.0.1.0 Grid Infrastructure was installed first
  • Followed by an installation of the Oracle 12.2.0.1.0 RDBMS EE software
  • After having created a database in ASM I applied the January 2018 GI RU
  • Finally I upgraded TFA to the current version (as downloaded from My Oracle Support DOC ID 1513912.1)

Not all of these steps are relevant for this article though.

Now where would you find TFA?

The question I had when creating this environment was essentially this: where would I find TFA? Would it be part of the Grid Home, or rather the RDBMS installation?

After having installed the binaries for Grid Infrastructure, I didn’t find a reference to roottfa.sh in the Grid Home’s root.sh script.

[oracle@server4 ~]$ cat /u01/app/oracle/product/12.2.0.1/grid/root.sh
#!/bin/sh
unset WAS_ROOTMACRO_CALL_MADE
. /u01/app/oracle/product/12.2.0.1/grid/install/utl/rootmacro.sh "$@"
. /u01/app/oracle/product/12.2.0.1/grid/install/utl/rootinstall.sh

#
# Root Actions related to network
#
/u01/app/oracle/product/12.2.0.1/grid/network/install/sqlnet/setowner.sh 

#
# Invoke standalone rootadd_rdbms.sh
#
/u01/app/oracle/product/12.2.0.1/grid/rdbms/install/rootadd_rdbms.sh

/u01/app/oracle/product/12.2.0.1/grid/rdbms/install/rootadd_filemap.sh 
/u01/app/oracle/product/12.2.0.1/grid/crs/config/rootconfig.sh $@ 
EXITCODE=$? 
if [ $EXITCODE -ne 0 ]; then
	exit $EXITCODE
fi

[oracle@server4 ~]$ 

I checked whether TFA has been configured in a different place, but there wasn’t any TFA-related process running nor was there a systemd unit file with *tfa* in its name. So it looks like you don’t get the option to install TFA automatically as part of the Grid Infrastructure installation.

Not finding TFA configured to run out of Grid Infrastructure surprised me, especially since Oracle states in the Autonomous Health Framework documentation chapter 4.2.2 that TFA is automatically configured as part of the GI configuration, upon invoking root.sh or rootupgrade.sh. According to my testing, this is true for RAC, but not Oracle Restart.

Which kind-of makes sense if you consider that most users will install the database software anyway in an Oracle Restart setup.

It doesn’t really matter, read on ;)

Installing the RDBMS software

If TFA isn’t configured automatically with Grid Infrastructure in an Oracle Restart configuration it should come as no surprise that TFA is once more installed and configured from the RDBMS home. Just as with the single instance installation I wrote about previously, you find a reference to roottfa.sh in the RDBMS home’s root.sh:

[oracle@server4 tfa]$ cat /u01/app/oracle/product/12.2.0.1/dbhome_1/root.sh 
#!/bin/sh
unset WAS_ROOTMACRO_CALL_MADE
. /u01/app/oracle/product/12.2.0.1/dbhome_1/install/utl/rootmacro.sh "$@"
. /u01/app/oracle/product/12.2.0.1/dbhome_1/install/utl/rootinstall.sh
/u01/app/oracle/product/12.2.0.1/dbhome_1/suptools/tfa/release/tfa_home/install/roottfa.sh
/u01/app/oracle/product/12.2.0.1/dbhome_1/install/root_schagent.sh

#
# Root Actions related to network
#
/u01/app/oracle/product/12.2.0.1/dbhome_1/network/install/sqlnet/setowner.sh 

#
# Invoke standalone rootadd_rdbms.sh
#
/u01/app/oracle/product/12.2.0.1/dbhome_1/rdbms/install/rootadd_rdbms.sh

/u01/app/oracle/product/12.2.0.1/dbhome_1/rdbms/install/rootadd_filemap.sh 
[oracle@server4 tfa]$ 

The procedure appears to be almost exactly the same as with single instance Oracle. I have again opted to use TFA in daemon mode, just as I did before in my post about Oracle single instance.

TFA thankfully discovers and adds both RDBMS as well as Grid Infrastructure directories (and quite a few Linux related directories, too!). You can see for yourself when running tfactl print directories.

As the final step you might want to consider upgrading TFA to the MOS version. That’s it-happy troubleshooting!