Search

Top 60 Oracle Blogs

Recent comments

Little things worth knowing: redo transport in Data Guard 12.2 part 1

I have researched Data Guard redo transport quite a bit, but so far haven’t really blogged about some of the details that I have worked out. This is what I am trying to do in this post.

In order to have a somewhat meaningful setup, I created a Data Guard configuration with 1 primary and 2 standby databases. They all reside on my lab (virtual) machines:

  • CDB1 currently operates in primary mode on server1
  • CDB2 and CDB3 are physical standby databases on server2
  • The VMs use Oracle Linux 7.4 with UEK4

The databases have been patched to 12.2.0.1.170814. I c^Hshould probably have created a dedicated VM for the second standby database but I hope what I found out with just 2 systems applies equally. If you know or find out otherwise, please let me know in the comments section of this article. This post is based on information from the Data Guard Concepts and Administration Guide version 12c as of 171212 as well as my own investigation shown here. There’s a good Oracle white paper out there on the impact of synchronous log shipping and some “best practices” which you might want to read as well.

Data Guard configuration

All my Data Guard instances are always managed via the Data Guard Broker, it’s just too convenient for me not to. Here’s the configuration detail for later reference:

DGMGRL> show configuration

Configuration - allthree

  Protection Mode: MaxAvailability
  Members:
  CDB1 - Primary database
    CDB2 - Physical standby database 
    CDB3 - Physical standby database 

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 22 seconds ago)

My protection mode is defined as MaxAvailability, which can be a good compromise between availability/data protection and performance. Your mileage will vary, picking the right protection mode for your environment is out of scope of this little article.

I am using sync redo transport to CDB2, but asynchronous redo transport to CDB3:

DGMGRL> show database 'CDB2' logxptmode
  LogXptMode = 'SYNC'
DGMGRL> show database 'CDB3' logxptmode
  LogXptMode = 'ASYNC'

In addition to the Broker, the primary database shows the same information:

SQL> select dest_id, destination, status, error, transmit_mode
  2  from v$archive_dest where destination is not null;

   DEST_ID DESTINATION                    STATUS    ERROR                TRANSMIT_MOD
---------- ------------------------------ --------- -------------------- ------------
         1 USE_DB_RECOVERY_FILE_DEST      VALID                          SYNCHRONOUS
         2 CDB2                           VALID                          PARALLELSYNC
         3 CDB3                           VALID                          ASYNCHRONOUS

DEST_ID 1 is of course the local archiving destination and doesn’t have anything to do with redo transport:

SQL> select dest_name, destination, type, valid_type, valid_role
  2  from v$archive_dest where dest_id = 1
  3  /

DEST_NAME            DESTINATION                    TYPE    VALID_TYPE      VALID_ROLE
-------------------- ------------------------------ ------- --------------- ------------
LOG_ARCHIVE_DEST_1   USE_DB_RECOVERY_FILE_DEST      PUBLIC  ALL_LOGFILES    ALL_ROLES

As @fritshoogland would point out, the log writer trace is a great source of information and I’ll show parts of it throughout the post. This excerpt shows how lgwr says it’s not responsible for destination 1:

Trace file /u01/app/oracle/diag/rdbms/cdb1/CDB1/trace/CDB1_lgwr_14986.trc
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Build label:    RDBMS_12.2.0.1.0_LINUX.X64_170125
ORACLE_HOME:    /u01/app/oracle/product/12.2.0.1/dbhome_1
System name:    Linux
Node name:      server1
Release:        4.1.12-94.3.9.el7uek.x86_64
Version:        #2 SMP Fri Jul 14 20:09:40 PDT 2017
Machine:        x86_64
Instance name: CDB1
Redo thread mounted by this instance: 0 
Oracle process number: 20
Unix process pid: 14986, image: oracle@server1 (LGWR)


*** 2017-12-12T09:58:48.421846+00:00 (CDB$ROOT(1))
*** SESSION ID:(10.52290) 2017-12-12T09:58:48.421874+00:00
*** CLIENT ID:() 2017-12-12T09:58:48.421878+00:00
*** SERVICE NAME:() 2017-12-12T09:58:48.421882+00:00
*** MODULE NAME:() 2017-12-12T09:58:48.421885+00:00
*** ACTION NAME:() 2017-12-12T09:58:48.421888+00:00
*** CLIENT DRIVER:() 2017-12-12T09:58:48.421892+00:00
*** CONTAINER ID:(1) 2017-12-12T09:58:48.421895+00:00
 
Created 2 redo writer workers (2 groups of 1 each)

*** 2017-12-12T09:59:06.794537+00:00 (CDB$ROOT(1))
Destination LOG_ARCHIVE_DEST_3 is using asynchronous network I/O
Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
...

The information I wanted to show you is found towards the end of this little extract: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR.

Investigation into Data Guard redo transport

Like so many other areas in the database, Data Guard is well instrumented. The main views I’ll use for this post are v$managed_standby and v$dataguard_process. Let’s begin by looking at the processes on the primary with the intention to work out which process is transmitting redo.

SQL> select process,pid,status,client_process,client_pid,sequence#,block#
  2  from v$managed_standby
  3  order by status;

PROCESS   PID        STATUS       CLIENT_P CLIENT_PID  SEQUENCE#     BLOCK#
--------- ---------- ------------ -------- ---------- ---------- ----------
DGRD      15275      ALLOCATED    N/A      N/A                 0          0
DGRD      15128      ALLOCATED    N/A      N/A                 0          0
DGRD      15362      ALLOCATED    N/A      N/A                 0          0
DGRD      15124      ALLOCATED    N/A      N/A                 0          0
ARCH      15240      CLOSING      ARCH     15240              84          1
ARCH      15224      CLOSING      ARCH     15224              83          1
ARCH      15126      CLOSING      ARCH     15126              85     342016
ARCH      15233      CLOSING      ARCH     15233              82          1
LGWR      14986      WRITING      LGWR     14986              86      51577
LNS       15252      WRITING      LNS      15252              86      51577

10 rows selected.

OK, so we have a few archiver processes present, a group of “DGRD” processes, LNS, and LGWR. This is the current state of my primary database.

SQL> archive log list
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence     84
Next log sequence to archive   86
Current log sequence           86

I am running a low-impact Swingbench workload to generate some redo, but not too much (Average Apply Rate: 56.00 KByte/s according to the Broker).

These DGRD processes are visible as TTnn processes on the O/S-level. I used the listagg function to group their PIDs into an egrep-able format:

SQL> select process , listagg(pid, '|') within group (order by pid) as pids
  2  from v$managed_standby group by process;

PROCESS   PIDS
--------- ------------------------------
ARCH      15126|15224|15233|15240
DGRD      15124|15128|15275|15362
LGWR      14986
LNS       15252

The first group to map to Linux processes is DGRD (I have an idea what these archiver processes will look like on the O/S so won’t go into that detail)

[oracle@server1 ~]$ ps -ef | egrep '15124|15128|15275|15362'
oracle   15124     1  0 09:59 ?        00:00:00 ora_tt00_CDB1
oracle   15128     1  0 09:59 ?        00:00:00 ora_tt01_CDB1
oracle   15275     1  0 09:59 ?        00:00:00 ora_tt03_CDB1
oracle   15362     1  0 09:59 ?        00:00:00 ora_tt04_CDB1

So there are tt00 – tt04 with the exception of tt02. The Oracle 12.2 reference refers to the TTnn processes as redo transport slaves. The LNS process from the query is our missing tt02:

[oracle@server1 ~]$ ps -ef | egrep 15252
oracle   15252     1  0 09:59 ?        00:00:04 ora_tt02_CDB1

Interestingly enough, the current (Oracle 12.2) documentation has nothing to say about the LNS process. Or it has hidden it well from me. I found a reference in the 10.2 DataGuard Concepts and Administration Guide, chapter 5 (Redo Transport) but nothing concrete.

The final process to show is LGWR:

[oracle@server1 ~]$ ps -ef | grep 14986
oracle   14986     1  0 09:58 ?        00:00:12 ora_lgwr_CDB1

The next step is to work out how processes on the primary and standby database map. v$dataguard_status is an interesting view, and it shows relevant information as seen in the alert.log. I found v$dataguard_process to be even more valuable. On the primary, I am getting the following results when querying it:

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id
  2  from v$dataguard_process order by action;
  
NAME  PID        ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ---------- ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
ARC0  15126      archive local           IDLE                  0 none                      0          0          0
ARC3  15240      archive redo            IDLE                  0 none                      0          0          0
TMON  15020      redo transport monitor  IDLE                  0 none                      0          0          0
NSS2  15029      sync                    IDLE                  0 none                      0          0          0
TT00  15124      gap manager             IDLE                  0 none                     86          0          2
TT01  15128      redo transport timer    IDLE                  0 none                      0          0          0
TT04  15362      controlfile update      IDLE                  0 none                      0          0          0
TT03  15275      heartbeat redo informer IDLE                  0 none                      0          0          0
ARC1  15224      archive redo            IDLE                  0 none                      0          0          0
ARC2  15233      archive redo            IDLE                  0 none                      0          0          0
TT02  15252      async ORL multi         WRITING               0 none                     86      92262          0
LGWR  14986      log writer              WRITING               0 none                     86      92271          0

12 rows selected.

So from that output it appears as if only LGWR and TT02 are the only processes actually writing. Would be interesting to know where they are writing to :) It’s unclear to me why the process with PID 15252 appears as TT02 instead of LNS, as in v$managed_standby.

If you run the same query again you notice that the block# is increasing: as far as I remember that’s a sign of work! This coincides with information from the alert.log:

Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED
LGWR: Standby redo logfile selected to archive thread 1 sequence 94
LGWR: Standby redo logfile selected for thread 1 sequence 94 for destination LOG_ARCHIVE_DEST_2
2017-12-12T13:23:07.153256+00:00
Thread 1 advanced to log sequence 94 (LGWR switch)
  Current log# 1 seq# 94 mem# 0: /u01/oradata/CDB1/onlinelog/o1_mf_1_dvh035kx_.log
  Current log# 1 seq# 94 mem# 1: /u01/fast_recovery_area/CDB1/onlinelog/o1_mf_1_dvh0363f_.log
2017-12-12T13:23:07.310438+00:00
Archived Log entry 194 added for T-1.S-93 ID 0x3784c932 LAD:1
2017-12-12T13:23:07.531896+00:00
TT02: Standby redo logfile selected for thread 1 sequence 94 for destination LOG_ARCHIVE_DEST_3

There’s also a new entry that wasn’t present in v$managed_standby, named NSS2. This process has not yet been discussed:

[oracle@server1 ~]$ ps -ef  | grep 5348
oracle    5348     1  0 11:16 ?        00:00:01 ora_nss2_CDB1

Its job is described by the database reference as transferring redo for synchronous targets. Nevertheless, I never saw it in a state other than “IDLE”. The aforementioned white paper however references it in section “Understanding how synchronous transport insures data integrity”.

In this particular section, the authors state that LGWR writes to the online redo logs and an identical network redo write is performed by NSSn. Looking at the netstat output, LGWR (and its 2 worker processes) does not have a connection to “server2” open, but NSS2 does:

[root@server1 ~]# ps -ef | egrep 'lgwr|lg0|nss'
oracle   14986     1  0 09:58 ?        00:01:19 ora_lgwr_CDB1
oracle   14990     1  0 09:58 ?        00:00:00 ora_lg00_CDB1
oracle   14994     1  0 09:58 ?        00:00:00 ora_lg01_CDB1
oracle   15029     1  0 09:58 ?        00:00:43 ora_nss2_CDB1

[root@server1 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '14986|14990|14994|15029'
tcp        0      0 server1.example.com:15515 server2.example.com:1521 ESTABLISHED oracle     16400768   15029/ora_nss2_CDB1 
[root@server1 ~]# 

I cross-referenced this with output from lsof and it seems to be correct. What I haven’t been able to work out though is the relationship between a user session, LGWR and NSSn process. One clue is in the current log writer trace. Continuing from the previous trace:

*** 2017-12-12 09:59:12.450533 
Netserver NSS2 (PID:15029) for mode SYNC has been re-initialized
Performing a channel reset to ignore previous responses
Connecting as publisher to KSR Channel [id = 20]
Successfully reused NSS2 (PID:15029) for dest CDB2 mode SYNC ocis = 0x7fea5c6f6c40

There appears to be some sort of communication, although it’s not clear to me of what kind ;)

Summary

In this part of the two-part series I hoped to give you a few insights into Data Guard redo transport for synchronous and asynchronous redo shipping. In the next part the information in this article will be joined up with diagnostic information from server2 and both standby databases.