Search

OakieTags

Who's online

There are currently 0 users and 25 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

Database Thin Cloning: Summary

This post is part of an ongoing series:

Summary

These blog posts on thin cloning have covered a number of ways to create database thin clones, yet there are still some challenges:

Database-managed Cloning with Oracle Clonedb

  • Performance issues
  • Requires dNFS
  • Oracle 11.2.0.2 and higher only
  • No possibility of branching clones
  • No practical way to share datafiles from clones of the source DB at different points in time

Copy on Write with EMC COFW

  • Can cause write overhead on source filesystems
  • No branching of clones
  • Difficulty in maintaining efficient changes on secondary arrays
  • Limit of 16 snapshots per LUN

Redirect on Write with EMC VNX ROW

  • Limit of 256 snapshots per LUN
  • Difficulty maintaining efficient storage of changes from source database on secondary storage array

Write Anywhere File System with NetApp

  • Limit of 255 snapshots per file or LUN
  • Support provided for maintaining source database changes efficiently on a second array; however, they require a number of products and implementation overhead
  • Size limited to 16-100TB depending on the model of array

Allocate on Write with ZFS

  • Unlimited and instantaneous snapshots with no space overhead
  • Possibility of efficiently sending changes from a source array to a secondary array but no documentation exists for these methods

Each of these technologies faces different challenges. One obvious constraint is that any solution that depends on a vendor such as EMC, NetApp, or Oracle will tie the solution only to that storage solution. Some concepts such as Clonedb and Illumos ZFS can be run on any storage but have other constraints.

One of the consistent challenges across all of these technologies is the expert knowledge and extensive manual configuration that can impede implementation of these technologies to provide database thin clones. If database thin clones reduce storage so dramatically as well as reduce cloning times drastically, the technology normally would have taken off across multiple industries. This technology involving filesystem snapshots has existed since the mid 1990s—almost a decade and a half ago. In 1994, StorageTek introduced the virtual disk in their Iceberg release. In 1995 Iceberg started supporting filesystem snapshots. However after 15 years, database thin cloning is still rarely used.

The answer to this riddle lies in an analogy. The analogy is the Internet: it was around years before the browser was ever used. Before browsers one could do many things that they can do today on the Internet: use email, transfer files (via FTP), go to chat rooms, and use bulletin boards. But until the browser was created most activity on the Internet was from academics. It wasn’t until the introduction of the browser that usage of the Internet exploded. In a similar way it wasn’t until the introduction of Database Virtualization that usage of database thin cloning began to explode.

Stay tuned for new series on Data Virtualization for Databases, i.e. Database Virtualization.
Upcoming blog posts will discuss
  • OEM 12c Database as a Service (DBaaS) for Netapp
  • Snapshot Management Utility (SMU) for ZFS storage Appliance
  • Delphix any storage even JBODs

SQL Gone Bad – But Plan Not Changed?

Last week an interesting issue popped up on a mission critical production app (MCPA). A statement that was run as part of a nightly batch process ran long. In fact, the statement never finished and the job had to be killed and restarted. This particular system is prone to plan stability issues due to various factors outside the scope of this post, so the first thing that the guys checked was if there had been a plan change. Surprisingly the plan_hash_value was the same as it had been for the past several months. The statement was very simple and a quick look at the xplan output showed that the plan was indeed the same with one exception. The predicate section was slightly different.

As a quick diversion, you probably already know that the plan_hash_value is calculated based on partial information about the plan. Arguably it’s the most important parts, but there are some important parts of the plan that are not included (namely the stuff that shows up in the predicate section of the plan). Randolf Geist explained which parts of the plan are used in calculating the plan_hash_value well in a post on How PLAN_HASH_VALUES Are Calculated several years ago. His summary was this:

So in summary the following conclusions can be made:

- The same PLAN_HASH_VALUE is merely an indicator that the same operations on objects of the same name are performed in the same order.

- It tells nothing about the similarity of the expected runtime performance of the execution plan, due to various reasons as demonstrated. The most significant information that is not covered by the PLAN_HASH_VALUE are the filter and access predicates, but there are other attributes, too, that are not part of the hash value calculation.

The production issue got me thinking about several things:

    Could I come up with a simple test case to show a severe performance degradation between executions of a statement with the same plan_hash_value because of a change in the predicates? (the answer is it’s pretty easy actually)

    What forensics could be used to determine when this situation has occurred?

    How do you fix the problem?

So the first thing I did was create a test case that consists of a query with two predicates, one that is fairly CPU intensive and the other that does a simple equality comparison (which uses very little CPU). Both of the filters are very selective (and the optimizer actually estimate that the same number of rows will be returned on my test system) but the CPU cost of executing them is very different. So the order in which the filters are applied should make a big difference. I also put the low CPU filter last so that I could use a rarely used hint (ordered_predicates) to rearrange the way the optimizer wants to do it (i.e. apply the expensive one first). So here’s the set up.

 
SYS@DEMO1> -- fast filter 
SYS@DEMO1> select count(*) from kso.skew2 where col4='F';
 
  COUNT(*)
----------
         1
 
Elapsed: 00:00:00.82 <======== about 1 second with no smart scan tricks
 
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')); 
 
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  gzvkkhaykkphp, child number 0
-------------------------------------
select count(*) from kso.skew2 where col4='F'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3610 | <=== optimizer estimates 3610
-----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - storage("COL4"='F')
       filter("COL4"='F')
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
27 rows selected.
 
Elapsed: 00:00:00.01
SYS@DEMO1> -- slow filter
SYS@DEMO1> select count(*) from kso.skew2 where TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a';
 
  COUNT(*)
----------
         0
Elapsed: 00:05:49.85 <==== almost 6 minutes with no smart scan
 
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')); 
 
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  gatc45y5cza17, child number 0
-------------------------------------
select count(*) from kso.skew2 where
TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3610 | <=== optimizer still expects 3610 rows
-----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - storage(TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a')
       filter(TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a')
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
28 rows selected.
Elapsed: 00:00:00.02

The first section shows how long it takes to read through the table and apply each of the 2 filters. As you can see, the COL4=’F’ filter takes much less time (note that I don’t want to get bogged down here, but I hope you’ll trust me that no Exadata magic kicked in and they did the same amount of work in terms of lio/pio, etc…). Also note that the STORAGE predicate is present merely because we are on an Exadata (as was the original problem query) and it only signifies that a predicate could potentially be offloaded, not that it actually has been offloaded. Please take my word for it that no smarts kick in for any of the testing in this post.

The next bit of code shows the actual test case which combines the two filters.

SYS@DEMO1> select /*+ monitor */ avg(pk_col) from kso.skew2
  2  where translate(to_char(sin(col1)),'1','a') = 'a'   
  3  and col4 = 'F'
  4  /
 
AVG(PK_COL)
-----------
 
 
Elapsed: 00:00:01.72
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------
SQL_ID  5p4610dk709pj, child number 0
-------------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where
translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3784 |
-----------------------------------------------------
 
Predicate Information (identified by operation id): <==== note that the fast filter is applied first
---------------------------------------------------
 
   2 - storage(("COL4"='F' AND TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='
              a'))
       filter(("COL4"='F' AND TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a
              '))
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
30 rows selected.
 
Elapsed: 00:00:00.01
SYS@DEMO1> @sm
Enter value for sid: 
Enter value for sql_id: 5p4610dk709pj
Enter value for sql_exec_id: 
 
 SID    SERIAL# SQL_ID        SQL_EXEC_ID SQL_EXEC_START       PLAN_HASH_VALUE         ETIME BUFFER_GETS DISK_READS
---- ---------- ------------- ----------- -------------------- --------------- ------------- ----------- ----------
####       2853 5p4610dk709pj    16777216 01-Jun-13 22:50:34        4220890033          1.63      325662          0
 
SYS@DEMO1> @rsm
Enter value for sid: 
Enter value for sql_id: 5p4610dk709pj
Enter value for sql_exec_id: 
 
REPORT
------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Monitoring Report
 
SQL Text
------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
 
Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  SYS (1366:2853)
 SQL ID              :  5p4610dk709pj
 SQL Execution ID    :  16777216
 Execution Started   :  06/01/2013 22:50:34
 First Refresh Time  :  06/01/2013 22:50:34
 Last Refresh Time   :  06/01/2013 22:50:36
 Duration            :  2s
 Module/Action       :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1
 
Global Stats
=================================================
| Elapsed |   Cpu   |  Other   | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|    1.63 |    1.62 |     0.02 |     1 |   326K |
=================================================
 
SQL Plan Monitoring Details (Plan Hash Value=4220890033)
====================================================================================================================================
| Id |          Operation           | Name  |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Activity | Activity Detail |
|    |                              |       | (Estim) |       | Active(s) | Active |       | (Actual) |   (%)    |   (# samples)   |
====================================================================================================================================
|  0 | SELECT STATEMENT             |       |         |       |         1 |     +2 |     1 |        1 |          |                 |
|  1 |   SORT AGGREGATE             |       |       1 |       |         1 |     +2 |     1 |        1 |          |                 |
|  2 |    TABLE ACCESS STORAGE FULL | SKEW2 |    3784 | 89309 |         2 |     +1 |     1 |        0 |   100.00 | Cpu (2)         |
====================================================================================================================================

So the optimizer reordered the filters correctly (fast filter applied first) as we expected and the statement executed in less than 2 seconds. Note that this happened even though we had no stats on the table and dynamic sampling had to kick in and that the estimated cardinality of the two filters was the same. So this reordering was done based on estimated CPU cost.

Now let’s see what happens when we trick the optimizer into applying the filters in the wrong order. I could do this by changing the statistics but I’ll use a SQL Patch to apply the ordered_predicates hint. To do this I’ll use a little script I wrote called create_sql_patch.sql which prompts for a hint and applies it to a statement behind the scenes (much like a SQL Profile or Baseline).

 
SYS@DEMO1> @create_sql_patch
Enter value for sql_id: 5p4610dk709pj
Enter value for profile_name (PATCH_sqlid): ordered_predicates
Enter value for category (DEFAULT): 
Enter value for hint_text: ordered_predicates
Enter value for validate (false): 
 
SQL Patch ordered_predicates created.
 
Elapsed: 00:00:00.04
SYS@DEMO1> select /*+ monitor */ avg(pk_col) from kso.skew2
  2  where translate(to_char(sin(col1)),'1','a') = 'a'
  3  and col4 = 'F'
  4  /
 
AVG(PK_COL)
-----------
 
 
Elapsed: 00:05:54.05 <=== clearly something's changed!
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  5p4610dk709pj, child number 0
-------------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where
translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3784 |
-----------------------------------------------------
 
Predicate Information (identified by operation id): <==== filters now in wrong order
---------------------------------------------------
 
   2 - storage((TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a' AND
              "COL4"='F'))
       filter((TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a' AND
              "COL4"='F'))
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - SQL patch "ordered_predicates" used for this statement <==== Patch applied
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
32 rows selected.
 
Elapsed: 00:00:00.01

So now we have the same statement (sql_id: 5p4610dk709pj) and the same plan (plan_hash_value: 4220890033) but the performance is vastly different (~6 minutes vs. < 2 seconds). So yes this can and does happen. Fortunately it doesn’t happen very often, at least in my experience. Or at least I should say, it doesn’t seem to happen very often where the performance is vastly different. I should note that this specific test case is exceedingly simple and not very likely in real life because the optimizer should always know how expensive a filter is in terms of CPU usage. The more common case would be a change in estimated selectivity of a filter due to changes in statistics. Also, this simple test case is just reversing 2 filters on a single line in a plan but it is also possible that a filter would be moved to another line or that an access path may become a filter or some internal function may be applied inside a filter or access predicate. Since this post is getting long I will plan to do a Part 2 that discusses the forensic aspects (i.e. how do you find this situation?). As Tanel says, “the rabbithole gets deeper, as always”.

Delphix

If you’ve been keeping an eye on my Public Appearances page you’ll know that I am scheduled to go on line with Kyle Hailey for a second (more technical) discussion about Delphix and virtual databases on 19th June (tomorrow). If you haven’t registered, there’s still time to do so. It’s scheduled for 5:00 pm (BST), which makes it 9:00 am in San Francisco.

For an idea of the points we’ll cover, here’s a link with a draft agenda that Kyle Hailey has posted.

Update 1: Delphix have got 10 copies of Oracle Core to give away and they’ve decide to give one to every 10th registrant (until stocks run out) for the webinar.

Update 2: Over the last few days Kyle Hailey has been writing a short series comparing the commonest technologies currently available for Virtual Databases (or “Thin cloning”):

Database Thin Cloning: Allocate on Write (ZFS)

This post is part of an ongoing series:

(very interested in real life experiences and alternative strategies to using ZFS for thin cloning. Please comment and or email me at kylelf@gmail.com)

Allocate on Write Thin Cloning

Three challenges specifically stand out when considering Copy on Write filesystem snapshots described in the previous section:

  • The number of snapshots you can take of source database LUNs is limited
  • The size of the snapshots can also be limited
  • Difficulties arise sharing the base image of source databases at multiple points in time. In some cases it is not possible, in others difficult or resource heavy.

These challenges highlight a specific need: to create thin provision clones of a source database from multiple points of time at the same time without using any additional space consumption. This requirement is important, as it allows one base image to serve as the foundation for all subsequent clones and imposes no unplanned storage or refresh requirements on users of the target (cloned) systems.

With a filesystem storage technology called Allocate on Write, these challenges can be met. In allocate on write filesystems, data blocks are never modified. When modifications are requested to a block, the block with the new changes is written to a new location. After a request to modify a block has been issued and completed there will be two versions of the block: the version that existed prior to modification and the modified block. The location of the blocks and the versioning information for each block is located in a metadata area that is in turn managed by the same allocate on write mechanism. When a new version of a block has been written to a new location, the metadata has to be modified. However, instead of modifying the contents of the relevant metadata block, the new metadata block is written to a new location. These allocations of new metadata blocks with points to the new block ripple up the metadata structures all the way to the root block of the metadata. Ultimately, the root metadata block will be allocated in a new place pointing to the new versions of all blocks, meaning that the previous root block points to the filesystem at a previous point in time. The current, recently modified root block points to the filesystem at the current point in time. Through this mechanism an allocate on write system is capable of holding complete version history of not only a block, but all blocks involved in that block’s tracking.

Screen Shot 2013-06-03 at 10.29.37 AM

Figure 10. When a datablock in the bottom left is modified, instead of modifying the current block a new block is allocated with the modified contents. The metadata pointing to this new location has to be modified as well, and again instead of modifying the current metadata block, a new metadata block is allocated. These changes ripple up the structure such that the current root block points to the filesystem at the current point in time while the previous root block points to the filesystem at the previous point in time.

ZFS

Allocate on write has many similar properties with EMC’s VNX copy on write and NetApp’s WAFL systems, but the way allocate on write has been implemented in ZFS eliminates the boundaries found in both. With ZFS there is no practical size limitations to snapshots, no practical limit to the number of snapshots, and snapshots are almost instantaneously and practical zero space (on the order of a few kilobytes).

ZFS was developed by Sun Microsystems to address the limitations and complexity of filesystems and storage. Storage capacity is growing rapidly, yet filesystems have many limitations on how many files can be in a directory or how big a volume can be. Volume sizes are predetermined and have to be shrunk or expanded later depending on how far off the original calculation was, making capacity planning an incredibly important task. Any requirement to change filesystem sizes could cause hours of outages while filesystems are remounted and fsck is run. ZFS has no need for filesystem checks because it is designed to always be consistent on disk. The filesystems can be allocated without size constraints because they are allocated out of a storage pool that can easily be extended on the fly. The storage pool is a set of disks or LUNs. All disks are generally assigned to one pool on a system, and thus all ZFS filesystems using that pool have access to the entire space in the pool. More importantly, they have access to all the I/O operations for the spindles in that pool. In many ways, it completely eliminates the traditional idea of volumes.

On a non-ZFS filesystem the interface is a block device. Writes are done per block and there are no transaction boundaries. In the case of a loss of power or other critical issue there is also a loss of consistency. While the inconsistency issues have been addressed by journaling, that solution impacts performance and can be complex.

In a ZFS filesystem all writes are executed via allocate on write, and thus no data is overwritten. Writes are written in transaction groups such that all related writes succeed or fail as a whole, alleviating the need for fsck operations or journaling. On-disk states are always valid and there are no on-disk “windows of vulnerability”. Everything is checksummed and there is no silent data corruption.

Screen Shot 2013-06-03 at 11.11.33 AM

Figure 11. Comparison of non-ZFS filesystems on top and ZFS filesystems on the bottom. The ZFS filesystems are created in a storage pool that has all the available spindles, giving filesystems access to all the storage and IOPS from the entire pool. On the other hand, the non-ZFS filesystems are created on volumes and those volumes are attached to a specific set of spindles, creating islands of storage and limiting the IOPS for each filesystem.

Excepting certain hardware or volume manager specific software packages, the general comparison between non-ZFS and ZFS filesystems is as follows:

Filesystem (non-ZFS)

  • One filesystem per volume
  • Filesystem has limited bandwidth
  • Storage is stranded on the volume

ZFS Filesystem

  • Many filesystems in a pool
  • Filesystems grow automatically
  • Filesystems have access to all bandwidth

Along with many filesystem improvements, ZFS basically has moved the size barrier beyond any existing hardware that has yet been created and has no limitations on the number of snapshots that can be created. The maximum number of snapshots is 2^64 (18 quintillion) and the maximum size of a filesystem is 2^64 bytes (18.45 Exabytes).

A ZFS snapshot is a read-only copy of a filesystem. Snapshot creation is basically instantaneous and the number of snapshots is practically unlimited. Each snapshot takes up no additional space until original blocks become modified or deleted. As snapshots are used for clones and the clones are modified, the new modified blocks will take up additional space. A clone is a writeable copy of a snapshot. Creation of a clone is practically instantaneous and for all practical purposes the number of clones is unlimited.

Snapshots can be sent to a remote ZFS array via a send and receive protocol. Either a full snapshot or incremental changes between snapshots can be sent. Incremental snaps generally send and receive quickly and can efficiently locate modified blocks.

One concern with allocate on write technology is that a single block modification can set off a cascade of block allocations. First, the datablock to be modified is not overwritten but a new block is allocated and the modified contents are written into the new block (similar to copy on write). The metadata that points to the new datablock location has to be modified; but again, instead of overwriting the metadata block, a new block is allocated and the modified data is written into the new block. These changes cascade all the way up the metadata tree to the root block or uber block (see Figure 10). Thus for one data block change there can be 5 new blocks allocated. These allocations are quick as they take place in memory, but what happens when they are written out to disk? Blocks are written out to disk in batches every few seconds for non-synchronous writes. On an idle or low activity filesystem a single block change could create 5 writes to disk, but on an active filesystem the total number of metadata blocks changed will be small compared to the number of datablocks. For every metadata block written there will typically be several datablocks that have been modified. On an active filesystem typically a single metadata block covers the modifications of 10 or 20 datablocks and thus the extra number of blocks written to disk is usually on the order of 10% the actual metadata block count.

Screen Shot 2013-06-03 at 10.30.46 AM

Figure 12. The flow of transaction data through in-memory buffers and disk.

But what happens for sync writes that can’t wait for block write batches that happen every few seconds? In those cases the sync writes must be written out immediately. Sync writes depend on another structure called the ZFS Intent Log (ZIL). The ZIL is like a database change log or redo log. It contains just the change vectors and is written sequentially and continuously such that a synchronous write request for a datablock change only has to wait for the write to the ZIL to complete. There is a ZIL per filesystem, and it is responsible for handling synchronous write semantics. The ZIL creates log records for events that change the filesystem (write, create, etc.). The log records will have enough information to replay any changes that might be lost in memory in case of a power outage where the block changes in memory are lost. Log records are stored in memory until either:

  • Transaction group commits
  • A synchronous write requirement is encountered (e.g. fsync() or O_DSYNC)

In the event of a power failure or panic, log records are replayed. Synchronous writes will not return until ZIL log records are committed to disk.

Another concern is that blocks that were initially written sequentially next to each other may end up spread over the disk after modifications to those blocks due to the updates resulting in a new block being allocated to a different location. This fragmentation has little effect on random read workloads but multiblock reads can suffer from this because a simple request for a continuous number of blocks may turn into several individual reads by ZFS.

ZFS also introduced the concept of hybrid storage pools where both traditional spinning disks and modern flash-based SSDs are used in conjunction. In general, disks are cheap and large in size but are limited both in latency and throughput by mechanics. Flash devices on the other hand provide I/O requests with latency that is only a small fraction of that of disks; however, they are very expensive per gigabytes. So while it may be tempting to achieve the best possible performance by putting all data on SSDs, this is usually still too cost prohibitive. ZFS allows mixing these two storage technologies in a storage pool, after which the ZIL can be placed on a mirror of flash devices to speed up synchronous write requests where latency is crucial.

Another use for SSDs in ZFS is for cache devices. ZFS caches blocks in a memory area called the Adaptive Replacement Cache—also the name of the algorithm used to determine which blocks have a higher chance of being requested again. The ARC is limited in size by the available system memory; however, a stripe of SSD devices for a level 2 ARC can be configured to extend the size of the cache. Since many clones can be dependent on one snapshot, being able to cache that snapshot will speed up access to all the thin clones based off of that snapshot.

Screen Shot 2013-06-03 at 10.30.57 AM

Figure 13. A storage pool with an SSD caching layer and ZFS Intent Log for syncing.

With these capabilities in mind, there are several methods available to use this technology for database thin clones:

  • Open Source ZFS snapshots and clones
  • ZFS Storage Appliance from Oracle with RMAN
  • ZFS Storage Appliance from Oracle with Dataguard

(Open) Solaris ZFS

ZFS is available in a number of operating systems today. It was released in Solaris 10 and has gained even more features and importance in Solaris 11. After the acquisition of Sun by Oracle, the OpenSolaris project was abandoned but the community forked a number of open source projects, the most notable of which is Illumos and OpenIndiana. These releases are still actively being developed and maintained. Many commercial products are built on these open source projects.

Any one of these systems can be used to build your own ZFS based storage system to support thin cloning:

  • Database storage on local ZFS
  • ZFS storage as an NFS filer
  • ZFS storage as an iSCSI/block storage array

When a database is already running on Solaris with local disks, a ZFS filesystem can be used to hold all database files. Creating snapshots and clones on that filesystem is a simple matter of using a few ZFS commands; however, one does not have to bother with storage protocols like NFS. If Solaris is in use and datafiles are on ZFS anyways, it may also be a good idea to automate regular snapshots as an extra layer of security and to enable a “poor man’s flashback database”.

When a database is not running locally on a Solaris server, you can still benefit from ZFS features by building your own ZFS storage server. You can share ZFS volumes via iSCSI or fibre channel and use ASM on the database server for datafiles but instead we will focus on the easier setup with ZFS filesystems and the NFS protocol to share the volumes.

On a Solaris Storage server

  • Create a zpool (ZFS pool)
  • Create a ZFS filesystem in the pool
  • Export that filesystem via NFS

On the source database server

  • Mount the NFS filesystem
  • Put datafiles on the NFS mount as one of:
    • “live” data (this may have performance implications)
    • backup image copies (or an RMAN clone)
    • a replication target

On the Solaris Storage server

  • Take snapshots whenever necessary
  • Create clones from the snapshots
  • Export the clones via NFS

On the target database server

  • Mount NFS clones
  • Use this thin clone

ZFS Storage Appliance with RMAN

Oracle sells a ZFS storage appliance preconfigured with disks, memory, ZFS filesystem, and a powerful monitoring and analytics dashboard. One of these appliances can be used to create database thin clones; in fact, Oracle has published a 44-page white paper outlining the steps (found at http://www.oracle.com/technetwork/articles/systems-hardware-architecture/cloning-solution-353626.pdf). In brief, the steps involved are:

On the ZFS Appliance

  • Create a “db_master” project
  • Create a “db_clone” project
  • For both the “db_clone” and “db_master” project, create 4 filesystems:
    • datafile
    • redo
    • archive
    • alerts

On the Source Database

  • Mount a directory from the ZFS Appliance via NFS
  • Backup the source database with RMAN to the NFS mount directory

On the ZFS Appliance

  • Select the “db_master” project
  • Snapshot the “db_master” project
  • Clone each filesystem on “db_master” to the “db_clone” project

On the Target Host

  • Mount the 4 filesystems from the db_clone project via NFS
  • Startup the clone database on the target host using the directories from the db_clone project mount via NFS from the ZFS storage appliance

Screen Shot 2013-06-03 at 10.31.06 AM

Figure 14. A diagram of the procedure used to clone databases using the ZFS storage appliance and RMAN. First a directory is mounted on the source machine from the ZFS storage appliance via NFS. Then an RMAN backup is taken of the source database onto the NFS mounted directory. The snapshot can be taken off the RMAN backup on the ZFS storage appliance and then used to create thin clones.

ZFS Storage Appliance with DataGuard

One way to efficiently address getting changes from a source database onto a ZFS storage appliance is by using Dataguard as outlined in Oracle’s white paper on Maximum Availability Architecture (MAA) DB Cloning. You can find the document at the following link:

http://www.oracle.com/technetwork/database/features/availability/maa-db-clone-szfssa-172997.pdf

The concept revolves around using Dataguard to host the datafiles from a Dataguard instance on the ZFS storage appliance. With the datafiles hosted in ZFS, all changes from the source database will be propagated to the ZFS Storage Appliance via the Dataguard instance. Once the Dataguard datafiles are hosted on the ZFS storage appliance, the snapshots of the datafiles can easily be taken at desired points in time and clones can be made from the snapshots. The ZFS clones can be used to start up database thin clones on target database hosts by mounting those datafiles via NFS to the target hosts.

Screen Shot 2013-06-03 at 10.31.14 AM

Figure 15. Using Dataguard, files can be shared with a ZFS storage appliance via NFS to use for thin cloning of a target database.

Parallel DML

Parallel DML is not enabled by default, and it’s easy to forget this and think you’ve made an operation “as parallel as possible” by declaring objects parallel or putting in hints for parallelism.

A recent question on OTN asked about speeding up a  materialized view refresh that seemed to be spending a lot of its time waiting on “PX Deq Credit: send blkd”. The manuals describe this as an “idle event”; but that’s not always true. The OP had supplied the output from tkprof for one of the sessions showing the “insert as select” that was the (complete) refresh and it was clear that the select was running in parallel, but the insert wasn’t – and that’s one case in which the “PX Deq Credit: send blkd” is arguably an “idle” wait (with a timeout of 2 seconds). It’s possible that the refresh could go faster if the OP enabled parallel DML.

The argument for calling this wait idle (in this case) is that N-1 of the parallel slaves that are trying to feed the query co-ordinator are waiting because the query co-ordinator is soaking up data from the Nth slave as fast as it can – the query co-ordinator can’t go any faster and the slaves are being told to wait until the query co-ordinator is ready for their input. On the other hand, if you don’t need those waits to happen at all you could argue that they aren’t idle because they are affecting the end-user response time. To stop them happening, you can minimise the messages from the PX slaves to the query co-ordinator by making the insert run in parallel; and since it’s important to recognise the difference in plans between a parallel and non-parallel insert I thought I’d give you a little model to test.

create table t1 as select * from all_objects;
create table t2 as select * from t1 where rownum <= 1;  

alter table t1 parallel (degree 2); 
alter table t2 parallel (degree 2); 

explain plan for 
insert /*+ append */ into t2 select * from t1; 

select * from table(dbms_xplan.display(null,null,'-note -cost -rows -bytes')); 

commit; 
alter session enable parallel dml; 

explain plan for 
insert /*+ append */ into t2 select * from t1; 

select * from table(dbms_xplan.display(null,null,'-note -cost -rows -bytes')); 

This code creates a couple of tables, declared as parallel, then inserts (twice) from one to the other. In the first case only the select can run parallel so all the data from the parallel slaves will be passed to the query co-ordinator to be inserted. In the second case both the select and the insert can take place in parallel, so each slave would build its own data segments, and the only messages sent to the query co-ordinator would be about the list of extents in each “private” segment that have to be merged. Here are the two execution plans:

 
----------------------------------------------------------------------- 
| Id  | Operation             | Name     |    TQ  |IN-OUT| PQ Distrib | 
----------------------------------------------------------------------- 
|   0 | INSERT STATEMENT      |          |        |      |            | 
|   1 |  LOAD AS SELECT       | T2       |        |      |            | 
|   2 |   PX COORDINATOR      |          |        |      |            | 
|   3 |    PX SEND QC (RANDOM)| :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS FULL| T1       |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------

-----------------------------------------------------------------------
| Id  | Operation             | Name     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------
|   0 | INSERT STATEMENT      |          |        |      |            |
|   1 |  PX COORDINATOR       |          |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT     | T2       |  Q1,00 | PCWP |            |
|   4 |     PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS FULL| T1       |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------

Both plans are so simple in shape that you basically read them from the bottom up. The first plan shows the data passing from the PX slaves to the QC (PX coordinator) which does the LOAD AS SELECT. The second plan shows the PX slaves loading as they select (the load is grouped as PCWP – parallel combined with child – with the tablescan), and information is sent to the QC only after the load has completed.

Bottom line: if the PX SEND QC is above (lower line number) the LOAD AS SELECT the slaves are doing the insert, if the LOAD AS SELECT is above the PX SEND QC the query coordinator is doing the insert (and you’ll probably see lots of “PX Deq Credit: send blkd” at that point as the PX slaves wait for the QC to load data into blocks).

Getting the Most Out of ASH online seminar

Just a little reminder – next week (10-11th June) I’ll be delivering my last training session before autumn – a short 1-day (2 x 0.5 days actually) seminar about Getting the Most Out of Oracle’s Active Session History. In the future it will act as sort of a prequel (or preparation) for my Advanced Oracle Troubleshooting class, as the latter one deliberately goes very deep. The ASH seminar’s 1st half is actually mostly about the GUI way of troubleshooting the usual performance problems (EM/Grid Control) and the 2nd half is about all my ASH scripts for diagnosing more complex stuff.

P.S. I’ll also have multiple very cool news in a few months ;-)


Clusterware Startup

The restart of a UNIX server call initialization scripts to start processes and daemons. Every platform has a unique directory structure and follows a method to implement server startup sequence. In Linux platform (prior to Linux 6), initialization scripts are started by calling scripts in the /etc/rcX.d directories, where X denotes the run level of the UNIX server. Typically, Clusterware is started at run level 3. For example, ohasd daemon started by /etc/rc3.d/S96ohasd file by supplying start as an argument. File S96ohasd is linked to /etc/init.d/ohasd.

S96ohasd -> /etc/init.d/ohasd

/etc/rc3.d/S96ohasd start  # init daemon starting ohasd.

Similarly, a server shutdown will call scripts in rcX.d directories, for example, ohasd is shut down by calling K15ohasd script:

K15ohasd -> /etc/init.d/ohasd
/etc/rc3.d/K15ohasd stop  #UNIX daemons stopping ohasd

In Summary, server startup will call files matching the pattern of S* in the /etc/rcX.d directories. Calling sequence of the scripts is in the lexical order of script name. For example, S10cscape will be called prior to S96ohasd, as the script S10cscape occurs earlier in the lexical sequence.

Google if you want to learn further about RC startup sequence. Of course, Linux 6 introduces Upstart feature and the mechanism is a little different: http://en.wikipedia.org/wiki/Upstart

That’s not the whole story!

Have you ever thought why the ‘crsctl start crs’ returns immediately? You can guess that Clusterware is started in the background as the command returns to UNIX prompt almost immediately. Executing the crsctl command just modifies the ohasdrun file content to ‘restart’. It doesn’t actually perform the task of starting the clusterware. Daemon init.ohasd reads the ohasdrun file every few seconds and starts the Clusterware if the file content is changed to ‘restart’.

# cat /etc/oracle/scls_scr/oel6rac1/root/ohasdrun
restart

If you stop has using ‘crsctl stop has’ , then the ohasdstr file content is modified to stop and so, init.ohasd daemon will not restart Clusterware. However, stop command is synchronous and executes the stop of clusterware too.

# crsctl stop has
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'oel6rac1'
CRS-2673: Attempting to stop 'ora.crsd' on 'oel6rac1'
..

The content of ohasdrun is modified to stop:

# cat  /etc/oracle/scls_scr/oel6rac1/root/ohasdrun
stop # 

In a nutshell, init.ohasd daemon is monitoring the ohasdrun file and starts the Clusterware stack if the value in the file is modified to restart.

Inittab

Init.ohasd daemon is an essential daemon for Clusterware startup. Even if the Clusterware is not running on a node, you can start the Clusterware from a different node. How does that work? Init.ohasd is the reason.

The init.ohasd daemon is started from /etc/inittab. Entries in the inittab is monitored by the init daemon (pid=1) and init daemon will react if the inittab file is modified. The init daemon monitors all processes listed in the inittab file and reacts according to the configuration in the inittab file. For example, if init.ohasd fails for some reason, it is immediately restarted by init daemon.

Following is an example entry in the inittab file. Fields are separated a colon, second field indicates that init.ohasd will be started in run level 3, and the third field indicates an action field. Restart in the action field means that, if the target process exist, just continue scanning inittab file; if the target process does not exist, then restart the process.

#cat /etc/inittab
…
h1:3:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 

If you issue a clusterware startup command from a remote node, that a message sent to init.ohasd daemon in the target node, and the daemon initates the clusterware startup. So, init.ohasd will be always running irrespective of whether the Clusterware is running or not.

You can use strace on init.ohasd to verify this behavior. Following are a few relevant lines from the output of strace command of init.ohasd process:

…
5641  1369083862.828494 open("/etc/oracle/scls_scr/oel6rac1/root/ohasdrun", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
5641  1369083862.828581 dup2(3, 1)      = 1
5641  1369083862.828606 close(3)        = 0
5641  1369083862.828631 execve("/bin/echo", ["/bin/echo", "restart"], [/* 12 vars */]) = 0
…

Just for fun!

So, what happens if I manually modify that ohasdrun to restart? I copied the ohasdrun to a temporary file (/tmp/a1.lst) and stopped the clusterware.

cp /etc/oracle/scls_scr/oel6rac1/root/ohasdrun /tmp/a1.lst 
# crsctl stop has

I verified that Clusterware is completely stopped. Now, I will copy the file again overlaying ohasdrun:

# cat /tmp/a1.lst
restart
# cp /tmp/a1.lst  /etc/oracle/scls_scr/oel6rac1/root/ohasdrun

After a minute or so, I see that Clusterware processes are started. Not that, you would use this type of hack in a Production cluster, but this test proves my point.

It’s also important not to remove the files in the scls_scr directories. Any removal of the files underneath the scls_scr directory structure can lead to an invalid configuration.

There are also two more files in the scls_scr directory structure. Ohasdstr file decides if the HAS daemon should be started automatically or not. For example, if you execute ‘crsctl disable has’, that command modifies ohasdstr file contents to ‘disable’. Similarly, crsstart file controls CRS daemon startup. Again, you should use recommended commands to control the startup, rather than modifying any of these files directly.

11.2.0.1 and HugePages

If you tried to configure hugepages in 11.2.0.1 clusterware, by increasing memlock kernel parameter for GRID and database owner, you would have realized that database doesn’t use hugepages if started by the clusterware. Database startup using sqlplus will use hugepages, but the database startup using srvctl may not use hugepages.

As the new processes are cloned from the init.ohasd daemon, until init.ohasd is restarted, user level memlock limit changes are not correctly reflected in an already running process. Only recommended way to resolve the problem is to restart the node completely (not just the clusterware), as init.ohasd daemon must be restarted to reflect the user level limits.

Version 11.2.0.2 fixes this issue by explicitly calling ulimit command from /etc/init.d/ohasd files.


Summary

In Summary, init.ohasd process is an important process. Files underneath scls_scr directory is controlling the startup behavior. This also means that if a server is restarted, you don’t need to explicitly stop the Clusterware. You can let the server startup to restart the Clusterware.

PS: Some of you have asked my why my blogging frequency has decreased. I have been extremely busy co-authoring a book on RAC titled “Expert RAC Practices 12c”. We are covering lots of interesting stuff in the book. As soon as 12c release is Production, we can release the book also.

Database Thin Cloning: WAFL (Netapp)

This post is part of an ongoing series:

(very interested in real life experiences and alternative strategies to using Netapp for thin cloning. Please comment and or email me at kylelf@gmail.com)

Write Anywhere File Layout (WAFL)

With EMC, thin cloning can only be achieved by using backup technology; in essence, the process has to be architected manually in order to support databases. How can the same goals be achieved but with database thin cloning specifically in mind?

A more seamless approach to database thin cloning is SnapManager for Oracle and SnapManager for SQL Server offered by NetApp. NetApp employs a technology called Write Anywhere File Layout (WAFL) that sounds on the surface like EMC VNX copy on write but is different.  WAFL has been around far longer and has a track record of being used for database thin cloning. WAFL allows quick, easy, and efficient snapshots to be taken of a filesystem. New writes don’t overwrite previous blocks with WAFL; instead, the new writes go to a new location. With this architecture it is easy to snapshot files, filesystems or LUNs in minutes.

Up to 255 snapshots can be created from a single LUN. (The 255 limitation is  per volume) An entire LUN can be the source of a snapshot, or snapshots can be made of specific sets of files. Along with the quick and easy snapshot technology, NetApp provides a feature called SnapMirror that will propagate snapshots to a secondary filer. The secondary filer in turn can use a feature called FlexClone that can be used to create clones.

Clones created in this manner will share duplicate blocks and thus can be used to create database thin clones on a secondary filer. The snapshots on the source array can be managed specifically for databases with NetApp Snapshot Manager for Oracle (SMO), or Snapshot Manager for SQL Server. SMO connects to the database, and in the case of Oracle will put all tablespaces in hot backup mode before taking snapshots then take them out of hot backup mode when the snapshot is complete. Information about the snapshots is tracked and managed within SMO inside an Oracle database that serves as a repository.

The technology involved with snapshot cloning in WAFL is solid but very component heavy. On top of the components already listed is a required installation on the target array called NetApp SnapDrive for UNIX. Snapshots are propagated to the secondary array with SnapMirror but a feature called Protection Manager manages the process. A critical step in cloning operations is correctly synchronizing the snapshot schedule of SMO with the transfer schedule of Protection Manager so that the same retention class is maintained on the source and target arrays. On the destination array it is important to manage and track how many clones are made and which snapshot is used for the basis of each clone. If more than 255 clones are made of a single LUN, the next clone will no longer be a logic (virtual) clone sharing duplicate data blocks but a physical clone with a completely new copy of the datafiles.

Screen Shot 2013-06-03 at 10.29.29 AM

Figure 8. Using NetApp filer technologies including WAFL, SnapMirror, SMO, and FlexClone to create thin provisioned database clones.

An important consideration on WAFL volumes on NetApp is the aggregate pool. The aggregate pool defines which LUNs will be included in a snapshot. The size limitation on this pool varies between 16TB and 100TB depending on the model of the NetApp array. The limits on the size of this pool and the limit of 255 snapshots should be considered when evaluating the capabilities of SMO and FlexClone on NetApp.

Reference

from http://media.netapp.com/documents/tr-3761.pdf 

Screen Shot 2013-06-05 at 8.11.09 AM

Screen Shot 2013-06-05 at 8.11.18 AM

an interesting discussion of Netapp vs EMC filesystem snapshots:

 

Database Thin Cloning: Copy on Write (EMC)

This post is part of an ongoing series:

(very interested in real life experiences and alternative strategies to using EMC for thin cloning. Please comment and or email me at kylelf@gmail.com)

Copy on Write

Copy on write is a filesystem mechanism that allows filesystems to create snapshots at specific points in time. Whereas Clonedb is a little known and rarely used option, filesystem snapshot technologies are widely known and used in the industry. These snapshots maintain an image of a filesystem at a specific point in time. If the active filesystem makes a change to a block, the original block will be read from disk in its original form and written to a save location. Once the block save is completed, the snapshot will be updated to point to the new block location. After the snapshot has been updated, the active filesystem datablock can be written out and overwrite the original version.

Screen Shot 2013-06-03 at 10.28.39 AM

Figure 4. This figure shows filesystem blocks in green. A snapshot will point to the datablocks at a point in time as seen on the top left.

Screen Shot 2013-06-03 at 10.28.44 AM

Figure 5. When the active filesystem changes a block, the old version of the block has to be read and then written to a new location and the snapshot updated. The active filesystem can then write out the new modified block.

Using filesystem snapshots, an administrator can snapshot the filesystem containing datafiles for the database and use the snapshot to create a clone of a source database. With multiple snapshots, multiple clones with shared redundant blocks can be provisioned.

On the other hand, if the source database is an important production environment then creating clone databases on the same storage as the production database is generally not a good practice. A strategy that allows the cloned database files to be stored off of the production storage environment will be more optimal for performance and stability.

EMC Snapshot with BCV

EMC has a number of technologies that can create database thin clones. In the simplest case the clone databases can share the same storage as the source databases using snapshots of the filesystem. The filesystem snapshot can be taken and used to make a thin clone. EMC supports up to 16 writeable filesystem snapshots allowing up to 16 thin clones of the same source datafiles (while sharing the same filesystem as the source database). If the source database consists of several LUNs then snapshots must be taken of the LUNs at the same point in time. Taking consistent snapshots of multiple LUNs at the same point in time requires the EMC Timefinder product that will manage taking snapshots of multiple LUNs at the same point in time.

Taking load off of production databases and protecting production databases from possible performance degredation is an important goal of cloning. By taking snapshots of the production LUNs one incurs an extra read and extra write for every write issued by the production database. This overhead will impact both production and the clone. On top of the extra load generated by the snapshots, the clones themselves create load on the LUNs because of the I/O traffic they generate.

In order to protect the performance of the production database, clones are often provisioned on storage arrays that are separate from production. In the case where production LUNs are carved out of one set of isolated physical disk spindles and another set of LUNs are carved out of a separate set of physical spindles on the same array, it may be acceptable to run the clones within the same array. In this case, Business Continuance Volumes (BCV) can be used to mirror production LUNs onto the LUNs allocated for the clones. Then shapshots can be taken of the mirrors and those snapshots can be used for thin clones; or, in order to protect the production LUNs from the overhead generated by snapshots, the BCV mirrors can be broken and the LUNs allocated for cloning can be used to start up thin clone databases. Filesystem snapshots can be used to clone up to 16 thin clone databases using the LUNs mirrored from production.

More often than not, however, snapshots are taken of BCVs or the BCVs are broken and then copied to a second non-production storage array where snapshots can be taken and clones provisioned off of the snapshots. In this case, though the EMC environment is limited to only 16 clones and if those clones are from yesterday’s copy of production, then a whole new copy of production has to be made to create clones of today’s copy of production. This ends up taking more storage and more time, which goes against the goal of thin cloning.

EMC’s goal has been backup, recovery, and high availability as opposed to thin cloning; however, these same technologies can be harnessed for thin cloning.

The steps to set this configuration up on EMCs system are:

  1. Create BCVs and then break the BCVs
  2. Zone and mask a LUN to the target host
  3. Perform a full copy of the BCV source files to target array
  4. Perform a snapshot operation on target array
  5. Startup database and recover using  the target array

Screen Shot 2013-06-03 at 10.29.07 AM

Figure 6. Timefinder is used to snapshot multiple LUNs from the production filer to the non-production filer to be used for thin provision clones.

EMC is limited to 16 writeable snapshots and shapshots of snapshots (also known as branching) is generally not allowed. On some high-end arrays it may be possible to take a single snapshot of a snapshot, but not branch any deeper.

EMC VNX

While copy on write filesystem snapshots are limited to 16 snapshots, there are other options available in order to increase the number and to enable branching of oclones. EMC has another technology called VNX which improves upon previous Snapview snapshots. The VNX technology:

  • requires less space
  • has no read+write overhead of copy on first write (COFW)
  • makes snapshot reads simpler
  • supports clones of clones (branching)

When the older Snapview snapshots were created they required extra storage space at creation time. The newer VNX snapshots don’t require any extra storage space when they are created. The older COFW feature caused more writes for the filesystem than before the snapshot was in place. With newer VNX Snapshots the filesystem writes become Redirect on Write (ROW) where each new active filesystem modification is written to a different location with no extra read or write overhead.

Another benefit of VNX is how blocks are read from the source LUNs: in  the older Snapview, reads from snapshot had to merge data from the filesystem with the Reserve LUN Pool (RLP) where the original data blocks that have been modified are kept. With the newer VNX the snapshot data is read directly from the snapshot source LUN.

EMC’s Timefinder capability is also no longer necessary with VNX. Up to 256 snapshots can be taken in a VNX environment, and snapshots can be made of multiple LUNs simultaneously without needed additional software capabilities to create a consistent copy.

VNX relaxes some of the constraints of the older Snapview clones; however, in both cases the problem of efficiently bringing new changes from a source array to arrays used for development still exists. After a copy is brought over to a target array from source database LUNs, changes on the source (fresh data) cannot easily be brought over to the target array without a full new copy of the source database. Multiple point in time snapshots are also difficult, as having a target database on the development array share duplicate blocks with another version of the target database (different point in time) is impossible with this architecture. Instead, multiple copies will take up excess space on the target array, and none of the benefits of block sharing in cache or on disk will apply if multi-versioned clone databases are required.

EMC Snapshots with SRDF and Recover Point

A major challenge of both BCVs and VNX is keeping the remote storage array used for clones up to date with the source database. EMC has two solutions to this challenge; each provides a way of continuously pulling in changes from the source database into the second storage array in order to keep it up to date and usable for refreshed databases:

  • Symmetrix Remote Data Facility (SRDF)
  • RecoverPoint

SRDF streams changes from a source array to a destination array on Symmetric storage arrays only.

RecoverPoint is a combination of a RecoverPoint Splitter and a RecoverPoint appliance. The splitter splits writes, sending one write to the intended destination and the other to a RecoverPoint appliance. The splitter can live in the array, be fabric based, or host based. Host based splitting is implemented by installing a device driver on the host machine and allows RecoverPoint to work with non-EMC storage; however, because the drivers are implemented at the OS level the availability will depend on the operating system that has been ported. The fabric based splitters currently work with Brocade SAN switches and Cisco SANTap. Fabric splitters open up the usage of RecoverPoint with non-EMC storage. The RecoverPoint appliance can coalesce and compress the writes and send them back to a different location on the array or send them off to a different array either locally or in another datacenter.

One advantage of RecoverPoint over SRDF is that SRDF will immediately propagate any changes from the source array to the destination. As with all instant propagation systems if there is a logical corruption on the source (for instance, a table being dropped), it will immediately be propagated to the destination system. With RecoverPoint changes are recorded and the destination can be rolled back to before the point in time of the logical corruption.

SRDF could be used in conjunction with Timefinder snapshots to provide a limited number of consistent point-in-time recovery points for groups of LUNs. RecoverPoint on the other hand can work with consistency groups to guarantee write order collection over a group of LUNs, and provides continuous change collection. RecoverPoint tracks block changes and journals them to allow rolling back target systems in the case of logical corruption or the need to rewind the development system.

Screen Shot 2013-06-03 at 10.29.21 AM

Figure 7. EMC SRDF or RecoverPoint can propagate changes from source filer LUNs to the target filer dynamically, allowing better point in time snapshotting capabilities.

Using SRDF or RecoverPoint allows propagation of changes from a source array to a target array. On the target array, clones can be made from the source database at different points in time while still sharing duplicate blocks between the clones no matter which point in time they came from.

In all these cases, however, there are limits to the snapshots that can be taken as well as technical challenges trying to get the source changes to the target array in an easy and storage-efficient manner.

More information on EMC snapshot technologies can be found via the following website links:

Clustering Factor Calculation Improvement Part III (Too Much Rope)

In my previous post, I discussed how for larger tables that really do have randomised data, not even setting the new TABLE_CACHED_BLOCKS to its 255 maximum value is going to make any real difference to the resultant Clustering Factor (CF) of an index. The 255 maximum value here protects us somewhat from abusing this capability […]