Search

Top 60 Oracle Blogs

Recent comments

Exadata

Making the Most of Oracle Exadata – A Technical Review

class="tweetmeme_button" style="float: right; margin-left: 10px;"> /> />

Over the past few weeks several people have asked me about an Exadata article entitled “Making the Most of Oracle Exadata” by Marc Fielding of Pythian. Overall it’s an informative article and touches on many of the key points of Exadata, however, even though I read (skimmed is a much better word) and briefly commented on the article back in August, after further review I found some technical inaccuracies with this article so I wanted to take the time to clarify this information for the Exadata community.

Exadata Smart Scans

Marc writes:

style="text-decoration: underline;">Smart scans: Smart scans are Exadata’s headline feature. They provide three main benefits: reduced data transfer volumes from storage servers to databases, CPU savings on database servers as workload is transferred to storage servers, and improved buffer cache efficiency thanks to column projection. Smart scans use helper processes that function much like parallel query processes but run directly on the storage servers. Operations off-loadable through smart scans include the following:

  • Predicate filtering – processing WHERE clause comparisons to literals, including logical operators and most SQL functions.
  • Column projection – by looking at a query’s SELECT clause, storage servers return only the columns requested, which is a big win for wide tables.
  • Joins – storage servers can improve join performance by using Bloom filters to recognize rows matching join criteria during the table scan phase, avoiding most of the I/O and temporary space overhead involved in the join processing.
  • Data mining model scoring – for users of Oracle Data Mining, scoring functions like PREDICT() can be evaluated on storage servers.

I personally would not choose a specific number of benefits from Exadata Smart Scan, simply stated, the design goal behind Smart Scan is to reduce the amount of data that is sent from the storage nodes (or storage arrays) to the database nodes (why move data that is not needed?). Smart Scan does this in two ways: it applies the appropriate column projection and row restriction rules to the data as it streams off of disk. However, projection is not limited to just columns in the SELECT clause, as Marc mentions, it also includes columns in the WHERE clause as well. Obviously JOIN columns need to be projected to perform the JOIN in the database nodes. The one area that Smart Scan does not help with at all is improved buffer cache efficiency. The reason for this is quite simple: Smart Scan returns data in blocks that were created on-the-fly just for that given query — it contains only the needed columns (projections) and has rows filtered out from the predicates (restrictions). Those blocks could not be reused unless someone ran the exact same query (think of those blocks as custom built just for that query). The other thing is that Smart Scans use direct path reads (cell smart table scan) and these reads are done into the PGA space, not the shared SGA space (buffer cache).

As most know, Exadata can easily push down simple predicates filters (WHERE c1 = ‘FOO’) that can be applied as restrictions with Smart Scan. In addition, Bloom Filters can be applied as restrictions for simple JOINs, like those commonly found in Star Schemas (Dimensional Data Models). These operations can be observed in the query execution plan by the JOIN FILTER CREATE and JOIN FILTER USE row sources. What is very cool is that Bloom Filters can also pass their list of values to Storage Indexes to aid in further I/O reductions if there is natural clustering on those columns or it eliminates significant amounts of data (as in a highly selective set of values). Even if there isn’t significant data elimination via Storage Indexes, a Smart Scan Bloom Filter can be applied post scan to prevent unneeded data being sent to the database servers.

Exadata Storage Indexes

Marc writes:

style="text-decoration: underline;">Storage indexes: Storage indexes reduce disk I/O volumes by tracking high and low values in memory for each 1-megabyte storage region. They can be used to give partition pruning benefits without requiring the partition key in the WHERE clause, as long as one of these columns is correlated with the partition key. For example, if a table has order_date and processed_date columns, is partitioned on order_date, and if orders are processed within 5 days of receipt, the storage server can track which processed_date values are included in each order partition, giving partition pruning for queries referring to either order_date or processed_date. Other data sets that are physically ordered on disk, such as incrementing keys, can also benefit.

In Marc’s example he states there is correlation between the two columns PROCESSED_DATE and ORDER_DATE where PROCESSED_DATE = ORDER_DATE + [0..5 days]. That’s fine and all, but to claim partition pruning takes place when specifying ORDER_DATE (the partition key column) or PROCESSED_DATE (non partition key column) in the WHERE clause because the Storage Index can be used for PROCESSED_DATE is inaccurate. The reality is, partition pruning can only take place when the partition key, ORDER_DATE, is specified, regardless if a Storage Index is used for PROCESSED_DATE.

Partition Pruning and Storage Indexes are completely independent of each other and Storage Indexes know absolutely nothing about partitions, even if the partition key column and another column have some type of correlation, as in Marc’s example. The Storage Index simply will track which Storage Regions do or do not have rows that match the predicate filters and eliminate reading the unneeded Storage Regions.

Exadata Hybrid Columnar Compression

Marc writes:

style="text-decoration: underline;">Columnar compression: Hybrid columnar compression (HCC) introduces a new physical storage concept, the compression unit. By grouping many rows together in a compression unit, and by storing only unique values within each column, HCC provides storage savings in the range of 80 90% based on the compression level selected. Since data from full table scans remains compressed through I/O and buffer cache layers, disk savings translate to reduced I/O and buffer cache work as well. HCC does, however, introduce CPU and data modification overhead that will be discussed in the next section.

The Compression Unit (CU) for Exadata Hybrid Columnar Compression (EHCC) is actually a logical construct, not a physical storage concept. The compression gains from EHCC come from column-major organization of the rows contained in the CU and the encoding and transformations (compression) that can be done because of that organization (like values are more common within the same column across rows, vs different columns in the same row). To say EHCC only stores unique values within each column is inaccurate, however, the encoding and transformation algorithms use various techniques that yield very good compression by attempting to represent the column values with as few bytes as possible.

Data from EHCC full table scans only remains fully compressed if the table scan is not a Smart Scan, in which case the compressed CUs are passed directly up to the buffer cache and the decompression will then be done by the database servers. However, if the EHCC full table scan is a Smart Scan, then only the columns and rows being returned to the database nodes are decompressed by the Exadata servers, however, predicate evaluations can be performed directly on the EHCC compressed data.

Read more: Exadata Hybrid Columnar Compression Technical White Paper

Marc also writes:

Use columnar compression judiciously: Hybrid columnar compression (HCC) in Exadata has the dual advantages of reducing storage usage and reducing I/O for large reads by storing data more densely. However, HCC works only when data is inserted using bulk operations. If non-compatible operations like single-row inserts or updates are attempted, Exadata reverts transparently to the less restrictive OLTP compression method, losing the compression benefits of HCC. When performing data modifications such as updates or deletes, the entire compression unit must be uncompressed and written in OLTP-compressed form, involving an additional disk I/O penalty as well.

EHCC does require bulk direct path load operations to work. This is because the compression algorithms that are used for EHCC need sets of rows as input, not single rows. What is incorrect with Marc’s comments is that when a row in a CU is modified (UPDATE or DELETE), the entire CU is not uncompressed and changed to non-EHCC compression, only the rows that are UPDATED are migrated to non-EHCC compression. For DELETEs no row migrations take place at all. This is easily demonstrated by tracking ROWIDs as in the example at the bottom of this post.

Exadata Smart Flash Cache

Marc writes:

style="text-decoration: underline;">Flash cache: Exadata s flash cache supplements the database servers buffer caches by providing a large cache of 384 GB per storage server and up to 5 TB in a full Oracle Exadata Database Machine, considerably larger than the capacity of memory caches. Unlike generic caches in traditional SAN storage, the flash cache understands database-level operations, preventing large non-repeated operations such as backups and large table scans from polluting the cache. Since flash storage is nonvolatile, it can cache synchronous writes, providing performance benefits to commit-intensive applications.

While flash (SSD) storage is indeed non-volatile, the Exadata Smart Flash Cache is volatile – it loses all of its contents if the Exadata server is power cycled. Also, since the Exadata Smart Flash is currently a write-through cache, it offers no direct performance advantages to commit-intensive applications, however, it does offer indirect performance advantages by servicing read requests that would otherwise be serviced by the HDDs, thus allowing the HDDs to service more write operations.

Read more: Exadata Smart Flash Cache Technical White Paper

EHCC UPDATE and DELETE Experiment

--
-- EHCC UPDATE example - only modified rows migrate
--

SQL> create table order_items1
  2  compress for query high
  3  as
  4  select rownum as rnum, x.*
  5  from order_items x
  6  where rownum <= 10000;

Table created.

SQL> create table order_items2
  2  as
  3  select rowid as rid, x.*
  4  from order_items1 x;

Table created.

SQL> update order_items1
  2  set quantity=10000
  3  where rnum in (1,100,1000,10000);

4 rows updated.

SQL> commit;

Commit complete.

SQL> select b.rnum, b.rid before_rowid, a.rowid after_rowid
  2  from order_items1 a, order_items2 b
  3  where a.rnum(+) = b.rnum
  4  and (a.rowid != b.rid
  5    or a.rowid is null)
  6  order by b.rnum
  7  ;

           RNUM BEFORE_ROWID       AFTER_ROWID
--------------- ------------------ ------------------
              1 AAAWSGAAAAAO1aTAAA AAAWSGAAAAAO1aeAAA
            100 AAAWSGAAAAAO1aTABj AAAWSGAAAAAO1aeAAB
           1000 AAAWSGAAAAAO1aTAPn AAAWSGAAAAAO1aeAAC
          10000 AAAWSGAAAAAO1aXBEv AAAWSGAAAAAO1aeAAD

--
-- EHCC DELETE example - no rows migrate
--

SQL> create table order_items1
  2  compress for query high
  3  as
  4  select rownum as rnum, x.*
  5  from order_items x
  6  where rownum <= 10000;

Table created.

SQL> create table order_items2
  2  as
  3  select rowid as rid, x.*
  4  from order_items1 x;

Table created.

SQL> delete from order_items1
  2  where rnum in (1,100,1000,10000);

4 rows deleted.

SQL> commit;

Commit complete.

SQL> select b.rnum, b.rid before_rowid, a.rowid after_rowid
  2  from order_items1 a, order_items2 b
  3  where a.rnum(+) = b.rnum
  4  and (a.rowid != b.rid
  5    or a.rowid is null)
  6  order by b.rnum
  7  ;

           RNUM BEFORE_ROWID       AFTER_ROWID
--------------- ------------------ ------------------
              1 AAAWSIAAAAAO1aTAAA
            100 AAAWSIAAAAAO1aTABj
           1000 AAAWSIAAAAAO1aTAPn
          10000 AAAWSIAAAAAO1aXBEv
name="googleone_share_1" style="position:relative;z-index:5;float: right; margin-left: 10px;">{lang: 'ar'} class='wpfblike' >

An investigation into exadata, part 2

It’s been a while since I’ve post the first part of this series.

Considerations

I created a little tool, snap, which is more or less a combination of Tanel Poder’s snapper and a script from Tom Kyte. I used this for the first part to give me all the details of database process I was using. The next step is invoking Oracle’s Parallel Query feature. Parallel Query means multiple processing are working together to execute a query.

Cellcli Command Syntax – Top 10 List

Exadata storage software uses the cellcli utility as its command line interface. Unfortunately, although the documentation set that comes with Exadata does have many examples of cellcli commands, and even a chapter dedicated to cellcli, it does not include any reference material on the syntax itself (particularly the LIST command). So I thought I would write up a few of the things I’ve learned while picking around at it. But first a little bit of a rant on why they wrote yet another command line interface.

They already had SQL*Plus for crying out loud. Why not just use that. SQL*Plus has all kinds of functionality for using variables, formatting output, etc… And on top of that, they invented a new syntax. Why use LIST instead of SELECT? They used WHERE and LIKE, so why not SELECT? I find it more than a little annoying (in case you couldn’t tell). I’m told that storage admins don’t like SQL and that’s one of the reasons for not using straight SQL syntax. That seems pretty silly since the storage is designed specifically for use with Oracle databases.

So anyway, here’s my quick top ten list of things you should know:

  1. cellcli does have a handful of SQL*Plus commands (START (@), SET ECHO ON, SPOOL, DESCRIBE, HELP)
  2. SELECT is replaced by LIST and it must be the first key word on the command line
  3. There is no FROM keyword (the LIST keyword must be immediately followed by the ObjectType which is equivalent to a table name)
  4. There is a DESCRIBE command which displays the attributes (columns) that make up an ObjectType (table)
  5. Column names are specified with the ATTRIBUTES keyword followed by the columns you wish to be displayed
  6. There is a default set of columns for each Object that will be returned if the ATTRIBUTES keyword is not specified
  7. There is a WHERE clause that can be applied to any attribute and multiple conditions can be ANDed together (no OR though)
  8. There is no ORDER BY equivalent
  9. The DETAIL key word can be appended to any LIST command to change the output from column oriented to row oriented
  10. The LIKE operator works but instead of the standard SQL wildcard, %, cellcli uses regex – so ‘%’ = ‘.*’

So here are a few examples:

CellCLI> help 
 
 HELP [topic]
   Available Topics:
        ALTER
        ALTER ALERTHISTORY
        ALTER CELL
        ALTER CELLDISK
        ALTER GRIDDISK
        ALTER IORMPLAN
        ALTER LUN
        ALTER THRESHOLD
        ASSIGN KEY
        CALIBRATE
        CREATE
        CREATE CELL
        CREATE CELLDISK
        CREATE FLASHCACHE
        CREATE GRIDDISK
        CREATE KEY
        CREATE THRESHOLD
        DESCRIBE
        DROP
        DROP ALERTHISTORY
        DROP CELL
        DROP CELLDISK
        DROP FLASHCACHE
        DROP GRIDDISK
        DROP THRESHOLD
        EXPORT CELLDISK
        IMPORT CELLDISK
        LIST
        LIST ACTIVEREQUEST
        LIST ALERTDEFINITION
        LIST ALERTHISTORY
        LIST CELL
        LIST CELLDISK
        LIST FLASHCACHE
        LIST FLASHCACHECONTENT
        LIST GRIDDISK
        LIST IORMPLAN
        LIST KEY
        LIST LUN
        LIST METRICCURRENT
        LIST METRICDEFINITION
        LIST METRICHISTORY
        LIST PHYSICALDISK
        LIST THRESHOLD
        SET
        SPOOL
        START
 
CellCLI> help set
 
  Usage: SET  
 
  Purpose: Sets a variable to alter the CELLCLI environment settings for your
           current session.
 
  Arguments:
    variable and value represent one of the following clauses:
    DATEFORMAT { STANDARD | LOCAL }
    ECHO { ON | OFF }
 
  Examples:
    set dateformat local
    set echo on 
 
 
CellCLI> help list
 
  Enter HELP LIST  for specific help syntax.
    :  {ACTIVEREQUEST | ALERTHISTORY | ALERTDEFINITION | CELL 
                     | CELLDISK | FLASHCACHE | FLASHCACHECONTENT | GRIDDISK
                     | IORMPLAN | KEY | LUN 
                     | METRICCURRENT | METRICDEFINITION | METRICHISTORY 
                     | PHYSICALDISK | THRESHOLD }
 
CellCLI> help list FLASHCACHECONTENT
 
  Usage: LIST FLASHCACHECONTENT [] [] [DETAIL] 
 
  Purpose: Displays specified attributes for flash cache entries.
 
  Arguments:
    :  An expression which determines the entries to be displayed.
    : The attributes that are to be displayed.
                      ATTRIBUTES {ALL | attr1 [, attr2]... }
 
  Options:
    [DETAIL]: Formats the display as an attribute on each line, with
              an attribute descriptor preceding each value.
 
  Examples:
    LIST FLASHCACHECONTENT DETAIL

So as you can see, the help system allows you to see a bit of the syntax for each command. You may also have noticed a couple of SQL*Plus carry-overs. SET, SPOOL, and START work pretty much as expected. Note the @ is equivalent to START and that the only things you can SET are ECHO and DATEFORMAT. Now for a couple of queries (er LIST commands):

CellCLI> desc flashcachecontent
         ^
CELL-01504: Invalid command syntax.
 
CellCLI> describe flashcachecontent
        cachedKeepSize
        cachedSize
        dbID
        dbUniqueName
        hitCount
        hoursToExpiration
        missCount
        objectNumber
        tableSpaceNumber
 
CellCLI> set echo on
 
CellCLI> @fc_content
 
> CellCLI> list flashcachecontent where dbUniqueName like 'EXDB' and hitcount > 100 attributes dbUniqueName, objectNumber, cachedKeepSize, cachedSize, hitCount, missCount
         EXDB    2       0       4194304         600     208
         EXDB    40      0       2424832         376     60
         EXDB    224     0       1802240         115     80
         EXDB    267     0       458752          128     9
         EXDB    383     0       2547712         157     27
         EXDB    423     0       1867776         180     41
         EXDB    471     0       4071424         552     85
         EXDB    472     0       1277952         114     22
         EXDB    474     0       13246464        286     326
         EXDB    475     0       5914624         519     124
         EXDB    503     0       5308416         669     455
         EXDB    5710    0       3735552         363     90
         EXDB    6207    0       393216          112     9
         EXDB    6213    0       3842048         359     147
         EXDB    6216    0       1245184         184     29
         EXDB    6373    0       3481600         222     61
         EXDB    56085   0       4194304         822     129
         EXDB    66849   0       438763520       1221    3322
         EXDB    71493   0       5636096         302     127
         EXDB    71497   0       1351680         320     22
         EXDB    71573   0       2760704         101     37
         EXDB    71775   0       1801412608      34994   46315
 
CellCLI> list flashcachecontent where dbUniqueName like 'EX.?.?' and hitcount > 100 -
 
> attributes dbUniqueName, objectNumber, cachedKeepSize, cachedSize 
         EXDB    2       0       4194304
         EXDB    18      0       1179648
         EXDB    37      0       622592
         EXDB    40      0       2424832
         EXDB    63      0       524288
         EXDB    104     0       688128
         EXDB    224     0       3407872
         EXDB    267     0       458752
         EXDB    383     0       2670592
         EXDB    420     0       1507328
         EXDB    423     0       1867776
         EXDB    424     0       720896
         EXDB    471     0       4071424
         EXDB    472     0       1277952
         EXDB    473     0       2351104
         EXDB    474     0       13574144
         EXDB    475     0       5521408
         EXDB    503     0       5308416
         EXDB    5702    0       262144
         EXDB    5709    0       2416640
         EXDB    5710    0       3735552
         EXDB    6207    0       393216
         EXDB    6210    0       131072
         EXDB    6213    0       4227072
         EXDB    6216    0       1245184
         EXDB    6373    0       3579904
         EXDB    56085   0       4194304
         EXDB    66849   0       438763520
         EXDB    71493   0       5636096
         EXDB    71497   0       1351680
         EXDB    71573   0       2801664
         EXDB    71775   0       1801412608
 
CellCLI> list flashcachecontent where dbUniqueName like 'EX.?.?' and hitcount > 100 and objectNumber like '.*775'
         2356637742      6       71775
 
CellCLI> list flashcachecontent where dbUniqueName like '.*X.?.?' and objectNumber like '.*775' detail                                      
         cachedKeepSize:         0
         cachedSize:             1801412608
         dbID:                   2356637742
         dbUniqueName:           EXDB
         hitCount:               34994
         missCount:              46315
         objectNumber:           71775
         tableSpaceNumber:       6
 
CellCLI> list flashcachecontent where dbUniqueName like 'EX.?.?' and hitcount > 100 and objectNumber like '.*775'
         2356637742      6       71775
 
CellCLI> list flashcachecontent attributes objectNumber, hitCount, missCount where dbUniqueName like 'EX.?.?' and hitcount > 100 and objectNumber like '.*775'
         71775   34994   46315

So DESC doesn’t work as an abbreviation of DESCRIBE. Notice that there are no headings for column oriented output. As you can see, you can run “scripts” and SET ECHO ON to display the commands in any scripts that you execute. One of the LIST commands was strung across two lines by using the continuation operator (-). The LIST commands look a lot like SQL except for LIST being used instead of SELECT and the regex expressions for matching when using the LIKE key word. Also notice that in the last command a number was matched with a regex expression implying a data type conversion, although all data may be treated at text. You can see that the ATTRIBUTES and WHERE key words can be anywhere on the command line after the “LIST objectName” keywords. In other words, these two key words are not positional, either one can be first. Finally, the DETAIL keyword turns the output sideways. Or as the help says, “Formats the display as an attribute on each line, with
an attribute descriptor preceding each value.”

So the cellcli interface is really not that bad, I just happen to like SQL*Plus better. ;) I do think it would have been a simple matter to reuse SQL*Plus since they already have all the functionality built into it, but maybe there were other concerns that I’m not aware of. But cellcli works. And by the way, cellcli has the ability to scroll though previous commands and edit them via the arrow keys which is quite handy. The editing capability is definitely a step forward from SQL*Plus on unix like platforms (although you can use rlwrap to accomplish this – see this post for more details on that: Using rlwrap on Windows) And regex also provides a very powerful pattern matching capability although it’s still a little confusing to have SQL like syntax mixed with regex to my way of thinking. Maybe if they just added the ability to use the % wildcard in addition to the regex I would feel better about it.

Applying Exadata Bundle Patch 6 (BP6)

Well November came around quickly - when I started blogging I promised myself that I would at least try to share my Exadata findings once a month.  So it is time to share my experience and findings from applying the bundle patch 6. If you are new to Exadata patching you should hopefully find some important points below. The patching took place on a ¼ Rack (2 compute nodes). The high level approach for applying BP6 is
1.       Apply patch 10114705 - Database Bundle Patch 6 for Exadata first

Oracle XMonth

I spent most of the last week in California at Oracle’s XMonth (it was really a week so I’m not sure why they called it XMonth). Which reminds me of a tuning engagement I did a few years ago. I showed up at the client’s facilities and we started discussing the problem. They told me that their daily job was running too slow and it was causing major problems. So I asked them what time the daily job ran each day. And they told me that it ran every 20 minutes. Ha! “And why do you call it the daily job?”, I asked. They weren’t really sure. The job was taking about an hour to complete as I recall, but I digress.

At XMonth, they had three tracks: Exadata (and two others that I can’t remember). I did learn a few things I thought were worth sharing.

  1. Exalogic is not shipping yet, but we did cover it in some detail. Exalogic’s biggest advantage appears to be the Infiniband fabric. It can be connected directly to any DB server using IB (Exadata for example) and can communicate with the extremely low latency RDS protocol.
  2. Oracle has relaxed their “no changes to the Exadata configuration” stance (but only very slightly). They said that it was OK to change out the Cisco switch and replace it with some other equivalent switch.
  3. A competitive analysis of Sun’s server line was provided. It included T series, X series, and M series servers along with Exadata. Exadata pretty much kicks all other options in the teeth (in my opinion). M series are still suitable for very large applications that are unable to scale out via RAC – such are Oracle’s own MRP package which uses the dbms_pipe package limiting its ability to scale in a RAC environment. But in general, the advice to the sales team was that if you are in a competitive situation, in most cases you should lead with Exadata.

So that’s about it. Oh they also let me talk about our experiences with Exadata. That was fun and I got to try my hand at a virtual presentation, as there were participants all over the world following along with Webex. The software has gotten pretty good for doing these kinds of presentations by the way. It was good practice for the Virtual Oracle Conference we have coming up next month. (note that there are only a couple of days left to sign up at the discounted rate)

Exadata, to index or not to index, that is the question

We will take a look at in this blog post, by testing several different approaches and comparing the time and the statistics for each scenario.

The tests have been performed on a quarter rack ExaData database machine (2 db nodes – with 16 cores each and 3 storage servers). The database is setup for a data warehouse implementation and has been patched with bundle patch 5 at the time of testing.

The tests were executed on a table with 403M rows distributed over 14 range partitions – 7 non-compressed and 7 partitions compressed with HCC query option.  Each test spans over two partitions covering 57.5M rows.  Please note the dimension tables contain 4000 or less rows.  The data is production data and are event based data, meaning data is generated when a certain events occur.

TABLE_NAME  PARTITION_NAME COMPRESS COMPRESS_FOR LAST_ANALYZED  SAMPLE_SIZE   NUM_ROWS
EVENT_PART  TEST_20100901  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100902  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100903  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100904  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100905  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100906  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100907  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100908  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100909  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100910  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100911  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100912  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100913  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100914  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
 

Each test-case/SQL has been executed 5 times under different scenario:

(1)    Without any bitmap or regular indexes on non-compressed partitions
           Storage FTS on event_part and dimensions
(2)    Without any bitmap or regular indexes on HCC partitions
            Storage FTS on event_part and dimensions
(3)    With primary key  constraint on dimension tables only
          Storage FTS on event_part with primary key look up on dimensions
(4)    With bitmap and primary key indexes on non-compressed partitions
            Bitmap index lookup on event_part and primary key lookup on dimensions
(5)    With bitmap and primary key  indexes on HCC partitions
           Bitmap index lookup on event_part and primary key lookup on dimensions 

The test cases used are from a warehouse environment and I have modified the column and table names. For the bitmap test-cases I had to hint the queries to ensure the bitmap indexes was actually used.

The output from the test cases is over 20K lines, so I have summed up the elapsed time and a few statistics in tables below to provide a better overview. 

select /*+ MONITOR  basic */
            st.s_name,  pt.p_name,  cct.pc_name,  ep.c_level, count(ep.c_id)
  from event_part ep,  t_test tt, s_test st, p_test pt, cc_test cct
   where tt.t_id between 20100901 and 20100902
        and ep.t_id = tt.t_id
        and ep.t_id between 20100901 and 20100902
        and ep.s_id = st.s_id
        and ep.p_id = pt.p_id
        and ep.cc_id = cct.c_id
    group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
    order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for basic breakdown – Test case 1

Stats / Tests

1

2

3

4

5

Elapsed time sec

8.87

12.00

09.22

185.05

149.55

cell physical IO bytes saved by storage index

0

0

0

0

0

cell physical IO bytes eligible for predicate offload

5,209,325,568

0

5,209,325,568

0

0

cell physical IO interconnect bytes returned by smart scan

2,562,203,600

0

2,562,201,584

0

0

cell flash cache read hits

9

26

9

9,290

2,063

CC Total Rows for Decompression

 

57,586,000

 

 

57,586,000

 This is a basic query, which is used for high level summaries and serves as a good base line to compare with, for the other test-cases.  There is no use of a where clause in the test case, so we will not benefit from any storage indexes in this case.  The first 3 tests are without any indexes on the fact table and are performing much better than test 4 and 5 and we should of course not expect the CBO to follow this path anyway.   It is evident for test 1 and 3 that the performance gained is supported by the storage server offloading and the smart scans.  The above CC stats for test 2, tell us that the db node performs the decompression, so this test will have to burn extra CPU cycles compared to test 1 and 3.  There is more to be mentioned for test 2, but I’ll try to cover that in the conclusion.

 select /*+ MONITOR lc breakdown */
           st.s_name, pt.p_name,  cct.pc_name,  ep.c_level,  count(ep.c_id)
 from event_part ep, t_test tt, s_test st, p_test pt, cc_test cct
 where tt.t_id between 20100903 and 20100904
      and ep.t_id = tt.t_id
      and ep.t_id between 20100903 and 20100904
      and ep.s_id = st.s_id and ep.p_id = pt.p_id
      and ep.cc_id = cct.c_id and ep.c_id = 7
 group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
 order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for LC breakdown - Test case 2

Stats / Tests

1

2

3

4

5

Elapsed time sec

2.05

19.17

1.84

30.33

36.58

cell physical IO bytes saved by storage index

4,186,636,288

0

4,186,636,288

0

0

cell physical IO bytes eligible for predicate offload

5,209,292,800

0

5,209,292,800

0

0

cell physical IO interconnect bytes returned by smart scan

317,496,848

0

317,497,280

0

0

cell flash cache read hits

18

59

36

1,043

219

CC Total Rows for Decompression

0

57,782,554

0

0

7,842,364

Similar finding as we saw from the 1st test case; however, in this test-case we are performing the breakdown for a certain ID and therefore the performance of test 1 and 3, improved further from the IO saved by the Storage Index.   For this test case, I ran test 1 and 3 on the save partitions and it is worth noticing, that second time around the savings from the Storage Index improved; so the storage indexes are further maintained/improved as we select data from the tables and partitions.

select /*+ MONITOR lp breakdown */
           st.s_name,  pt.p_name, cct.pc_name, ep.c_level,  count(ep.c_id)
 from event_part ep,  t_test tt,  s_test st,  p_test pt,  cc_test cct
 where tt.t_id between 20100905 and 20100906
       and ep.t_id = tt.t_id
      and ep.t_id between 20100905 and 20100906
      and ep.s_id = st.s_id  and ep.p_id = pt.p_id
      and ep.cc_id = cct.c_id and ep.p_id = 4611686019802841877
 group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
 order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for lp breakdown - Test case 3

Stats / Tests

1

2

3

4

5

Elapsed time sec

 2.99

 6.01

 2.72

 49.22

 39.29

cell physical IO bytes saved by storage index

 2,623,143,936

 

0

 

2,623,799,296

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,325,568

 

0

 

5,209,325,568

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

674,439,456

 

0

 

 

674,436,288

 

0

 

0

cell flash cache read hits

64

44

10

2,113

635

CC Total Rows for Decompression

 

0

 

57,979,108

 

0

 

0

 

15,582,048

 Similar findings as we saw from the 2nd test case; this test is just performed on a different ID, which has a higher distinct count than the first ID we tested in test case 2; and as a result of that and on how the data is sorted during insert we are seeing less IO saved by the storage index.

 select /*+ MONITOR spcl breakdown */
           st.s_name, pt.p_name,  cct.pc_name,  ep.c_level, count(ep.c_id)
  from event_part ep, t_test tt,   s_test st, p_test pt,  cc_test cct
 where tt.t_id between 20100906 and 20100907
      and ep.t_id = tt.t_id
      and ep.t_id between 20100906 and 20100907
      and ep.s_id = st.s_id and ep.p_id = pt.p_id
      and ep.cc_id = cct.c_id and ep.s_id = 1
      and ep.cc_id =7 and ep.p_id = 4611686019802841877
  group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
  order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for spcl breakdown – Test case 4

Stats / Tests

1

2

3

4

5

Elapsed time sec

1.67

13.69

01.14

12.77

7.90

cell physical IO bytes saved by storage index

 

4,531,191,808

 

0

 

4,532,174,848

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,325,568

 

0

 

5,209,325,568

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

237,932,736

 

0

 

237,933,312

 

0

 

0

cell flash cache read hits

73

52

10

594

183

CC Total Rows for Decompression

 

0

 

57,782,554

 

0

 

0

 

5,614,752

This test case is performed with a where clause on multiple ID’s.  Again test 1 and 3 are taking advantage of the Exadata features and are performing well.   Test 4 and 5 are still not close to test 1 or 3, but have definitely become a bit more competitive.  Comparing the two HCC tests (2 and 5) test 5 seems to do better as it only has to burn CPU cycles for 10% of the results set of test 2.   A valid question to ask here would be why we are not seeing any benefits from either Storage offloading or indexing on test 2, but again I’ll defer that discussion to the conclusion.

select /*+ MONITOR  ttl */
           st.s_name, cct.pc_name, ep.c_level, count(ep.c_id) row_count,
           round(min((ep.ceg - ep.cbg) / 60),2) min_g_min,
           round(avg((ep.ceg - ep.cbg) / 60.0),2)  avg_g_min,
          round(max((ep.ceg - ep.cbg) / 60),2) max_g_min
  from event_part ep,  t_test tt,   s_test st,  cc_test cct
 where  tt.t_id between 20100901 and 20100902
      and ep.t_id = tt.t_id
      and ep.t_id between 20100901 and 20100902
      and ep.s_id = st.s_id
      and ep.character_class_id = cct.class_id
  group by st.shard_name, cct.public_class_name, ep.character_level
  order by  st.shard_name, cct.public_class_name, ep.character_level;
 

Table figure for ttl breakdown - Test case 5

Stats / Tests

1

2

3

4

5

Elapsed time sec

12.82

15.50

11.67

254.26

304.92

cell physical IO bytes saved by storage index

 

0

 

0

 

0

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,325,568

 

0

 

5,209,325,568

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

2,328,566,432

 

0

 

2,328,567,440

 

0

 

0

cell flash cache read hits

9

15

9

132,467

2,341

CC Total Rows for Decompression

 

0

 

57,586,000

 

0

 

0

 

61,643,318

Very similar findings as we saw from the 1st test case; the only difference is this query looks examine the time to something.

select /*+ MONITOR ttlsc */
           st.s_name, cct.pc_name, ep.c_level,  count(ep.c_id) row_count,
           round(min((ep.ceg - ep.cbg) / 60),2) min_g_min,
           round(avg((ep.ceg - ep.cbg) / 60.0),2)  avg_g_min,
           round(max((ep.ceg - ep.cbg) / 60),2) max_g_min
 from event_part ep, t_test tt, shard_test st, cc_test cct
where tt.t_id between 20100903 and 20100904
      and ep.t_id = tt.t_id
      and ep.t_id between 20100903 and 20100904
      and ep.s_id = st.s_id
      and ep.cc_id = cct.c_id
      and ep.s_id = 2
      and ep.cc_id =6
    group by st.s_name, cct.pc_name, ep.c_level
    order by st.s_name, cct.pc_name, ep.c_level;
 

Table figure for ttlsc breakdown - Test case 6

Stats / Tests

1

2

3

4

5

Elapsed time sec

 1.16

4.57

1.01

12.71

 03.87

cell physical IO bytes saved by storage index

 

4,697,096,192

 

0

 

4,698,832,896

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,292,800

 

0

 

5,209,292,800

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

55,906,960

 

0

 

55,906,384

 

0

 

0

cell flash cache read hits

 9

31

10

3891

 107

CC Total Rows for Decompression

 0

 57,749,795

 0

 0

1,998,299

 Very similar findings as we saw for the 4th test case.

 Conclusion

Most warehouse like queries I have performed in our Exadata environment is doing well without indexes on fact tables.  So it is no surprise to me to hear more and more people are dropping most of their indexes and take advantage of the Exadata features.   If you like to keep the primary key indexes on your dimension tables to ensure the hassle of resolving the duplicate key issues, that seems to be a valid option as well.

In my environment I’m still to find a case where the bitmap index search could compete with the no index approach; and let just say we found such a case, when it would still have to show significant improvements before  I would choose that path;  Consider the benefits of not having to maintain the bitmap indexes after each load.   There are also several restrictions with bitmap indexes that would be nice not to have to worry about.

Now, I mentioned that I would get back to the test 2 results, which were based on Storage FTS on partitions compressed with the HCC query option.   In the past I have performed queries on HCC tables and have seen IO savings from the Storage indexes.  

Initially i suspected the test2 results observed above to be a bug or alternatively be related to my HCC compressed partitions are only 29MB a piece versa 2.4GB uncompressed.  Oracle support/development has confirmed it to be related to the data size, as we can see from the stat "cell physical IO bytes eligible for predicate offload", which doesn't get bumped up after query.  The reason for that is after partition pruning,  the table is too small for predicate push to kick in and since predicate push doesn't kick in, the Storage Indexes won't kick in either.

Please be aware i don't know the Storage index internals, but I look forward to learn.

Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part II.

In my recent post entitled Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part I, I started to address the many questions folks are sending my way about what factors to consider when choosing between Exadata Database Machine X2-8 versus Exadata Database Machine X2-2. This post continues that thread.

As my friend Greg Rahn points out in his recent post about Exadata, the latest Exadata Storage Server is based on Intel Xeon 5600 (Westmere EP) processors. The Exadata Storage Server is the same whether the database grid is X2-2 or X2-8. The X2-2 database hosts are also based on Intel Xeon 5600. On the other hand, the X2-8 database hosts are based on Intel Xeon 7500 (Nehalem EX). This is a relevant distinction when thinking about database encryption.

Oracle Exadata Database Machine Offerings: X2-2 and X2-8

For those who followed or attended Oracle OpenWorld last week you may have seen the introduction of the new hardware for the Oracle Exadata Database Machine. Here’s a high level summary of what was introduced: Updated Exadata Storage Server nodes (based on the Sun Fire X4270 M2) Updated 2 socket 12 core database nodes for the X2-2 (based on the Sun Fire X4170 M2) New offering of 8 socket 64 core database nodes using the Intel 7500 Series (Nehalem-EX) processors for the X2-8 (based on the Sun Fire X4800) The major updates in the X2-2 compared to V2 database nodes are: CPUs updated from quad-core Intel 5500 Series (Nehalem-EP) processors to six-core Intel 5600 Series (Westmere-EP) Network updated from 1 GbE to 10 GbE RAM updated from 72 GB to 96 GB The updates to the Exadata Storage Servers (which are identical for both the X2-2 and X2-8 configurations) are: CPUs updated to the six-core Intel 5600 Series (Westmere-EP) processors 600 GB 15k RPM SAS offering now known as HP (High Performance) 2 TB  7.2k RPM SAS offering now known as HC (High Capacity) [previously the 2 TB drives were 7.2k RPM SATA] One of the big advantages of the CPU updates to the Intel [...]

Software. Hardware. Complete.

A lot is happening here at Oracle Open World, more than my brain can master right now. Exalogic, ZFS storage, Unbreakable Enterprise Linux, Fusion Apps (finally), Oracle VM for Solaris, etc… Some of those topics aren’t that important for me and/or my customers right now, because they are just out of reach like, for example Exalogic. I read that a full box setup will go for just over 1.000.000 US$ so that is a setup I won’t be managing for a while. On the other hand from an Oracle perspective this is the logic next step to make to provide a solid complete solution from apps layer down to the hardware layer providing a boxed solution for Oracle’s most demanding customers regarding performance, best of breed and availability.

Oracle Open World 2010

It has very good technology features in there of which I hope it will be open for us “general” database consumers in the near future. One of those Exadata features I would really like to get my hands on would be the in memory index functionality, if not only that it fits perfect with the way I handle XML data, that is “content” based. An other feature would be, for instance, those storage cell optimizations. Anyway, until now, when trying to enable them, it only comes back with a “Exadata Only” warning, so I will have to wait a little (I hope).

So now, I am currently following Wim Coekaert‘s OEL / Unbreakable Enterprise Linux kernel session called “Oracle’s Linux Roadmap”, if not only being interested in the statement of direction about ZFS, OEL and/or Oracle Linux. Oracle Linux will be a fork as far as I have read and/or will it be an option to chose while installing Oracle Linux. From now (Oracle 5.5 and onwards) you can chose installation for a Red Hat Linux kernel strict install or a for Oracle optimized Unbreakable Linux kernel install. You can, although I haven’t tried it yet, also update the kernel afterwards with the Oracle optimized kernel.

Some of Oracle's Open Source contributions

So what is this Oracle Unbreakable Linux kernel part all about…?

Until now you could download the freely available binaries or source code of Oracle Enterprise Linux and applications would run unchanged if they were compatible and supported on RedHat. No code change needed and Oracle will/would fully support it. Until now there were, as mentioned during the presentation, not one single bugs reported of Oracle Enterprise Linux not being compatible with RedHat. Oracle Enterprise Linux is used by an enormous amount of big customers and among others also implemented in the Exadata set-up’s and has cheaper support offerings than, for example, RedHat support. So far so good; nothing new.

But as was mentioned, these limits of being strict Red Hat Compatible had some mayor disadvantages for Oracle, like fixing bugs from a new Red Hat distribution to be properly working in, working with, an Oracle software environment. Red Hat does not validate Oracle software, plus Red Hat adapts community efforts very slowly. So these were some of the reasons why Oracle now created the Unbreakable Enterprise Kernel. Be aware this is not the old thing called the Unbreakable Linux program but actually a new Linux kernel.

So this new Oracle kernel is a fast, modern, reliable and optimized for Oracle software. It’s, among others, used in the Exadata and Exalogic machines to deliver extreme performance. From now on its also allows Oracle to innovate without sacrificing compatibility. In short, regarding the software: It now includes both the Unbreakable Enterprise Kernel and the existing strict Red Hat Compatible Kernel. During boot time you can chose for or the strict Red Hat one or the optimized for Oracle software Unbreakable Enterprise Linux kernel.

As Oracle states “Oracle now recommends only the Unbreakable Enterprise Linux Kernel for all Oracle software on Linux”. You could wonder about what this means regarding certified solutions. My guess would be that Oracle will push the the Oracle Unbreakable Enterprise kernel (OUEL?) but will certify on both kernels if not only that a lot of Oracle’s on Demand services are still based on Red Hat environments.

Until now the OUL kernel has made huge improvements compared to Red Hat, for example, 400% gain regarding 8kb flash cache reads (IOPS)… Oracle also has now the ability to support bigger servers, up to 4096 CPU’s and 2 TB of memory, up to 4 PB clustered volumes with OCFS2 and advanced NUMA support. CPU’s can stay now in a low power state when the system is idle. Also this power management supports ACPI 4.0 and fine grained CPU and memory resource control.

The Oracle Unbreakable Linux Kernel

All these additions will flow back into the community which, of course, is what Linux is all about. Some other stuff which Oracle pushed back into the stream/kernel source is better data integrity, stop corrupted data in memory from actually being written by detecting in flight memory corruption; on hardware fault management level, errors detection and logging before they effect OS or application and, for example, automatic isolation of defective CPU’s and memory which will avoid crashes and improves application up time. Diagnostic tooling has been made less resource intensive hoping that people won’t switch them so when a performance or corruption issue happens, then there will trace info. During the presentation the “latencytop” diagnostic tool was mentioned as an example.

The Oracle Unbreakable Enterprise Linux kernel can be downloaded via or the public-yum.oracle.com from OEL 5.5 and onwards or use the ULN network. Follow instructions to download via the public yum server and/or look up instructions via public-yum-el5.repo and apply it on your current 5.5 system, a different way to achieve the same would be to use “up2date oracle-linux” via yum if you have already bought/installed/registered with Oracle Unbreakable Linux support.

Wim described it in more detail on his blog via

(1) On ULN we created a new channel “Oracle Linux Latest (ol5_x86_64_latest)” which you can subscribe to. It’s really very simple. Take a standard RHEL5.5 or OEL5.5 installation, register with ULN with the O(E)L5.5 channel and add the Oracle Linux 5 latest on top as well. Then just run up2date -u oracle-linux

the oracle-linux rpm will pull in all the required packages

(2) On our public yum repo we updated the repo file public-yum-el5.repo . So just configure yum with the above repo file and enable the Oracle Linux channel. [ol5_u5_base] enabled=1 and run yum install oracle-linux

The OUEL kernel improved also on some miscellaneous thinks like NFS IPv6 support and RAID 5 to RAID 6 support. There will be a new volume manager kind of GUI that will support your LUN creation, deletion, expansion and snapshot of NAS storage. All the work done is submitted to the mainline kernel and enhancements will trickle down and be tested in the Enterprise Linux distribution (be aware: Oracle Enterprise Linux is not the same as the Oracle Unbreakable Enterprise kernel).

Also see Wim’s post of today on blogs.oracle.com/wim to get more info. I guess, this also counts for Sergio Leunissen’s blog, who is currently hammering his laptop behind me in the audience, while I am writing this. Keep watching those blogs for more info :-)

By the way, while speaking with Sergio about the “naming” of this software, which apparently was driven by the need to have clear distinction between “Oracle Solaris” and more general hardware based “Oracle Linux” architectures.

Exadata V3 – Oops – EXADATA X2-8

Oracle had the new version of the Exadata Machine on display at Oracle Open World this week. It’s called the Exadata X2-8. That’s a catchy name. It sounds very Iron Manish! In fact they had these fellows on display next to the demo machines.


The X2-8 uses two 4U Sun Fire x4800 servers which each have 8 x eight-core intel CPUs (X7560) and 1 Terabyte of memory along with 14 Exadata Storage Servers. Here’s a link to the spec sheet for the X2-8 and for the Oracle Sun X4800. Below are a couple of pictures. The first one shows one of the database servers (X4800) with one of the CPU modules out.

The storage cells have not changed much from the original V2. They still have 2 CPUs and 12 drives and 384G of flash cache. Although I’ve been told they have newer (faster) 6 core Intel CPUs. (I did get a look at an unpublished spec sheet on an Oracle employee’s iPhone and it said they were using the 5670 CPUs) Oh, and they will allow you to run Solaris on the database servers. Of course they will have to finish a new version of Solaris 11 before that can happen. It’s worth noting that X2-8 can be ordered but that they don’t have a firm delivery date yet.

So this configuration is definitely for the high end market and addresses a couple of issues. The increased memory will allow us to have a more robust consolidation platform. It will also allow bigger OLTP type systems to run better (i.e. the additional memory means we can support many more concurrent users and have much larger buffer caches). Note that Exadata’s offloading generally reduces the memory requirements, but nevertheless, very large systems, particularly ones with lot’s of fast OLTP type transactions and lot’s of users will be better satisfied by this type of configuration. Also note that there is no little brother version of the X2-8. It comes in a full rack only. Which makes sense because there are only two database machines. I don’t believe the price has been set yet, but the word on the street is that the hardware will be about 50% more than the full rack version with the small 2 CPU blades (now renamed X2-2 by the way).

I did a post a couple of weeks ago (Thoughts on Exadata V3) about what I thought we might get to see in the way of Exadata changes. We got most of the things I was expecting but not all of them. Among the things we got are bigger/beefier servers with more memory and available slots for HBAs to provide some additional connection capabilities (although I’m not sure if Oracle is going to want people to open the machines up and put additional controllers in). I did see mention in the x4800 spec sheet of an HBA so they may actually have one in there already (I need to check that out). They also announced that they will be offering a version of Solaris that can run on the database servers which I was expecting, although they are still using Intel chips. The thing I was expecting that didn’t happen was a change of mind set about flexibility of configuration. They seem pretty set on maintaining a fixed configuration that will be the same for every customer. That is probably not such a bad idea. Certainly it’s easier to support and faster to deploy. But you know how customers are, they want their hamburgers with extra mustard and no pickles. So we’ll see how that works out over time. But for now, it’s a fixed menu. To quote Henry Ford, “You can get it in any color you want as long as it’s black”.

So that’s all I can think of at this point. Please let me know if you have any questions and I’ll see what I can find out.