Search

Top 60 Oracle Blogs

Recent comments

Exadata

Exadataを支える技術4 マルチコアCPUによるParallel Queryの実現


各サーバはIntel Xeon 7560プロセッサ(2010年9月バージョン)を2基搭載したマザーボードで構成されDB Server1台で:4コア x 2 = 8コアとなっている

Sun,IBM,HPといったサーバメーカによるSMPマシン競争の時代が終わりました。今はIntelとAMDの2社による汎用プロセッサーによるマルチコア対応の時代となりました。
Exadataでは、その中で最も高価なIntel Xeonをdual構成で使っています。

Exadataを支える技術3 Automatic Storage Management (ASM)

ASM:Shared Diskを実現するためのOracleコンポーネント
OEL:Oracle Enterprose Linux
CELLSRV:Cell Service Databaseとのコミュニケーション
MS:Management Service
RS:Restart Service
IORM:I/O Resource Manager

ASMがあるからOracle RACのShared Disk環境が構築できる。そして従来であれば高価なSANディスクアレイ装置を複数台用意しなければ実現できなかったI/O転送量、バンド幅を実現しているソフトウェア側の立役者的な存在といえる。

ASMと同様なオープンソースにSolaris ZFSがあるが、Oracleを使うのならASMの方がパフォーマンス面で有利。

ZFS機能概要:@IT記事より抜粋より:

Exadataを支える技術1 InfiniBand

- Intel Xeonなどに代表される高性能なコアが作り出すあるいは必要とする並列計算データが膨大となり
- Gigabit EtherNetを従来通り用いた並列システムでは、ネットワークの通信能力を超える通信が発生し、多くの通信要求が待ち状態になって、並列処理がスムーズに進まない
- ネットワークが飽和して並列計算性能が伸びないという現象が、高速コンピュータでは少ない台数で発生し、低速コンピュータではより多くの台数を接続した時に発生してしまう
- そこでGigabit EtherNetの10倍以上の通信性能を持ち、高いバンドレートと低い遅延で、並列処理を実現する新しいネットワークファシリティ、InfiniBandが登場した
- 世界の並列コンピュータシステムをリストしたTop500 (http://www.top500.org) でも、InfiniBandが他を圧倒している
参考:株式会社コンカレントシステムズの記事より(http://www.concurrent.co.jp/products/option/infiniband.html)

http://ja.wikipedia.org/wiki/InfiniBandより抜粋
1998年後半、基幹業務に使用される各種サーバは、ムーアの法則に従い、ほぼ18カ月に2倍のペースで高速化されてきていた。しかし、サーバ内のバスアーキテクチャは機械・電気的制限により、その進化に追い付けず、コンピュータシステム性能向上のボトルネックとなっていた。

Performance Stories from Exadata Migrations

Here are my UKOUG 2010 slides about Exadata migration performance, this is real life stuff, not repeating the marketing material:
View more presentations from tanelp.

Share

Making the Most of Oracle Exadata – A Technical Review

class="tweetmeme_button" style="float: right; margin-left: 10px;"> /> />

Over the past few weeks several people have asked me about an Exadata article entitled “Making the Most of Oracle Exadata” by Marc Fielding of Pythian. Overall it’s an informative article and touches on many of the key points of Exadata, however, even though I read (skimmed is a much better word) and briefly commented on the article back in August, after further review I found some technical inaccuracies with this article so I wanted to take the time to clarify this information for the Exadata community.

Exadata Smart Scans

Marc writes:

style="text-decoration: underline;">Smart scans: Smart scans are Exadata’s headline feature. They provide three main benefits: reduced data transfer volumes from storage servers to databases, CPU savings on database servers as workload is transferred to storage servers, and improved buffer cache efficiency thanks to column projection. Smart scans use helper processes that function much like parallel query processes but run directly on the storage servers. Operations off-loadable through smart scans include the following:

  • Predicate filtering – processing WHERE clause comparisons to literals, including logical operators and most SQL functions.
  • Column projection – by looking at a query’s SELECT clause, storage servers return only the columns requested, which is a big win for wide tables.
  • Joins – storage servers can improve join performance by using Bloom filters to recognize rows matching join criteria during the table scan phase, avoiding most of the I/O and temporary space overhead involved in the join processing.
  • Data mining model scoring – for users of Oracle Data Mining, scoring functions like PREDICT() can be evaluated on storage servers.

I personally would not choose a specific number of benefits from Exadata Smart Scan, simply stated, the design goal behind Smart Scan is to reduce the amount of data that is sent from the storage nodes (or storage arrays) to the database nodes (why move data that is not needed?). Smart Scan does this in two ways: it applies the appropriate column projection and row restriction rules to the data as it streams off of disk. However, projection is not limited to just columns in the SELECT clause, as Marc mentions, it also includes columns in the WHERE clause as well. Obviously JOIN columns need to be projected to perform the JOIN in the database nodes. The one area that Smart Scan does not help with at all is improved buffer cache efficiency. The reason for this is quite simple: Smart Scan returns data in blocks that were created on-the-fly just for that given query — it contains only the needed columns (projections) and has rows filtered out from the predicates (restrictions). Those blocks could not be reused unless someone ran the exact same query (think of those blocks as custom built just for that query). The other thing is that Smart Scans use direct path reads (cell smart table scan) and these reads are done into the PGA space, not the shared SGA space (buffer cache).

As most know, Exadata can easily push down simple predicates filters (WHERE c1 = ‘FOO’) that can be applied as restrictions with Smart Scan. In addition, Bloom Filters can be applied as restrictions for simple JOINs, like those commonly found in Star Schemas (Dimensional Data Models). These operations can be observed in the query execution plan by the JOIN FILTER CREATE and JOIN FILTER USE row sources. What is very cool is that Bloom Filters can also pass their list of values to Storage Indexes to aid in further I/O reductions if there is natural clustering on those columns or it eliminates significant amounts of data (as in a highly selective set of values). Even if there isn’t significant data elimination via Storage Indexes, a Smart Scan Bloom Filter can be applied post scan to prevent unneeded data being sent to the database servers.

Exadata Storage Indexes

Marc writes:

style="text-decoration: underline;">Storage indexes: Storage indexes reduce disk I/O volumes by tracking high and low values in memory for each 1-megabyte storage region. They can be used to give partition pruning benefits without requiring the partition key in the WHERE clause, as long as one of these columns is correlated with the partition key. For example, if a table has order_date and processed_date columns, is partitioned on order_date, and if orders are processed within 5 days of receipt, the storage server can track which processed_date values are included in each order partition, giving partition pruning for queries referring to either order_date or processed_date. Other data sets that are physically ordered on disk, such as incrementing keys, can also benefit.

In Marc’s example he states there is correlation between the two columns PROCESSED_DATE and ORDER_DATE where PROCESSED_DATE = ORDER_DATE + [0..5 days]. That’s fine and all, but to claim partition pruning takes place when specifying ORDER_DATE (the partition key column) or PROCESSED_DATE (non partition key column) in the WHERE clause because the Storage Index can be used for PROCESSED_DATE is inaccurate. The reality is, partition pruning can only take place when the partition key, ORDER_DATE, is specified, regardless if a Storage Index is used for PROCESSED_DATE.

Partition Pruning and Storage Indexes are completely independent of each other and Storage Indexes know absolutely nothing about partitions, even if the partition key column and another column have some type of correlation, as in Marc’s example. The Storage Index simply will track which Storage Regions do or do not have rows that match the predicate filters and eliminate reading the unneeded Storage Regions.

Exadata Hybrid Columnar Compression

Marc writes:

style="text-decoration: underline;">Columnar compression: Hybrid columnar compression (HCC) introduces a new physical storage concept, the compression unit. By grouping many rows together in a compression unit, and by storing only unique values within each column, HCC provides storage savings in the range of 80 90% based on the compression level selected. Since data from full table scans remains compressed through I/O and buffer cache layers, disk savings translate to reduced I/O and buffer cache work as well. HCC does, however, introduce CPU and data modification overhead that will be discussed in the next section.

The Compression Unit (CU) for Exadata Hybrid Columnar Compression (EHCC) is actually a logical construct, not a physical storage concept. The compression gains from EHCC come from column-major organization of the rows contained in the CU and the encoding and transformations (compression) that can be done because of that organization (like values are more common within the same column across rows, vs different columns in the same row). To say EHCC only stores unique values within each column is inaccurate, however, the encoding and transformation algorithms use various techniques that yield very good compression by attempting to represent the column values with as few bytes as possible.

Data from EHCC full table scans only remains fully compressed if the table scan is not a Smart Scan, in which case the compressed CUs are passed directly up to the buffer cache and the decompression will then be done by the database servers. However, if the EHCC full table scan is a Smart Scan, then only the columns and rows being returned to the database nodes are decompressed by the Exadata servers, however, predicate evaluations can be performed directly on the EHCC compressed data.

Read more: Exadata Hybrid Columnar Compression Technical White Paper

Marc also writes:

Use columnar compression judiciously: Hybrid columnar compression (HCC) in Exadata has the dual advantages of reducing storage usage and reducing I/O for large reads by storing data more densely. However, HCC works only when data is inserted using bulk operations. If non-compatible operations like single-row inserts or updates are attempted, Exadata reverts transparently to the less restrictive OLTP compression method, losing the compression benefits of HCC. When performing data modifications such as updates or deletes, the entire compression unit must be uncompressed and written in OLTP-compressed form, involving an additional disk I/O penalty as well.

EHCC does require bulk direct path load operations to work. This is because the compression algorithms that are used for EHCC need sets of rows as input, not single rows. What is incorrect with Marc’s comments is that when a row in a CU is modified (UPDATE or DELETE), the entire CU is not uncompressed and changed to non-EHCC compression, only the rows that are UPDATED are migrated to non-EHCC compression. For DELETEs no row migrations take place at all. This is easily demonstrated by tracking ROWIDs as in the example at the bottom of this post.

Exadata Smart Flash Cache

Marc writes:

style="text-decoration: underline;">Flash cache: Exadata s flash cache supplements the database servers buffer caches by providing a large cache of 384 GB per storage server and up to 5 TB in a full Oracle Exadata Database Machine, considerably larger than the capacity of memory caches. Unlike generic caches in traditional SAN storage, the flash cache understands database-level operations, preventing large non-repeated operations such as backups and large table scans from polluting the cache. Since flash storage is nonvolatile, it can cache synchronous writes, providing performance benefits to commit-intensive applications.

While flash (SSD) storage is indeed non-volatile, the Exadata Smart Flash Cache is volatile – it loses all of its contents if the Exadata server is power cycled. Also, since the Exadata Smart Flash is currently a write-through cache, it offers no direct performance advantages to commit-intensive applications, however, it does offer indirect performance advantages by servicing read requests that would otherwise be serviced by the HDDs, thus allowing the HDDs to service more write operations.

Read more: Exadata Smart Flash Cache Technical White Paper

EHCC UPDATE and DELETE Experiment

--
-- EHCC UPDATE example - only modified rows migrate
--

SQL> create table order_items1
  2  compress for query high
  3  as
  4  select rownum as rnum, x.*
  5  from order_items x
  6  where rownum <= 10000;

Table created.

SQL> create table order_items2
  2  as
  3  select rowid as rid, x.*
  4  from order_items1 x;

Table created.

SQL> update order_items1
  2  set quantity=10000
  3  where rnum in (1,100,1000,10000);

4 rows updated.

SQL> commit;

Commit complete.

SQL> select b.rnum, b.rid before_rowid, a.rowid after_rowid
  2  from order_items1 a, order_items2 b
  3  where a.rnum(+) = b.rnum
  4  and (a.rowid != b.rid
  5    or a.rowid is null)
  6  order by b.rnum
  7  ;

           RNUM BEFORE_ROWID       AFTER_ROWID
--------------- ------------------ ------------------
              1 AAAWSGAAAAAO1aTAAA AAAWSGAAAAAO1aeAAA
            100 AAAWSGAAAAAO1aTABj AAAWSGAAAAAO1aeAAB
           1000 AAAWSGAAAAAO1aTAPn AAAWSGAAAAAO1aeAAC
          10000 AAAWSGAAAAAO1aXBEv AAAWSGAAAAAO1aeAAD

--
-- EHCC DELETE example - no rows migrate
--

SQL> create table order_items1
  2  compress for query high
  3  as
  4  select rownum as rnum, x.*
  5  from order_items x
  6  where rownum <= 10000;

Table created.

SQL> create table order_items2
  2  as
  3  select rowid as rid, x.*
  4  from order_items1 x;

Table created.

SQL> delete from order_items1
  2  where rnum in (1,100,1000,10000);

4 rows deleted.

SQL> commit;

Commit complete.

SQL> select b.rnum, b.rid before_rowid, a.rowid after_rowid
  2  from order_items1 a, order_items2 b
  3  where a.rnum(+) = b.rnum
  4  and (a.rowid != b.rid
  5    or a.rowid is null)
  6  order by b.rnum
  7  ;

           RNUM BEFORE_ROWID       AFTER_ROWID
--------------- ------------------ ------------------
              1 AAAWSIAAAAAO1aTAAA
            100 AAAWSIAAAAAO1aTABj
           1000 AAAWSIAAAAAO1aTAPn
          10000 AAAWSIAAAAAO1aXBEv
name="googleone_share_1" style="position:relative;z-index:5;float: right; margin-left: 10px;">{lang: 'ar'} class='wpfblike' >

An investigation into exadata, part 2

It’s been a while since I’ve post the first part of this series.

Considerations

I created a little tool, snap, which is more or less a combination of Tanel Poder’s snapper and a script from Tom Kyte. I used this for the first part to give me all the details of database process I was using. The next step is invoking Oracle’s Parallel Query feature. Parallel Query means multiple processing are working together to execute a query.

Cellcli Command Syntax – Top 10 List

Exadata storage software uses the cellcli utility as its command line interface. Unfortunately, although the documentation set that comes with Exadata does have many examples of cellcli commands, and even a chapter dedicated to cellcli, it does not include any reference material on the syntax itself (particularly the LIST command). So I thought I would write up a few of the things I’ve learned while picking around at it. But first a little bit of a rant on why they wrote yet another command line interface.

They already had SQL*Plus for crying out loud. Why not just use that. SQL*Plus has all kinds of functionality for using variables, formatting output, etc… And on top of that, they invented a new syntax. Why use LIST instead of SELECT? They used WHERE and LIKE, so why not SELECT? I find it more than a little annoying (in case you couldn’t tell). I’m told that storage admins don’t like SQL and that’s one of the reasons for not using straight SQL syntax. That seems pretty silly since the storage is designed specifically for use with Oracle databases.

So anyway, here’s my quick top ten list of things you should know:

  1. cellcli does have a handful of SQL*Plus commands (START (@), SET ECHO ON, SPOOL, DESCRIBE, HELP)
  2. SELECT is replaced by LIST and it must be the first key word on the command line
  3. There is no FROM keyword (the LIST keyword must be immediately followed by the ObjectType which is equivalent to a table name)
  4. There is a DESCRIBE command which displays the attributes (columns) that make up an ObjectType (table)
  5. Column names are specified with the ATTRIBUTES keyword followed by the columns you wish to be displayed
  6. There is a default set of columns for each Object that will be returned if the ATTRIBUTES keyword is not specified
  7. There is a WHERE clause that can be applied to any attribute and multiple conditions can be ANDed together (no OR though)
  8. There is no ORDER BY equivalent
  9. The DETAIL key word can be appended to any LIST command to change the output from column oriented to row oriented
  10. The LIKE operator works but instead of the standard SQL wildcard, %, cellcli uses regex – so ‘%’ = ‘.*’

So here are a few examples:

CellCLI> help 
 
 HELP [topic]
   Available Topics:
        ALTER
        ALTER ALERTHISTORY
        ALTER CELL
        ALTER CELLDISK
        ALTER GRIDDISK
        ALTER IORMPLAN
        ALTER LUN
        ALTER THRESHOLD
        ASSIGN KEY
        CALIBRATE
        CREATE
        CREATE CELL
        CREATE CELLDISK
        CREATE FLASHCACHE
        CREATE GRIDDISK
        CREATE KEY
        CREATE THRESHOLD
        DESCRIBE
        DROP
        DROP ALERTHISTORY
        DROP CELL
        DROP CELLDISK
        DROP FLASHCACHE
        DROP GRIDDISK
        DROP THRESHOLD
        EXPORT CELLDISK
        IMPORT CELLDISK
        LIST
        LIST ACTIVEREQUEST
        LIST ALERTDEFINITION
        LIST ALERTHISTORY
        LIST CELL
        LIST CELLDISK
        LIST FLASHCACHE
        LIST FLASHCACHECONTENT
        LIST GRIDDISK
        LIST IORMPLAN
        LIST KEY
        LIST LUN
        LIST METRICCURRENT
        LIST METRICDEFINITION
        LIST METRICHISTORY
        LIST PHYSICALDISK
        LIST THRESHOLD
        SET
        SPOOL
        START
 
CellCLI> help set
 
  Usage: SET  
 
  Purpose: Sets a variable to alter the CELLCLI environment settings for your
           current session.
 
  Arguments:
    variable and value represent one of the following clauses:
    DATEFORMAT { STANDARD | LOCAL }
    ECHO { ON | OFF }
 
  Examples:
    set dateformat local
    set echo on 
 
 
CellCLI> help list
 
  Enter HELP LIST  for specific help syntax.
    :  {ACTIVEREQUEST | ALERTHISTORY | ALERTDEFINITION | CELL 
                     | CELLDISK | FLASHCACHE | FLASHCACHECONTENT | GRIDDISK
                     | IORMPLAN | KEY | LUN 
                     | METRICCURRENT | METRICDEFINITION | METRICHISTORY 
                     | PHYSICALDISK | THRESHOLD }
 
CellCLI> help list FLASHCACHECONTENT
 
  Usage: LIST FLASHCACHECONTENT [] [] [DETAIL] 
 
  Purpose: Displays specified attributes for flash cache entries.
 
  Arguments:
    :  An expression which determines the entries to be displayed.
    : The attributes that are to be displayed.
                      ATTRIBUTES {ALL | attr1 [, attr2]... }
 
  Options:
    [DETAIL]: Formats the display as an attribute on each line, with
              an attribute descriptor preceding each value.
 
  Examples:
    LIST FLASHCACHECONTENT DETAIL

So as you can see, the help system allows you to see a bit of the syntax for each command. You may also have noticed a couple of SQL*Plus carry-overs. SET, SPOOL, and START work pretty much as expected. Note the @ is equivalent to START and that the only things you can SET are ECHO and DATEFORMAT. Now for a couple of queries (er LIST commands):

CellCLI> desc flashcachecontent
         ^
CELL-01504: Invalid command syntax.
 
CellCLI> describe flashcachecontent
        cachedKeepSize
        cachedSize
        dbID
        dbUniqueName
        hitCount
        hoursToExpiration
        missCount
        objectNumber
        tableSpaceNumber
 
CellCLI> set echo on
 
CellCLI> @fc_content
 
> CellCLI> list flashcachecontent where dbUniqueName like 'EXDB' and hitcount > 100 attributes dbUniqueName, objectNumber, cachedKeepSize, cachedSize, hitCount, missCount
         EXDB    2       0       4194304         600     208
         EXDB    40      0       2424832         376     60
         EXDB    224     0       1802240         115     80
         EXDB    267     0       458752          128     9
         EXDB    383     0       2547712         157     27
         EXDB    423     0       1867776         180     41
         EXDB    471     0       4071424         552     85
         EXDB    472     0       1277952         114     22
         EXDB    474     0       13246464        286     326
         EXDB    475     0       5914624         519     124
         EXDB    503     0       5308416         669     455
         EXDB    5710    0       3735552         363     90
         EXDB    6207    0       393216          112     9
         EXDB    6213    0       3842048         359     147
         EXDB    6216    0       1245184         184     29
         EXDB    6373    0       3481600         222     61
         EXDB    56085   0       4194304         822     129
         EXDB    66849   0       438763520       1221    3322
         EXDB    71493   0       5636096         302     127
         EXDB    71497   0       1351680         320     22
         EXDB    71573   0       2760704         101     37
         EXDB    71775   0       1801412608      34994   46315
 
CellCLI> list flashcachecontent where dbUniqueName like 'EX.?.?' and hitcount > 100 -
 
> attributes dbUniqueName, objectNumber, cachedKeepSize, cachedSize 
         EXDB    2       0       4194304
         EXDB    18      0       1179648
         EXDB    37      0       622592
         EXDB    40      0       2424832
         EXDB    63      0       524288
         EXDB    104     0       688128
         EXDB    224     0       3407872
         EXDB    267     0       458752
         EXDB    383     0       2670592
         EXDB    420     0       1507328
         EXDB    423     0       1867776
         EXDB    424     0       720896
         EXDB    471     0       4071424
         EXDB    472     0       1277952
         EXDB    473     0       2351104
         EXDB    474     0       13574144
         EXDB    475     0       5521408
         EXDB    503     0       5308416
         EXDB    5702    0       262144
         EXDB    5709    0       2416640
         EXDB    5710    0       3735552
         EXDB    6207    0       393216
         EXDB    6210    0       131072
         EXDB    6213    0       4227072
         EXDB    6216    0       1245184
         EXDB    6373    0       3579904
         EXDB    56085   0       4194304
         EXDB    66849   0       438763520
         EXDB    71493   0       5636096
         EXDB    71497   0       1351680
         EXDB    71573   0       2801664
         EXDB    71775   0       1801412608
 
CellCLI> list flashcachecontent where dbUniqueName like 'EX.?.?' and hitcount > 100 and objectNumber like '.*775'
         2356637742      6       71775
 
CellCLI> list flashcachecontent where dbUniqueName like '.*X.?.?' and objectNumber like '.*775' detail                                      
         cachedKeepSize:         0
         cachedSize:             1801412608
         dbID:                   2356637742
         dbUniqueName:           EXDB
         hitCount:               34994
         missCount:              46315
         objectNumber:           71775
         tableSpaceNumber:       6
 
CellCLI> list flashcachecontent where dbUniqueName like 'EX.?.?' and hitcount > 100 and objectNumber like '.*775'
         2356637742      6       71775
 
CellCLI> list flashcachecontent attributes objectNumber, hitCount, missCount where dbUniqueName like 'EX.?.?' and hitcount > 100 and objectNumber like '.*775'
         71775   34994   46315

So DESC doesn’t work as an abbreviation of DESCRIBE. Notice that there are no headings for column oriented output. As you can see, you can run “scripts” and SET ECHO ON to display the commands in any scripts that you execute. One of the LIST commands was strung across two lines by using the continuation operator (-). The LIST commands look a lot like SQL except for LIST being used instead of SELECT and the regex expressions for matching when using the LIKE key word. Also notice that in the last command a number was matched with a regex expression implying a data type conversion, although all data may be treated at text. You can see that the ATTRIBUTES and WHERE key words can be anywhere on the command line after the “LIST objectName” keywords. In other words, these two key words are not positional, either one can be first. Finally, the DETAIL keyword turns the output sideways. Or as the help says, “Formats the display as an attribute on each line, with
an attribute descriptor preceding each value.”

So the cellcli interface is really not that bad, I just happen to like SQL*Plus better. ;) I do think it would have been a simple matter to reuse SQL*Plus since they already have all the functionality built into it, but maybe there were other concerns that I’m not aware of. But cellcli works. And by the way, cellcli has the ability to scroll though previous commands and edit them via the arrow keys which is quite handy. The editing capability is definitely a step forward from SQL*Plus on unix like platforms (although you can use rlwrap to accomplish this – see this post for more details on that: Using rlwrap on Windows) And regex also provides a very powerful pattern matching capability although it’s still a little confusing to have SQL like syntax mixed with regex to my way of thinking. Maybe if they just added the ability to use the % wildcard in addition to the regex I would feel better about it.

Applying Exadata Bundle Patch 6 (BP6)

Well November came around quickly - when I started blogging I promised myself that I would at least try to share my Exadata findings once a month.  So it is time to share my experience and findings from applying the bundle patch 6. If you are new to Exadata patching you should hopefully find some important points below. The patching took place on a ¼ Rack (2 compute nodes). The high level approach for applying BP6 is
1.       Apply patch 10114705 - Database Bundle Patch 6 for Exadata first

Oracle XMonth

I spent most of the last week in California at Oracle’s XMonth (it was really a week so I’m not sure why they called it XMonth). Which reminds me of a tuning engagement I did a few years ago. I showed up at the client’s facilities and we started discussing the problem. They told me that their daily job was running too slow and it was causing major problems. So I asked them what time the daily job ran each day. And they told me that it ran every 20 minutes. Ha! “And why do you call it the daily job?”, I asked. They weren’t really sure. The job was taking about an hour to complete as I recall, but I digress.

At XMonth, they had three tracks: Exadata (and two others that I can’t remember). I did learn a few things I thought were worth sharing.

  1. Exalogic is not shipping yet, but we did cover it in some detail. Exalogic’s biggest advantage appears to be the Infiniband fabric. It can be connected directly to any DB server using IB (Exadata for example) and can communicate with the extremely low latency RDS protocol.
  2. Oracle has relaxed their “no changes to the Exadata configuration” stance (but only very slightly). They said that it was OK to change out the Cisco switch and replace it with some other equivalent switch.
  3. A competitive analysis of Sun’s server line was provided. It included T series, X series, and M series servers along with Exadata. Exadata pretty much kicks all other options in the teeth (in my opinion). M series are still suitable for very large applications that are unable to scale out via RAC – such are Oracle’s own MRP package which uses the dbms_pipe package limiting its ability to scale in a RAC environment. But in general, the advice to the sales team was that if you are in a competitive situation, in most cases you should lead with Exadata.

So that’s about it. Oh they also let me talk about our experiences with Exadata. That was fun and I got to try my hand at a virtual presentation, as there were participants all over the world following along with Webex. The software has gotten pretty good for doing these kinds of presentations by the way. It was good practice for the Virtual Oracle Conference we have coming up next month. (note that there are only a couple of days left to sign up at the discounted rate)

Exadata, to index or not to index, that is the question

We will take a look at in this blog post, by testing several different approaches and comparing the time and the statistics for each scenario.

The tests have been performed on a quarter rack ExaData database machine (2 db nodes – with 16 cores each and 3 storage servers). The database is setup for a data warehouse implementation and has been patched with bundle patch 5 at the time of testing.

The tests were executed on a table with 403M rows distributed over 14 range partitions – 7 non-compressed and 7 partitions compressed with HCC query option.  Each test spans over two partitions covering 57.5M rows.  Please note the dimension tables contain 4000 or less rows.  The data is production data and are event based data, meaning data is generated when a certain events occur.

TABLE_NAME  PARTITION_NAME COMPRESS COMPRESS_FOR LAST_ANALYZED  SAMPLE_SIZE   NUM_ROWS
EVENT_PART  TEST_20100901  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100902  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100903  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100904  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100905  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100906  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100907  DISABLED              27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100908  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100909  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100910  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100911  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100912  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100913  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
EVENT_PART  TEST_20100914  ENABLED  QUERY HIGH   27-SEP-10         28793000   28793000
 

Each test-case/SQL has been executed 5 times under different scenario:

(1)    Without any bitmap or regular indexes on non-compressed partitions
           Storage FTS on event_part and dimensions
(2)    Without any bitmap or regular indexes on HCC partitions
            Storage FTS on event_part and dimensions
(3)    With primary key  constraint on dimension tables only
          Storage FTS on event_part with primary key look up on dimensions
(4)    With bitmap and primary key indexes on non-compressed partitions
            Bitmap index lookup on event_part and primary key lookup on dimensions
(5)    With bitmap and primary key  indexes on HCC partitions
           Bitmap index lookup on event_part and primary key lookup on dimensions 

The test cases used are from a warehouse environment and I have modified the column and table names. For the bitmap test-cases I had to hint the queries to ensure the bitmap indexes was actually used.

The output from the test cases is over 20K lines, so I have summed up the elapsed time and a few statistics in tables below to provide a better overview. 

select /*+ MONITOR  basic */
            st.s_name,  pt.p_name,  cct.pc_name,  ep.c_level, count(ep.c_id)
  from event_part ep,  t_test tt, s_test st, p_test pt, cc_test cct
   where tt.t_id between 20100901 and 20100902
        and ep.t_id = tt.t_id
        and ep.t_id between 20100901 and 20100902
        and ep.s_id = st.s_id
        and ep.p_id = pt.p_id
        and ep.cc_id = cct.c_id
    group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
    order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for basic breakdown – Test case 1

Stats / Tests

1

2

3

4

5

Elapsed time sec

8.87

12.00

09.22

185.05

149.55

cell physical IO bytes saved by storage index

0

0

0

0

0

cell physical IO bytes eligible for predicate offload

5,209,325,568

0

5,209,325,568

0

0

cell physical IO interconnect bytes returned by smart scan

2,562,203,600

0

2,562,201,584

0

0

cell flash cache read hits

9

26

9

9,290

2,063

CC Total Rows for Decompression

 

57,586,000

 

 

57,586,000

 This is a basic query, which is used for high level summaries and serves as a good base line to compare with, for the other test-cases.  There is no use of a where clause in the test case, so we will not benefit from any storage indexes in this case.  The first 3 tests are without any indexes on the fact table and are performing much better than test 4 and 5 and we should of course not expect the CBO to follow this path anyway.   It is evident for test 1 and 3 that the performance gained is supported by the storage server offloading and the smart scans.  The above CC stats for test 2, tell us that the db node performs the decompression, so this test will have to burn extra CPU cycles compared to test 1 and 3.  There is more to be mentioned for test 2, but I’ll try to cover that in the conclusion.

 select /*+ MONITOR lc breakdown */
           st.s_name, pt.p_name,  cct.pc_name,  ep.c_level,  count(ep.c_id)
 from event_part ep, t_test tt, s_test st, p_test pt, cc_test cct
 where tt.t_id between 20100903 and 20100904
      and ep.t_id = tt.t_id
      and ep.t_id between 20100903 and 20100904
      and ep.s_id = st.s_id and ep.p_id = pt.p_id
      and ep.cc_id = cct.c_id and ep.c_id = 7
 group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
 order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for LC breakdown - Test case 2

Stats / Tests

1

2

3

4

5

Elapsed time sec

2.05

19.17

1.84

30.33

36.58

cell physical IO bytes saved by storage index

4,186,636,288

0

4,186,636,288

0

0

cell physical IO bytes eligible for predicate offload

5,209,292,800

0

5,209,292,800

0

0

cell physical IO interconnect bytes returned by smart scan

317,496,848

0

317,497,280

0

0

cell flash cache read hits

18

59

36

1,043

219

CC Total Rows for Decompression

0

57,782,554

0

0

7,842,364

Similar finding as we saw from the 1st test case; however, in this test-case we are performing the breakdown for a certain ID and therefore the performance of test 1 and 3, improved further from the IO saved by the Storage Index.   For this test case, I ran test 1 and 3 on the save partitions and it is worth noticing, that second time around the savings from the Storage Index improved; so the storage indexes are further maintained/improved as we select data from the tables and partitions.

select /*+ MONITOR lp breakdown */
           st.s_name,  pt.p_name, cct.pc_name, ep.c_level,  count(ep.c_id)
 from event_part ep,  t_test tt,  s_test st,  p_test pt,  cc_test cct
 where tt.t_id between 20100905 and 20100906
       and ep.t_id = tt.t_id
      and ep.t_id between 20100905 and 20100906
      and ep.s_id = st.s_id  and ep.p_id = pt.p_id
      and ep.cc_id = cct.c_id and ep.p_id = 4611686019802841877
 group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
 order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for lp breakdown - Test case 3

Stats / Tests

1

2

3

4

5

Elapsed time sec

 2.99

 6.01

 2.72

 49.22

 39.29

cell physical IO bytes saved by storage index

 2,623,143,936

 

0

 

2,623,799,296

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,325,568

 

0

 

5,209,325,568

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

674,439,456

 

0

 

 

674,436,288

 

0

 

0

cell flash cache read hits

64

44

10

2,113

635

CC Total Rows for Decompression

 

0

 

57,979,108

 

0

 

0

 

15,582,048

 Similar findings as we saw from the 2nd test case; this test is just performed on a different ID, which has a higher distinct count than the first ID we tested in test case 2; and as a result of that and on how the data is sorted during insert we are seeing less IO saved by the storage index.

 select /*+ MONITOR spcl breakdown */
           st.s_name, pt.p_name,  cct.pc_name,  ep.c_level, count(ep.c_id)
  from event_part ep, t_test tt,   s_test st, p_test pt,  cc_test cct
 where tt.t_id between 20100906 and 20100907
      and ep.t_id = tt.t_id
      and ep.t_id between 20100906 and 20100907
      and ep.s_id = st.s_id and ep.p_id = pt.p_id
      and ep.cc_id = cct.c_id and ep.s_id = 1
      and ep.cc_id =7 and ep.p_id = 4611686019802841877
  group by st.s_name, pt.p_name, cct.pc_name, ep.c_level
  order by st.s_name, pt.p_name, cct.pc_name, ep.c_level;
 

Table figure for spcl breakdown – Test case 4

Stats / Tests

1

2

3

4

5

Elapsed time sec

1.67

13.69

01.14

12.77

7.90

cell physical IO bytes saved by storage index

 

4,531,191,808

 

0

 

4,532,174,848

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,325,568

 

0

 

5,209,325,568

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

237,932,736

 

0

 

237,933,312

 

0

 

0

cell flash cache read hits

73

52

10

594

183

CC Total Rows for Decompression

 

0

 

57,782,554

 

0

 

0

 

5,614,752

This test case is performed with a where clause on multiple ID’s.  Again test 1 and 3 are taking advantage of the Exadata features and are performing well.   Test 4 and 5 are still not close to test 1 or 3, but have definitely become a bit more competitive.  Comparing the two HCC tests (2 and 5) test 5 seems to do better as it only has to burn CPU cycles for 10% of the results set of test 2.   A valid question to ask here would be why we are not seeing any benefits from either Storage offloading or indexing on test 2, but again I’ll defer that discussion to the conclusion.

select /*+ MONITOR  ttl */
           st.s_name, cct.pc_name, ep.c_level, count(ep.c_id) row_count,
           round(min((ep.ceg - ep.cbg) / 60),2) min_g_min,
           round(avg((ep.ceg - ep.cbg) / 60.0),2)  avg_g_min,
          round(max((ep.ceg - ep.cbg) / 60),2) max_g_min
  from event_part ep,  t_test tt,   s_test st,  cc_test cct
 where  tt.t_id between 20100901 and 20100902
      and ep.t_id = tt.t_id
      and ep.t_id between 20100901 and 20100902
      and ep.s_id = st.s_id
      and ep.character_class_id = cct.class_id
  group by st.shard_name, cct.public_class_name, ep.character_level
  order by  st.shard_name, cct.public_class_name, ep.character_level;
 

Table figure for ttl breakdown - Test case 5

Stats / Tests

1

2

3

4

5

Elapsed time sec

12.82

15.50

11.67

254.26

304.92

cell physical IO bytes saved by storage index

 

0

 

0

 

0

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,325,568

 

0

 

5,209,325,568

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

2,328,566,432

 

0

 

2,328,567,440

 

0

 

0

cell flash cache read hits

9

15

9

132,467

2,341

CC Total Rows for Decompression

 

0

 

57,586,000

 

0

 

0

 

61,643,318

Very similar findings as we saw from the 1st test case; the only difference is this query looks examine the time to something.

select /*+ MONITOR ttlsc */
           st.s_name, cct.pc_name, ep.c_level,  count(ep.c_id) row_count,
           round(min((ep.ceg - ep.cbg) / 60),2) min_g_min,
           round(avg((ep.ceg - ep.cbg) / 60.0),2)  avg_g_min,
           round(max((ep.ceg - ep.cbg) / 60),2) max_g_min
 from event_part ep, t_test tt, shard_test st, cc_test cct
where tt.t_id between 20100903 and 20100904
      and ep.t_id = tt.t_id
      and ep.t_id between 20100903 and 20100904
      and ep.s_id = st.s_id
      and ep.cc_id = cct.c_id
      and ep.s_id = 2
      and ep.cc_id =6
    group by st.s_name, cct.pc_name, ep.c_level
    order by st.s_name, cct.pc_name, ep.c_level;
 

Table figure for ttlsc breakdown - Test case 6

Stats / Tests

1

2

3

4

5

Elapsed time sec

 1.16

4.57

1.01

12.71

 03.87

cell physical IO bytes saved by storage index

 

4,697,096,192

 

0

 

4,698,832,896

 

0

 

0

cell physical IO bytes eligible for predicate offload

 

5,209,292,800

 

0

 

5,209,292,800

 

0

 

0

cell physical IO interconnect bytes returned by smart scan

 

55,906,960

 

0

 

55,906,384

 

0

 

0

cell flash cache read hits

 9

31

10

3891

 107

CC Total Rows for Decompression

 0

 57,749,795

 0

 0

1,998,299

 Very similar findings as we saw for the 4th test case.

 Conclusion

Most warehouse like queries I have performed in our Exadata environment is doing well without indexes on fact tables.  So it is no surprise to me to hear more and more people are dropping most of their indexes and take advantage of the Exadata features.   If you like to keep the primary key indexes on your dimension tables to ensure the hassle of resolving the duplicate key issues, that seems to be a valid option as well.

In my environment I’m still to find a case where the bitmap index search could compete with the no index approach; and let just say we found such a case, when it would still have to show significant improvements before  I would choose that path;  Consider the benefits of not having to maintain the bitmap indexes after each load.   There are also several restrictions with bitmap indexes that would be nice not to have to worry about.

Now, I mentioned that I would get back to the test 2 results, which were based on Storage FTS on partitions compressed with the HCC query option.   In the past I have performed queries on HCC tables and have seen IO savings from the Storage indexes.  

Initially i suspected the test2 results observed above to be a bug or alternatively be related to my HCC compressed partitions are only 29MB a piece versa 2.4GB uncompressed.  Oracle support/development has confirmed it to be related to the data size, as we can see from the stat "cell physical IO bytes eligible for predicate offload", which doesn't get bumped up after query.  The reason for that is after partition pruning,  the table is too small for predicate push to kick in and since predicate push doesn't kick in, the Storage Indexes won't kick in either.

Please be aware i don't know the Storage index internals, but I look forward to learn.