Search

Top 60 Oracle Blogs

Recent comments

Exadata

EHCC and the GET_COMPRESSION_TYPE function

Well I turned in the HCC chapter on the Exadata book last week and of course as is usually the case, I immediately ran across something kind of cool on the topic I just finished writing about. (we still have several editing passes though, so I can add it later). Anyway, although I don’t have time to post much these days, I thought this one would be a quick little snippet. So here it is.

The Compression Advisor is part of the DBMS_COMPRESSION package. Specifically it is the GET_COMPRESSION_RATIO procedure. This procedure is worthy of a separate post but I won’t discuss it here except to say that as of 11.2.0.2 you can use it to test HCC compression ratios on non-Exadata platforms. That’s pretty cool, but what I wanted to tell you about is a handy little function in the same package called GET_COMPRESSION_TYPE. This function can tell you exactly what level of compression has been applied to a single row. This can come in handy for investigating the inner workings of  HCC (or OLTP or BASIC compression for that matter).

As you probably already know, HCC is only applied to records loaded via direct path writes. Any updates cause rows to be migrated out of that storage format into blocks flagged for OLTP compression. Of course OLTP compression on a block only kicks in when a block is “full”. On top of this, altering a table to change it’s compression does not actually change the storage format of any existing records (unless you use the MOVE keyword). So you could load some data and then change the designation (say from QUERY LOW to QUERY HIGH). Rows that are inserted after the change will be stored in the new format (assuming the records are loaded via direct path writes of course). So why am I telling you all this. Well, because I ran across a statement in some Oracle documentation that said you can check to see what compression method a table is stored with by looking at the COMPRESS_FOR column in the DBA_TABLES view. This column does reveal what the table designation is. However, the setting actually only tells you how rows inserted in the future will be compressed. It tells you absolutely nothing about the way current rows are stored.

As for the mechanics, it appears that each row has a bitmask associated with it showing what compression format is being used. So I wrote a little script to give me what I want to see (check_row_comp.sql) using the DBMS_COMPRESSION.GET_COMPRESSION_TYPE function. Here’s an example of its use.

== Note this listing has been updated to fix the bitmask as suggested by Greg in the comments (I had it wrong initially)
 
SYS@SANDBOX1> !cat check_row_comp.sql
col old_rowid for a20
/*
There is a bit mask that indicates level of compression
10000000 (1) = no compression
01000000 (2) = BASIC/OLTP
00100000 (4) = QUERY HIGH
00010000 (8) = QUERY LOW
00001000 (16) = ARCHIVE HIGH
00000100 (32) = ARCHIVE LOW
*/
select 
old_rowid(rowid) old_rowid,
decode(
DBMS_COMPRESSION.GET_COMPRESSION_TYPE ( '&owner', '&table_name', '&rowid'), 
1, 'No Compression',
2, 'Basic/OLTP Compression', 
4, 'HCC Query High',
8, 'HCC Query Low',
16, 'HCC Archive High',
32, 'HCC Archive Low',
'Unknown Compression Level') compression_type
from dual;
 
SYS@SANDBOX1> select rowid from kso.skew_hcc3 where pk_col = 444444;
 
ROWID
------------------
AAAWbXAAIAAGAfiBdz
 
1 row selected.
 
SYS@SANDBOX1> @check_row_comp
Enter value for owner: KSO
Enter value for table_name: SKEW_HCC3
Enter value for rowid: AAAWbXAAIAAGAfiBdz
 
OLD_ROWID            COMPRESSION_TYPE
-------------------- -------------------------
1.929.0              HCC Archive Low
 
1 row selected.
 
SYS@SANDBOX1> update kso.skew_hcc3 set col1 = col1*2 where pk_col=444444;
 
1 row updated.
 
SYS@SANDBOX1> commit;
 
Commit complete.
 
SYS@SANDBOX1> select rowid, old_rowid(rowid) old_rowid from kso.skew_hcc3 where pk_col = 44444;
 
ROWID              OLD_ROWID
------------------ --------------------
AAAWbXAAIAAF7aUAAA 8.1554068.0
 
1 row selected.
 
SYS@SANDBOX1> -- row migrated to file 8
SYS@SANDBOX1> 
SYS@SANDBOX1> @check_row_comp
Enter value for owner: KSO
Enter value for table_name: SKEW_HCC3
Enter value for rowid: AAAWbXAAIAAF7aUAAA
 
OLD_ROWID            COMPRESSION_TYPE
-------------------- -------------------------
1.929.0              No Compression
 
1 row selected.

I did a previous post (Proof That Whole CU’s Are Not Decompressed) where I showed row migration to a block flagged for OLTP compression, but of course the OLTP compression would not be applied until the block was sufficiently full. As you can see in the example, the function showed that the record was not actually compressed, even though it had been migrated to a block flagged for OLTP compression.

So the GET_COMPRESSION_TYPE function is handy for showing us how an individual record is actually stored. Of course you can see this by dumping blocks as well, but I think this function is much easier than looking at block dumps. You could also write a script to sample some percentage of the rows in a table to get a feel for how well compressed the rows are, but I will leave that as an exercise for the reader. (please let me know if you undertake that task as it is something I will eventually want to do as well) ;)

Debunking More Netezza FUD About Exadata

class="tweetmeme_button" style="float: right; margin-left: 10px;"> /> />

A reader recently left a comment for which my reply was longer than I’d like to leave for a comment so I’m answering it in detail with this blog post.

Gabramel writes:

Greg, /> Nice article. I am just reading the Netezza paper.

You don’t appear to have debunked the following statement.

“Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.”

Not many queries exist where data is only required from two tables. Are Oracle suggesting we need to change the way data is structured to enable best use of Exadata – increasing TCO significantly?

Thanks & Nice post.

There is a reason that I did not debunk that statement – it did not exist in the original version of Netezza’s paper. It seems they have taken the shopping basket example that I debunked in my previous post and replaced it with this one. Nonetheless lets take a look at Netezza’s claim:

Exadata’s storage tier provides Bloom filters to implement simple joins between one large and one smaller table, anything more complex cannot be processed in MPP. Analytical queries commonly require joins more complex than those supported by Exadata. Consider the straightforward case of an international retailer needing insight to the dollar value of sales made in stores located in the UK. This simple SQL query requires a join across three tables – sales, currency and stores.

select sum(sales_value * exchange_rate) us_dollar_sales
from sales, currency, stores
where sales.day = currency.day
and stores.country = 'UK'
and currency.country = 'USA'

Exadata is unable to process this three table join in its MPP tier and instead must inefficiently move all the data required by the calculation across the network to Oracle RAC.

Before I comment, did you spot the error with the SQL query? Hint: Count the number of tables and joins.

Now that we can clearly see that Netezza marketing can not write good SQL because this query contains a cross product as there is no JOIN between sales and stores thus the value returned from this query is not “the [US] dollar value of sales made in stores located in the UK”, it’s some other rubbish number.

Netezza is trying to lead you to believe that sending data to the database nodes (running Oracle RAC) is a bad thing, which is most certainly is not. Let’s remember what Exadata is – Smart Storage. Exadata itself is not an MPP database, so of course it needs to send some data back to the Oracle database nodes where the Oracle database kernel can use Parallel Execution to easily parallelize the execution of this query in an MPP fashion efficiently leveraging all the CPUs and memory of the database cluster.

The reality here is that both Netezza and Oracle will do the JOIN in their respective databases, however, Oracle can push a Bloom filter into Exadata for the STORES.COUNTRY predicate so that the only data that is returned to the Oracle database are rows matching that criteria.

Let’s assume for a moment that the query is correctly written with two joins and the table definitions look like such (at least the columns we’re interested in):

create table sales (
 store_id    number,
 day         date,
 sales_value number
);

create table currency (
 day           date,
 country       varchar2(3),
 exchange_rate number
);

create table stores (
 store_id number,
 country  varchar2(3)
);

select
    sum(sales.sales_value * currency.exchange_rate) us_dollar_sales
from
    sales,
    currency,
    stores
where
    sales.day = currency.day
and sales.store_id = stores.store_id
and stores.country = 'UK'
and currency.country = 'USA'

For discussion’s sake, let’s assume the following:

  • There is 1 year (365 days) in the SALES table of billions of rows
  • There are 5000 stores in the UK (seems like a realistic number to me)

There is no magic in those numbers, it’s just something to add context to the discussion, so don’t think I picked them for some special reason. Could be more, could be less, but it really doesn’t matter.

So if we think about the the cardinality for the tables:

  • STORES has a cardinality of 5000 rows
  • CURRENCY has a cardinality of 365 rows (1 year)

The table JOIN order should be STORES -> SALES -> CURRENCY.

With Exadata what will happen is such:

  • Get STORE_IDs from STORE where COUNTRY = ‘UK’
  • Build a Bloom Filter of these 5000 STORE_IDs and push them into Exadata
  • Scan SALES and apply the Bloom Filter in storage, retuning only rows for UK STORE_IDs and project only the necessary columns
  • JOIN that result to CURRENCY
  • Compute the SUM aggregate

All of these operations are performed in parallel using Oracle’s Parallel Execution.

Netezza suggests that Exadata can use Bloom filters for only two table joins (1 big, 1 small) and that analytical queries are more complex than that so Exadata can not use a Bloom filter and provide an example to suggest such. The reality is not only is their example incorrectly written SQL, it also works great with Exadata Bloom filters and it is more than 2 tables! In addition, it is a great demonstration of efficient and smart data movement as Exadata can smartly filter using Bloom filters and needs to only project a very few columns, thus likely creating a big savings versus sending all the columns/rows from the storage. Thus Exadata Bloom filters can work with complex analytical queries of more than two tables and efficiently send data across the network to the Oracle RAC cluster where Parallel Execution will work on the JOINs and aggregation in an MPP manor.

Now to specifically answer your question: No, Oracle is not suggesting you need to change your data/queries to support two table joins, Exadata will likely work fine with what you have today. And to let you and everyone else in on a little secret: Exadata actually supports applying multiple Bloom filters to a table scan (we call this a Bloom filter list denoted by the Predicate Information of a query plan by SYS_OP_BLOOM_FILTER_LIST), so you can have multiple JOIN filters being applied in the Exadata storage, so in reality Bloom filters are not even limited to just 2 table JOINs.

Oh well, so much for Netezza competitive marketing. Just goes to show that Netezza has a very poor understanding how Exadata really works (yet again).

name="googleone_share_1" style="position:relative;z-index:5;float: right; margin-left: 10px;">{lang: 'ar'} class='wpfblike' >

Teradata との比較

Exadata – the Sequel Exadata V2 is Still Oracle(http://www.teradata.com/t/assets/0/206/276/5bfc4694-ce82-4a07-867d-3f104...)より

Shared Everything vs. Shared Nothing
Nothingというとネガティブな印象になりがちだが、けしてそういうわけではない。
TeradataはShared Nothing方式をとることによりI/Oボトルネックを防いでいる。

ExadataはASMにより均等にCellにデータを配置する。これをSAMEアーキテクチャ(Stripe And Mirror Everywhere(あるいはEverything))と呼ぶ。Parallel Queryは均等にストライプされたCellからReadするわけだから、Parallel Query Slave(下記の図ではWorkerと記述されている)同士でもI/O待ちは発生する。そして同時セッション数とI/O待ちが比例する。
数秒で終わるQueryでも数十分かかるQueryでも「全てのParallel Query Slave」が「全てのCell」を検索する。

EHCC Mechanics – Proof that whole CU’s are not decompressed

I saw an interesting post recently where Greg Rahn talked about HCC mechanics. He claimed that an update to a record stored in HCC format did not require decompressing the whole Compression Unit (CU) which consist of several Oracle blocks. I’m assuming by this he meant that all the records contained in the CU did not get written back to storage in a non-HCC format due to a single record being updated. Greg then showed an example proving row migration occurred for an updated record. He didn’t show that the other records had not been decompressed though. So since I was already working on an HCC chapter for the upcoming Apress Exadata book, I thought I would take time off from the book writing to post this (hopefully the editors will forgive me).

Here’s the recipe: Basically we’ll update a single row, see that its rowid has changed, veify that we can still get to the record via its original rowid, and check to see if the TABLE FETCH CONTINUED ROW statistic gets updated when we we access the row via its original rowid, thus proving basic row migration (this is what Greg has already shown). Then we’ll look at block dumps for the original and new block to see what’s there.

-bash-3.2$ sqlplus / as sysdba
 
SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 14 14:16:20 2011
 
Copyright (c) 1982, 2010, Oracle.  All rights reserved.
 
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
 
SYS@SANDBOX1> select rowid, old_rowid(rowid) location from kso.skew_hcc3 where pk_col=16367;
 
ROWID              LOCATION
------------------ --------------------
AAATCBAAIAAF8uSFc9 8.1559442.22333
 
SYS@SANDBOX1> -- so my row is in file 8, block 1559442, slot 22333
SYS@SANDBOX1> 
SYS@SANDBOX1> update kso.skew_hcc3 set col1=col1 where pk_col=16367;
 
1 row updated.
 
SYS@SANDBOX1> select rowid, old_rowid(rowid) location from kso.skew_hcc3 where pk_col=16367;
 
ROWID              LOCATION
------------------ --------------------
AAATCBAAHAAMGMMAAA 7.3171084.0
 
SYS@SANDBOX1> -- Ha! The rowid has changed – the row moved to file 7, block 3171084, slot 0
SYS@SANDBOX1> 
SYS@SANDBOX1> -- Let's see if we can still get to it via the original rowid
SYS@SANDBOX1> 
SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAIAAF8uSFc9';
 
    PK_COL
----------
     16367
 
SYS@SANDBOX1> -- Yes we can! – can we use the new rowid?
SYS@SANDBOX1> 
SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAHAAMGMMAAA';
 
    PK_COL
----------
     16367
 
SYS@SANDBOX1> -- That works too! – It’s a migrated Row!
SYS@SANDBOX1> -- Let’s verify with “continued row” stat
SYS@SANDBOX1> 
SYS@SANDBOX1> @mystats
Enter value for name: table fetch continued row
 
NAME                                                                             VALUE
---------------------------------------------------------------------- ---------------
table fetch continued row                                                         2947
 
SYS@SANDBOX1> -- select via the original rowid
SYS@SANDBOX1> 
SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAIAAF8uSFc9';
 
    PK_COL
----------
     16367
 
SYS@SANDBOX1> @mystats
Enter value for name: table fetch continued row
 
NAME                                                                             VALUE
---------------------------------------------------------------------- ---------------
table fetch continued row                                                         2948
 
SYS@SANDBOX1> -- Stat is incremented – so definitely a migrated row!

So the row has definitely been migrated. Now let’s verify that the migrated row is not compressed. We can do this by dumping the block where the newly migrated record resides.

SYS@SANDBOX1> !cat dump_block.sql
@find_trace
alter system dump datafile &fileno block &blockno;
 
SYS@SANDBOX1> @dump_block
 
TRACEFILE_NAME
------------------------------------------------------------------------------------
/u01/app/oracle/diag/rdbms/sandbox/SANDBOX1/trace/SANDBOX1_ora_5191.trc
 
Enter value for fileno: 7
Enter value for blockno: 3171084
 
System altered.

Now let’s look at the trace file produced in the trace directory. Here is an excerpt from the block dump.

Block header dump:  0x01f0630c
 Object id on Block? Y
 seg/obj: 0x13081  csc: 0x01.1e0574d4  itc: 3  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1f06300 ver: 0x01 opc: 0
     inc: 0  exflg: 0
 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x002f.013.00000004  0x00eec383.01f2.44  ----    1  fsc 0x0000.00000000
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x01f0630c
data_block_dump,data header at 0x2b849c81307c
===============
tsiz: 0x1f80
hsiz: 0x14
pbl: 0x2b849c81307c
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f60
avsp=0x1f4c
tosp=0x1f4c
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f60
block_row_dump:
tab 0, row 0, @0x1f60
tl: 32 fb: --H-FL-- lb: 0x1  cc: 5
col  0: [ 4]  c3 02 40 44
col  1: [ 2]  c1 02
col  2: [10]  61 73 64 64 73 61 64 61 73 64
col  3: [ 7]  78 6a 07 15 15 0b 32
col  4: [ 1]  59
end_of_block_dump

Notice that there is only one row in the block (nrow=1). Also notice that the object_id is included in the block (in hex format). It is labeled “seg/obj:”. The table has 5 columns. The values are displayed – also in hex format. Just to verify that we have the right block and row we can translate the object_id and the value of the first column as follows:

SYS@SANDBOX1> !cat obj_by_hex.sql
col object_name for a30
select owner, object_name, object_type
from dba_objects
where object_id = to_number(replace('&hex_value','0x',''),'XXXXXX');
 
SYS@SANDBOX1> @obj_by_hex
Enter value for hex_value: 0x13081
 
OWNER                          OBJECT_NAME                    OBJECT_TYPE
------------------------------ ------------------------------ -------------------
KSO                            SKEW_HCC3                      TABLE
 
 
SYS@SANDBOX1> desc kso.skew_hcc3
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 PK_COL                                 NUMBER
 COL1                                   NUMBER
 COL2                                   VARCHAR2(30)
 COL3                                   DATE
 COL4                                   VARCHAR2(1)
 
SYS@SANDBOX1> !cat display_raw.sql
col value for a50
select display_raw(replace('&string',' ',''),nvl('&TYPE','VARCHAR2')) Value from dual
/
 
SYS@SANDBOX1> @display_raw
Enter value for string: c3 02 40 44
Enter value for type: NUMBER
 
VALUE
--------------------------------------------------
16367

As you can see, this is the record that we updated earlier in the SKEW_HCC3 table. Note: display_raw.sql depends on a little function called display_raw() which coincidentally I got from Greg Rahn. Here’s a script to create the function: create_display_raw.sql

Now for a quick look back at the original block (note that in my testing I updated another row in this same block):

===============
tab 0, row 1, @0x32
tl: 5086 fb: --H-F--N lb: 0x3  cc: 1
nrid:  0x0217cb93.0
col  0: [5074]
Compression level: 03 (Archive Low)
 Length of CU row: 5074
kdzhrh: ------PC CBLK: 1 Start Slot: 00
 NUMP: 01
 PNUM: 00 POFF: 5054 PRID: 0x0217cb93.0
CU header:
CU version: 0   CU magic number: 0x4b445a30
CU checksum: 0x982dec03
CU total length: 11403
CU flags: NC-U-CRD-OP
ncols: 5
nrows: 32759
algo: 0
CU decomp length: 7266   len/value length: 945436
row pieces per row: 1
num deleted rows: 2
deleted rows: 22333, 30848,
START_CU:
 00 00 13 d2 1f 01 00 00 00 01 00 00 13 be 02 17 cb 93 00 00 00 4b 44 5a 30
 03 ec 2d 98 00 00 2c 8b eb 06 00 05 7f f7 00 0e 6d 1c 01 00 02 00 00 00 00

So this little except shows that this is an HCC compressed block (Compression level: 03 (Archive Low) and many CU references). The nrows line shows us that the block contains 32759 rows. It also shows that 2 rows have been deleted from the block (num deleted rows). Notice that one of the deleted rows is the one in slot 22333 (sound familiar). If you’ll look back at the original rowid in the old format (fileno.blockno.slot) you’ll see that it is the row we updated. It was “deleted” from this block when it was migrated to the new block. Of course there is still a pointer left behind.

SYS@SANDBOX1> select old_rowid('AAATCBAAIAAF8uSFc9') location from dual;
 
LOCATION
-----------------------------------------
8.1559442.22333
 
 
SYS@SANDBOX1> select old_rowid(rowid) location, a.* from kso.skew_hcc3 a where rowid = 'AAATCBAAIAAF8uSFc9';
 
LOCATION                 PK_COL       COL1 COL2                           COL3      C
-------------------- ---------- ---------- ------------------------------ --------- -
7.3171084.0               16367          1 asddsadasd                     21-JUL-06 Y

It’s hard to prove a negative, but it does not appear that any records are decompressed other than those that are actually updated. The other rows in the block appear to remain in HCC format.

Exadataを支える技術6 Advanced Compression

Oracle Advanced CompressionによるTableの圧縮はParallel Queryのパフォーマンスにも大きく影響するOracle Advanced Compression White Paperより:

大規模DWHシステムの場合、物理ディスクからの読み込みデータ量を減らすことがパフォーマンス改善のキーとなる。TPC-Hベンチマークで、その影響度を見てみると(http://ichiroobata.blogspot.com/2010/08/partitioning.htmlより):

Exadataを支える技術5 Oracle Partitioning

Oracle Data Sheetより、

パーティショニングにより、さまざまなアプリケーションの可用性、性能および管理面でさまざまなメリットをもたらします。

Exadataを支える技術4 マルチコアCPUによるParallel Queryの実現


各サーバはIntel Xeon 7560プロセッサ(2010年9月バージョン)を2基搭載したマザーボードで構成されDB Server1台で:4コア x 2 = 8コアとなっている

Sun,IBM,HPといったサーバメーカによるSMPマシン競争の時代が終わりました。今はIntelとAMDの2社による汎用プロセッサーによるマルチコア対応の時代となりました。
Exadataでは、その中で最も高価なIntel Xeonをdual構成で使っています。

Exadataを支える技術3 Automatic Storage Management (ASM)

ASM:Shared Diskを実現するためのOracleコンポーネント
OEL:Oracle Enterprose Linux
CELLSRV:Cell Service Databaseとのコミュニケーション
MS:Management Service
RS:Restart Service
IORM:I/O Resource Manager

ASMがあるからOracle RACのShared Disk環境が構築できる。そして従来であれば高価なSANディスクアレイ装置を複数台用意しなければ実現できなかったI/O転送量、バンド幅を実現しているソフトウェア側の立役者的な存在といえる。

ASMと同様なオープンソースにSolaris ZFSがあるが、Oracleを使うのならASMの方がパフォーマンス面で有利。

ZFS機能概要:@IT記事より抜粋より:

Exadataを支える技術1 InfiniBand

- Intel Xeonなどに代表される高性能なコアが作り出すあるいは必要とする並列計算データが膨大となり
- Gigabit EtherNetを従来通り用いた並列システムでは、ネットワークの通信能力を超える通信が発生し、多くの通信要求が待ち状態になって、並列処理がスムーズに進まない
- ネットワークが飽和して並列計算性能が伸びないという現象が、高速コンピュータでは少ない台数で発生し、低速コンピュータではより多くの台数を接続した時に発生してしまう
- そこでGigabit EtherNetの10倍以上の通信性能を持ち、高いバンドレートと低い遅延で、並列処理を実現する新しいネットワークファシリティ、InfiniBandが登場した
- 世界の並列コンピュータシステムをリストしたTop500 (http://www.top500.org) でも、InfiniBandが他を圧倒している
参考:株式会社コンカレントシステムズの記事より(http://www.concurrent.co.jp/products/option/infiniband.html)

http://ja.wikipedia.org/wiki/InfiniBandより抜粋
1998年後半、基幹業務に使用される各種サーバは、ムーアの法則に従い、ほぼ18カ月に2倍のペースで高速化されてきていた。しかし、サーバ内のバスアーキテクチャは機械・電気的制限により、その進化に追い付けず、コンピュータシステム性能向上のボトルネックとなっていた。

Performance Stories from Exadata Migrations

Here are my UKOUG 2010 slides about Exadata migration performance, this is real life stuff, not repeating the marketing material:
View more presentations from tanelp.

Share