Yes, it has, and I’m really sorry but at the moment all the blog writing has to take a lower priority. I am working very hard to finish my chapters of the upcoming Pro Oracle Database 11g RAC on Linux, which turned out to be a lot more work than I expected. The last chapter I have to write from scratch is about ASM, and it’s again a meaty one. Luckily the first draft is complete and I sent it to Apress for review and the usual editorial work.
I so decided to quit my daytime job to focus entirely on the book, which I’d like to see released between Open World and the UKOUG conference this year.
Once completed, I intend to become a freelance Oracle database consultant, specialising on RAC (no surprises here!) and Oracle 11.2 upgrades. I am available from mid September (the first 4 weeks are already booked!) if you like-just contact me!
Recenty the following question was posted on oracle-l (I paraphrase…):
With Oracle Database it is possible to create something similar to Teradata’s sparse indexes?
Since the question is an interesting one, I decided to write this short post.
First of all, I have to say that such a feature is not supported by the CREATE INDEX statement [...]


前回db_file_multiblock_read_count=64のとき、安定しない結果に終わった。
そして「ディスク転送量+アルファ」の何かがあると書いた。
_db_file_direct_io_count : Sequential I/O buf size
今回のテストはBLOCK_SIZE=8K=8192Bでテストをしている。
このI/O buffer sizeを最適化してからテストを再度行う:
db_file_multiblock_read_count = 64, _db_file_direct_io_count = 524288
db_file_multiblock_read_count = 128, _db_file_direct_io_count = 1048576
db_file_multiblock_read_count = 256, _db_file_direct_io_count = 2907152
db_file_multiblock_read_count = 512, _db_file_direct_io_count = 4194304
db_file_multiblock_read_count = 1024, _db_file_direct_io_count = 8388608
結果db_file_multiblock_read_count = 64でも安定した。そして128のときと比べて、ほぼ同じ。
direct path readは、そのバッファサイズを最適化しておかないと無駄なTemporaryアクセスを発生する。
それを踏まえてdb_file_multiblock_read_countを調整する。
DWHだからって「むやみにdb_file_multiblock_read_countを増やせばよい」ではない。
因みに、Exadata1のTPC-Hベンチマークでは:
db_file_multiblock_read_count=64
db_block_size=32K
を設定している。
最後に、
_db_file_direct_io_countの設定はdb_file_multiblock_read_countと必ず対応しなければなりません。たとえば、multiblock_read_cnt=512なのに、この値を1024対応のものにすると最適化されなくなり、ディスク応答時間に影響します。今回の例ですとBlockSize=8K=8192。それにmultiblock_read_cntを掛けて算出しています。
因みに、これらは_adaptive_direct_read = TRUEで制御されているみたいです。
I got an email today, which I think is for once a genuine one…beside that, just before Oracle Open World, its always interesting to guess about what Mr. Ellison is about to announce…
I have checked out the site/post and, besides the apparent ones (PR), I couldn’t find any hidden traps (always a bit, healthy I hope, paranoia on this kind of stuff).
From: Stephen Jannise
Sent: Tuesday 3 August 2010 19:57
To: Marco Gralike
Subject: Editorial question about your blogHi Marco,
I thought you would be interested in an article I’ve written about Oracle’s next blockbuster deal. The company’s surprising acquisition of Sun suggests that Oracle is willing and able to make major deals in unexpected areas. I’m interested to see where they go next, so I’m hosting a poll on my blog at: http://www.softwareadvice.com/articles/manufacturing/oracle-mergers-acquisitions-whos-next-1080310/.
I’ve presented a list of thirteen potential targets for readers to vote on. Rather than guess these targets at random, I’ve done some research into the past five years of Oracle acquisitions as well as studied the current market to make some educated suggestions about possible future acquisitions.
Responses are trickling in, so I’m reaching out to a few bloggers to spread the word and drive more responses. Would you mind posting a brief entry about this on your blog? I would really appreciate your help. Please let me know what you think.
Thanks,
Stephen
Stephen said about his article in one of his emails to me:
The article traces the last five years of Oracle acquisitions in an attempt to pinpoint the strategies behind Oracle’s blockbuster purchases. Based on this analysis and a review of potential targets, the article suggests thirteen companies that Oracle may conceivably be interested in over the upcoming months. There are a few obvious choices along with some more unlikely options, and readers are encouraged to voice their own opinions and participate in the poll. You can find the article and the poll here:
http://www.softwareadvice.com/articles/manufacturing/oracle-mergers-acquisitions-whos-next-1080310/.
My idea about it, if I were allowed to gamble (getting into Mr. Ellison’s head) is that Stephen is wrong. The mentioned companies aren’t, IMHO, the ones that Oracle needs to keep in pace with Google, IBM or Microsoft, to name the few, although I think that “VMware” is an interesting one.
I would still vote for a company like Amazon (probably too expensive, but then again I have seen stranger things happen…), which isn’t on Stephens list, but it would provide the opportunity for Oracle to get into the Cloud with his virtualization software, deploy apps like OpenOffice, EBS or Fusion Apps and Middleware etc.
Being a Oracle DBA “nerd” for 15+ years I know that Mr. Ellison is always looking waaaaay ahead into the future, remember Video on Demand via “nCube” or the “raw iron” project…? He also sticks to it… Its all long term stuff, some paid off and on some he / Oracle lost, but you could say via JRockit / OEL some, also, still became a “raw iron” reality. With some data centers in place, “Oracle Docs” or “Oracle Apps” could become the next reality…
Guessing can be fun…
While browsing my old pictures, I discovered two small directories with some installation snapshots I once made, because there is almost no info left about this topic (besides in people’s heads) I left it here for “past” reference…
Double click on it to go to the bigger “version” on Picassa.
I think the VST diagrams are powerful on their own but my original goal was to show the execution path on top of the diagram in order to look at to execution plans side by side and quickly see the differences which isn’t possible with textual explain plans. Here is an example of two explain plans in text and then graphically This graphic was one of my attempts to show executions on top of the VST diagram, and I’ve been working with how best to show the order. Because the diagram layout stays the same, and the order is overlaid on top, it’s easy to compare them side by side. I can also see why original execution plan, on the left, was wrong. The original execution started at E which joins to C producing 198201422 rows (as seen on the join line), yet the query only returns 44,000 rows (not shown). Also table E has no filtering, really, since 99% of it's rows are returned after applying the filtering condition. The filtering is represented in blue to the bottom right of the table as a percentage of rows returned from the table. The query should start at A where there is 2% filter ratio. After starting at A we should join to C or B (which is a subquery) because the result set sizes are the smallest 85K and 642K respectively. Making this change (via hints) took this query from running over 24 hours to 5 minutes) Designing VST diagrams has been fun exciting and challenging. I wonder if you can imagine this: Have you ever been rock climbing ? and I’m not talking about the gym, but out on real cliffs? If so you probably know that 99% of the time people use a route map that shows exactly where to go and that indeed there is a way to climb up and off the cliff and so you won’t get stuck halfway up with nowhere to go. Climbing these routes is super scary and exciting, but the amazing thing is someone climbed them without a map, not knowing if they would get up 1000 feet and find that there is no way to finish, and somehow have to get down, which might not even be possible given the gear . Seriously life threatening. That’s what I think about writing a completely new way of tuning SQL. Of course it’s not mortally life threatening J, but years of my life and my job are on the line. I have no idea where the end is, or what all the road blocks will be. I certainly didn’t when I started but I’m far enough along to see that the diagrams provide powerful information rapidly to the user in an easy to understand graphic way. Graphics can be misused even abused resulting worse information instead of better, but when graphics are used well they are much more powerful than the textual quantitative data. The graphical diagrams are a great way to understand the relationships in the query faster than reading the query text. It’s sort of like looking at the Google map and seeing the route drawn verses having to read the directions. Both are important but I can understand the route on map much faster than the directions, though I might want to read the directions after reading the map. The map with the route though is often all I need. I’m not able to store vast amounts of information and see all the permutations like some Oracle experts can. This can be a curse and a blessing. It’s a blessing because it makes me look for ways where I can understand the problem space with out having to buffer tons of information and such solutions are valuable to the general public. On the other hand its scary in that I don’t see all the possible problems, but that can be a blessing as well because I try to just concentrate on the solutions that have the biggest return, the biggest bang for the buck , the issues that can be most easily solved, instead of getting distracted the overwhelming possibilities and issues. In some ways the SQL optimization space seems overwhelming and in others ways it seems really small. I think that VST diagrams will help make the problem space seem much smaller, manageable and understandable as the VST diagrams mature.
This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!
Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.
Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.
He makes his point by stating that traditional RDBM design is already being replaced for a variety of specialized solutions: Data-warehouses, streams processing, text and scientific databases. The only uses left for RDBMS is OLTP and hybrid systems.
The provocatively named paper is simply a description of a system, designed from scratch for modern OLTP requirements and the demonstration that this system gives better performance than traditional RDBMS on OLTP type load. The conclusion is that since RDBMS can’t even excel at OLTP – it must be destined for the garbage pile. I’ll ignore the fact that hybrid systems are far from extinct and look at the paper itself.
The paper starts with a short review of the design considerations behind traditional RDBMS, before proceeding to list the design considerations behind the new OLTP system, HStore.:
In other sections Stonebraker describes few more properties of the system:
The requirements seem mostly reasonable and very modern – use replication as a method of high availability and scalabilty, avoid disks and their inherent latencies, avoid the complications of concurrency, avoid ad-hoc queries, avoid SQL and avoid annoying DBAs. If Stonebraker can deliver on his promise, if he can do all of the above without sacraficing the throughput and durability of the system, this sounds like a database we’ll all enjoy.
In the rest of the paper, the authors describe some special properties of OLTP work loads, and then explains how HStore utilizes the special properties to implement a very efficient distributed OLTP system. In the last part of the paper, the authors use HStore to run a TPC-C like benchmark and compare the results with an RDBMS.
Here are in very broad strokes the idea:
The paper explains in some detail how things are done, while I only describe what is done:
The system is distributed, with each object partitioned over the nodes. You can have specify how many copies of each row will be distributed, and this will provide high availability (if one node goes down you will have all the data available on other nodes).
Each node is single threaded. Once SQL query arrives at a node, it will be performed to the end without interruptions. There are no physical files. The data objects are stored as Btrees in memory, Btree block is sized to match L2 cache line.
The system will have a simple cost-based optimizer. It can be simple because OLTP queries are simple. If multi-way joins happen they always involve identifying a single tuple and then tuples to join to that record in a small number of 1-to-n joins. Group by and aggregation don’t happen in OLTP systems.
The query plans can either run completely in one of the nodes, can be decomposed to a set of independent transactions that can run completely in one node each, or require results to be communicated between nodes.
The way to make all this efficient is by using a “database designer” – Since the entire workload is known in advance, the database designer’s job is to make sure that most queries in the workload can run completely on a single node. It does this by smartly partitioning the tables, placing parts that are used together frequently on the same node and copying tables (or just specific columns) that are read-only all over the place.
Since there are at least two copies of each row and each table, there must be a way to consistently update them. Queries that can complete on a single node, can just be sent to all relevant nodes and we can be confident that they will all complete them with identical results. The only complication is that each node must wait a few milliseconds before running the latest transaction to allow for recieving prior transactions from other nodes. The order in which transactions run is identified by timestamps and node ids. This allows for identical order of execution on all nodes and is responsible for consistent results.
In case of transactions that span multiple sites and involve changes that affect other transactions (i.e. The order in which they execute in relation to other transactions matter), one way to achieve consistency could be locking the data sources for the duration of the transaction. The HStore uses another method – each worker node recieves its portion of the transaction from a coordinator. If there are no conflicting transactions with lower timestamps, the transaction runs and the worker sends the coordinator an “ok”, otherwise the worker aborts and notifies the coordinator. The transaction failed and its up to the application to recover from this. Of course, some undo should be used to rollback the successfull nodes.
The coordinator monitors the number of aborts and if there are too many unsuccessfull transactions, it starts waiting longer between the time a transaction arrives at a node until the node attempts to run it. If there are still too many failures, a more advanced strategy of aborting is used. In short, this is a very optimistic database where failure is prefered to locking.
I’ll skip the part where a modified TPC-C proves that HStore is much faster than a traditional RDBMS tuned for 3 days by an expert. We all know that all benchmarks are institutionalized cheating.
What do I think of this database?
I hope that in the next few month I’ll add few more posts reviewing futuristic systems. I enjoy keeping in touch with industry trends and cutting-edge ideas.
TPC-Hベンチマークの続き、
db_file_multiblock_read_count=64
db_file_multiblock_read_count=128
db_file_multiblock_read_count=256
db_file_multiblock_read_count=512
db_file_multiblock_read_count=1024
db_file_multiblock_read_count=128が一番成績がよい。
SSDの応答時間(殆んど一定の応答時間)
| db_file_multiblock_read_count | x 8K | SSD応答時間(ミリ秒) |
| 64 | 512K | 3 |
| 128 | 1024K | 6 |
| 256 | 2048K | 10 |
| 512 | 4096K | 16 |
| 1024 | 8192K | 25 |
そして、、、、
db_file_multiblock_read_count > 128 でTemporary Tablespaceへの書き出しが始まった。
pga_aggregate_target=5Gでmemory_targetは使用していない。
書き出し量はdb_file_multiblock_read_countに比例して増える。
この地道な調査で、今後のチューニングの切り札DB_BLOCK_SIZEを割り出す方法を模索している。
いい加減にすると無駄なTemporaryアクセスが発生し、ノード間Parallel Queryの苦手パターンを踏むことになる。
最後に、
db_block_size=8Kでは1024以上はサポートされていない動きだった。
8MBが限界なのか?1024が限界なのか?
db_file_multiblock_read_count=64ではTPC-Hが安定しなかった。
しかし、ディスク転送量は430MB/sを軽く上回り現時点では最高値を記録した。
direct path readのベンチマークは「ディスク転送量」だけでは計れない「+アルファ」がある。
TPC-Hベンチマークの続き
前回のSQLを見てみると、やはりTPC-H。TPC-Cとはぜんぜん違う。
実行結果もたくさん返される:
TPC-Cのときは:
ネットワークバッファを広げてみる。
sqlnet.ora
DEFAULT_SDU_SIZE=32768
RECV_BUF_SIZE=524288
SEND_BUF_SIZE=524288
セッション・データ・ユニット(session data unit: SDU)
Oracle Netがネットワーク間でデータを転送する前にデータを配置するバッファ。Oracle Netがバッファ内のデータを送信するのは、データ送信が要求されたとき、またはバッファがデータでいっぱいになったときである。
hammeroraでTPC-Hベンチマークを再び行い、ディスク転送量をみる:
前回のほぼ最大が400MB/sから安定した420MB/s強に改善された。5%アップだ。
もう少し突っ込んでIPC接続でもテストをした(tnsnames.oraにORACLE2を追加):
ORACLE =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = oracle)
)
)
LISTENER_ORACLE =
(ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
ORACLR_CONNECTION_DATA =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
)
(CONNECT_DATA =
(SID = CLRExtProc)
(PRESENTATION = RO)
)
)
ORACLE2 =
(DESCRIPTION=
(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = oracle)
)
)
5%→7%にUpした。
しかし、IPC接続は使用環境が限定されるため、今回は5%止まりとする。
最後に、
IPCの結果がTCPとそれほど変わらなかった。
UNIX環境ではIPCはOSの機能だが、Windows環境はOracleによるシュミレーション機能だ。
したがって、、、、。
Recent comments
17 weeks 3 days ago
27 weeks 2 days ago
29 weeks 1 hour ago
32 weeks 1 day ago
34 weeks 3 days ago
44 weeks 2 hours ago
45 weeks 3 days ago
46 weeks 3 days ago
46 weeks 5 days ago
49 weeks 3 days ago