Over the past few weeks Oracle Mix had opened the Oracle OpenWorld 2011 Suggest-a-Session to the general public where anyone could submit or vote on a session. One limitation of the Oracle Mix site was that it was impossible to sort the sessions by votes but that challenge was tackled by Roel Hartman with his blog post and APEX demo. After seeing the top session by votes, it was very interesting to me that around half of the top 15 sessions were all from the same author. That got me thinking…and that thinking turned into a little data hacking project that I embarked on. Now I admit it, I think data is very cool, and even cooler is extracting patterns and neat information from data.
The Oracle Mix site is very “crawler friendly” — it has well defined code and tags which made extracting the data fairly painless. The basic process I used came down to this:
I did all of that with curl, wget and some basic regex as a “version 1″ but was hoping to go back and try it again using some more sexy technology like Beautiful Soup. That will have to be continued…
With Oracle Mix Suggest-a-Session, people generally vote for a session for one of two reasons:
What I think is interesting to know is just how much of the voting is done because of #2. After all, Oracle Mix is a social networking site so there certainly is some voting for that reason. In fact, one of the session authors, Yury Velikanov from Pythian, even blogged his story of rounding up votes. The data shows us this, but more on that in just a bit…
The (Unofficial) Data
I took some time to mingle around the data and found some very interesting things. Let’s just start with a few high level points:
Here are some interesting tidbits I extracted from the data set (apologize for not making a cool visualization chart of all this – but I’ll make up for it later):
-- top 10 sessions by total votes: +-------------+-----------------+--------------------------------------------------------------------------------+ | total_votes | session_author | title | +-------------+-----------------+--------------------------------------------------------------------------------+ | 167 | tariq farooq | Oracle RAC Interview Q/A Interactive Competition | | 156 | tariq farooq | Database Performance Tuning: Getting the best out of Oracle Enterprise Manager | | 137 | tariq farooq | Overview & Implementation of Clustering & High Availability with Oracle VM | | 130 | tariq farooq | Migrate Your Online Oracle Database to RAC Using Streams and Data Pump | | 127 | tariq farooq | 360 Degrees - Achieving High Availability for Enterprise Manager Grid Control | | 126 | yury velikanov | Oracle11G SCAN: Sharing successful implementation experience | | 124 | sandip nikumbha | Accelerated Interface Development Approach - Integration Framework | | 123 | tariq farooq | Oracle VM: Overview, Architecture, Deployment Planning & Demo/Exercise | | 123 | sandip nikumbha | Remote SOA - Siebel Local Web Services Implementation | | 119 | yury velikanov | AWR Performance data mining | +-------------+-----------------+--------------------------------------------------------------------------------+ -- top 10 voters (who place the most votes) +--------------------+--------------+ | voter_name | votes_placed | +--------------------+--------------+ | arup nanda | 53 | | tariq farooq | 43 | | connie cservenyak | 36 | | xiaohuan xue | 36 | | bruce elliott | 36 | | peter khoury | 35 | | yugant patra | 35 | | balamohan manickam | 35 | | suresh kuna | 34 | | eddie awad | 34 | +--------------------+--------------+ -- top 10 voters by unique session authors (how many unique authors did they vote for?) +--------------------+----------------+ | name | unique_authors | +--------------------+----------------+ | arup nanda | 28 | | paul guerin | 24 | | eddie awad | 24 | | bruce elliott | 23 | | xiaohuan xue | 23 | | connie cservenyak | 23 | | peter khoury | 22 | | wai ling ng | 22 | | yugant patra | 22 | | balamohan manickam | 22 | +--------------------+----------------+ -- top 10 session authors by total votes received, number of sessions, avg votes per session +---------------------+-------------+----------+-----------------------+ | session_author | total_votes | sessions | avg_votes_per_session | +---------------------+-------------+----------+-----------------------+ | tariq farooq | 1057 | 8 | 132.1250 | | yury velikanov | 557 | 5 | 111.4000 | | alex gorbachev | 429 | 6 | 71.5000 | | sandip nikumbha | 360 | 3 | 120.0000 | | syed jaffar hussain | 281 | 4 | 70.2500 | | kristina troutman | 233 | 5 | 46.6000 | | russell tront | 221 | 3 | 73.6667 | | wendy chen | 217 | 3 | 72.3333 | | asif momen | 184 | 2 | 92.0000 | | alison coombe | 183 | 5 | 36.6000 | +---------------------+-------------+----------+-----------------------+
I could not help noticing that Tariq Farooq had the top 5 spots by total vote count. I would assert that is related to these two points:
I have no doubt there there is some of both in the mix, but just how much influence on the votes is there from a person’s circle of friends? Or to put another way: How many voters only voted for a single session author? Or even more interesting, how many people voted for every session for a single author, and voted for no other sessions? All good questions…with answers that reside in the data!
-- number of users who voted for exactly one author +---------------------------+ | users_voting_for_1_author | +---------------------------+ | 828 | +---------------------------+ -- number of voters who voted for every session by a given author -- and total # of votes per voter is the same # as sessions by an author +-------------------------------------------------+ | users_who_voted_for_every_session_of_an_author | +-------------------------------------------------+ | 826 | +-------------------------------------------------+
Wow – now that interesting! Of people only voting for a single session author, just two of them did not vote for every one of that author’s sessions. That’s community for you!
I was very interested to see what the Mix Voting Graph looked liked, so I imported the voting data into Gephi and rendered a network graph. What I was in search of was to identify the community structure of the voting community. Gephi lets you do this by partitioning the graph into modularity classes so that the communities become visible. This process is similar to how the LinkedIn InMap breaks your professional network into different communities.
Here is what the Oracle Mix voting community looks like:
/> 
This is a great visualization of the communities and it accentuates the data from above – the voters who only voted for a single author. This can be seen by the small nodes on the outer part of the graph that have just a single edge between it and the session author’s node. Good examples of this are for Yury Velikanov and Tariq Farooq. Also clearly visible is what I’d refer to the “Pythian and friends” community centered around Alex Gorbachev and Yury Velikanov in the darker green color. There are also several other distinct communities and the color coding makes that visible.
This is my first real data hacking attempt with web data and using some of the tools like Gephi for the graph analysis. One of my inspirations was Neil Kodner‘s Hadoop World 2010 Tweet Analysis, so I need to give a big shout out to Neil for that as well as his help with Gephi. Thanks Neil!
So what are people’s sessions about that were submitted? This Wordle says quite a bit.
If you wish to play on you own: https://github.com/grahn/oow-vote-hacking
Here is another graph where the edges are weighed according to votes to an author (obviously related to number of sessions for that author).
/> 
Recording courtesy of the people at Formspider.
The topic is about tuning a two-table join, and the video is about an hour long (so there’s a lot more that could be said on the topic, and one day I might do a half day “masterclass” on it).
Here’s a video of the presentation that I did as the keynote for the Turkish Oracle Users’ Group conference earlier on this year. Recording courtesy of the people at Formspider.
The topic is about tuning a two-table join, and the video is about an hour long (so there’s a lot more that could be said on the topic, and one day I might do a half day “masterclass” on it).
UPDATE: July 12, 2011 I just noticed (!) that the DBA_HIST_EVENT_HISTOGRAM is only on 11g, so for 10g, it requires collecting the histogram deltas manually off of V$EVENT_HISTOGRAM, something like
set pagesize 150
col event format a25
col tm format a14
select event,
to_char(sysdate,'J')||to_char(round(sysdate-trunc(sysdate),5)) tm,
wait_time_milli,
wait_count
from v$event_histogram
where event in ('log file parallel write' ,
'db file scattered read' ,
'db file sequential read' )
order by event;and collecting this information every N seconds and taking the deltas. Also if you want the buckets in columns instead of rows:
select event, to_char(sysdate,'J')||to_char(round(sysdate-trunc(sysdate),5)) tm, sum (case when WAIT_TIME_MILLI=1 then WAIT_COUNT else 0 end) b1, sum (case when WAIT_TIME_MILLI=2 then WAIT_COUNT else 0 end) b2, sum (case when WAIT_TIME_MILLI=4 then WAIT_COUNT else 0 end) b3, sum (case when WAIT_TIME_MILLI=8 then WAIT_COUNT else 0 end) b4, sum (case when WAIT_TIME_MILLI=16 then WAIT_COUNT else 0 end) b5, sum (case when WAIT_TIME_MILLI=32 then WAIT_COUNT else 0 end) b6, sum (case when WAIT_TIME_MILLI=64 then WAIT_COUNT else 0 end) b7, sum (case when WAIT_TIME_MILLI=128 then WAIT_COUNT else 0 end) b8, sum (case when WAIT_TIME_MILLI=256 then WAIT_COUNT else 0 end) b9, sum (case when WAIT_TIME_MILLI=512 then WAIT_COUNT else 0 end) b10, sum (case when WAIT_TIME_MILLI=1024 then WAIT_COUNT else 0 end) b11, sum (case when WAIT_TIME_MILLI=2048 then WAIT_COUNT else 0 end) b12, sum (case when WAIT_TIME_MILLI=4096 then WAIT_COUNT else 0 end) b13, sum (case when WAIT_TIME_MILLI=8192 then WAIT_COUNT else 0 end) b14, sum (case when WAIT_TIME_MILLI=16384 then WAIT_COUNT else 0 end) b15, sum (case when WAIT_TIME_MILLI=32768 then WAIT_COUNT else 0 end) b16, sum (case when WAIT_TIME_MILLI=65536 then WAIT_COUNT else 0 end) b17, sum (case when WAIT_TIME_MILLI=131072 then WAIT_COUNT else 0 end) b18, sum (case when WAIT_TIME_MILLI=262144 then WAIT_COUNT else 0 end) b19, sum (case when WAIT_TIME_MILLI=524288 then WAIT_COUNT else 0 end) b20, sum (case when WAIT_TIME_MILLI=1048576 then WAIT_COUNT else 0 end) b21, sum (case when WAIT_TIME_MILLI=2097152 then WAIT_COUNT else 0 end) b22, sum (case when WAIT_TIME_MILLI=4194304 then WAIT_COUNT else 0 end) b23, sum (case when WAIT_TIME_MILLI > 4194304 then WAIT_COUNT else 0 end) b24 from v$event_histogram where event='log file parallel write' group by event,to_char(sysdate,'J')||to_char(round(sysdate-trunc(sysdate),5)) ;
Exploring I/O histogram values in AWR. In the AWR repository I have several databases, all reading off the same LUNs, so below I try to take the deltas between each snapshot and sum up all the waits for each bucket in the histogram
set pagesize 50000
set feedback off
set echo off
SET markup HTML on
spool hist.html
with hist as (
select
sn.snap_id,
sn.dbid,
to_char(trunc(cast(begin_interval_time as date ))+
(ROUND ((cast(begin_interval_time as date) -
TRUNC (cast(begin_interval_time as date))) * 24) / 24),
'YYYY/MM/DD HH24:MI') btime,
h.event_name,
h.wait_time_milli,
h.wait_count
from dba_hist_event_histogram h,
dba_hist_snapshot sn
where
h.instance_number = 1
and sn.instance_number = 1
and h.event_name like 'db file seq%'
and sn.snap_id=h.snap_id
and sn.dbid=h.dbid
)
select a.btime,
a.wait_time_milli,
sum(b.wait_count - a.wait_count)
from hist a,
hist b
where a.dbid=b.dbid
and a.snap_id=b.snap_id-1
and a.wait_time_milli = b.wait_time_milli
group by a.btime, a.wait_time_milli
having sum(b.wait_count - a.wait_count) > 0;
spool off
SET markup HTML offWith the HTML output it’s easy to read into excel.
The data can be graphed in Excel using pivot tables or pivot charts, but if I simply save the data in an excel worksheet, then I can open it up in Tableau, which in some circumstances, can be easier to use than excel
The Y axis is in log scale. I don’t find that the data speaks to me immediately (other than there are some seriously slow I/Os) but it is easy to see the major outliers.
UPDATE:
One advantage of Tableau is the easy color coordination. Using the “cyclic” pallet, the colors come out pretty good for the I/O histograms (this would be time consuming on Excel)
I had to laugh when I read this story about Amazon Web Services. It’s posted with an attention grabbing title that implies this is an Amazon problem, but it is squarely down to user error/oversight.
Luckily I’ve not fallen into this trap yet, but I have done equally silly things in the past. That reminds me, I must go to a Pete Finnigan session next time we are at the same conference…
Cheers
Tim…
Someone asked me a question about generating HTML with embedded images from PL/SQL and my answer started to spiral out of control, so I figured I’d just write it as an article.
Cheers
Tim…
Update: nice query from Jonathan Lewis on AWR:
http://jonathanlewis.wordpress.com/awr-sys-stats/
I found this while looking for how to eliminate reporting stats over database restarts. Apparently if using snap_id then there will be no snap_id-1 over restarts because the ids will jump more than one across restarts – have to check into this
Exploring ways to mine data in AWR.
One situation I’m looking at now is a number of database that all use the same LUNs on the back end storage, so two metrics that would be interesting to look at are IOs from these databases and the IO latencies to answer the question, “does read and/or write I/O activity correlate to increased latency on the backend storage?”
I have multiple databases in one AWR repository.
I want to create the output in a format that I can read with something like Excel, so I make the values comma delimited,
AWRs values for sysmetric are hourly, so I want to round the times to the nearest hour to make correlation easier and I want to translate the DBID into the database name to make reading the data easier. I could do something like:
select
n.db_name||'_'||n.host_name||'_'||
decode(metric_name,
'Physical Write Total Bytes Per Sec','write','read')||','||
to_char(trunc(begin_time)+
( ROUND ((begin_time - TRUNC (begin_time)) * 24) / 24),
'YYYY/MM/DD HH24:MI')||','||
average
from dba_hist_sysmetric_summary s,
(select distinct dbid, db_name, host_name
from dba_hist_database_instance) n
where
( s.metric_name= 'Physical Write Total Bytes Per Sec'
or
s.metric_name= 'Physical Read Total Bytes Per Sec'
) and n.dbid=s.dbid
order by begin_time;For the I/O latencies it take some more work as I have to compute the deltas between snapshots getting the total elasped time and the count of waits to compute the average wait over the snapshot period for each I/O wait event. I also have to watch out for database bounces, which I have inelligantly addressed by simply filtering only for waits less than or equal to 0. I could do something like
select
btime||','||
n.db_name||'_'||n.host_name||'_'||
event_name||','||
nvl(round((time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0),3),0) avg_ms
from (
select
e.event_name,
e.dbid,
to_char(trunc(cast(begin_interval_time as date ))+
(ROUND ((cast(begin_interval_time as date)- TRUNC (cast(begin_interval_time as date))) * 24) / 24),
'YYYY/MM/DD HH24:MI') btime,
Lag (e.total_waits) OVER( PARTITION BY e.event_name, e.dbid ORDER BY s.snap_id)
count_beg,
total_waits count_end,
Lag (e.time_waited_micro/1000) OVER( PARTITION BY e.event_name,e.dbid ORDER BY s.snap_id)
time_ms_beg,
time_waited_micro/1000 time_ms_end
from
DBA_HIST_SYSTEM_EVENT e,
DBA_HIST_SNAPSHOT s
where
s.snap_id=e.snap_id and
s.dbid=e.dbid and
(e.event_name= 'db file sequential read'
or e.event_name='db file scattered read' )
order by begin_interval_time
) s,
(select distinct dbid, db_name, host_name from dba_hist_database_instance) n
where
n.dbid=s.dbid
and
round((time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0),3) >= 0
order by btime;Correlating the read/write throughput along with single and multiblock read latencies across multiple database over multiple days can pose challenges. One strategy is to boil the data down to a smaller number of metrics like total read, total write to average single block I/O. For average single block I/O, I’d want to normalize the I/O latency relative to read I/O for that database. Maybe I could get an average of latencies weighted by throughput of that database
Work in progress …
A question came up on Oracle-L recently about the difference in work done by the following two queries:
SELECT /*+ RULE */ DOM_NAME FROM DOMAINS, TABLE(CAST(:B1 AS DOMAIN_LIST)) DL WHERE DOM_NAME = DL.COLUMN_VALUE ; SELECT DOM_NAME FROM DOMAINS WHERE DOM_NAME IN ( SELECT COLUMN_VALUE FROM TABLE(CAST(:B1 AS DOMAIN_LIST)) ) ;
Before saying anything else, I should point out that these two queries are NOT logically equivalent unless you can guarantee that the table() operator returns a unique set of values – and Oracle doesn’t allow uniqueness to be enforced on collections.
However, the author of the question had ensured that the results from the table() operator contained no duplicates so the two versions of the query returned the same result set. The question is, why does one query do fewer buffer visits than the other – as evidenced by the results from sql_trace and tkprof.
Rule-based Join =============== call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.01 0.01 0 0 0 0 Fetch 1 0.86 0.86 0 200047 0 115195 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 0.87 0.87 0 200047 0 115195 Rows Row Source Operation ------- --------------------------------------------------- 115195 NESTED LOOPS (cr=200047 pr=0 pw=0 time=6355 us) 99704 COLLECTION ITERATOR PICKLER FETCH (cr=0 pr=0 pw=0 time=284 us) 115195 INDEX RANGE SCAN DOM_NAME_IDX (cr=200047 pr=0 pw=0 time=3191 us)(object id 54309) CBO with subquery ================= call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.01 0.01 0 0 0 0 Fetch 1 0.78 0.78 0 157986 0 115195 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 0.80 0.80 0 157986 0 115195 Rows Row Source Operation ------- --------------------------------------------------- 115195 NESTED LOOPS (cr=157986 pr=0 pw=0 time=4720 us cost=536 size=6141 card=267) 99704 SORT UNIQUE (cr=0 pr=0 pw=0 time=417 us) 99704 COLLECTION ITERATOR PICKLER FETCH (cr=0 pr=0 pw=0 time=259 us) 115195 INDEX RANGE SCAN DOM_NAME_IDX (cr=157986 pr=0 pw=0 time=3353 us cost=2 size=21 card=1)(object id 54309)
Notice how the optimizer has unnested the subquery and introduced a “sort unique” – this demonstrates the optimizer’s recognition that the two queries are not logically identical. However, the number of rows from the collection is the same (99,704) and the number of rows after joining is the same (115,195) – the data sets have been rigged so that it is a fair comparison. So why does the explicit join take 200,047 buffer visits when the transformed subquery approach take only 157,986 buffer visits.
The answer is that you can visit buffers without doing “logical I/Os” – and this benefit accrues most frequently to indexed access paths. Thanks to the way Oracle has sorted the collection before doing the join the nested loop follows the second index (dom_name_idx) in order – this increases the chance that index blocks that were used in the previous cycle of the loop will be re-used in the next cycle, which means that Oracle need not release the pins on the index blocks, but can revisit them incrementing the statitsic: “buffer is pinned count”.
I have to say that I haven’t proved that this is exactly what was happening, but when I suggested to the person who had asked the question that he re-run and check to see if the decrease in “session logical reads” was balanced by an increase in “buffer is pinned count” the answer seemed to confirm the hypothesis. [Correction (see comment #1 from Randolf): the results didn't confirm the hypothesis - but I think that' s because of the change in "pinning" that Randolf describes, so I'll have to find a way to confirm the hypothesis some other time.]
Just letting you know that we are all (almost) done with the Expert Oracle Exadata book work and it will be published in less than a month!!!
Expect (many) new blog entries soon! :-)
Recent comments
17 weeks 3 days ago
27 weeks 1 day ago
28 weeks 6 days ago
32 weeks 22 hours ago
34 weeks 2 days ago
43 weeks 6 days ago
45 weeks 3 days ago
46 weeks 3 days ago
46 weeks 4 days ago
49 weeks 2 days ago