May 26, 2011 In a previous article I showed a simple method to convert values found in multiple table rows into a comma delimited list. The method works very well, until the situation in which the approach is to be used becomes a bit more complex. Assume for a moment that the following table structure [...]![]()
I researched an interesting new feature available with Oracle 11g R2, the so called RAC FAN API when writing the workload management chapter for the RAC book. The RAC FAN API is documented in Oracle® Database JDBC Developer’s Guide, 11g Release 2 (11.2) available online, but when it came to the initial documentation following the 11.2.0.1 release on Linux it was pretty useless. The good news is that it improved!
The RAC FAN Java API
The aim of this API is to allow a Java application to listen to FAN events by creating a subscription to the RAC nodes’ ONS processes. The application then registers a FANListener, based on the subscription, which can pick up instances of the following events:
All of these are in the oracle.simplefan namespace, the javadoc reference of which you can find in the official documenation.
When it was initially released in 11.2.0.1 I tried to get the FAN subscription to work without any luck, the registration with the ONS didn’t work. Schematically, this is how it should work:
The application (shown on the top) requires an Oracle client for the ONS configuration and libraries, and the setup is very similar to how you’d set up FCF. With the ons.config file pointing to the RAC nodes’ ONS processes you should be able to read FAN events as they occur on the source system. The ons.config in $CLIENT_HOME/opmn/conf/ons.config has to contain the local and remote ONS port, as well as all the RAC nodes with their remote ONS port. When subscribing to the events, you specify the service name you’d like to receive events for.
My 4 node RAC cluster is comprised of node rac11gr2drnode{1,2,3}, with local ONS listing on port 6200 for external, and port 6100 for internal requests. This can be checked using the onsctl debug command on any cluster node.. The client’s ons.config contains these lines:
localport:6100
remoteport:6200
nodes=rac11gr2drnode1:6200, rac11gr2drnode2:6200,rac11gr2drnode3:6200, rac11gr2drnode4:6200
Unfortunately I couldn’t subscribe to the remote ONS-I know that my setup is valid because of my UCP/FCF testing I did previously (check my earlier blog postings about how to get started with UCP and Tomcat 6). Or maybe it doesn’t work on Windows, who knows? Trying to subscribe to the remote ONS my application bails out with the following error:
Exception in thread “main” oracle.ons.SubscriptionException: Subscription request timed out after 30000 millseconds. Possible causes: OPMN may not be running, you may have an OPMN running in an alternate ORACLE_HOME using duplicate port values, or OPMN may be misconfigured.
at oracle.ons.SubscriptionNotification.waitForReply(SubscriptionNotification.java:83)
at oracle.ons.ONS.addSubscriber(ONS.java:956)
at oracle.ons.Subscriber.realStartup(Subscriber.java:103)
at oracle.ons.Subscriber.
at oracle.simplefan.impl.FanSubscription.subscribe(FanSubscription.java:228)
at oracle.simplefan.impl.FanSubscription.createFanSubscription(FanSubscription.java:46)
at oracle.simplefan.impl.FanManager.subscribe(FanManager.java:120)
at de.mbh.TestClass.
at de.mbh.TestClass.main(TestClass.java:53)
I tried my favourite search engine on the Exception but couldn’t find any useful information. This is not a big deal for me, I can start a local ONS on the client. This is shown here:
C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl ping ons is not running ... C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl start onsctl start: ons started C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl debug HTTP/1.1 200 OK Content-Length: 3627 Content-Type: text/html Response: == client:6200 5872 11/05/26 11:47:16 == Home: c:\oracle\product\11.2.0\client_1 ======== ONS ======== IP ADDRESS PORT TIME SEQUENCE FLAGS --------------------------------------- ----- -------- -------- -------- 10.xxx.xx.98 6200 4dde2fb0 00000006 00000008 Listener: TYPE BIND ADDRESS PORT SOCKET -------- --------------------------------------- ----- ------ Local 127.0.0.1 6100 344 Remote any 6200 352 Servers: (3) INSTANCE NAME TIME SEQUENCE FLAGS DEFER ---------------------------------------- -------- -------- -------- ---------- dbInstance_rac11gr2drnode1_6200 4d53d6a4 00029496 00000002 0 10.xxx.xx.155 6200 dbInstance_rac11gr2drnode2_6200 4d3eb9e2 0006cbeb 00000002 0 10.xxx.xx157 6200 dbInstance_rac11gr2drnode3_6200 4d5937d6 0001990f 00000002 0 10.xxx.xx.158 6200 Connection Topology: (4) IP PORT VERS TIME --------------------------------------- ----- ----- -------- 10.xxx.xx.158 6200 4 4dde2fb2 ** 10.xxx.xx.155 6200 ** 10.xxx.xx.157 6200 ** 10.xxx.xx.98 6200 10.xxx.xx.157 6200 4 4dde2fb2 ** 10.xxx.xx.155 6200 ** 10.xxx.xx.158 6200 ** 10.xxx.xx.98 6200 10.xxx.xx.155 6200 4 4dde2fb2 ** 10.xxx.xx.98 6200 ** 10.xxx.xx.157 6200 ** 10.xxx.xx.158 6200 10.xxx.xx.98 6200 4 4dde2fb2= ** 10.xxx.xx.155 6200 ** 10.xxx.xx.158 6200 ** 10.xxx.xx.157 6200 Server connections: ID CONNECTION ADDRESS PORT FLAGS SENDQ REF WSAQ -------- --------------------------------------- ----- ------ ----- --- ---- 0 10.xxx.xx.155 6200 010405 00000 001 --- 1 10.xxx.xx.156 6200 002405 00000 001 2 10.xxx.xx.157 6200 010405 00000 001 -- 3 10.xxx.xx.158 6200 010405 00000 001 --- Client connections: ID CONNECTION ADDRESS PORT FLAGS SENDQ REF SUB W -------- --------------------------------------- ----- ------ ----- --- --- - 4 internal 0 01008a 00000 001 002 request 127.0.0.1 6100 03201a 00000 001 000 Worker Ticket: 28/28, Last: 11/05/26 11:47:15 THREAD FLAGS -------- -------- 120 00000012 124 00000012 128 00000012 Resources: Notifications: Received: Total 12 (Internal 6), in Receive Q: 0 Processed: Total 12, in Process Q: 0 Pool Counts: Message: 1, Link: 1, Ack: 1, Match: 1 C:\oracle\product\11.2.0\client_1\opmn\bin>
With a local ONS started on my client, I can actually subscribe to the ONS and make use of the events. The easiest way is to simply decode the load balancing events, as I did in my code, shown below (modified version of the code in the Oracle documentation to make it work):
package de.mbh;
import oracle.simplefan.FanSubscription;
import oracle.simplefan.FanEventListener;
import oracle.simplefan.FanManager;
import oracle.simplefan.LoadAdvisoryEvent;
import oracle.simplefan.NodeDownEvent;
import oracle.simplefan.ServiceDownEvent;
import java.util.Properties;
public class TestClass {
TestClass() {
System.out.println("Hello");
Properties p = new Properties();
p.put("serviceName", "OEMSRV");
System.setProperty("oracle.ons.oraclehome", "c:\\oracle\\product\\11.2.0\\client_1");
System.out.println(System.getProperty("oracle.ons.oraclehome"));
FanSubscription sub = FanManager.getInstance().subscribe(p);
System.out.println("I'm subscribed!");
sub.addListener(new FanEventListener() {
public void handleEvent(ServiceDownEvent arg0) {
System.out.println("Service Down registered!");
}
public void handleEvent(NodeDownEvent arg0) {
System.out.println("Node Down Event Registered");
}
public void handleEvent(LoadAdvisoryEvent arg0) {
System.out.println("Just got a Load Advisory event");
System.out.println("originating database: " + arg0.getDatabaseUniqueName());
System.out.println("originating instance: " + arg0.getInstanceName());
System.out.println("Service Quality : " + arg0.getServiceQuality());
System.out.println("Percent : " + arg0.getPercent());
System.out.println("Service Name : " + arg0.getServiceName());
System.out.println("Service Quality : " + arg0.getServiceQuality());
System.out.println("Observed at : " + arg0.getTimestamp() + "\n\n");
} } );
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TestClass tc = new TestClass();
int i = 0;
while ( i < 100000) {
try {
Thread.sleep(100);
i++;
} catch (Exception e) {
System.out.println(e);
}
}
System.out.println("execution ended");
}
}
When compiling or executing the code, you need the simplefan.jar and ons.jar files in your classpath, and chances are that you need ojdbc6.jar as well.
Starting the application reveals Load Balancing events being read:
Hello c:\oracle\product\11.2.0\client_1 I'm subscribed! Just got a Load Advisory event originating database: LNGDS1RD originating instance: dbInstance_rac11gr2drnode3_6200 Service Quality : -1 Percent : 51 Service Name : OEMSRV Service Quality : -1 Observed at : Thu May 26 11:56:03 BST 2011 Just got a Load Advisory event originating database: LNGDS1RD originating instance: dbInstance_rac11gr2drnode3_6200 Service Quality : -1 Percent : 49 Service Name : OEMSRV Service Quality : -1 Observed at : Thu May 26 11:56:03 BST 2011
This proves that you can write your own connection cache, which allows you to react to *down events and rebalance your session differently.
Here’s a little gem from the OTN database forum – why should the CPU usage of querying an “interval partitioned” table depend on the size of something that’s missing ?
(The most significant symptom is identified in a reply I wrote to the thread – but that’s only a step in the right direction).
It’s been nearly six months since I made the switch from CentOS to Fedora as my main desktop OS.
The Fedora 15 final release dropped a couple of days ago and I slapped it on my main desktop PC straight away. As usual, the first impression is all about the visuals. GNOME 3 looks great. I’m sure lots of people think KDE looks great too, but I tend to just stick with the default window manager, so it’s GNOME for me.
If you’ve read any of the press you will know that the menu bar and task bar have been removed. This is kinda weird at first. How does it affect me?
The fancy visuals worked straight out of the box for my main desktop machine, but one of my other machines (with a better graphics card) couldn’t handle GNOME 3 and ran using fallback mode. Fallback mode is pretty much like previous GNOME releases with a menu bar and task bar. I’m sure some people will prefer fallback mode, but I think the new stuff is certainly worth a try.
If you really can’t handle the new interface you can manually switch to fallback mode. Start up the System Info dialog (Activities > Applications > System Settings > System Info), click “Graphics” , flick the “Forced Fallback Mode” switch and relog.
As for the OS itself, I’ve had no dramas so far, but it is early days. Time will tell…
By the way, I did the usual Oracle on Fedora thing.
Cheers
Tim…
UPDATE July 9, 2011: README.TXT 10046.pl
Often when I have a 10046 trace file, especially when looking at I/O issues, I want a histogram of I/O response time. To get I/O response time I’ve hacked out incomple awk scripts from time to time, always meaning to write a more complete one, well now I don’t have to. It’s already been done!
Here is a cool perl script from Clive Bostock: 10046.pl
(also checkout orasrp which produces a more indepth report in HTML. I like both. I like 10046.pl as a short easy portable script that I can modify, whereas orasrp is a binary and only works on some ports)
For example, if I trace a session with 10046, and retrieve the tracefile, then I can run:
$ 10046.pl -t mytrace.trc
and it will output a header and three sections
Header
This looks like
Header
Trace file mytrace.trc Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /u02/oracle System name: SunOS Node name: toto Release: 5.10 Version: Generic_142900-12 Machine: sun4u Instance name: orcl Redo thread mounted by this instance: 1 Oracle process number: 177 Unix process pid: 16553, image: oracle@toto Trace input file : mytrace.trc
Wait summary
EVENT AGGREGATES
================
Wait Event Count Elapsed(ms) Avg Ela (ms) %Total
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~
db file sequential read 2715 11593 4 3.74
direct path read 4484 4506 1 1.45
db file scattered read 141 898 6 0.29
log file sync 3 8 2 0.00
~~~~~~~~~~~~
Total Elapsed: 309821Wait Summary by object
Object Id : Wait Event Count Tot Ela (ms) %Total Avg Ela (ms) ~~~~~~~~~~ : ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~~ 28581 : direct path read 4484 4506 1.45 1 1756433 : db file sequential read 725 1891 0.61 2 764699 : db file sequential read 332 1762 0.57 5 37840 : db file sequential read 200 1044 0.34 5 38018 : db file sequential read 108 1009 0.33 9 81596 : db file scattered read 140 887 0.29 6
wait histogram by object
EVENT HISTOGRAM BREAKDOWN =========================== This section splits the event counts into elapsed time buckets so that we can see if there are any suspiciousn or anomalous response time / frequency patterns. Object Id : Wait Event <1ms <2ms <4ms <8ms <16ms <32ms <64ms <128ms <256ms <512ms >=1024ms ~~~~~~~~~ : ~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~~ ~~~~~ ~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~ 28581 : direct path read 7680 87 148 221 144 40 4 0 0 0 0 1756433 : db file sequential read 606 268 45 35 66 6 2 0 0 0 0 764699 : db file sequential read 74 119 11 78 78 9 0 0 0 0 0 37840 : db file sequential read 50 72 6 45 47 5 0 0 0 0 0 38018 : db file sequential read 12 38 7 10 30 12 5 0 0 0 0 81596 : db file scattered read 64 4 13 62 18 8 3 0 0 0 0 41995 : db file sequential read 20 39 0 7 16 8 4 0 0 0 0 108718 : db file sequential read 74 54 5 12 24 4 0 0 0 0 0 33490 : db file sequential read 0 5 11 25 19 4 0 0 0 0 0
Here’s one of those funny little details that can cause confusion:
SQL> select * from user_audit_object;
no rows selected
SQL> audit select on indjoin by session whenever successful;
Audit succeeded.
SQL> select
2 count(*)
3 from
4 indjoin ij
5 where
6 id between 100 and 200
7 and val between 50 and 150
8 ;
COUNT(*)
----------
51
1 row selected.
SQL> select * from user_audit_object where obj_name = 'INDJOIN';
no rows selected
So we had nothing in our “object audit trail”, then we enabled auditing on one particular table for select statements (and audit_trail = db has been set in the parameter file) but our select statement hasn’t been audited. What went wrong ?
Let’s try a different query against the audit view which, until a moment ago, held no data:
SQL> select obj_name from user_audit_object; OBJ_NAME ----------------------------------------------- index$_join$_001
So we have audited something – but what ? Here’s the execution plan for our query:
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 5 (20)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
|* 2 | VIEW | index$_join$_001 | 3 | 24 | 5 (20)| 00:00:01 |
|* 3 | HASH JOIN | | | | | |
|* 4 | INDEX RANGE SCAN| IJ_PK | 3 | 24 | 3 (34)| 00:00:01 |
|* 5 | INDEX RANGE SCAN| IJ_VAL | 3 | 24 | 3 (34)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("VAL"<=150 AND "ID"<=200 AND "ID">=100 AND "VAL">=50)
3 - access(ROWID=ROWID)
4 - access("ID">=100 AND "ID"<=200)
5 - access("VAL">=50 AND "VAL"<=150)
See that index$_join$_001 in line 2 ? We have a plan that uses the “index hash join” mechanism to query that table – so Oracle audits the query, but unfortunately uses the name of the internal view in the audit record.
Be very careful how you audit objects; you may think that an object has not been accessed when it has (sort of). If you’ve started doing this type of auditing to check whether or not an object is ever accessed you could be fooled.
You could claim, of course, that the object hasn’t really been accessed – but compare the index join above with the following (cut-n-paste) example, which I ran after deleting everything from the aud$ table:
SQL> connect test_user/test
Connected.
SQL> select obj_name from user_audit_object;
no rows selected
SQL> set autotrace on explain
SQL> select count(*) from indjoin where id = 23;
COUNT(*)
----------
1
1 row selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 689603510
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 1 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | INDEX UNIQUE SCAN| IJ_PK | 1 | 4 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID"=23)
SQL> set autotrace off
SQL> select obj_name from user_audit_object;
OBJ_NAME
----------------
INDJOIN
1 row selected.
We haven’t actually visited the indjoin table, but the audit code recognises that that was the intent, even though the optimizer made physical access unnecessary. The behaviour ought to be the same for both cases.
(This example orginally came from a client site, by the way – with the question “how come the audit trail says we’re not visiting this table?”. The client version, and the version I tested on, was 10.2.0.3. There were a couple of other little surprises I found at the same time – I may find time to write about them some other day.)
On
http://dboptimizer.com/2011/04/27/envisioning-nfs-performance/
I showed some code that displayed the send and receive sizes and times over TCP on Solaris 10 with the dtrace command (see tcp.d for the code). I took this code on another machine and got errors like
“dtrace: error on enabled probe ID 29 (ID 5461: tcp:ip:tcp_input_data:receive): invalid alignment (0xffffff516eb5e83a) in predicate at DIF offset 136″
Not quite sure why this was happening but by a process of elimination I found that accessing args[4] in tcp:::receive caused these errors.
Fortunately many of the values in args[4] are found in args[3] as well.
To find arguments to tcp:::receive , first run the following command (or alternatively look the TCP probes up on the wiki at http://wikis.sun.com/display/DTrace/tcp+Provider)
( for command line flags check out http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?dtrace+7
i.e. list the verbose information about the probes named and don’t enable these probes, just list them)
$ dtrace -lvn tcp:::receive
5473 tcp ip tcp_output send
Argument Types
args[0]: pktinfo_t *
args[1]: csinfo_t *
args[2]: ipinfo_t *
args[3]: tcpsinfo_t *
args[4]: tcpinfo_t *(by the way, there are a number of probes that match tcp:::receive, but they all have the same arguments, I didn’t notice all these different tcp:::receive until I actually ran the above command. Before running the command I’d depended on the wiki. Now I’m wondering what the difference is between some of these tcp:::receive and tcp:::send probes )
After finding the args for a probe, you can look up the definition of the structs at http://cvs.opensolaris.org/source/
tcp:::send and tcp:::receive both have the same arguments
Looking up the structs at http://cvs.opensolaris.org/source/, I find the contents of the structs as follows:
tcpsinfo_t ( args[3] )
111 typedef struct tcpsinfo {
112 uintptr_t tcps_addr;
113 int tcps_local; /* is delivered locally, boolean */
114 int tcps_active; /* active open (from here), boolean */
115 uint16_t tcps_lport; /* local port */
116 uint16_t tcps_rport; /* remote port */
117 string tcps_laddr; /* local address, as a string */
118 string tcps_raddr; /* remote address, as a string */
119 int32_t tcps_state; /* TCP state */
120 uint32_t tcps_iss; /* Initial sequence # sent */
121 uint32_t tcps_suna; /* sequence # sent but unacked */
122 uint32_t tcps_snxt; /* next sequence # to send */
123 uint32_t tcps_rack; /* sequence # we have acked */
124 uint32_t tcps_rnxt; /* next sequence # expected */
125 uint32_t tcps_swnd; /* send window size */
126 int32_t tcps_snd_ws; /* send window scaling */
127 uint32_t tcps_rwnd; /* receive window size */
128 int32_t tcps_rcv_ws; /* receive window scaling */
129 uint32_t tcps_cwnd; /* congestion window */
130 uint32_t tcps_cwnd_ssthresh; /* threshold for congestion avoidance */
131 uint32_t tcps_sack_fack; /* SACK sequence # we have acked */
132 uint32_t tcps_sack_snxt; /* next SACK seq # for retransmission */
133 uint32_t tcps_rto; /* round-trip timeout, msec */
134 uint32_t tcps_mss; /* max segment size */
135 int tcps_retransmit; /* retransmit send event, boolean */tcpinfo_t (args[4])
95 typedef struct tcpinfo {
96 uint16_t tcp_sport; /* source port */
97 uint16_t tcp_dport; /* destination port */
98 uint32_t tcp_seq; /* sequence number */
99 uint32_t tcp_ack; /* acknowledgment number */
100 uint8_t tcp_offset; /* data offset, in bytes */
101 uint8_t tcp_flags; /* flags */
102 uint16_t tcp_window; /* window size */
103 uint16_t tcp_checksum; /* checksum */
104 uint16_t tcp_urgent; /* urgent data pointer */
105 tcph_t *tcp_hdr; /* raw TCP header */
106 } tcpinfo_t;In the script I accessed the TCP seq and ack in arg[4], which was giving me errors in tcp:::receive, so I just switch these for the equivalents in arg[3]. Now I’m not exactly clear on the equivalence, but it seems
The ack=rack seems solid as tcps_rack = “highest sequence number for which we have received and sent an acknowledgement”.
The seq=tcps_suna is less clear to me as tcps_suna = “lowest sequence number for which we have sent data but not received acknowledgement”
But for my purposes, these distinctions might be unimportant as I’ve stopped looking at seq and ack and now look at outstanding unacknowledge bytes on the receiver and sender, which is
but more about that later. Let’s look at the new version of the program and the output
#!/usr/sbin/dtrace -s #pragma D option defaultargs #pragma D option quiet inline string ADDR=$$1; dtrace:::BEGIN { TITLE = 10; title = 0; walltime=timestamp; printf("starting up ...\n"); } tcp:::send, tcp:::receive / title == 0 / { printf("%9s %8s %6s %8s %8s %12s %s \n", "delta" , "cid" , "pid" , "send" , "recd" , "seq", "ack" ); title=TITLE; } tcp:::send / ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%9d %8d %6d %8d --> %8s %8d %8d %12s > %s \n", delta/1000, args[1]->cs_cid % 100000, args[1]->cs_pid , args[2]->ip_plength - args[4]->tcp_offset, "", args[4]->tcp_seq - args[3]->tcps_suna , args[4]->tcp_ack - args[3]->tcps_rack , args[3]->tcps_raddr, curpsinfo->pr_psargs ); title--; } tcp:::receive / ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%9d %8d %6d %8s <-- %-8d %8d %8d %12s < %s \n", delta/1000, args[1]->cs_cid % 100000, args[1]->cs_pid , "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_rack % 10000, args[3]->tcps_suna % 10000, args[3]->tcps_raddr, curpsinfo->pr_psargs ); title--; }
output
starting up ...
delta cid pid send recd
570 3904 845 <-- 0
34 3904 845 <-- 140
455 3904 845 <-- 0
24 3904 845 <-- 0
4789720 3904 845 124 -->
82 3904 845 244 -->
99 3904 845 132 -->
delta cid pid send recd
52 3904 845 1448 -->
28 3904 845 1448 -->
24 3904 845 1448 -->
36 3904 845 1448 -->
33 3904 845 1448 -->
26 3904 845 952 -->
86 3904 845 244 -->
212 3904 845 <-- 140
501 3904 845 <-- 132
60 3904 845 124 -->
256 3904 845 <-- 140
72 3904 845 <-- 0
39658 3904 845 <-- 0What the heck is that huge time 4789720 us? ie 4 secs? The whole operation took me less than 1 second.
I wouldn’t have found the answer to this without help from Adam Levanthal. Turns out that output is in order only per CPU, but different CPUs can output data in different order. Each CPU buffers up data and then passed the buffer back to userland dtrace, so on a one CPU machine, this code will always output in order, but on a multi-cpu machine there is no guarentee on the order of the output. Lets add CPU # to the output:
#!/usr/sbin/dtrace -s #pragma D option defaultargs #pragma D option quiet inline string ADDR=$$1; dtrace:::BEGIN { TITLE = 10; title = 0; walltime=timestamp; printf("starting up ...\n"); } tcp:::send, tcp:::receive / title == 0 / { printf("%9s %8s %6s %8s %8s %4s \n", "delta" , "cid" , "pid" , "send" , "recd" , "cpu#" ); title=TITLE; } tcp:::send / ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%9d %8d %6d %8d --> %8s %d \n", delta/1000, args[1]->cs_cid % 100000, args[1]->cs_pid , args[2]->ip_plength - args[4]->tcp_offset, "", cpu ); title--; } tcp:::receive / ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%9d %8d %6d %8s <-- %-8d %d \n", delta/1000, args[1]->cs_cid % 100000, args[1]->cs_pid , "", args[2]->ip_plength - args[4]->tcp_offset, cpu ); title--; }
output
delta cid pid send recd cpu#
42 3904 845 244 --> 0
66 3904 845 124 --> 0
6091 3904 845 124 --> 2
81 3904 845 244 --> 2
96 3904 845 132 --> 2
31 3904 845 1448 --> 2
20 3904 845 1448 --> 2
18 3904 845 1448 --> 2
17 3904 845 1448 --> 2
16 3904 845 1448 --> 2
8910406 3904 845 0 --> 3
375 3904 845 <-- 0 3
24 3904 845 <-- 140 3
34 3904 845 0 --> 3
470 3904 845 <-- 140 3
410 3904 845 <-- 132 3
delta cid pid send recd cpu#
491 3904 845 <-- 140 3
393 3904 845 <-- 0 3
15 3904 845 952 --> 3
36 3904 845 <-- 0 3
delta cid pid send recd cpu#
19 3904 845 <-- 0 3
40167 3904 845 <-- 0 3what we see is the data ordered by CPU. In other words for each CPU the data is ordered but which CPU get’s printed first is unknown. In dtrace each CPU buffers up it’s data and then sends it to the userland dtrace process.
The only “fix” for now is to add a timestamp and order the data by timestamp.
Unsorted, it looks like:
607858102997348 281 3904 845 124 --> 2 607858103608856 84 3904 845 244 --> 2 607858104125414 delta cid pid send recd cpu# 607858104143731 116 3904 845 132 --> 2 607858104176769 33 3904 845 1448 --> 2 607858104198187 21 3904 845 1448 --> 2 607858104215813 17 3904 845 1448 --> 2 607858104233004 17 3904 845 1448 --> 2 607858104267629 34 3904 845 1448 --> 2 607858104287379 19 3904 845 952 --> 2 607858102716187 11569935 3904 845 <-- 132 3 607858103245377 248 3904 845 <-- 0 3 607858103282421 37 3904 845 <-- 140 3 607858103339076 56 3904 845 244 --> 3 607858103524093 185 3904 845 <-- 140 3 607858103774417 165 3904 845 <-- 132 3 607858103823145 48 3904 845 124 --> 3 607858104027216 204 3904 845 <-- 140 3 607858104387780 100 3904 845 <-- 0 3 607858104401487 13 3904 845 <-- 0 3 607858104520815 119 3904 845 <-- 0 3 607858144436175 delta cid pid send recd cpu# 607858144454625 39933 3904 845 <-- 0 3
sorted it looks like
607858102716187 11569935 3904 845 <-- 132 3 607858102997348 281 3904 845 124 --> 2 607858103245377 248 3904 845 <-- 0 3 607858103282421 37 3904 845 <-- 140 3 607858103339076 56 3904 845 244 --> 3 607858103524093 185 3904 845 <-- 140 3 607858103608856 84 3904 845 244 --> 2 607858103774417 165 3904 845 <-- 132 3 607858103823145 48 3904 845 124 --> 3 607858104027216 204 3904 845 <-- 140 3 607858104125414 delta cid pid send recd cpu# 607858104143731 116 3904 845 132 --> 2 607858104176769 33 3904 845 1448 --> 2 607858104198187 21 3904 845 1448 --> 2 607858104215813 17 3904 845 1448 --> 2 607858104233004 17 3904 845 1448 --> 2 607858104267629 34 3904 845 1448 --> 2 607858104287379 19 3904 845 952 --> 2 607858104387780 100 3904 845 <-- 0 3 607858104401487 13 3904 845 <-- 0 3 607858104520815 119 3904 845 <-- 0 3 607858144436175 delta cid pid send recd cpu# 607858144454625 39933 3904 845 <-- 0 3
so now the strange long time delta is at the beginning where I’d expect it.
I’m not quite sure how to deal with this. Post processing the data by sorting the timestamp column works, but interactively processing the data to get it in the right order as it comes out seems problematic.

In my professional opinion, if your performance work is to have any credibility then you need to agree with and abide by this philosophy. I think it’s really what separates the real experts from the amateurs.
“Measurement is the first step that leads to control and eventually to improvement. If you can’t measure something, you can’t understand it. If you can’t understand it, you can’t control it. If you can’t control it, you can’t improve it.” – Dr. H. James Harrington
Discuss.
name="googleone_share_1" style="position:relative;z-index:5;float: right; margin-left: 10px;">Overheard in an IRC chat room (Freenode#oracle) this morning…
Recent comments
16 weeks 5 days ago
26 weeks 4 days ago
28 weeks 1 day ago
31 weeks 3 days ago
33 weeks 5 days ago
43 weeks 2 days ago
44 weeks 5 days ago
45 weeks 5 days ago
45 weeks 6 days ago
48 weeks 5 days ago