Search

OakieTags

Who's online

There are currently 0 users and 31 guests online.

Recent comments

Affiliations

March 2011

Upgrade Whoa

(The title’s a pun, by the way – an English form of humour that is not considered good unless it’s really bad.)

Very few people try to email me or call me with private problems – which is the way it should be, and I am grateful to my audience for realizing that this blog isn’t trying to compete with AskTom – but I do get the occasional communication and sometimes it’s an interesting oddity that’s worth a little time.

Today’s blog item is one such oddity – it was a surprise, it looked like a nasty change in behaviour, and it came complete with a description of environment, and a neatly formatted, complete, demonstration. For a discussion of the problem in Spanish you can visit the blog of John Ospino Rivas, who sent me the original email and has written his own blog post on the problem.

We start with a simple table, and then query it with a ‘select for update from two different sessions:


drop table tab1 purge;

create table tab1(
	id	number,
	info	varchar2(10),
	constraint tab1_pk primary key (id)
		using index (create index idx_tab1_pk on tab1(id))
);

insert into tab1 values(1,'a');
insert into tab1 values(2,'a');
insert into tab1 values(3,'a');
commit;

execute dbms_stats.gather_table_stats(user,'tab1',cascade=>true)

column id new_value m_id

set autotrace on explain

select  id
from    tab1
where   id = (
            select  min(id)
            from    tab1
        )
for update
;

set autotrace off

prompt	=============================================================
prompt  Now repeat the query in another session and watch it lock
prompt	And use a third session to check v$lock
prompt  Then delete here, commit and see what the second session does
prompt	=============================================================

accept X prompt 'Press return to delete and commit'

set verify on
delete from tab1 where id = &m_id;
commit;

The fact that the primary key index is created as a non-unique index isn’t a factor that affects this demonstration.

Given the query and the data in the table, you won’t be surprised by the result of the query from the first session (for convenience I’ve captured the selected value using the ‘column new_value’ option). Here’s the result of the query and its execution plan:


        ID
----------
         1

--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |     1 |     3 |     1   (0)| 00:00:01 |
|   1 |  FOR UPDATE                  |             |       |       |            |          |
|*  2 |   INDEX RANGE SCAN           | IDX_TAB1_PK |     1 |     3 |     0   (0)| 00:00:01 |
|   3 |    SORT AGGREGATE            |             |     1 |     3 |            |          |
|   4 |     INDEX FULL SCAN (MIN/MAX)| IDX_TAB1_PK |     3 |     9 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID"= (SELECT MIN("ID") FROM "TAB1" "TAB1"))

At this point the program issues instructions to repeat the query from a second session, then waits for you to press Return. When you run the same query from another session it’s going to see the data in read-consistent mode and try to select and lock the row where ID = 1, so the second session is going to hang waiting for the first session to commit or rollback.

Here’s the key question: what’s the second session going to return when you allow the first session to continue, delete the row it has selected, and commit ? Here’s the answer if you’re running 10.2.0.3 or 11.1.0.6 (which is what I happen to have easily available):

SQL> select  id
  2  from    tab1
  3  where   id = (
  4              select  min(id)
  5              from    tab1
  6          )
  7  for update
  8  ;

        ID
----------
         2

1 row selected.

Now, this seems perfectly reasonable to me – especially since I’ve read Tom Kyte’s notes on “write consistency” and seen the “rollback and restart” mechanism that kicks in when updates have to deal with data that’s changed since the start of the update. Session 2 had a (select for) update, and when it finally got to a point where it could lock the data it found that the read-consistent version of the data didn’t match the current version of the data so it restarted the statement at a new SCN. At the new SCN the current highest value was 2.

Now here’s what happened when I ran the test under 11.2.0.2:

SQL> select  id
  2  from    tab1
  3  where   id = (
  4              select  min(id)
  5              from    tab1
  6          )
  7  for update
  8  ;

no rows selected

The upgrade produces a different answer !

At first sight (or guess) it looks as if the query has run in two parts – the first part producing the min(id) of 1 using a read-consistent query block, with the second part then using the resulting “known value” to execute the outer select (shades of “precompute_subquery”) and restarting only the second part when it discovers that the row it has been waiting for has gone away.

It doesn’t really matter whether you think the old behaviour or the new behaviour is correct – the problem is that the behaviour has changed in a way that could silently produce unexpected results. Be careful if any of your code uses select for update with subqueries.

As a defensive measure you might want to change the code to use the serializable isolation level – that way the upgraded code will crash with Oracle error ORA-08177 instead of silently giving different answers:

SQL> alter session set isolation_level = serializable;

Session altered.

SQL> get afiedt.buf
  1  select  /*+ gather_plan_statistics */
  2          id
  3  from    tab1
  4  where   id = (
  5              select  min(id)
  6              from    tab1
  7          )
  8* for update
  9  /
from    tab1
        *
ERROR at line 3:
ORA-08177: can't serialize access for this transaction

It might be a way of avoiding this specific problem, of course, but it’s not a frequently used feature (the first pages of hits on Google are mostly about SQL Server) so who knows what other anomalies this change in isolation level might introduce.

PX COORDINATOR FORCED SERIAL operation

This is just a short heads-up for those that come across an execution plan showing the PX COORDINATOR FORCED SERIAL operation. I don't have official confirmation but according to my tests a plan with such an operation effectively means: Cost for parallel execution but execute serially (You might remember that I've recently mentioned in another post that there is the possibility to have a plan executed in parallel but costed serially, weird isn't it). Why such an operation exists is not clear to me - obviously it would make much more sense to cost straight away for serial execution in such a case. Probably there is a good reason, otherwise such an operation didn't exist but I think at least the costing is questionable in current versions.

LOBREAD SQL Trace entry in Oracle 11.2 (and tracing OPI calls with event 10051)

A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:

*** 2011-03-17 14:34:37.242
WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725
WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368
WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865
[...snipped...]
WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205
WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221
WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560
WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579
WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685
WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706
WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720
#ff0000;">LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735

In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.

Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).

So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.

Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):

SQL> @oerr 10051

ORA-10051: trace OPI calls

SQL> alter session set events '10051 trace name context forever, level 1';

Session altered.

Now run some LOB access code and check the tracefile:

*** 2011-03-17 14:37:07.178
WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602
OPI CALL: type=105 argc= 2 cursor=  0 name=Cursor close all
CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731
OPI CALL: type=94 argc=28 cursor=  0 name=V8 Bundled Exec
=====================
PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82'
select * from t_lob
END OF STMT
PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805
EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884
WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939
WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785
OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH
WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945
FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963
WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629
#ff0000;">OPI CALL: type=96 argc=21 cursor=  0 name=#ff0000;">LOB/FILE operations
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788
[...snip...]
WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054
LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064

Check the bold and especially the red string above.  Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…

By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?

Well, in 11.2.0.2 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.

Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.

It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:

If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:

Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:

Share

Review: Oracle Application Express 4.0 Cookbook

I decided to add a twist to my usual blogs - a book review, which I have not done before. I have been reading a book - Oracle Apex 4.0 Cookbook by Marcel van der Plas and Michel van Zoest and published by Packt, a UK based publisher. Michel is one of the first Apex Certified Experts in the world - a no small feat. It has been technically reviewed by a well known cast - Oracle ACE Director Dimitri Gielis, who also won Oracle Magazine's Apex Developer of the Year in '09; Maarten van Luijtelaar and Oracle ACE and frequent blogger Surachart Opun.

ASSM ouch!

Here’s a nasty little surprise I got last week while investigating an oddity with stats collection. I wanted to create a table in an ASSM tablespace and populate it from two or three separate sessions simultaneously so that I could get some “sparseness” in the data load. So I created a table and ran up 17 concurrent sessions to insert a few rows each. Because I wanted to know where the rows were going I got every session to dump the bitmap space management block at the start of the segment – the results were surprising.

I was using 11.1.0.6, with a locally managed tablespace with 1MB uniform extents, 8KB blocks, and ASSM. Here’s the basic code:

create table t1 (
	n1	number,
	v1	varchar2(100)
)
tablespace test_8k_assm
;

--
--  17 sessions then synchronised to do the following
--

begin
	for i in 1..3 loop
		insert into t1 values(i, rpad('x',100));
		commit;
	end loop;
end;
/

alter system flush buffer_cache;
execute dump_seg('t1',1,'table',0)

(The dump_seg() procedure is just a simple bit of code I wrote to dump selected blocks from a named segment.)
The results I saw in the bitmap block were so bizarre that I then ran the following query:

select
	dbms_rowid.rowid_block_number(rowid) as block_id,
	count(*)
from
	t1
group by
	dbms_rowid.rowid_block_number(rowid)
order by
	block_id
;

Remember, I was using 8KB blocks and 1MB uniform extents – which means 128 blocks per extent – and I had started with a completely new, clean, tablespace. So how many blocks do you think I had allocated to the table ? Here are the results of my query:

  BLOCK_ID   COUNT(*)
---------- ----------
        40          3
        41          3
        43          3
       106          3
       110          3
       112          3
       165          3
       173          3
       236          3
       303          3
       433          3
       464          3
       541          3
       555          3
       563          3
       626          3
      1000          3

What you see here is a table which has managed to allocate several extents – despite the fact that the 17 sessions could have inserted their data into 17 separate blocks in the first extent of the table. The table had grown to nine extents – although it hadn’t even put data into all the extents – and according to dbms_stats.gather_table_stats() there were 1,088 blocks in the table !

This looks pretty disastrous for ASSM – but I do have to say that flushing the buffer cache (17 times) could have introduced an extreme pinning problem that resulted in this surprising result. When I removed the flush and dump lines from my test the “realistic” results showed 17 blocks used in the first extent – so don’t panic if you’re using ASSM in a highly concurrent system; but if you have a couple of tables that seem to be much bigger than expected, maybe you now know what to look for.

Implicit datatype conversion in the parsing phase – something new I learned today!

Wow, I wasn’t aware that Oracle can also do an implicit datatype conversion for literal strings during parsing phase!

SQL> @desc t
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      A                                        NUMBER(38)

SQL> select * from t where a = '1' || RPAD('0',5,'0');

no rows selected

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------
SQL_ID  d7r6md8wfu74d, child number 0
-------------------------------------
select * from t where a = '1' || RPAD('0',5,'0')

Plan hash value: 1601196873

--------------------------------------------------------
| Id  | Operation         | Name | E-Rows | Cost (%CPU)|
--------------------------------------------------------
|   0 | SELECT STATEMENT  |      |        |     2 (100)|
|*  1 |  TABLE ACCESS FULL| T    |      1 |     2   (0)|
--------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("A"=100000)

You see what happened? The expression ’1′ || RPAD(’0′,5,’0′) has been evaluated, which returns a string. And this string ’100000′ has been converted to a NUMBER 100000 during parsing phase .. otherwise you would see quotes around the number above with a TO_NUMBER() function around it (so that Oracle could compare the NUMBER column “A” to the same datatype)…

I add a TO_CHAR() around the column A just for demoing that a varchar datatype (as the original “literal” in my query is) will be shown with quotes like every normal string:

SQL> select * from t where to_char(a) = '1'||rpad('0',5,'0');

no rows selected

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------
SQL_ID  7yf6j8fdyrvk7, child number 0
-------------------------------------
select * from t where to_char(a) = '1'||rpad('0',5,'0')

Plan hash value: 1601196873

--------------------------------------------------------
| Id  | Operation         | Name | E-Rows | Cost (%CPU)|
--------------------------------------------------------
|   0 | SELECT STATEMENT  |      |        |     2 (100)|
|*  1 |  TABLE ACCESS FULL| T    |      1 |     2   (0)|
--------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(TO_CHAR("A")='100000')

Let’s see whether this trick is somehow done also for bind variables:

SQL> var x varchar2(10)
SQL> exec :x:= '1' || RPAD('0',5,'0');

PL/SQL procedure successfully completed.

SQL> print x

X
--------------------------------
100000

SQL> select * from t where a = :x;

no rows selected

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------
SQL_ID  45f39y7580bdp, child number 2
-------------------------------------
select * from t where a = :x

Plan hash value: 1601196873

--------------------------------------------------------
| Id  | Operation         | Name | E-Rows | Cost (%CPU)|
--------------------------------------------------------
|   0 | SELECT STATEMENT  |      |        |     2 (100)|
|*  1 |  TABLE ACCESS FULL| T    |      1 |     2   (0)|
--------------------------------------------------------

Peeked Binds (identified by position):
--------------------------------------

   1 - (VARCHAR2(30), CSID=873): '100000'

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("A"=TO_NUMBER(:X))

Apparently not! And this kind of makes sense – as if this string to number conversion is done during parse phase – Oracle doesn’t know what the actual value is yet (in the bind variable memory) so it can’t convert it to number in advance either :-)

This is a little interesting detail… I didn’t know that in addition to the implicit datatype conversion during query execution (using TO_CHAR, TO_NUMBER functions etc) Oracle can sometimes convert a string literal to number datatype under the hood during the parse time!

P.S. I tested this on Oracle 11.2.0.2 with optimizer_features_enable set from 11.2.0.2 to all the way back to 8.0.0 and the behavior was the same. I didn’t find any mention of this conversion in the CBO tracefile although after a filter pushdown transformation (FPD) the string literal was already shown as a number datatype. If anyone still has access to ancient Oracle database versions (like 9.2 and 10.1 ;-) then let me know whether you see the same results!

Share

AIX note

This is just a temporary link to a note from Kurt Van Meerbeeck that should be of interest to anyone who’s using AIX and thinking of upgrading Oracle to 11.2 in the near future: Doubled memory footprint on AIX.

Troubleshooting Grid Infrastructure startup

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong?An obvious reason for this could be a reboot, and fair enough, there has been one:

[grid@node1.example.com] $ uptime
09:09:22 up  2:40,  1 user,  load average: 1.47, 1.46, 1.42

The next step was to check if the local CRS stack was up, or better, to check what was down. Sometimes it’s only crsd which has a problem. In my case everything was down:

[grid@node1.example.com] $ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[grid@node1.example.com] $ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
CRS-4404: The following nodes did not reply within the allotted time:
node2,node3, node4, node5, node6, node7, node8

The CRS-4404 was slightly misleading, I assumed all cluster nodes were down after a clusterwide reboot. Sometimes a single node reboot triggers worse things. However, logging on to node 2 I saw that all but the first node were ok.

CRSD really needs CSSD to be up and running, and CSSD requires the OCR to be there. I wanted to know if the OCR was impacted in any way:

[grid@node1.example.com] $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
ORA-29701: unable to connect to Cluster Synchronization Service

Well it seemed that the OCR location was unavailable. I know that on this cluster, the OCR is stored on ASM. Common reasons for the PROC-26 error are

  • Unix admin upgrades the kernel but forgets to upgrade the ASMLib kernel module (common grief with ASMLib!)
  • Storage is not visible on the host, i.e. SAN connectivity broken/taken away (happens quite frequently with storage/sys admin unaware of ASM)
  • Permissions not set correctly on the block devices (not an issue when using asmlib)

I checked ASMLib and it reported a working status:

[oracle@node1.example.com] $ /etc/init.d/oracleasm status
Checking if ASM is loaded: yes
Checking if /dev/oracleasm is mounted: yes

That was promising, /dev/oracleasm/ was populated and the matching kernel modules loaded. /etc/init.d/oracleasm listdisks listed all my disks as well. Physical storage not accessible (PROC-26) seemed a bit unlikely now.

I could rule out permission problems since ASMLib was working fine, and I also rule out the kernel upgrade/missing libs problem by comparing the RPM with the kernel version: they matched. So maybe it’s storage related?

Why did the node go down?

Good question, usually to be asked towards the unix administration team. Luckily I have a good contact placed right inside that team and I could get the following excerpt from /var/log/messages arond the time of the crash (6:31 this morning):

Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192116, count = 1
Mar 17 06:26:06 node1 kernel: Aborting journal on device dm-2.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192216, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192166, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192122, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192140, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192174, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 10 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_truncate: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_orphan_del: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_delete_inode: Journal has aborted
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 kernel: ext3_abort called.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Mar 17 06:26:06 node1 kernel: Remounting filesystem read-only
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:41076
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:41076
Mar 17 06:26:09 node1 kernel: SysRq : Resetting
Mar 17 06:31:15 node1 syslogd 1.4.1: restart.

So it looks like a file system error triggered the reboot-I’m glad the box came back up ok on it’s own. The $GRID_HOME/log/hostname/alerthostname.log didn’t show anything specific to storage. Normally you would see that it starts counting a node down if it lost contact to the voting disks (in this case OCR and voting disks share the same diskgroup).

And why does Clusteware not start?

After some more investigation it seems there was no underlying problem with the storage, so I tried to manually start the cluster, traililng the ocssd.log file for possible clues.

[root@node1 ~]# crsctl start cluster
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2674: Start of ‘ora.cssd’ on ‘node1′ failed
CRS-2679: Attempting to clean ‘ora.cssd’ on ‘node1′
CRS-2681: Clean of ‘ora.cssd’ on ‘node1′ succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′

… the command eventually failed. The ocssd.log file showed this:

...
2011-03-17 09:47:49.073: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996354 ms, node 0x2aaab008a260 { host 'node4', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-a7491097, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.084: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996364 ms, node 0x2aaab008a630 { host 'node6', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-2f6ece1c, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.113: [    CSSD][1113332032]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846440, LATS 10996434, lastSeqNo 30846437, uniqueness 1300108895, timestamp 1300355268/3605443434
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355257, LATS 10996434, lastSeqNo 31355254, uniqueness 1300344405, timestamp 1300355268/10388584
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372473, LATS 10996434, lastSeqNo 31372470, uniqueness 1297097908, timestamp 1300355268/3605182454
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384686, LATS 10996434, lastSeqNo 31384683, uniqueness 1297098093, timestamp 1300355268/3604696294
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388819, LATS 10996434, lastSeqNo 31388816, uniqueness 1297098327, timestamp 1300355268/3604712934
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612975, LATS 10996434, lastSeqNo 29612972, uniqueness 1297685443, timestamp 1300355268/3603054884
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203293, LATS 10996434, lastSeqNo 31203290, uniqueness 1297156000, timestamp 1300355268/3604855704
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 3, node33, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355258, LATS 10996434, lastSeqNo 31355255, uniqueness 1300344405, timestamp 1300355268/10388624
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372474, LATS 10996434, lastSeqNo 31372471, uniqueness 1297097908, timestamp 1300355268/3605182494
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384687, LATS 10996434, lastSeqNo 31384684, uniqueness 1297098093, timestamp 1300355268/3604696304
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388821, LATS 10996434, lastSeqNo 31388818, uniqueness 1297098327, timestamp 1300355268/3604713224
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612977, LATS 10996434, lastSeqNo 29612974, uniqueness 1297685443, timestamp 1300355268/3603055224
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846441, LATS 10996474, lastSeqNo 30846438, uniqueness 1300108895, timestamp 1300355269/3605443654
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355259, LATS 10996474, lastSeqNo 31355256, uniqueness 1300344405, timestamp 1300355268/10389264
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203294, LATS 10996474, lastSeqNo 31203291, uniqueness 1297156000, timestamp 1300355269/3604855914
2011-03-17 09:47:49.619: [    CSSD][1116485952]clssnmSendingThread: sending join msg to all nodes
...

The interesting bit is the “BUT NO NETWORK HB”, i.e. something must be wrong with the network configuration. I quickly checked the output of ifconfig and found a missing entry for my private interconnect. This is defined in the GPnP profile if you are unsure:


 
 
 

Now that’s a starting point! If tried to bring up bond1.251, but that failed:

[root@node1 network-scripts]# ifup bond1.251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
ERROR: could not add vlan 251 as bond1.251 on dev bond1

The “invalid argument” didn’t mean too much to me, so I ran ifup with the “-x” flag to get more information about which argument was invalid:

[root@node1 network-scripts]# which ifup
/sbin/ifup
[root@node1 network-scripts]# view /sbin/ifup
# turned out it's a shell script! Let's run with debug output enabled
[root@node1 network-scripts]# bash -x /sbin/ifup bond1.251
+ unset WINDOW
...
+ MATCH='^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$'
+ [[ bond1.251 =~ ^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$ ]]
++ echo bond1.251
++ LC_ALL=C
++ sed 's/^[a-z0-9]*\.0*//'
+ VID=251
+ PHYSDEV=bond1
+ [[ bond1.251 =~ ^vlan[0-9]{1,4}? ]]
+ '[' -n 251 ']'
+ '[' '!' -d /proc/net/vlan ']'
+ test -z ''
+ VLAN_NAME_TYPE=DEV_PLUS_VID_NO_PAD
+ /sbin/vconfig set_name_type DEV_PLUS_VID_NO_PAD
+ is_available bond1
+ LC_ALL=
+ LANG=
+ ip -o link
+ grep -q bond1
+ '[' 0 = 1 ']'
+ return 0
+ check_device_down bond1
+ echo bond1
+ grep -q :
+ LC_ALL=C
+ ip -o link
+ grep -q 'bond1[:@].*,UP'
+ return 1
+ '[' '!' -f /proc/net/vlan/bond1.251 ']'
+ /sbin/vconfig add bond1 251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
+ /usr/bin/logger -p daemon.info -t ifup 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
+ echo 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
ERROR: could not add vlan 251 as bond1.251 on dev bond1
+ exit 1

Hmmm so it seemed that the underlying interface bond1 was missing-which was true. The output of ifconfig didn’t show it as configured, and trying to start it manually using ifup bond1 failed as well. It turned out that the ifcfg-bond1 file was missing and had to be recreated from the documentation. All network configuration files in Red Hat based systems belong into /etc/sysconfig/network-scripts/ifcfg-interfaceName. With the recreated file in place, I was back in the running:

[root@node1 network-scripts]# ll *bond1*
-rw-r–r– 1 root root 129 Mar 17 10:07 ifcfg-bond1
-rw-r–r– 1 root root 168 May 19  2010 ifcfg-bond1.251
[root@node1 network-scripts]# ifup bond1
[root@node1 network-scripts]# ifup bond1.251
Added VLAN with VID == 251 to IF -:bond1:-
[root@node1 network-scripts]#

Now I could try to start the lower stack again:

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2676: Start of ‘ora.cssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip’ on ‘node1′
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘node1′
CRS-2676: Start of ‘ora.ctssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.evmd’ on ‘node1′
CRS-2676: Start of ‘ora.evmd’ on ‘node1′ succeeded
CRS-2676: Start of ‘ora.cluster_interconnect.haip’ on ‘node1′ succeeded
CRS-2679: Attempting to clean ‘ora.asm’ on ‘node1′
CRS-2681: Clean of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.asm’ on ‘node1′
CRS-2676: Start of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.crsd’ on ‘node1′
CRS-2676: Start of ‘ora.crsd’ on ‘node1′ succeeded
[root@node1 network-scripts]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Brilliant-problem solved. This is actually the first time that an incorrect network config prevented a cluster I looked after from starting. The best indication in this case is in the gipcd log file, but it didn’t occur to me to have a look at is as the error was clearly related to storage.

Telling the Truth in IT

I’ve been doing presentations for many years, mostly on Oracle Technology, occasionally on management topics. However, my favorite presentation to give is one about when things go wrong.

The title is usually something like “Surviving Survivable Disasters” or “5 ways to Seriously Screw up a Project” and, though the specific examples and flow may vary, the general content is the same. I talk about IT situations that have gone wrong or things that strike me as daft/silly/mindless in IT. My aim is to be entertaining and have a laugh at the situations but I also want to explore what causes disasters and how we might go about avoiding at least some of them.

When doing the presentation I have a couple of ground rules:

  • I must have witnessed the situation myself or know personally, and trust, the individual who is my source.
  • I do not name organisations or individuals unless I am specifically given permission {by individuals that is, organisations never get named. Except one}.
  • I try to resist the temptation to embellish. It’s not hard to resists, a good disaster usually stands on it’s own merits.

It’s a great talk for introducing some light relief into a series of very technical presentations or for opening up a day of talks, to get people relaxed. It’s also the only talk I get seriously nervous about doing – if you are aiming to be entertaining and you miss, you stand to die on stage. The first time I did the talk I was physically sweating. However, it went down a storm. I did it 4 or 5 more times over as many years and it always went down well.

However, about 4 years ago I did the presentation just as I was about to go back to being self employed. After the talk a very good friend came over and said something like “Really entertaining talk but…maybe you should tone it down? A lot. Potential employers are going to take a dim view of you doing this, they will worry they will appear in the next talk”. I protested that I never mention companies or people and, surely, all organisations are able to admit that things go wrong and it is to everyone’s benefit if we all learn from them? My friend was adamant that though companies want to benefit from other disasters, they never, ever want to in any way be the source of that benefit. He was sure it would be very damaging to my potential career. Hmmmm…. I could see his point.

I was already scheduled to do the talk again in a couple of months and I took heed of his advice for it. I toned down the material, I removed some of the best stories and I added several disclaimers. I also died on stage. It went from an amusing 45 minutes to a preachy and stodgy affair.

I have not done it since.

The question is, should I have pulled back from doing that talk? Is it really going to harm my potential employability? (After all, no work has ever come my way from presenting). Why can’t we be honest that issues occur and that learning from them is far more valuable than covering them up? After all, do we believe a person who claims never to have made mistakes?

What prompted this thread is that I have been asked to do the talk again – and I have agreed to do so. I’ll be doing it next week, with the title “5 ways to advance your career through IT Disasters” for the UK Oracle user group Back to Basics event. This is a day of introductory talks for people who are fairly new to Oracle, the brain-child of Lisa Dobson. Lisa realised a few years ago that there were not enough intro-type presentations, most technical talks are by experts for fellow experts {or, at least, people wanting to become experts}.

I’m very happy to support helping those who are new to Oracle and I think it is important that people who are new to IT are exposed to what can go wrong – and any advice that might help them avoid it. After all, it’s better they learn from our mistakes than just repeat them again. OK, they’ll just repeat them again anyway, but they might spot that they are doing so just in time :-)

Is this a good idea? What the hell, I want more free time to do things like this blog – and get on top of the garden.

RAC One Changes in 11.2.0.2 Patchset

Oracle patchsets used to be simple - they were bugfixes only; no additional functionalities were added. Oracle added stuff quietly in 10.2.0.4 patchset (the workload capture); but that was part of a new functionality anyway.


The 11.2.0.2 patchset changed all the rules. Several new functionalities were added to the patchsets. Several functionalities have been added to RACOne. Unfortunately they didn't make it to the manuals. Some of them I discovered only recently. If you have been using RACOne, or considering it, you will be delighted to know these.