Search

OakieTags

Who's online

There are currently 0 users and 39 guests online.

Recent comments

Affiliations

Over-indexing

This is the text of an article I published in the UKOUG magazine a few years ago, but it caught my eye while I was browsing through my knowledge base recently, and it’s still relevant. I haven’t altered the original apart from adding a couple of very brief comments in brackets [Ed: like this].

Over-indexing

One of the strengths of a relational database is that you should be able to throw any reasonable query (and even some unreasonable queries) at it and it will be able to return the right answer without being told how to navigate through the data.

There’s no guarantee, though, that you’ll get the answer quickly unless you’ve given the database some help by turning your logical model into a sensible physical implementation. Part of the physical implementation will be the choice of indexes – and this article reviews one of the commonest indexing issues that I see in OLTP systems

Costs and Benefits.

Why do we create indexes? There are three main reasons as far as Oracle is concerned. We need some indexes to help Oracle enforce uniqueness; we need some indexes for performance reasons to supply a high-precision access path to important data; and we may need some indexes to help Oracle enforce referential integrity constraints.

Theoretically, of course, we don’t need any indexes to ensure data correctness, so you could say that the only reason for having indexes is to improve performance.  For example, you could enforce primary key constraints simply by locking tables on every insert, update or delete and then doing a tablescan to make sure that there is no conflicting data – but having an index “for” the primary key allows you to check just one block in the index without locking the table, so you get better performance and better concurrency.  (In passing, an index that covers a primary key or unique constraint need not be a unique index, and the index definition need not be an exact match for the constraint – provided it starts with the relevant columns.)

So indexes can be a massive aid to performance and concurrency. But indexes have a price – you have to maintain them, usually in real-time, and that can cause problems. Roughly every two weeks [Ed: not quite so frequently in 2012], someone writes into the Oracle Forums with a question like: “I have a batch process that inserts 1 million rows of data into a table every night. But it’s very slow and I see lots of waits for ‘db file sequential read’ while it’s going on. Why is this happening?”

If you insert 1 million rows into a table you might have to create and (eventually) write something in the region of 25,000 table blocks (assuming a row size of about 200 bytes … 40 rows per block). If you have one index on that table then, in principle, you might have to do a random read to find an index leaf block for every single row you insert – and you may have to write the updated block out very soon afterwards to make space for the next random read. If the table is large and you have a couple of dozen indexes this “in principle” threat very soon becomes a real one.

There is a very important difference between (heap) tables and their indexes – a row can go into a table wherever the next empty space happens to be; a row has to go into an index in exactly the right place, and finding that place can be expensive. Things get worse if you are doing updates – you can update a row in the table “in situ”, but if you change the value of an indexed column you have to find an index entry for the old value, delete it, and then insert an index entry for the new value in the right place.

So don’t create indexes unless they really pay for their existence when you come to query the data – and when you do create an index make sure you maximise the payback.

The Commonest Error.

There are a number of subtle ways in which you can create too many indexes, but there is one error that is easy to spot – the foreign key index.

When you create a referential integrity constraint (foreign key) you do not need to create an index to support it and (unlike the unique constraint and primary key constraint) Oracle will not create such an index automatically. However, if you are going to update the primary key or unique key at the “parent” end of the referential integrity constraint Oracle will lock the “child” table to check for committed and uncommitted child rows if a suitable index does not exist. Consequently many people (and some application generators) automatically create a matching index for every foreign key they create.

Rule 1: don’t create “foreign key” indexes unless they are really necessary, or unless they are sufficiently useful that you would have created them any way even if there hadn’t been a referential integrity constraint to protect.

If you do have a foreign key constraint that needs an index, though, remember that the index does not have to be an exact match for the constraint – the critical feature is that it should start with the same columns (not necessarily in the same order) as the constraint definition. This means that you can always ‘add value’ to a foreign key index that might otherwise be just a technical overhead.

Rule 2: consider adding columns to foreign key indexes to make them useful to high precision queries.

Moreover, when you think about the parent/child relationship remember that you often see one parent row with several child rows – think orders/order_lines, or currency_codes/payments. In cases like this there may be an opportunity for saving quite a lot of space (and reducing the impact on the buffer cache) by compressing the index on some, or all, of the foreign key column(s).

Case Study.

Here’s a list of indexes (with a little camouflage) from a system I was looking at recently [Ed: some time in 2008]. It’s typical of the over-indexing problems that I see fairly frequently. I got this list by running a query similar to:

select
        index_name, column_name
from
        user_ind_columns
where
        table_name = 'XXXXX'
order by
        index_name, column_position
;

You could enhance this by query to use dba_ind_columns, of course, and add all sorts of extras by joining to dba_tab_columns, dba_indexes, and dba_ind_expressions to pick up details about mandatory columns, column types, uniqueness, index types, and functions involved in function-based indexes. In the case of my example, I used a slightly more sophisticated query to show the expressions used in function-based indexes – and here’s the list of indexes I found on one table:

ap_fun_fk_i ( fun_id )
ap_grp_fk_i ( grp_id )
ap_grp_fun_id_i( grp_id,fun_id )
ap_org_ap_i ( org_id,ap_id )
ap_org_fk_i ( org_id )
ap_per_ap_i ( per_id,ap_id )
ap_per_fk_i ( per_id )
ap_pk ( ap_id )
ap_ud_i ( trunc(update_date) )

This system had a convenient naming convention, PK in the name indicates the primary key, and FK indicates foreign keys. You can see immediately that we seem to have several redundant indexes as far as protecting foreign keys is concerned.

  • AP_GRP_FK_I is made redundant by AP_GRP_FUN_ID_I,
  • AP_ORG_FK_I is made redundant by AP_ORG_AP_I,
  • AP_PER_FK_I is made redundant by AP_PER_AP_I.

On top of this, we might decide that since there are only a few organizations (org_id) we could benefit by compressing the index on (org_id, ap_id) on its first column. We might also compress the index on functions (fun_id) as well – although we might go one step further and decide that we will never delete functions or update the key values, and drop the index altogether.

Following the thought about compression, we could also consider compressing the index on (trunc(update_date)) – a fairly large amount of data gets updated each day, so the value is quite repetitive, similarly (grp_id, fun_id) is also fairly repetitive, so we might compress on both columns.

This leaves us with:

ap_grp_fun_id_i ( grp_id,fun_id ) compress 2
ap_org_ap_i ( org_id,ap_id )  compress 1
ap_per_ap_i ( per_id,ap_id )
ap_pk ( ap_id )
ap_ud_i ( trunc(update_date) )  compress 1

The reduction in the number of indexes is likely to reduce the undo and redo by about 40% as we insert (single row) data into this table, as well as reducing the possible I/O demand by the same sort of percentage. At the same time, the effect of the compression could (in this case) eliminate a further 10% to 15% of I/O related to this table because of the reduction in size of the indexes.

Further considerations.

The only remaining issue to consider is whether there is anything about the nature of our processing that suggests the need for housekeeping on these indexes – and the index on trunc(update_date) is an ideal candidate for consideration.

As we update data, we are going to delete entries from blocks in the left hand end of the index and insert them at the right hand end of the index. Depending on the rate and pattern of updates it is possible that a large number of blocks at the left hand end of the index will become close to empty –  this could have a significant impact on the effectiveness of the buffer cache and might encourage us to use the coalesce command on the index every few days. (If this sounds familiar, I have written about this particular index in the past.)

Having thought about that, you might then wonder why we have an index on trunc(update_date) at all. A query that tries to find all rows updated on a certain date, or range of dates, may have to jump around all over the table to acquire a lot of data. Who wants to do this, and how often? It’s possible that the index exists for a particular report that runs just once per day – perhaps there is an argument for dropping this index as well.

Conclusion:

It’s very easy to create more indexes than you really need, and indexes can be very expensive to maintain. Sometimes you can spot “obviously” redundant indexes simply by inspecting index names and column names.  A few reasonable guesses may allow you to drop some indexes and make others more efficient with a minimum of risk and effort.