We’ve just booked the first European venue for the Understanding Storage Masterclass. I will be presenting the Masterclass on April 24/25 2012 at Prospero House in London, tickets are available HERE. I’m pretty excited to host this training session in my home country, and I hope to see you there!
Wow, it’s been a while since I wrote a post, sorry about that! I thought that I would take a brief break from the technical postings and espouse some opinion on something that has been bothering me for a while – ‘Best Practices.’ Best Practices have been around a long time, and started with very [...]
I recently had an interesting time with a customer who is all too familiar with SAN’s. SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive. This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata. So, how could this be possible? Does Exadata have an enchantment spell that makes the drives magically spin faster? Maybe a maybe a space time warp to service IO?
The Exadata X2-2 data sheet states “up to 50,000 IOPS” for a full rack of high performance 600GB 15K rpm drives. This works out to be 300 IOs per second. At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange. But really, it only means that you have to on average service more than one IO per revolution. So, how do you service more than one IO per revolution?
Drive command queuing and short stroking
Modern drives have the ability to queue up more than one IO at a time. If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution. As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases. Lots of literature exists on this topic and indeed many have tested this phenomena. A popular site “Tom’s Hardware” has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.
This effect of servicing more than one IO per revolution is enhanced when the seek distances are short. There is an old benchmark trick to use only the outer portion of the drive to shrink the seek distance. This technique combined with command queuing increases the probability of servicing more than one IO per revolution.
But how can this old trick work with real world environments?
ASM intelligent data placement to the rescue
ASM has a feature “Intelligent Data Placement” IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive. The drive is essentially split into “Hot” and “Cold” regions. This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive. This is the realization of an old benchmark trick, using a real feature in ASM.
the proof is in the pudding… “calibrate” command shows drive capabilities
The “calibrate” command, which is part of the Exadata storage “cellcli” interface, is used to test the capabilites of the underlinying components of Exadata storage. The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations. The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns. This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives. I have included an example output from an X2-2 machine below.
CellCLI> calibrate Calibration will take a few minutes... Aggregate random read throughput across all hard disk luns: 1809 MBPS Aggregate random read throughput across all flash disk luns: 4264.59 MBPS Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923 Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197 Calibrating hard disks (read only) ... Lun 0_0 on drive [20:0 ] random read throughput: 155.60 MBPS, and 422 IOPS Lun 0_1 on drive [20:1 ] random read throughput: 155.95 MBPS, and 419 IOPS Lun 0_10 on drive [20:10 ] random read throughput: 155.58 MBPS, and 428 IOPS Lun 0_11 on drive [20:11 ] random read throughput: 155.13 MBPS, and 428 IOPS Lun 0_2 on drive [20:2 ] random read throughput: 157.29 MBPS, and 415 IOPS Lun 0_3 on drive [20:3 ] random read throughput: 156.58 MBPS, and 415 IOPS Lun 0_4 on drive [20:4 ] random read throughput: 155.12 MBPS, and 421 IOPS Lun 0_5 on drive [20:5 ] random read throughput: 154.95 MBPS, and 425 IOPS Lun 0_6 on drive [20:6 ] random read throughput: 153.31 MBPS, and 419 IOPS Lun 0_7 on drive [20:7 ] random read throughput: 154.34 MBPS, and 415 IOPS Lun 0_8 on drive [20:8 ] random read throughput: 155.32 MBPS, and 425 IOPS Lun 0_9 on drive [20:9 ] random read throughput: 156.75 MBPS, and 423 IOPS Calibrating flash disks (read only, note that writes will be significantly slower) ... Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS Lun 5_3 on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS CALIBRATE results are within an acceptable range.
As you can see, the drives can actually be driven even higher than the stated 300 IOPS per drive.
So, why can’t SANs achieve this high number?
A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing. But, SANs are not typically configured in this manner. SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email. When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset. In Solaris, SAN vendors require the setting of the “sd_max_throttle” parameter which limits the amount of IO presented to the SAN. This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.
long story short…
A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the “drive” is definitely capable of more.
Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.
How much Disk do I need for my new Oracle database? Answer:-
{Disclaimer. This is of course just my opinion, based on some experience. If you use the above figures for a real project and get the total disc space you need wrong, don’t blame me. If you do and it is right, then of course you now owe me a beer.}
Many of us have probably had to calculate the expected size a database before, but the actual database is only one component of all the things you need to run the Oracle component of your system. You need to size the other components too – Archived redo logs, backup staging area, dataload staging area, external files, the operating system, swap space, the oracle binaries {which generally gets bigger every year but shrink in comparison to the average size of an Oracle DB} etc…
In a similar way to my thoughts on how much database space you need for a person, I also used to check out the total disk space every database I created and those that I came across took up. {A friend emailed me after my earlier posting to ask if I had an obsession about size. I think the answer must be “yes”}.
First of all, you need to know how much “raw data” you have. By this I mean what will become the table data. Back in the early 90’s this could be the total size of the flat files the old system was using, even the size of the data as it was in spreadsheets. An Oracle export file of the system gives a pretty good idea of the raw data volume too. Lacking all these then you need to roughly size your raw data. Do a calculation of “number_of_rows*sum_of_columns” for your biggest 10 tables (I might blog more on this later). Don’t be tempted to overestimate, my multipliers allow for the padding.
Let us say you have done this and it is 60GB of raw data for an OLTP system. Let the storage guys know you will probably want about 500GB of space. They will then mentally put it down as “of no consequence” as if you have dedicated storage guys you probably have many terabytes of storage. {Oh, I should mention that I am not considering redundancy at all but space that is provided. The amount of actual spinning disk is down to the level and type of RAID you storage guys make you use. That is a whole other discussion}.
If you come up with 5TB of raw data for a DW system then you need around 12-15TB of disk storage.
If you come up with more than a Terabyte or so of raw data for an OLTP system or 10 to 20 Terabytes for a DW, when you give you figures to the storage guys/procurement people then they may well go pale and say something like “you have got to be kidding!”. This is part of why the multiplication factor for Data Warehouses and larger systems in general is less, as you are forced to be more careful about the space you allocate and how you use it.
The overhead of total disk space over Raw data reduces as the database gets bigger for a number of reasons:
My best ever ratio of database size to raw data was around 1.6 and it took an awful lot of effort and planning to get there. And an IT manager who made me very, very aware of how much the storage was costing him (it is not the disks, it’s all the other stuff).
I should also just mention that the amount of disk you need is only one consideration. If you want your database to perform well you need to consider the number of spindles. After all, you can create a very large database indeed using a single 2TB disc – but any actual IO will perform terribly.
How big are you in the digital world?
By this, I mean how much space do you (as in, a random person) take up in a database? If it is a reasonably well designed OLTP-type database a person takes up 4K. OK, around 4K.
If your database is holding information about people and something about them, then you will have about 4K of combined table and index data per person. So if your database holds 100,000 customers, then your database is between 200MB and 800MB, but probably close to 400MB. There are a couple of situations I know of where I am very wrong, but I’ll come to that.
How do I know this? It is an accident of the projects and places I have worked at for 20 years and the fact that I became strangely curious about this. My first job was with the NHS and back then disk was very, very expensive. So knowing how much you needed was important. Back then, it was pretty much 1.5K per patient. This covered personal details (names, addresses, personal characteristics), GP information, stays at hospitals, visits to outpatient clinics etc,. It also included the “reference “ data, ie the information about consultants, wards and departments, lookups etc. If you included the module for lab tests it went up to just over 2K. You can probably tell that doing this sizing was a job I handled. This was not Oracle, this was a database called MUMPS and we were pretty efficient in how we held that data.
When I moved to work on Oracle-based hospital systems, probably because I had done the data sizing in my previous job and partly because I was junior and lacked any real talent, I got the job to do the table sizings again, and a laborious job it was too. I did it very conscientiously, getting average lengths for columns, taking into account the length bytes, row overhead, block overhead, indexes etc etc etc. When we had built the database I added up the size of all the tables and indexes, divided by the number of patients and… it was 2K. This was when I got curious. Had I wasted my time doing the detailed sizings?
Another role and once again I get the database sizing job, only this time I wrote a little app for it. This company did utilities systems, water, gas, electricity. My app took into account everything I could think of in respect of data sizing, from the fact that the last extent would on average be 50% empty to the tablespace header. It was great. And pointless. Sum up all the tables and indexes on one of the live systems and divide by the number of customers and it came out at 2-3K per customer. Across a lot of systems. It had gone up a little, due to more data being held in your average computer system.
I’ve worked on a few more person-based systems since and for years I could not help myself, I would check the size of the data compared to the number of people. The size of the database is remarkably consistent. It is slowly going up because we hold more and more data, mostly because it is easier to suck up now as all the feeds are electronic and there is no real cost in taking in that data and holding it. Going back to the hospital systems example, back in 1990 it used to be that you would hold the fact a lab test had been requested and the key results information – like the various cell counts for a blood test. This was because sometimes you had to manually enter the results. Now the test results come off another computer and you get everything.
I said there were exceptions. There are three main ones:
I have to confess that I have not done this little trick of adding up the size of all the tables and indexes and dividing by the number of people so often over the last couple of years, but the last few times I checked it was still 3-4K – though a couple of times I had to ignore a table or two holding unstructured data.
{The massive explosion in the size of database is at least partly down to holding pictures – scanned forms, photos of products, etc, but when it comes down to the core part of the app for handling people, it seems to have stayed at 4K. The other two main aspects driving up database size seem to me to be the move from regional companies and IT systems to national and international ones, and that fact that people collect and keep all and every piece of information, be it any good for anything or not}.
I’d love to know if your person-based systems come out at around 4K per person but I doubt if many of you would be curious enough to check – I think my affliction is a rare one.
In my last blog entry I alluded to perhaps not being all that happy about Fibre Channel. Well, it’s true. I have been having a love/hate relationship with Fibre Channel for the last ten years or so, and we have now decided to get a divorce. I just can’t stand it any more! I first [...]
OK, this one might be contentious, but what the heck – somebody has to say it. Let’s start with a question: Raise your hand if you have a feeling, even a slight one, that storage arrays suck? Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance [...]
I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 10.2.0.4/SLES 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S. The problem was only really visible during the overnight batch runs, so AWR reports [...]
This year at the UKOUG Conference in Birmingham, acceptance permitting, I will present the successor to my original Sane SAN whitepaper first penned in 2000. The initial paper was spectacularly well received, relatively speaking, mostly because disk storage at that time was very much a black box to DBAs and a great deal of mystique [...]
Recent comments
2 weeks 2 days ago
4 weeks 6 days ago
5 weeks 1 day ago
22 weeks 3 days ago
30 weeks 3 days ago
1 year 4 weeks ago
1 year 5 weeks ago
1 year 10 weeks ago
1 year 10 weeks ago
1 year 11 weeks ago