“A beginning is the time for taking the most delicate care that the balances are correct.”
It is spring. Time for planting new seeds. I started on a new job last week, and it seems that few of my friends and former colleagues are on their way to new adventures as well. I’m especially excited because I’m starting not just a new job – I will be working on a new product, far younger than Oracle and even MySQL. I am also making first tiny steps in the open-source community, something I’ve been looking to do for a while.
I’m itching to share lessons I’ve learned in my previous job, three challenging and rewarding years as a consultant. The time will arrive for those, but now is the time to share what I know about starting new jobs. Lessons that I need to recall, and that my friends who are also in the process of starting a new job may want to hear.
This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!
Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.
Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.
(This post originally appeared at the Pythian blog)
I’ve been following the discussion in various MySQL blogs regarding the sort_buffer_size parameters. As an Oracle DBA, I don’t have an opinion on the subject, but the discussion did remind me of many discussions I’ve been involved in. What’s the best size for SDU? What is the right value for OPEN_CURSORS? How big should the shared pool be?
All are good questions. Many DBAs ask them hoping for a clear cut answer – Do this, don’t do that! Some experts recognize the need for a clear cut answer, and if they are responsible experts, they will give the answer that does the least harm.
Its March 22. Exactly one month ago, I came back from few days spent in Colorado. I gave presentations, met amazing people, enjoyed skiing and drank a lot of beer, wine and whiskey. I also barely made it back, but that’s a different story.
Obviously I drank too much, or maybe I gave too many presentations. I don’t remember.
What I do know is that in the month since that eventful weekend, my life has taken a sharp turn.
I’m about to start working for Pythian. There will be a separate post about that, where I explain how amazing Pythian is and how to continue to follow my blog once it is merged with the official Pythian blog. Same hypothetical blog post will also include comments about how much I love my current colleagues and how sorry I am to leave them.
Recently I did some soul searching about my expertise as a DBA. I am not talking about my knowledge, my talents and my work style. I’m talking about which things I’m really comfortable doing. The commands I know by heart, the issues I ran into so often that I can diagnose with the tiniest clues.
There are definitely things I’m very good at. Diagnosing why RAC crashed or wouldn’t start. Solving a range of different problems with Streams. User managed recoveries. Netapp. BASH. Top. sar. vmstat. Redo log mining. Datapump. ASH and its relatives AWR and ADDM. Using Snapper to work with wait event interface. SQL coding, network diagnosis, Patching.
These are mostly things I do every day or close enough to it that the commands, the techniques, the traps and the limitations are always clear in my mind. But there are things that I do rarely or even never. This are important DBA skills, some are even very basic, which I do not have because they are not very useful in my specific position.
These include RMAN, ASM, Dataguard, AWK, perl, python, PL/SQL, tracing, SQL tuning, upgrade testing, benchmarks, many Linux administration tools, hadoop and those new NoSQL things, MySQL, Amazon’s cloud databases, RAT, partitions, scheduler.
These are all things that I know something about, that I’ve read about – but I can’t say I’m confident with any of these because I simply haven’t played with them all that much. After all, you learn by doing and running into issues – not by reading people say how everything works perfectly when they use it.
I’ve ran into multiple products that claim to offer automated root cause analysis, so don’t think that I’m ranting against a specific product or vendor. I have a problem with the concept.
The problem these products are trying to solve: IT staff spend much of their time trying to troubleshoot issues. Essentially finding the cause of effects they don’t like. What is causing high response times on this report? What is causing the lower disk throughputs?
If we can somehow automate the task of finding a cause for a problem, we’ll have a much more efficient IT department.
The idea that troubleshooting can be automated is rather seductive. I’d love to have a “What is causing this issue” button. My problem is with the way those vendors go about solving this issue.
Most of them use variations of a very similar technique:
All these vendors already have monitoring software, so they usually know when there is a problem. They also know of many other things that happen at the same time. So if their software detects that response time go up, it can look at disk throughput, DB cpu, swap, load average, number of connections, etc etc.
When they see that CPU goes up together with response times – Tada! Root cause found!
First problem with this approach: You can’t look at correlation and declare that you found a cause. Enough said.
Second problem: If you collect so much data (and often these systems have millions of measurements) you will find many correlation by pure chance, in addition to some correlations that do indicate a common issue.
What these vendors do is ignore all the false findings and present the real problems found at a conference as proof that their method works. Also, you can’t reduce the rate of false-findings without losing the rate of finding real issues as well.
My notes from two presentations given to the data mining SIG of the local ACM chapter.
Hadoop is a scalable fault-tolerant grid operating system for data storage and processing.
It is not a database. It is more similar to an operating system: Hadoop has a file system (HDFS) and a job scheduler (Map-Reduce). Both are distributed. You can load any kind of data into Hadoop.
It is quite popular – the last Hadoop Summit had 750 attendees. Not bad for a new open-source technology. It is also quite efficient for some tasks. Hadoop cluster of 1460 nodes can sort a Terabyte of data in 62 seconds – currently the world record for sorting a terabyte.
Hadoop Design Axioms:
Distributed file system. Block size is 64M (!). User configures replication factor – each block is replicated on K machines (K chosen by user). More replication can be configured for hot blocks.
A name node keeps track of the blocks and if a node fails the data on it will be replicated to other nodes.
Distributes jobs. It tried to run jobs local to their data to avoid network overhead. It also detects failures and even servers running behind on the processing. If a part of the job is lagging in processing, it will start copies of this part of the job on other servers with the hope that one of the copies will finish faster.
37Signals, the company behind few highly successful web-based applications, has published a book about their business building experience. Knowing that the company is both successful and has an unconventional business development philosophy, I decided to browse a bit.
One of the essays that caught my attention is “Build Less”. The idea is that instead of having more features than the competition (or more employees or whatever), you should strive to have less. To avoid any sense of irony – the essay is very short
One of the suggestions I would add to the essay is:
“Keep less data”
Keeping a lot of data is a pain. Indexing, partitioning, tuning, backup and recovery – everything is more painful when you have terabytes instead of gigabytes. And when it comes to cleaning data out, it always causes endless debates on how long to keep the data (3 month? 7 years?) and different life-cycle options (move to “old data” system? archiving? how to purge? What is the process?).
What’s more, a lot of the time customers would really prefer we won’t keep the data. Maybe its privacy concerns (when we keep a lot of search history) or difficulty in generating meaningful reports or just plain confusion caused by all those old projects floating around.
Google taught us that all the data should be stored forever. But perhaps your business can win by keeping less data.