It has been nine months since I’ve written here. Needless to say, a lot has happened!
First, my family was living in Africa for three months earlier this year while I did some tech work at an NGO hospital. Second, upon our return I decided to join the good people at Pythian. I’m not moving to Canada, although I will travel a decent bit as part of the company’s consulting group.
If you’re interested in the Africa trip, look at the Africa page. I wasn’t working with Oracle technology but it was still a very interesting, challenging and engaging project.
I thought I’d briefly share a few high-level insights. You might be surprised how well these lessons apply almost anywhere (even Oracle-related projects)!
Two fun and important projects at the hospital in Africa:
How much more fundamental does it get than copper wires and radio signals? And in our industry, it doesn’t matter what you do – these are also the fundamentals underneath what you’re building.
In the days of [everything]-as-a-service and engineered-[anything]-appliances, we’re building at a higher level than ever before. You might think that building on clouds (or any abstracted/virtualized platform) means we can leave the platform implementation details to specialists.
But smart companies and experienced engineers still pay a lot of attention to the fundamentals. There’s no magic or voodoo in computer systems. An experienced engineer can understand how a particular stack works from top to bottom – and you should be wary of anyone who won’t explain at some level of detail how their piece works.
Do you remember that youtube video where the data center guy shouts into a bank of hard drives? Even if you’re buying pre-engineered, pre-packaged systems that come by the rack and fill half a room, you still need to ask the same basic environmental questions that you would with any other deployment into your datacenter.
Bottom line: everything generally comes down to the same few basic things – for example processors and I/O and memory/storage hierarchies. Ninety percent of what you need to know, even for very complex systems, is in chapter one of my computer systems college textbook. Know your fundamentals and find them even in your complex systems.
When I arrived in Africa, there were a number of issues crying out for immediate attention. For example: the head of finance couldn’t login to his workstation unless it was unplugged from the network. Every morning, this friendly Canadian guy unplugged his desktop, logged in to his domain account, then re-connected the network cable so he could access his network shares.
This problem had existed almost a year. A sudden power outage had corrupted a virtual server running as a domain controller. To get systems back up, overseas IT support had directed local employees to restore a previous backup image of the VM. This did restore that server’s file shares but caused havoc among the multi-master DC setup – which was never totally resolved.
A simple workaround beyond the unplug-and-login trick was not forthcoming. But now – after discussion with the director of the Hospital – we made a key decision and we changed course. A major upgrade was on the wish list… so rather than diving into complex debug & repair operation on multi-master windows domain, we put all our effort on the upgrade. And we re-architected the system so that this particular problem could never happen again.
When I say “re-architected” … I mean we started at the very beginning. Did you pay attention during the requirements engineering section of your software engineering college class?
Asking these simple questions led to two very important findings:
We realized that it was much better to have a simple system which would nonetheless prove reliable and easy to maintain, instead of a complex system which was hard to fix if something ever went wrong. The result was – as I wrote in my technical summary of the trip – we completely got rid of virtualization and the multi-master domain controller “cluster”. We migrated from four operating systems on two servers to a single server with a single operating system. But we also added a few things: on-disk encryption, RAID mirroring, improved & thoroughly tested backups, a printed & tested DR plan, a test server/domain exactly like the production server/domain, and a wiki.
The most important word here is actually not complexity but rather unjustifiable. We justified everything in the architecture by showing tracability to this organization’s unique requirements.
Here in the States, I spend a lot of time working with clusters and I really enjoy it. But I can’t count how many times I’ve thought someone forgot these simple questions. Do you know your requirements?
Maybe you’re buying pre-engineered or pre-packaged systems that come by the rack and fill half a room. You already learned from my first lesson and you have engineers who understand the fundamentals for these systems. But the most critical part is this step, the second one: now you need to justify those fundamental architectural characteristics by connecting them to the unique requirements of your organisation.
Big job? Yes. Somebody else’s job? I don’t buy it. No matter where you are in your organisation, you can start asking questions and learning. Don’t assume it’s not your problem just because you’re not the decision-maker. If you become an expert on both your business and your technology, then before long everyone will be specifically asking for your input!
If it’s worthwhile for a non-profit hospital in the middle of Africa, then how much more will it be beneficial for you?
I do have four lessons, but I think I’ll save the other two for another article. (This got long.
Update: Lessons from Africa, Part 2