The “Prod-Server Casino“
A production environment isn’t anything like a casino, at least not from the perspective of the guests. It needs to practically be guaranteed to succeed (So I guess, the house always winning is sort of similar.) You want everything to be sure in it. It isn’t grand. There aren’t fountains, there aren’t any comps, no shows, no gastronomical amazement. A production environment isn’t anything like I’d imagine Vegas to be – in fact, what happens in a production SQL Server rarely stays on that SQL Server – sometimes it can follow you around everywhere you go…
But.. I do play a game of chance in production environments from time to time, and I think you should too. It’s not just any game of chance, though; I try to make losing bets every time. Before I describe a few of those bets, let me talk about what the “bets” are and what the “house” looks like. Let me talk about the main game in a production environment…
The Game – It’s a simple game. I guess you could call it “up-time”, “availability”, or “SLA Compliance.”
The House, in this case, could be called “Murphy’s Law”, “Failure”, or you might even say, “Fate” …
The Bets are simple – you are making a wager by doing something “extra” to provide insurance against a failure…Sounds simple.. Why is that a losing bet? You’re right if you are talking about the “fixed” games that no “Prod-Server Casino” would take – like taking a backup while thinking about “can I restore this?” instead of only thinking of performance; running DBCC CHECKDB on a regular basis, or testing your High Availability setup.
But… Are those really risky bets? I mean are you really risking anything by taking a backup and every so often making sure you can restore your backup? But… are you risking much? No of course not, those things aren’t bets. It’s more like you are paying for a service to protect against a known failure. And I’d call you a risky fool if you didn’t do these things ( well at least if you knew you weren’t.. If you are an accidental DBA I’d call you a typical client who never learned what you didn’t know and I’d love to help you, but I digress again…)
What I’m talking about here is “paying” with a little extra planned downtime (if you can spare it), or going a little further to protect against a problem that is less likely than you keeping your house after mortgaging it for one spin on an American roulette wheel.
What I mean to say is… Get paranoid… A good DBA has a healthy touch of paranoia..
So.. Some bets as an example…
“It’s MPIO, We’ve Tested the Failover”
You are on a system connected to a SAN with MPIO. You know the paths are good and they are redundant, in fact you’ve had production issues that caused a path to die and your SQL Server trudged along. The MPIO failover worked, life is happy and all you suffered was a little performance degradation…
So the networking team wants to replace one of the switches on the path. You know this should be a non-event and you can keep the system up as long as it isn’t the busiest point in the day. The chance of a failure here is pretty small, you’d win your mortgage back on that roulette wheel before you saw a failure here.. But.. Your system isn’t really 24×7, it can suffer a little planned downtime, your users can have a little inconvenience and they may get a bit uppity but they can tolerate it and the networking team only has to stay until 6:30…
So it’s a bit of a bet… Management wants to do it live, the network team would love to do it earlier and your users would like to have access to their systems since they don’t work at Yahoo and can work from home.
The Wager – Some downtime, some minor scuffles with users, a little mocking for being paranoid, maybe an extra offcycle copy only backup or kicking off your log backup jobs…
Betting For – You are betting that if you just let them do the work that something would go wrong enough to cause some serious downtime, maybe database corruption,
The House usually loses this one – This happens without issue normally, they’ve done it before, you’ve seen it failover before. You could have just caused 30 – 45 minutes of inconvenience, had to request a downtime, had to go through paperwork, had to get folks mad at you, etc. – and it didn’t even fail!!
But… Imagine if The House wins! You just spent 30-45 minutes of downtime and some inconvenience but you saved yourself a restore scenario, even more downtime and potential lost data..
I look at this one as a well-paid CEO walking by a nickel slot machine that has a $2,000,000 payout and a bunch of other lower payouts with various combinations, and they just happened to have a nickel in their pocket… Spend the time to do it right. . .
“It’s a Cluster! You Have Two Machines!”
My philosophy on HA solutions is “I want to avoid using them.” That happens to be my philosophy on backups, run books, etc. too.
I can’t remember when this was exactly but it was Windows Server 2003 clustering/SQL 2000 or 2005 and I was a Full Time DBA. I drew up a whiteboard diagram of what I wanted the networking setup on the cluster nodes to look like. There were a lot of lines criss-crossing around.. I wanted redundant paths from each node to the public network, to the storage and to the private heartbeat VLAN.. I didn’t just want lots of cables, but I wanted redundancy in NICs, Switches, etc (I wanted the two nodes to be separated by a little more than a few vertical inches inside the rack also..) I wanted to be able to lose components and not have a failover. The networking/windows admin guy looked at me and said something like, “You have two machines! They’ll be clustered, isn’t this a little much?”
The Wager here was a couple extra cards, a few more cable runs, an angry system administrator and some ribbing..
Betting For – I was betting that The House may allow a path to fail and we’d have to use the clustering for what we installed it for – we’d have seconds or minutes of downtime, some angry users in this busy call center and be done with it.. But I don’t like downtime, actually I loathe downtime.. Especially in this particular busy environment.. So I took the bet, and last I heard the house is still winning. Oh well, I’ll take and lose that bed every time.
Black Swan Events Happen
We’ve seen it. Those “Once in a hundred years!” events happen – at the time I’m writing this post – Texas just experienced it. The storms and cold that “could never happen.” The “500-year floods” that come three times in 15 years. As DBAs, we need to be masters of the unexpected and act like “what could go wrong will go wrong – and plan around it.” In our SQL Server consulting practice, we’ve unfortunately made a lot of money from folks calling up after they’ve had their black swan event. If they called ahead of time and acted in a proactive way, we’d have made a lot less and everyone would have had a less stressful time.
We Get It…
I won’t keep going but that “everything that can go wrong will go wrong someday” attitude is a healthy one for a DBA to have. I don’t mean dwell on failure, or analyze everything you do in life like you enjoy actuarial sciences… I mean have a healthy respect for the potential for failure and stack the deck in your favor. Be willing to take bets that you don’t ever want to cash in on and don’t expect to.. Approach code reviews with that attitude, approach maintenance with that attitude. By planning to fail you will succeed.