Yes, he was impeached, by the house and acquitted by the Senate. Not the point of my post today, forget the politics. He wasn’t impeached for all of that business with Monica Lewinski. Was it unethical? Yes. Was it immoral? Well to me it is. Was that activity an impeachable offense? No. Why was he in legal trouble then? Simple: he lied. He, according to the conviction in the House, perjured himself about his affair while under oath. You may also remember his infamous press conference where he did the same thing to the public.
What’s Your Point? This is a SQL Server Blog…
Well my point is simple, often the cover-up or lie is worse (or has worse effects at least) than the act being lied about.
That’s where it relates to my chosen career field. That is where we, as DBAs, Sysadmins, Developers, etc., have to make sure we pay attention to the many lessons we see in public oopses.
Ever Have This Feeling?
One time, I was doing some work on a critical production cluster. I was adding drives and modifying an existing drive. Took the drive offline (shh if you already know how this ends). Took the disk offline… And then… I had that feeling. Please tell me you know the feeling also – my face felt like it was on fire. My fight or flight system was in overdrive as I had an adrenaline dump. Hands got a bit jittery, awareness increased while at the same time a narrow focus started. My stomach? Oh man. That horrible feeling of your stomach being twisted and turned upside down like an engine that just won’t turn over. Yup… I realized what I did when I saw the SQL cluster begin a dance of fail-overs. You bonehead! That drive was a dependency of the SQL Server group.
Yup. I brought the production cluster to a standstill for about 10 minutes while I hustled to repair the damage done. So while fixing that issue I had two choices of how I would proceed:
- Listen to the voice in my head that says (somehow italics look more sneaky), “quickly fix it, figure out how to cover the tracks and draft an e-mail that starts with something like, ‘a mysterious issue…’ “
- Listen to the voice in my head that says, “quickly fix it, ping someone in charge and give them a heads up and let them know what you did. Prepare to write a note explaining the event.”
I chose Option 2. I’m not saying I’m better than you if you went with Option 1. I’m not saying I’m perfect. I’m just saying that Option 1 doesn’t work. Early on in my career I may have gone with something between the two options. Even that doesn’t work because it leaves untied ends. The best manager I ever worked for (I blogged about him when I answered the SQL Quiz on Leadership) was really big on this. His philosophy on this more or less was, “Own it”. If you don’t own it on his team you have a problem, probably lost some respect and trust. If you do own it and the mistake or issue wasn’t something really horrible, you’ll end up alright in the long run.
How Do You “Own It”?
It’s simple. You admit you messed up, fix the issue or enlist help in fixing the issue, make a plan to prevent this from happening in the future and move on. You’ll be fine. We are all human, we all mess up and most of us still have our jobs. Heck, even Bill Clinton after his not owning it commands a lot of money to speak.
A Pattern
Now one of my flaws that I am still working on after 10 years in the field is keeping e-mails short and to the point. So take this advice with that caveat in mind. A good quick template of an e-mail I could have sent in that issue above:
Earlier today the cluster service became unavailable while I was doing maintenance on it. This e-mail outlines what happened.
Summary
I was tasked with removing a disk that was no longer being used. As this disk was removed, I realized (too late) that it was a dependency on the SQL Server group in the cluster. This means that the SQL Server failed back and forth between the nodes until it rested in a failed state. I was able to resolve this by breaking the dependency on the unused disk and bring the cluster back online.
How Was This Missed?
I should have checked the dependencies before touching the disk. I did not use a checklist to check for this and I missed this important detail.
What will be done to prevent it?
I have created a document for working with cluster resources like disks. This document links back to Microsoft checklists and includes a warning about resource dependencies with a process to check this. I have also sent a note to the server teams and DBA teams outlining what happened.
That isn’t exactly what I would write but the bolded points are the “pattern”. Describe what happened, why it happened and how it won’t happen again… And then move on. Deal with the repercussions and feel confident knowing that you were honest and up front and you mitigated your own issue. And remember…
You Learned a Good Lesson
So maybe this isn’t one to include in the e-mail or in a mea culpa discussion with your manager (though you should have that mea culpa discussion). Keep this tip for yourself – whatever you did wrong… whatever caused that stomach roller coaster just taught you a lesson you won’t soon forget. Like I said, unless you really, totally ruined the day for your company, you’ll still have a job. You’ll still get respect from your colleagues because you quickly came clean and even talked about how to prevent it. You learned something, though. So eventually you may even have a net positive out of the issue. Just remember to learn from that lesson and don’t let pride get in the way.
Bring Your Own Negative Example
So, I didn’t talk about some negative examples. I can instantly think of the times where I have been involved with projects and teams that had people who listened to voice number 1. I won’t. I don’t need to. You can think of them already, right? How did you feel as the coworker (or victim) of the person who reached for the broom and the carpet corner? Don’t be that person…
Share your story below in the comments. I want to hear about how you cope with a mistake you make on the job. I want to hear about a counter example that you’ve bumped into and how it made you feel.
Thanks for sharing this. This is an important lesson for anyone to learn from a junior DBA up to the most senior of DBAs. Credibility is key and its hard to build it up, easy to lose (just takes one major bonehead mistake) and even harder to regain after said mistake. At my previous job I was tasked with migrating custom web app from dev to staging and from staging to prod. This was all done on virtual servers so the process was to snapshot the server, make your changes, wait a few days to see if we needed to rollback and if all was well after a few days then commit the snapshot, otherwise rollback. Well I had several hundred servers to deal with overall. On this particular rollout from dev to staging I had several deployments on different servers to deal with. I went through the whole process, soup to nuts of moving all the parts and pieces and to their respective servers when all of a sudden I started getting calls. Production had gone down, what happened? Awww crap, I just accidently rolled out to production. Luckily (and I thank God for VMware) the snapshot saved my bacon and I simply had to revert to snapshot but I still had to send out the mea culpa email. Despite my email I still got written up but I’ll be damned if I didn’t learn a lesson and didn’t repeat said mistake again. Again, responsibility is key.
Jorge – Excellent comment. Being written up stinks, it hurts but.. Your point is excellent. You still have your job. Ever wonder what would have happened if you didn’t send the e-mail? ๐
You summed it up well: “Creditability is key and hard to build” But “Easy to lose”. I think the bonehead mistake may lose a tiny bit but lying about the mistake? It may never come back.
Thanks for the comment and sharing your own experience. The FAA runs a “Near Miss” database where Pilots, ground crew and Controllers can anonymously report mistakes that were almost serious. They do it so they can see where training is needed, where warnings are needed and so those roles can learn from others mistakes. Maybe we need an Information Technology near miss database ๐
Thanks Mike! Closest thing to a “near miss” database at this point is everyone’s blog or maybe even SQLDumbass’ blog!
The only major sanfu I had was using DTS to migrate from DEV to production back in the days of SQL Server 7.0. I cancelled the process, but it wasn’t wrapped in a transaction, so a bunch of tables in production had been dropped. Oops, and I hadn’t done a full backup of the production environment before starting, so it was last full and then a bunch of transaction log backups. Fortunately this happened after hours, so only my boss had to be notified, no users in the office using the application. I just had a late night that night.
Another minor issue was a classic, delete without a where clause. I had the where clause, but I always do a select first to make sure I’m getting what I want. Well, I was doing it all on one SSMS tab, so when I commented out the select and highlighted the delete, I forgot the where clause and it was a small table so the delete took < 1 sec. Fortunately it was a reporting table, so out went the email saying I had done the stupid thing I don't give the users rights to do and that the reports affected would be up again in a short time.
I always own up to my mistakes. It's hard not to make excuses, but when I catch myself doing it, I stop and say the I blew it and I'll fix it as soon as is humanly possible. It has always worked for me.
Ahh, Yes.. The Classic “Missing Where” conundrum. One of mine happened when I was training someone. He was a front end developer mostly and I was training him on the job and giving him some SQL trips… Early on in my career. We were connected to a large client site doing work for them. They needed some custom data manipulation… Forget what it was but it was basically “these people need this information changed”… I wrote my update statement showing how to do it and just as I clicked execute the person I was teaching said, “Where?!?!”… Ended up having to restore the pre-work backup and do a joined update. Got through it but man.. embarrassing.
This is cross posted from my response to SQL Quiz #1 :
While upgrading the second largest database in SQL Server at my job from SQL Server 2000 to SQL Server 2005, one that holds important financial’s data for credit card reconciliation’s, it was decided by the business users that a large portion of the data in the database would be archived off, and then deleted from the database. At the time the database was just under 140GB in size with the majority of the data in one very large table.
The end users worked with the vendor to get a script to handle this operation, and in testing it, the process was going to run over 30 hrs which was not going to be possible. So after some review, and discussion with the vendors support and development staff, I realized that it would be faster to create a new copy of the table, insert the rows to be kept into it, drop the old table and rename the new one, and rebuild all the constraints and indexes. This took the process down to about 1/2 hr in testing.
The problem came when I loaded the script on production to run the purge process. In testing it, I did the operation in an explicit transaction, and initially I had left the SET IDENTITY INSERT statements out of the batch by mistake. It wasn’t a problem on development because I had blocked the code in an explicit transaction, so when the inserts failed because I didn’t SET IDENTITY_INSERT ON, I issued a rollback, waited, and then fixed the problem. For some reason, I never saved the script.
Onto production, load the script, run it, and watch a 88GB table disappear in mere seconds. Now I know that everyone has made a mistake at some point, and we are all familiar with that sinking hot feeling that you get in the seat of your pants when you realize it. Take the worst case of that feeling you have ever had, and multiply it by a factor of 10.
The good thing was I had backups, and the database was fully logged, so backup the tail log (first thing I did was let out a string of words that can not be repeated and caused the consultants in the cube next to mine to look over the wall and then take a break, but the log backup immediately followed) and then I took a quick walk to my directors office to explain the disaster I had just created, how long it would take to fix, and that I needed to take a quick walk around the parking garage to clear my head before attempting to fix it, which he agreed was probably a good idea.
Lesson Learned: Save early and save often. Really, I was fine, my backups existed and had been tested numerous time recently in performing mock upgrades so I knew I was able to recover from it. The total impact was about 20 additional minutes of downtime over my initial estimate which had provided me ample wiggle room for a major problem requiring a restore of the database to a SQL Server 2000 instance if need be.
Plan for the worst, hope for the best, but never be without a good backup.
Another great example of a lesson learned during an oops. We can choose to learn from our mistakes or run and hide from them and pretend they didn’t happen. The only thing the latter seems to work well on is the bogeyman.
No, we can do both. The catch is to have your scapegoat picked out ahead of time. For Tom, it was Jerry. For Jerry, it was Tom. I’m not saying this is the right approach, just pointing out the error in logic. It’s isn’t the right approach. Owning the problem you’ve created is the only acceptable response for a true professional. Been there, done that, accidentally dropped the t-shirt along with the table. Good thing I had a valid recovery plan, just like Jonathan.
Mike, GREAT post and very important tip for everyone. I have had my fair share of “OOPS!” moments, ranging from accidentally deploying something into production that wasn’t ready to taking a database offline that I thought was no longer being used, but really was backing the web site of a small division. That one was easy to recover from (just bring it back online) but the lesson learned there revolved around not just documenting usage of databases, but also that the database had been named for a specific purpose (which no longer was needed, hence my “cleanup” activity) but had been secondarily purposed for something completely unrelated to the original name. Thankfully I had developed the practice of just taking production databases offline for a month or so before actually dropping them, just in case something like this happens.
And whenever an unexplained mystery happens in our shop, I remind everyone that accidents are understandable, but mysteries in tech just drive me (and our director) crazy.
Thank You for your great comment Mark. Always good to hear more experiences so we can remember our mistakes aren’t that extraordinary.
I like your closing sentence. Sums it up really well. Accidents are fine. Mysteris aren’t. Especially the mysteries that are only so because someone is covering up their tracks.
There are very few mistakes that should cause a person to lose his/her job. However, lying about a mistake (even a small one) will easily get you fired in many organizations. Personally, I’d rather live with the consequences of a dumb mistake than risk losing my job by trying to cover it up.
Good post… thanks for sharing.
Thanks, Tim.
Great comment especially the “Iโd rather live with the consequences of a dumb mistake than risk losing my job by trying to cover it up.” part!
That is absolutely the right way. And yeah, some mistakes may get you fired, but most? If you own them and handle them you have a chance ๐ We try and instill that in our kids too – especially as they are now two years older than when I wrote that post and there is one kicking around who wasn’t here then ๐ – If you guys do something bad, there may be consequences. If you cover it up/lie about it – there will certainly be consequences and that is worse than the original act in most cases.