Edited by: Narottam Regmi
Some important lessons come from highly technical issues; in other cases, we learn painful lessons from simple mistakes. This is a story about the latter.
I was working on a new project when I was called for a meeting. Apparently someone was looking through the log files of a set of servers I had previously managed and noticed that, on occasion, the servers were rebooting themselves for no reason and had been doing so for almost a year.
Head in hand I asked them, “do you see the Bounce service?”
“Don’t turn it off…it’s what’s keeping the system from crashing,” I said. I then explained that the service was designed to reboot the server if it detected a failure.
“What! So instead of fixing the problem you just set it up to reboot?”
They were incredulous, bordering on livid, and could not believe we would be as irresponsible to actually implement a systemic reboot. If I were in their place I would have called our team much worse. The conversation with their vice president who “wanted answers,” went something like this.
We were implementing our first business-to-business back end infrastructure. We had a large product database and were partnering with another organization to provide our products on their Web site over a private connection. Our architecture was sound, our development staff extraordinary, and our IT operations team experienced professionals. The business people were sharp and working with the technical side of the house along the way. This was a team designed to succeed.
Then there was the “problem.”
Shortly after we launched the service we unexpectedly lost our primary server and were forced to run on the fail-over server. The primary server simply crashed. Fortunately this happened during regular working hours so we immediately addressed the issue and began our research. During which our fail-over server crashed. This wasn’t going to be pretty.
Our processes were fairly standard. Collect the data, analyze, develop a theory, check the data to see if it supported our theory then implement a fix. No luck. The next time the server dropped we did the same thing. No luck. The third, fourth and fifth time we did exactly the same thing but no luck.
Over the next few weeks everything we did had no effect. New hardware, new network connections, re-installation of software. We tried everything. Our team was getting frustrated. Executives wanted answers, the business alliance was crumbling and internally, IT operations and the developers were doing battle. Things were looking bleak.
Finally, after getting a 3 a.m. page that the system was down again I showered and went in to work and pulled a nice 30+ hour day again working through the issues. Realizing that we really could not operate like this for long one of the developers hit on a brilliant idea. He would write a service that would monitor the server locally then when it detected the issue would reboot the server before it became non-responsive. He called this service, “bounce.”
Bounce was a stroke of brilliance and bought us the time we needed to finally isolate the offending problem and fix it without significantly impacting our business SLAs any farther. Total cost to solve the problem $250,000 in support and new hardware before we accounted for lost revenue. The issue turned out to be one of our internally developed applications overwriting its designated memory space and stepping on other applications. We allocated 8 bytes but wrote 12 if a particular series of events occurred. Changing our allocation to 16 solved the issue.
Failure to Follow Up
Following the solution we had a month or so of root cause analysis meetings, process updates and new mandates for code review. We worked out the issues between the teams, celebrated success and moved on to other challenges satisfied with our results. If the story ended there it would be a success.
A year and a half later I had moved on to a different team as had the developers. Our partners underwent similar changes and even the executives had moved on in one way or another. Essentially, 18 months later there was a completely new team managing the environment which is where our story started.
Have you ever removed a band-aid and had the whole world fall apart? Perhaps not, but I have; and it taught a painful lesson. The pace of IT operations ensures you will never have an opportunity to go back and clean up a temporary solution if you do not address it while the urgency is still fresh in mind. Removing temporary fixes, workarounds, and band-aids should be part of your problem management procedures for follow-up. I realize this lesson is obvious, but most mistakes are, in fact, simple mistakes; and following standard procedures is one way to minimize them.