It’s been a long week. The largest majority of my time has been taken up by a series of bizarre problems at work. The element common to each of these problems is the firewall systems we employ. If you do not know what a firewall is exactly, a short definition might run something like this: it’s the chokepoint that regulates what sorts of traffic traverse the boundary between the wild and wooly public-access network and an individual’s or organizations’ private networks. Any (and hopefully all) traffic that moves from one of those networks to the other does so through the firewall. The firewall, in turn, inspects the traffic to see if it is benign or malicious. If benign, it lets it through, sometimes with some cosmetic modifications. If malicious the firewall systems can block the traffic from going out or in. Think of it as a rural county sheriff’s office looking for out-of-state speeders to delay and ticket.

Here’s the abbreviated version of what transpired: the company adminisphere asked my group to add a new feature to our company’s firewall system. We got the instructions from the systems’ manufacturer; we followed them. The system broke.

Here’s the longer version. Tasks of this sort are a routine thing. Patches, updates, implementing new features—these are all the sorts of things that I commonly engineer and oversee. I do this on a wide variety of systems. This request was unique only in the fact that I hadn’t done this sort of feature update in some time. So I called the software vendor, spoke to their support and got the instructions for my particular requirements. They wrote down the procedure and sent it to me in e-mail. I was happy.

I looked over the procedure and everything looked straightforward. On the surface it looked almost trivial as these things go. If you’ve been reading along, you’re probably aware of the fact that this “trivial procedure” did not, in fact, prove to be the case. Otherwise why would I be writing a long article about the bizarre, consequential series of unfortunate accidents and disasters? I knew I respected you readers for a reason.

The procedure was wildly incomplete. It did not do what it promised it would. The system broke in strange and inexplicable ways. The good news was that my company’s userbase – some one thousand programmers, artists and other assorted fatuous egotists – remained unaware of the problem. They weren’t broken. Their stuff still worked. They could still do everything they needed to do to complete their jobs. I witnessed the bad news. I had breathtakingly disabled my ability to work—at all. And I’d done so under the explicit instructions of the monster’s creator. I was no longer happy.

I spent almost three days on the phone with technical support. Since this system affected the lifeline of the company, I was limited in when I could take it down to attempt to resuscitate it. After all, it wasn’t hurting anyone but me. – Except that I knew it couldn’t stay in this condition. I would need to work on it, and soon.

Enter the backup plan. I made a clone of the system. I did this the hard way, by hand, using my notes and documentation. I did this in this manner because all the tools designed for just such an emergency were not working. (See above, regarding: breathtaking disability.) I’d bring the broken system down and perform the various repair procedures. If they did not work, I would simply implement the backup clone systems and everything would be smiles and rainbows.

I scheduled this for the wee hours of the morning. I went in early and took the broken system down. I performed all the ‘fix-it’ procedures technical support had recommended. I even performed some ‘fix-it’ procedures that I had come up with on my own. Mine worked better than theirs. But in the end, just as it looked like everything was going to be back to normal, the system stopped working entirely. Not just stopped working for me, but stopped working for all my users.

It was time to punt. I put in the backup systems. – They worked great, just like I thought they would.

“They’re getting closer.”
“Oh, yeah? Watch this.”
“Watch what?”
“I think we’re in trouble.”
“If I may say so, sir, I noticed earlier the hyperdrive motivator has been damaged. It’s impossible to go to light speed.”
“We’re in trouble.”

The backup system I had just built – carefully, methodically – was falling apart before my very eyes. I had no idea why. Technical Support had no idea why. No one had any idea why. It was frustrating. Our explanations seemed weak and ill-considered. But we were running out of options.

The good news was that the backup systems were working—if highly sporadically. I did get so that I could find a way to jury-rig them and kick-start them when they started to misbehave. This bought me enough time to try using the quick cloning tools designed for just such an emergency. And this time they worked.

Clone Mark III went into production in the wee hours of the morning Friday morning. So far, no problems.

The postscript to this story is the unfortunate one. At the end of the workday on Friday, my boss has been questioned by a new vice president: Why did you not test this procedure? The answer being, we had been assured by the vendor and our own experience that this was a routine procedure, and even if it did not initially work, backing out the change was equally trivial. This proved to be untrue. The vice president had heard some sort of rumor from an anonymous coward that “some websites were slow” and “the Internet was down”. The first problem is a common one—with a wide variety of possible causes. The second problem was simply untrue. And we had the data to prove it.

This sort of unhelpful criticism and hindsight second-guessing is fairly common in my line of work. The vice president’s conclusion is not: we were instructed via email, if any change we are planning to make has the potential for affecting one user, we are to inform all users of the change, and get their approval, before it is made.

So rather than fix the problems, we’re supposed to ask everyone if it’s all right if we might, perhaps, trouble them so much as to fix the problems.

“I’ve got a bad feeling about this.”

In other news, the White Sox have taken the first two games from the Baltimore Orioles in their four-game series at the Cell. And Revenge of the Sith opens Thursday. This second fact is probably the reason for the quotes.

Advertisements