In particularly distressful moments of frustration it is not uncommon for me to feel reduced to emotional arguments absent any useful data. And that just makes things worse. The rational engineer in me chafes. I know better. And then I feel guilty for not developing some strategy to provide a metric. If only I could quantify it. Quantifying the experience allows me to speak dispassionately to management. With a fistful of data, I can be calm, cool, collected. I move from being the boy who cried wolf to a more rational position. I am able to explain with an air of aloofness: critical infrastructure equipment is blowing up once a quarter, once a month, once a year. The actual numbers are less important than the fact that they’ve been collected and are reproducible. It becomes much more difficult to dismiss recommendations based on aging infrastructure when the rate of the infrastructure’s failure is clearly documented. I can think back to all those hours in the chemistry lab and still pull out the second law of thermodynamics. In an isolated system, entropy never decreases. Things fall apart.

So the end results are some acronyms that we’ve come to recognize and know and love:

  • MTBF : Mean Time Between Failures
  • MTTR : Mean Time To Recovery

How long is it going to be before something breaks. How long is it going to take to get back to normal after it does. Attach dollar amounts to those times — equipment costs, labor costs, cost of lost business. It simplifies things dramatically to assess the risk of particular designs, or other so-called cost-saving managerial decisions.

But things are more complex, now. Technology is still a mysterious, fetishized commodity. Personal taboos spring up and develop around how we contend with it. Many years ago I worked at a shop where all the engineers refused to make major changes on Tuesdays. Fridays I can understand. No one wants to kill the weekend if a big change goes wrong. But Tuesdays just seemed bizarrely arbitrary. The thing was, that for about a three month period, every change made on a Tuesday backfired in a catastrophic way. Tuesdays were cursed. If you did anything significant on the systems on a Tuesday, it would blow up. Maybe this was self-fulfilling. Maybe this was anxiety developed over time, but I can tell you that after being the rat metaphorically shocked one too many times in the Tuesday Skinner Box, I learned to avoid major work on Tuesdays. And I wasn’t alone. Eventually even management got on board and actively recommended against major Tuesday changes. Mondays and Wednesdays were fine. Thursdays were still good. Fridays we avoided if possible for more rational concerns. Tuesdays were forbidden: there be monsters here.

The Tuesday example is an extreme case. I’m including it more for humorous effect. It does underline one important aspect, though, that of conventional wisdom in the face of technology. Performance problems get attributed to particular fetishized causes across the board. It’s the firewall. It’s DNS. It’s Oracle. It’s swap. It’s spanning-tree. It’s the RAID controller. It’s a race condition. It doesn’t matter what the actual problem might be or how it’s manifesting, the first instinct is to blame the fetish.

If you happen to be responsible for the fetish in question, your first task is to clear the name of your particular ward before you can proceed. Because everyone knows that it’s always the “whatever the Hell it is that you’re responsible for.” And because this is conventional wisdom we’re talking about, it doesn’t matter how many times you clear your ward of wrongdoing. It’s always your stuff at fault. I don’t know where these conventions come from. Maybe they’re something akin to post-traumatic stress disorder. One bad experience scars you as a company for the rest of your technological life with that particular daemon.

The time spent clearing your system’s name is time wasted. It’s aggravating and depressing, too. I’m conscientious about my responsibilities. I want them to work. When they break, I step up and do what I can to correct them. It wounds me to suggest time and time again that my systems are at fault. I take that personally. My systems are at fault; I’m at fault.

In an increasingly complex technological world it is refreshing to be reminded “but we can measure that”. Enter a new metric promoted by Jim Metzler:

  • MTTI : Mean Time to Innocence

Metzler writes:

The conventional wisdom inside most companies, and even within most IT organizations, is that the cause of application degradation invariably is the network. This piece of conventional wisdom leads to a new management metric ā€“- the mean time to innocence (MTTI). The MTTI is how long it takes for the networking organization to prove it is not the network causing the degradation. Once that task is accomplished, it is common to assume some other component of IT such as the servers must be at fault. This defensive (a.k.a., CYA) approach to troubleshooting elongates the time it takes to resolve application degradation issues.

It’s a great device. I think the MTTI metric — however cleverly titled — has the potential to refocus an entire IT organization off of perpetuation of useless fetishes and onto fixing real problems. It accomplishes this by promoting cooperation, and appropriate realignment of the entire operational structure. Metzler concludes:

[M]ore IT organizations need to focus their management attention on performance and these organizations also need to move away from a CYA approach to troubleshooting that is based on assigning blame and adopt an approach to troubleshooting that is based on fixing the problem.

Alternately, we could just stop changing things on Tuesdays.

Advertisements