By stretch | Wednesday, March 10, 2010 at 4:19 a.m. UTC
A number of people have written asking me what happened to a paper I wrote back in 2008 entitled "The Science of Network
Troubleshooting." Unfortunately, I neglected to republish the paper after revamping packetlife.net in late 2009, so here it is again
as a blog article.
Troubleshooting is notan art. Along with many other IT methodologies, it is often referred to as an art, but it's not. It's a science, if
ever there was one. Granted, someone with great skill in troubleshooting can make it seem like a natural talent, the same way a
professional ball player makes hitting a home run look easy, when in fact it is a learned skill. Another common misconception holds
troubleshooting as askill derived entirely from experience with the involved technologies. While experience is certainly beneficial,
the ability to troubleshoot effectively arises primarily from the embrace of a systematic process, a science.
It's said that troubleshooting can't be taught, but I disagree. More accurately, I would argue that troubleshooting can't be taught
easily, or to great detail. This is becausetraditional education encompasses how a technology functions; troubleshooting
encompasses all the ways in which it can cease to function. Given that it's virtually impossible to identify and memorize all the
potential points of failure a system or network might hold, engineers must instead learn a process for identifying and resolving
malfunctions as they occur. To borrow a cliché analogy,teach a man to identify why a fish is broken, rather than expecting him to
memorize all the ways a fish might break.
Troubleshooting as a Process
Essentially, troubleshooting is the correlation between cause and effect. Your proxy server experiences a hard disk failure, and you
can no longer access web pages. A backhoe digs up a fiber, and you can't call a branch office. Cause, and effect.Moving forward,
the correlation is obvious; the difficulty lies in transitioning from effect to cause, and this is troubleshooting at its core.
Consider walking into a dark room. The light is off, but you don't know why. This is the observed effect for which we need to
identify a cause. Instinctively, you'll reach for the light switch. If the light switch is on, you'll search for another cause.Maybe the
power is out. Maybe the breaker has been tripped. Maybe someone stole all the light bulbs (it happens). Without much thought,
you investigate each of these possible causes in order of convenience or likelihood. Subconsciously, you're applying a process to
resolve the problem.
Even though our light bulb analogy is admittedly simplistic, it serves to illustrate the fundamentals oftroubleshooting. The same
concepts are scalable to exponentially more complex scenarios. From a high-level view, the troubleshooting process can be
reduced to a few core steps:
Identify the effect(s)
Eliminate suspect causes
Devise a solution
Test and repeat
Step 1: Identify the Effect(s)
If you've been a network engineer for more than a few hours,you've been told at least once that the Internet is down. Yes, the
global information infrastructure some forty years in the making has fallen to its knees and is in a state of complete chaos. All this
is, of course, confirmed by Mary in accounting. Last time it was discovered her Ethernet cable had come unplugged, but this time
she's certain it's a global catastrophe.
Correctly identifying theeffects of an outage or change is the most critical step in troubleshooting. A poor judgment at this first step
will likely start you down the wrong path, wasting time and resources. Identifying an effect is not to be confused with deducing a
probable cause; in this step we are focused solely on listing the ways in which network operation has deviated from the norm.
Identifying effects is best...