Home |  Infosec |  DNS scripts |  Loghost HOWTO |  Syslog-ng FAQ |  The Art of System Administration 

Sep 2 20:31:19 PDT 2010
Your IP: 38.107.191.116

campin.net tent logo

Campin dot Net

The Art of System Administration

Definition of troubleshooter

From the Merriam-Webster online dictionary

Main Entry: trou·ble·shoot·er
Pronunciation: -"shü-t&r
Function: noun
Date: 1905
  1. a skilled worker employed to locate trouble and make repairs in machinery and technical equipment
  2. an expert in resolving diplomatic or political disputes : a mediator of disputes that are at an impasse
  3. a person skilled at solving or anticipating problems or difficulties

This is a fitting definition of a system administrator - making repairs, dealing with people, and trying to anticipate as well as prevent problems.

An experienced admin is not necessarily a good admin

The best admin is the person with the right mindset, not the person with the most time on the job. In fact, the more "indispensable" an admin seems to be, the more likely they are a bad admin. The dead giveaway is if the person needs to be around to answer questions all the time, which strongly suggests two things:

  1. That the documentation is incomplete

    Complete network, host and application documentation is must for every site. If an admin is constantly called while on vacation or while others are on call is probably not documenting their systems properly. Everything needed to understand and fix problems at their site should be clearly documented in a central location.

  2. That the systems aren't discoverable

    The systems should be as self-documenting as possible. This means that start/stop/reload scripts should be in conventional locations (like /etc/init.d/ on SysV and on most Linuxes), MOTD messages should give helpful info, scripts should have comments explaining their usage and purpose, and automated alerts should send useful information about the error condition(s) found. Leave nothing to be rediscovered every time someone new has to work on the systems, let them spend their time working on the actual issue(s) at hand.

    An admin unfamiliar with the machine(s) in question should be able to find their way around the system with a minimum of trouble.

Many admins become mired in their site's problems, and stop trying to improve their situation. They accept that their disks keep filling up, that their applications keep dying, and that mundane tasks take up all their time.

If they were to write cron jobs to trim files that grow until filesystems fill, restart dead applications with init or cfengine or cron scripts or daemontools, and automate repetitive tasks from cron or cfengine, they would have a smooth running network. Once things run smoothly, they can spend their time updating software, improving security, or any of the many projects that improve overall conditions. Such projects get little effort spent on them at sites without a proactive attitude.

Helpful links

Infrastructure links

We don't need no stinking disaster recovery plan

The network is totally unusable. System administrators are frantically running back and forth from the server room, in a total panic.

The web site is down, as well as corporate email, due to a power outage at the office server room where they are hosted. The IT director walks out into the area where the network and system administrators work and announces that it is time for plan B. Everyone stops running back and forth, stops yelling into phones and at each other, and looks at the director.

"Plan B?", the network manager asks. "We have no plan B, you laid off the person in charge of disaster and contingency planning six months ago!"

The director pauses for a moment, brow furrowed, before replying, "Well he seems to have a replacement that replied when I sent email to the old guy's email address. The new guy was very responsive the first time I sent email, but after that he just seemed to ignore me."

The network manager asked the name of the replacement.

"Vacation something," replied the director, "Oh, yes I have it - Vacation Program."

  Home |  Infosec |  DNS scripts |  Loghost HOWTO |  Syslog-ng FAQ |  The Art of System Administration