What is system troubleshooting?

On the surface it sounds like a simple thing, it works or it doesn’t. But, a system is made up of more than one device and it is all the devices together that make a system do it’s job.

Let’s examine bicycle as a system.

  • There is the frame, holding all the other parts together.
  • The handle bars connected to the front wheel.
  • The pedals, connected to the back wheel via a set of gears and a chain, could be considered a subsystem.
  • The seat, connected to the frame.

The bicycle as a system would fail if any of these other components failed. If the bicycle is unusable due to a flat tire, the cause of the failure would not be the bike but the tire.

Now lets look at a computer service as a system.

The service, point of sale or a library catalog for example, the primary purpose of the system.

  • The server it is on.
  • The servers:
    • Operating system (A subsystem of the larger system)
    • Hardware
    • Cooling (Could be a  shared subsystem)
    • Physical housing (Could be a  shared subsystem)
    • Electrical supply(A shared subsystem)
    • Networking (A shared subsystem)
  • The workstations that you use to access the program.
  • The Workstations:
    • Operating system (A subsystem of the larger system)
    • Hardware
    • Cooling (Could be a  shared subsystem)
    • Physical housing (Could be a  shared subsystem)
    • Electrical supply (A shared subsystem)
    • Networking (A shared subsystem)
  • The peripherals, printers used to share information \.
  • Their:
    • Hardware
    • Consumable supplies
    • Cooling (Could be a  shared subsystem)
    • Physical housing (Could be a  shared subsystem)
    • Electrical supply (A shared subsystem)
    • Networking (A shared subsystem)

How do you start finding the problem in a system that is spread over many pieces of equipment and possibly geographically separate areas.

Your first task is to understand what all the parts are and do. In the case of the bicycle it seems easy, you have to know the tire must be inflated to roll. The chain must be on the gears so that, when you pedal, it moves the bicycle.

A large software system can be a bit more difficult. A large failure, like the power is out, is easy. A small failure, localised to one workstation is also easy. The difficult ones are failures that affect a random number of people, or have no easily discernible pattern. It is these failures that require you to understand how your system works and what each part does.

When I begin working on these I try to gather as much information as I can about the problem.


When did it occur?

This helps to see if the problem is related to some kind of scheduled event.


What was going on just before?

Did an update just occur? a reboot or power surge? Was there a previous repair?


What exactly is the problem?

This can be the hardest information to gather. People often can only express the problem as “It didn’t work”. You may have to have them show you what they did and what should happen to understand what the problems are.


Can I recreate it?

This is another difficult thing to do, but an important point. The need to repeat the problem so that you can:

  1. Determine the cause.
  2. Ensure you have found and repaired it.

While gathering this information it is important to not blame or accuse the person you are asking. They will at first see you as the cyber cop who will punish them for doing something to break the system. To get to the source of the problem they need to know you are not blaming them for any part of it, that any keystroke issues are just accidental, which they likely are, and in no way was this their fault.

I hear you saying, but there are many times it is their fault. They did not follow directions or procedures. They spilled a drink, or knocked over something.

Fault is not important, intent matters most. Getting the problem solved first matters most of all. If they had an accident then after all is working, determine ways to not have the accident again and work with the person involved. If intent to do harm was in fact there, a hammer to the screen or the computer flys out the window, then follow procedures for reporting it and let the natural consequences of the action run it’s course.

Ok, your done gathering symptoms of the problem now we take those and look for trends, for what went wrong and compare it to the parts of the system.

We use the divide and conquer method. I will use an exceptionally difficult problem to illustrate.

The problem was the companies database was being corrupted by just one person working on it. The software company was seeing just one character change in the database, when they repaired the file.

After running some hardware diagnostic tests to see if it would provide any clarity, they did not. Because I had changed to new network cards in the environment I decided to test the data movement on all the workstations. I did this rather than have the database be damaged again.

The test comprised moving a very large file from the local hard drive to the network hard drive. Then I would compare it with a reference copy of the file already on the server.

Then I moved the file back to the local hard drive and compared it with the original file. I ran this same test on each workstation to locate where the failure occurred. I found that there was one failed character in a 100 mb file on one workstation. I know, but one wrong character each time the file was written to adds up quickly when adding information or indexing a database file.

This problem was present in one of the new network cards, that had been installed in all the workstations, so rather than wait for another to fail we replaced all the cards with a different brand. I reran the test, several times, and all was well.


Lets summarize this:

  • The initial symptom, a damaged database file. Initially the software company feels it is their problem and they fix it.
  • The problem reoccurs and now we have to look for another part of the system that may be causing the issue.
  • Now we do diagnostics to locate which workstation or server of the system is the issue.
  • Once we have a repeatable problem we can repair the issue and test it to ensure it is fixed.

This is a brief look at what can be a complex topic. I hope it helps to solve some of the problems you may encounter.

About jcoffey

This entry was posted in commentary and tagged , , . Bookmark the permalink.