Thursday, September 19, 2024

Troubleshooting Mistakes

The very first part of troubleshooting is identifying the problem. That’s not always easy even for skilled professionals.

It’s definitely not easy for the typical computer user, so when you get the call (we’re assuming that you are the professional who gets called with problems), what you are told may not match reality. This isn’t to imply that users are stupid, or ignorant, or careless (though some are all of those things), but simply that they may misinterpret symptoms and miss seeing the real problem.

Professionals do the same thing. In my career I’ve had more than one telephone call where someone describes themselves as a competent Windows administrator but apologizes for not “knowing Unix”. Sometimes we end up having an easy conversation where the problem really is simply that they need a little (sometimes a very little) Unix guidance to help them fix their issue. Sometimes it’s a little more involved: they’ve hit a tough nut and they’d really have needed years of experience to have any hope of fixing things.

Sometimes it’s not like that at all. More than once the immediate problem was a dead, non-booting machine. I don’t mean that Linux or Unix was trying to load and failing along the way, I mean that you could push the power button and the lights would come on and that was it. Nothing more. No BIOS display, no disk spin up, no beeps, nothing. Just dead. And yet here we have a supposedly competent Windows support person asking me what to do. What’s that have to do with me? It’s not a Unix issue – we haven’t got that far yet. It might become a Unix issue: if the hard drive has been damaged by whatever caused the stubborn nothingness being seen now, we might need a data recovery firm with knowledge of Unix/Linux file systems. Even if it’s just a missing boot sector, repairing that certainly requires OS specific knowledge. But right now? This is a low level hardware issue. Maybe a failed motherboard, power supply or missing/unplugged cables. Whatever is going on, right now it has nothing to do with Linux or Unix.

If you are dealing with a non-computer savvy user, remember that they may not understand things that seem obvious to you. For example, the user may understand that the hard drive stores his operating system and files, but may not realize that the initial BIOS information that flashes by at boot doesn’t come from there. So while you would think very differently about a machine that displays BIOS information but does not continue versus a machine that displays no BIOS data at all, the user of that machine might not. You need to interpret problem reports with an eye toward the reporters knowledge.

But you know that. You also know that if it is your suspicion that somebody did something they shouldn’t have, the user may not be willing to admit to it. You are going to take everything with a big helping of salt, and decide for yourself what the problem is. After all, you are the professional.

OK. But professionals also misinterpret things. You probably know this too: what you think you know can hurt you more than what you don’t know. Do I assume too much? Maybe so: I know I make mistakes like that, and I’ve sure seen other people do the same thing, but you could be different. If so, you can either skip the rest of this post or read it with relish while you savor your superiority.

I can remember the first bad troubleshooting mistake I made. It cost me a good customer – not because they were angry with me, but because I had them switch to hardware and software I did not support. I advised them to switch because I thought the OS and hardware they were running on had reached its performance limits. They were running a product called Glovia on a SCO Unix 80386 box. There were only about twenty users, but the background code had been getting more and more complicated over the years, and the system was slowing down badly. I tried increasing swap, adding more memory, and everything else that I could think of, but it kept getting worse. As their Glovia programmer was constantly adding new features, I assumed that these new routines were simply overtaxing the system: heck, I could see it in the sar reports: both the cpu and disk i/o were under excessive load. Basically, I just gave up and agreed with the advice they had received from Glovia and their programmer: upgrade to a big HP/UX system. They did, performance returned to acceptable levels, but because I didn’t know much about HP/UX, another consultant took over my position. I felt good about it overall: I had done the right thing, and I had more clients than I needed anyway. All parties pleased, time to move on.

But I was very, very wrong. I assumed the increasing load was from the heavy new tasks being added weekly, so I just didn’t look far enough. I had done some “ps” runs, but had missed seeing something very important. The clutter of Glovia processes blinded me: I didn’t see the big lumbering elephant in the crowd of dancing lambs. What I missed was an MMDF process called “deliver”. The reason I probably missed it was because I was looking for processes that were gaining time right now: I’d take two “ps” snapshots and “diff” them (this procedure is covered in more detail in a later chapter). The processes that popped out had used cpu time between the snapshots. If I had been lucky, “deliver” would have been in that list, but my timing was unfortunate: although “deliver” was using a lot of cpu, it didn’t happen to be sucking any at the times I happened to look.

I know this because I accidentally saved some printouts from that system. For some reason I had tucked them in my briefcase and forgot all about them. When I found them a few years later while searching for something else, I happened to take a quick glance, recognized where they were from, and immediately had an awful feeling in the pit of my stomach. I got that feeling because I noticed “deliver” and saw that it had a lot of accumulated time. In the intervening years, I had seen that at other SCO Unix jobs, and I knew what it meant. It meant that there were thousands, perhaps tens of thousands, of mail messages backed up on the system. That “deliver” process was trying desperately to run through them to see if they could now be delivered. It would do a lot of disk i/o and consume a lot of cpu in the process, and then it would go away until it was scheduled to run again. Eventually there were so many messages that it was almost always running – except when I ran my snapshots, of course. Just my luck, I guess – or more likely I just didn’t run enough of them because I saw all those Glovia processes and “knew” they had to be the problem.

Why didn’t the customer notice backed up email? Because it was root’s messages that were being delayed (due to a lock file on root’s mail folder) and nobody cared about root’s mail. User’s mail of course was getting slower and slower, but I took that as symptomatic rather than closer to causal. My loss: I saw what I expected to see, I didn’t see anything else, and I gave away a good account years before I had to.

*Originally published at APLawrence.com

Bookmark Murdok:

A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles