Mysterious Lockups

December 4, 2006

89

Of all computer problems, the unresponsive hang is the most annoying and most difficult to trace. There’s no crash, no panic: everything just stops dead.

The keyboard is useless, telnets just time out – you have no choice but to power cycle the machine.

Well, maybe. If you are running Linux, and if you have Magic Sysrq enabled, you might be able to do more. Even SCO has something similar: scodb gives access to a kernel debugger if available. I don’t know of anything like that for Mac OS X; there is ddb but that requires attaching a serial terminal and a recompiled kernel.

But let’s say none of that is helpful. In that case, the first thing you want to know is “how dead is it?”. Is the keyboard totally dead – if it has lights for Caps-Lock, do they cycle on and off as you press that? If not, you may have a motherboard or keyboard problem. Can you “Control-ALT-F3” (Linux and SCO) to switch screens? If so, the OS is still at least partially alive. Can you telnet or ssh to the box? Can you ping it? Does Samba or NFS etc. still work? These give you clues as to the state of the networking stack.

Ok, you’ve given up. There’s nothing that can be done but a power cycle. Here’s another chance to possibly learn something: does a reset exhibit different behaviour than a complete power off? If it takes a power off to get the machine responsive again, how long does it have to be off? Short rest periods might indicate capacitor or register problems: giving the machine a little more time to “bleed off” cures the problem. A need for a longer period off might mean heat problems – are fans malfunctioning or are the insides coated with insulating dust?

No? OK, then maybe something in software is doing this. A “tail” of system logs may give a clue as to what was happening just before the hang (set Syslog MARK option if you aren’t sure that stays running), as may tools like sar. A build up of unusual system activity prior to the hang might give clues as to its cause. If the hangs repeat, setting a “ps” running to log activity can help zero in on that should it happen again.

After all that though, these things are almost (*almost*) always hardware, and more often than not it is power related: bad power supplies are the most common cause I’ve seen. After that comes disk controllers and then motherboards, but nowadays I don’t feel it’s worth spending a lot of time chasing this sort of thing: move the system to new hardware as quickly as possible. If you then want to spend time investigating possibilities on the old hardware, at least you won’t be interfering with normal business. However, given the cost of hardware vs. the cost of labor, even that may not make sense: accept that the whole thing was mysterious, do whatever you need to do to protect any confidential data, and move on. Maybe some parts can be recycled or maybe the machine can move down to less important use, but the cost of messing around with it in its original role just doesn’t make sense.

Add to Del.icio.us | Digg | Reddit | Furl