had related at Another Raid Failure that an initial attempt to rebuild the raid had failed, and left the story with fresh drives on order and winging their way toward us. The drives did arrive, but unfortunately the rebuild still failed. A Knowledge Base article suggested that there might be unreported problems on other elements of the array, with the suggested solution of blowing it all away and recreating from scratch. As we still had some suspicions about backups at that point, I didn’t want to do that until we had unquestioned good recovery capability.
As they were running a seven year old copy of Microlite Edge, I downloaded the current version and installed it. I brought the machine to single user mode (actually to “init 4” so that I would have multiple logins) both to keep users off and to have the machine quiet while backing up. I felt I didn’t need any extra stress on cpu or the drives. A “mountall” remounted all drives, and I then ran a backup. This would take a little more than five hours, so that pretty much killed another day. Unpleasantly, the customer had now been “down” most of the week, but we really needed to be conservative at this point.
After that backup completed successfully, I felt a little more confident about proceeding. To cover myself further, though, I also did more backups to a Windows SMB share that we mounted through Visionfs. Unfortunately, that’s limited to a 2GB file size, which required breaking up the backup into sections. I scripted that, and left it running over night.
The next morning was Saturday. We had an early snowstorm Friday night, and I woke up to four inches of heavy, sticky snow. I dug out enough to get my car out and headed for the job, leaving my wife with instructions to try to hire someone to do the rest so that she could get out later. After a tough drive, I arrived early with the Windows person whose company had originally provided the hardware. Ordinarily I’d just do all this myself, but as they sold all this to start with, the customer wanted them to do the hardware piece. OK with me, I don’t turn down help. My first task was to make new Edge recovery media. We had old boot/filesystem disks, but as these were seven years old, I wanted a fresh set. Unfortunately, I ran into a problem: while the Boot and Filesystem diskettes were created without problem, it kept failing with a “can’t run re2” error on the final “Misc.” disk. I searched the Web and Microlite’s site, but couldn’t find any solution. Feeling a little frustrated, I tested the older disk set, and found that they did read both the tape and the drive, so we decided to plow forward.
(This later turned out to be failing when it tried to resolve the host name – setting “Enable Network” to off fixed it. Debugging is by “re2 -debug”. It’s a bug still – it shouldn’t have crashed, but at least we now can make diskettes. Microlite support said “Make sure the file /usr/lib/edge/bin/edge.fqhn has the right system name.” )
We pulled the array box, removed all the old drives, and installed three “new” 18GB drives to replace the six 9GB of the original array. However, when we put it back and reconnected it, the array controller saw no drives at all. Double check connections, nothing wrong. Pull the raid cage from the rack again, look inside more closely. Duh! There’s the problem: these new drives were LVD, but the terminator at the end of the cable was not. I hadn’t noticed it before because it was tucked away out of sight. I dug a LVD/SE terminator out of my trunk, and we put the cage back with that now at the end of the cable. The array controller now saw.. one drive. At ID 2. No sign of ID 0 or ID 1. Kind of difficult to build an array with one drive. We now wondered if perhaps the raid controller itself was a problem, but we had a spare, so now we pulled out the computer, replaced the controller and after putting it all back in the rack once more, we got.. nothing. No drives seen at all again.
I was getting suspicious of the external cable that connects the boxes, and both of us were tired of pulling heavy things out of the rack. We decided to pull both the box and the raid and set them up side by side on a nearby table with all covers off so that we could see what was going on. That took a few more minutes of scrounging for a monitor, finding a free table near a power supply, and getting everything hooked up. The day was slipping away.
Open boxes
With everything out in the open now, we could see the diagnostic lights on the new controller. Unfortunately, the pattern we saw wasn’t listed in the book. If we ignored one of the lights, it would seem to be saying we had a termination problem. I looked more closely at that idea and quickly realized that we probably did: the tape drive was attached to the internal connector of the card, but on this controller that isn’t a separate bus. So, at this point the controller was in the middle of the scsi chain: the lvd terminated drive cage on one side, and the tape on the other. That would mean the controller itself would need to be set either to “No termination” or “High only”, depending upon how the tape drive cable was terminated. Obviously the old controller had to have been set right, but we had changed that, and it defaulted to “On”. That was easily fixed in its BIOS (accessed with “Ctrl-D” on this card). But even after doing that, we had no joy.
I was still suspicious of termination at this point. I don’t trust Windows people with SCSI: very few of them understand it at all. I wanted to look at the termination of the tape. The Windows consultant said something to the effect of “it’s not needed, we use these all the time right out of the box”. I explained that on older systems, with a short cable, and luck on your side, you might be able to get away with this, but that it needed termination, and upon checking, sure enough, there was none. No jumper setting to provide it, either, and my trunk let me down: I was out of stock on cable terminators. I wonder if this might have affected the array rebuild and caused our failure – that would tick me off a bit if it did. Well, too late now.
OK, deep breath. Let’s get the raid working first, and then we’ll worry about the tape drive. I disconnected the tape, and we reset the controller back to termination “On”. Reboot, and now we were seeing ID 0 only. Not good. We disconnected the other two drives in the raid cage, and hooked up only ID 2. Rebooted, the card saw that. Hooked up ID 1 alone. Reboot, and the card saw nothing. OK, that must be a bum drive. We had ordered four 9GB’s, intending one for a hot spare, so that wasn’t too unsettling. Pull ID 1, replace it, yes, we can see it. Hook up all three drives again, but now we were back to seeing only ID 0. Frustrating.
I was really feeling that there had to be something wrong with either the external or the internal cables in this box. As a test, we ran a cable from the internal connector (where the tape drive had been hooked up) to the three drives. Wham! All three drives reported for duty. One small problem: the cable I chose was thick and was interfering with the operation of the fans in the raid cage. That’s OK, it’s just a test. Back to the trunk to look for more cables.
I carry a lot of stuff in my trunk. I’ve got cables of every kind, adapters, converters, test rigs: a lot of stuff. However, over the years, there has been attrition. Five years ago I would have had a dozen terminators of various kinds, but as we’ve already seen, I was down to one. The reason I hadn’t replaced stock as I used it is simply that I’ve had less and less use for my stockpile: drives have become more and more reliable, systems ship with terminators built into cables, I just don’t have as much need for this stuff. So I had let it dwindle down. And now, as I dug around trying to find something that could bring me from the external connector of the raid card to the internal drives of the raid cage, I was coming up empty.
I did manage to come up with a thinner cable that could replace the too thick one we had tested with, and we decided to at least get the array built with that. So I went into the raid configuration tool (“dptmgr” from a DOS floppy). To my surprise, that tool saw two more drives: id 4 and 5. What the heck? How could this be? Does the controller store old configuration data in NVRAM? I didn’t think it did, but there it was. The only way to get rid of the phantom drives was to tell it to do a mirror array, let that start building, and then come back in to tell it to do RAID 5. It was at that point that we saw what the problem really was: somehow, we had replaced the bad drive with a drive from the old array! THAT was where the controller had gotten the idea that there were more drives to work with, and of course that was useless to us because it would limited our array to 18GB. Pull that out, locate the correct 18GB drive and back up.. no, now we still don’t see three drives. What now?
Well, stupid thing. The fourth drive is a different brand than the other three. Looks very similar, but its ID jumpers aren’t the same. On the other three drives, the jumper for ID 1 is on the left of the block, on this one it’s on the right. Fixed that, and we finally saw three drives and started the Raid 5 build.
While it built, we ate the lunch the customer brought in for us and thought about what to do next. As I saw it, we had two basic problems: the unterminated tape drive, and the lack of a way to reconnect the cage to the raid card on its external connector. Yes, this stuff is all available somewhere, but we needed it now. The customer really needed to be back up and running on Monday. We could try calling around, but I know how that is: most of the local stores won’t know what we want even if they do stock scsi cables and terminators, so we’d have to go look, and we were running out of time. My Windows companion had a happy thought: why not move the tape out of the computer and install it in the raid cage? We had plenty of room – six drives had come out and only three were going back in. That would solve the termination issue, though we now had the problem that the cage and the box couldn’t go back in the rack configured this way – they needed to be side by side for the cable to reach. We went back and looked at the rack. Beside it was another rack which was presently used mostly for the KVM and the keyboard and monitor. Hmmm.. move the keyboard and monitor to a table beside both racks, steal a shelf from there.. there we go, room to put these things back side by side!
We all looked at each other. Are we really going to put these machines back like that? Ayup, we really are. So, as the array finished its rebuild, we reconfigured the racks to accept it. Forty five minutes later, everything was installed and running. I had one real concern: three of the six internal fans in the raid cage had inexplicably stopped working. I wasn’t worried too much about cooling: this was only three drives and a tape, and it was open because of our cable needs. My concern was that we didn’t know why they stopped. Were they thermostatically controlled or had they failed? Or was the power supply failing? If the latter, we could lose the array again.
A minor problem was that we couldn’t physically secure the tape drive: it sat loose in the cage, which meant that you needed to hold it in place to insert a tape. That wouldn’t be good for the woman who normally changes tapes, but we’d have to deal with that later. For now, we needed to get the OS restore going. I booted with the Edge recovery disks and began that restore.
It was now getting late. The restore would take approximately five hours, so I arranged to meet the customer on Sunday morning. I felt we could let the Windows guy go as we were done with hardware issues.
I got home after dark, and had to finish shoveling my driveway. My wife hadn’t found anyone (the kids in our neighborhood have too much money) so had cleared just enough for herself. I didn’t want her even doing that – she has back problems – but she had needed to get out, so she did what she had to do. I then did what I had to do: finish the shoveling, get some food, and then some sleep.
Up and running – almost
Sunday morning we returned. The restore seemed to have finished, and the machine rebooted fine. Things were looking good. We knew we’d be unable to run the app because of licensing issues tied to disk info, but the customer has 24×7 support, so we quickly got hold of the app folks and got them to vpn in to set a temporary license. Unfortunately, they called back with a problem: missing files.
Huh? The restore had finished, the machine was up and running, how could data be missing? Well, it was. I felt it had to be because the backup was made with a new version of Edge but the restore was by older recovery disks. Didn’t seem to be a major problem: everything else was OK, the tape just apparently didn’t get to the end. I started up “edgemenu” and asked for a non-destructive restore to bring back the missing data. Unfortunately, it took more than an hour to get the point where it started finding files that needed to be restored, and another two hours to bring them back. We got the app support people back in at that point, and now they found permission problems on the restored files. Again, I felt confused. Why would there be permission problems? Edge certainly would have have put back just what it had saved: we were restoring with a current version from a tape made from that version. There shouldn’t be a problem! But there was. Easy enough to fix, of course, and confined only to /app, but then more oddities turned up – a missing symlink for the application start program. I was beginning to suspect that we’d had corruption prior or during my backup. However, after recreating the symlink and putting in the temporary licensing, the app did come up. The customer ran several reports and all data seemed to check out and verify, so things were finally getting close to normal. On Monday they could deal with permanent relicensing and perhaps do something better about securing the tape drive and seeing why the fans stopped. For now, though, we were done and I went home.
This is a situation that never should have been this bad. By delaying the upgrade beyond a safe time, we got into a mess of old hardware and old software. Because the software was old (SCO 5.0.4), we couldn’t just transfer it to new equipment. Because the equipment was old, we ran into difficulties there. It just all was stacked against a quick recovery. The customer was down for a week, it cost them thousands of dollars in direct labor, and who knows how much lost time by employees. It didn’t need to happen. The person in charge of the department had been recommending, even begging for two years that upgrades were needed. Management didn’t want to spend the money because “it’s working fine”.
Don’t let this happen to you.
*Originally published at APLawrence.com
A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com