Sunday, December 15, 2024

Kernel Link Failures

Share

That’s a pretty awful feeling, isn’t it? You’ve got to link a new kernel because you need to change a value or needed to add something, and it fails.

The near gibberish it outputs looks completely unhelpful and you haven’t a clue where to start. Well, this article hopes to give you some clues.

A cover your butt procedure I always follow is to link a kernel BEFORE you change anything. If it fails, you know it was already broken, and didn’t break because of something you did. If you are feeling really paranoid, answer “N” to the “Do you want this kernel to boot by default” message, and then do:

-==-

and see if the two files are the same- they certainly should be if you haven’t changed anything yet. If they aren’t, I suggest:

-==-

We’re going to start with an actual case. A local consultant called me because he had tried to increase a kernel variable, but the link failed. The increase was critical to the proper functioning of the system, and he couldn’t fix it.

As it turns out, I could have identified the problem in seconds. Unfortunately, I didn’t realize that at the time (live and learn), but even if I had thought of that method, I would have dismissed it because I was sure the problem was elsewhere. I’ll tell you what I should have done that would have instantly told me what was wrong, but I’ll hold off explaining why until later. Here’s what would have given me the answer I needed:

-==-

Think about that as you read along.

This article doesn’t go into the whole subject of drivers and the link directories very deeply. You might want to read Understanding Device Drivers if you want to understand more.

The first thing I did was this:

-==-

After the script finished belching out its errors, I used CTRL-D to exit “script”, and went to look at /tmp/linkerr. Here it is:

-==-

Pretty awful mess, isn’t it? I was convinced that a driver file in /etc/con/pack.d must be missing or horribly corrupted. Actually, though, it couldn’t have been a missing driver file- the link_unix would have reported that in plain English. A really badly corrupted driver file would have also barfed differently, though the error message wouldn’t be as obvious (I’ll show examples of that later).Could it be that a good driver had been copied incorrectly- for example somehow copying /etc/conf/pack.d/clone/Driver.o to /etc/conf/pack.d/kbd ? No, because that would give us multiply defined symbols, and there’s no mention of that in the output.

How about a Driver.o from a different release, or from a backup prior to the application of patches? Yes, that could cause these kind of errors, and that was my first thought. Yet, I know the local consultant pretty well, and that doesn’t sound like something he would have done, even accidentally, so I gave up that and decided that some needed driver was just not being linked into the kernel. Now to find it.

I picked a symbol from the list of errors and went looking for it like this:

-==-

Let me say right away: that’s NOT the best way to look for symbols in a .o file, but I got lucky and “str” popped up as a match. I checked /etc/conf/sdevice.d/str, and it was marked N:

-==-

Now that’s pretty odd: it shouldn’t have been: “str” is the Streams driver and is necessary for just about everything on the network. I changed it to “Y” and tried the link again:

-==-

That’s better; a lot less errors, but still no success. When you are linking a kernel, even one error is one too many. So I tried my script again, but with clnopen this time:

-==-

This didn’t work, though. It’s not that “clnopen” isn’t somewhere in one of those Driver.o files, it’s that “strings” isn’t good enough to find it. However, I had other weapons: I was dialed in to the customer, but was working from my own machine which happens to be the same OS release. On my machine, I have the Development System installed, and the Development System has “nm”. So on my system I did this:

-==-

Bingo! The “clone” driver has “clnopen”, and sure enough, it too was turned off in /etc/conf/sdevice.d (nobody knows how or why this happened, by the way). I turned it back on, and now the kernel linked successfully.

If I had not had “nm”, I could have done this:

-==-

As I said at the outset, if I had done a diff on the two sdevice files, this would have shown me:

-==-

The reason that works is that link_unix apparently doesn’t replace sdevice until the link is successful (sdevice is built from the individiual files in /etc/conf/sdevice.d). That’s very helpful for this kind of error, because it immediately shows you what has changed since the last successful link.

Other Linking Errors

Of course, there are other things that can go wrong. One I see now and then is where a new device has been partially installed or partially removed, and the kernel fails to link because enough of it is still there to confuse it. In a case like this, you want to look in /etc/conf/cf.d/mdevice, and the offending device will probably be at the end of it. If you are not really sure, you can just comment out the line you think is the problem by putting a “#” at the beginning of the line; if the kernel then relinks, that was it. For example, here’s the end of my mdevice; the E3H was the last thing I added to this machine:

-==-

Corruption

What about a corrupted driver? The errors you get will depend upon the nature of the corruption, but let’s try some experiments (if you aren’t comfortable and sure of yourself, don’t try this on a working machine):

-==-

When I did this, I got a message saying that the file “Wed” (it happened to be Wednesday) couldn’t be opened for input. Let’s try something else:

-==-

This time I got a message complaining that it couldn’t open “file ELF”. That would be a very definite sign of corruption: Driver files would always be “COFF”.

To put everything back as it was:

-==-

I hope this gives you a little more confidence should you ever run into a broken kernel relink. Certainly other errors are possible, but these are the most common I’ve seen.

Originally appeared at http://www.aplawrence.com(http://www.aplawrence.com/Unixart/linkfail.html)

Please Read This Disclaimer
Copyright and Reprint Info

A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com

Table of contents

Read more

Local News