Friday, September 20, 2024

Fixing 404 errors

A 404 error is what you get when your browser tries to access a page that doesn’t exist. Maybe you mistyped something, or the link you followed was mistyped by someone else, or maybe the webmaster moved it or renamed it or just deleted it. It’s annoying for you, and sites that care about your visit try to avoid it happening.

Well, we can’t stop 404’s 100%, and frankly dealing with it is an annoyance for those of us maintaining the website too. It’s bad enough that other sites cause us problems with incorrect links, but it is really annoying when we cause our own problems.

Unfortunately, tracking these things down and fixing them is a bit of a pain. The “Custom 404” page and associated script referred to above corrects a lot of common errors automatically, and tries to offer help when it can’t just redirect you to the right page, but I need to keep updating it as I find new sources of errors. Sometimes the fix is as simple as just making a symbolic link, but if it is from an outside source, I want to correct it if I can. Even if it was caused by my own error, I may still want to add correction code in case that original error gets picked up by someone else.

So, to help me find errors, I have a Perl script that reads in the error_log, and compares it to a log of “corrections” already made by the Custom 404 script (this is necessary because the 404 ends up in my logs even though it was corrected). The script ignores pages that have already been corrected, and spits out a list of 404’s I need to at least investigate. Many of these will be confused web spiders – it’s really amazing how dumb some of these things are. For example, /MacOSX/macosxcupstofile.html contains this text:

sudo lpadmin -p tofile -E -v socket://localhost:12000 -m raw

Dumb spiders regularly think that is a link:

[Sun Jul 11 07:07:05 2004] [error] [client 217.107.152.79] File
does not exist:
/usr/local/www/vhosts/vps.pcunix.com/htdocs/MacOSX/socket://localhost:12000/

I have the script count the number of uncorrected 404 occurences so that I can devote immediate effort to the more serious problems. The output of the script might look something like this:

/blog/b930.html 2
/SCOFAQ/news:comp.unix.admin 1
/cgi-bin/fmail.pl 1
/Books/creatingcoolwebsites.html 10
/e51/SCOFAQ/FAQ_scotec8xsession.html 1

Obviously I need to jump on that “creatingcoolwebsites.html” problem right away.

See that “fmail.pl”? That’s a script kiddy trying to break in:

205.158.224.234 - - [12/Jul/2004:12:22:04 +0000] "POST /cgi-bin/fmail.pl
HTTP/1.0" 404 2317 " http://aplawrence.com/" "-"

Checking his other attempts proves it:

205.158.224.234 - - [12/Jul/2004:12:21:05 +0000] "POST /cgi-bin/formmail.pl
HTTP/1.0" 404 2320 " http://aplawrence.com/" "-"
205.158.224.234 - - [12/Jul/2004:12:22:04 +0000] "POST /cgi-bin/fmail.pl
HTTP/1.0" 404 2317 " http://aplawrence.com/" "-"

Nothing to worry about there.

The actual script is pretty simple:

#!/usr/bin/perl
# ck404.pl
open(LOG,"www/logs/error_log");
open(C,"www/data/corrections");
%foo=();
%foo2=();
while() {
chomp;
s/->.*//;
s/^ *//;
s/ *$//;
$foo{$_}=$_;
}
close C;
while() {
chomp;
s/.*htdocs//;
s/.*cgi-bin/\/cgi-bin/;
s/^ *//;
s/ *$//;
next if $foo{$_};
$foo2{$_}++;
}
foreach (keys %foo2) {
print "$_ $foo2{$_}\n";
}

This does generate some extra garbage now and then; it doesn’t need to be perfect – it’s just a helper script that saves me time.

Well, I’ve got a few hundred 404’s I need to go look at..most of them will probably be spider errors, or things I can easily fix, but invariably there will be some new 404 mixup to deal with, and the Custom 404 code will grow some more.

*Originally published at APLawrence.com

A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles