Create A .htaccess File Without Referral Spam

November 14, 2006

230

At present, there is a growing nuisance for users and administrators alike of sites that ruin web servers and more particularly, blogs.

This nuisance is being referred to as comment, trackback and referrer spams. Various solutions have been proposed with some being applicable to even two of these forms of spam using a single solution.

What is Referral Spam?

A referrer request-header file allows the client to specify the address (URI) of the resource from which the request-URI was obtained. It is a way for an HTTP client to send in the headers, the URI of the page that sent them there. This is especially handy for a site administrator to provide insight as to where the traffic on his web server is coming from. It is also depended upon by the most popular web server log analyzers in providing statistics on the most common referrers.

The HTTP Referrer: header is very useful but it is also completely arbitrary. Any web browser or HTTP client is free to send a forged Referrer: header with any request to a web server. Spammers have taken advantage of the fact that there is no provision for authentication in SMPTP and have used the existing openness to specially craft request with their website in the Referrer: header.

Most people will find it difficult to understand why someone would bother spamming something which only the site administrator will see in the logs. One probable motivation pinpointed is the boosting of search engine ranking. Another is simply to show-up in any stats published by the site. If a site being spammed runs a web server log analyzing software, access to the URL in the top referrer’s section is handily obtained by the spammer.

A serious consequence of referrer spam is that the process is often performed via an HTTP “GET” or “POST” request which retrieves the entire body of the document being spammed. A 30k document, for example, will have all the 30k transferred across one’s Internet pipe. This results to not a small amount of traffic in the web server which could be very costly since bandwidth is not cheap.

Referrer spam wastes CPU and disk space and can be a source of endless annoyance to server operators. It is being actually fought by search engine developers thus its initial effectiveness in boosting a site’s ranking has been considerably lessened. However, the problem persists and much has to be done to conquer it.

Some recommended practices in countering the threat of referral spam include the non-publication of referrers by bloggers, inclusion of the page in robots.txt when referrers have to be published, use of the rel=”no follow” attribute and gathering a cleaner list of referrers using JavaScript and beacon images. Some bloggers have begun fighting referrer spammers at the .htaccess level. Others have even taken steps to automate this.

Blocking Users by Referrer Notes

A very useful feature of .htaccess is the ability to block users or sites that originate from a particular domain. When there are tons of referrals from a particular site with no single visible link to one’s own site from the said site, the referral probably isn’t a legitimate one. The other site is most likely hot linking to certain files such as images, CSS file or other file. The blocking access by referrer in .htaccess requires the help of the Apache module mod rewrite to be able to make out the referrer first. There is a fear that spam would still come in even as .htaccess continue to grow. Blacklisting certain referrers in .htaccess is another option, the effectiveness of which has been greatly diminished due to the ease by which spammers are able to register thousands of domains and rotate them as quickly as they are blacklisted.

The .htaccess generator to prevent people from certain IP addresses, domains or even countries from gaining access to a site or to specific folders can be used. The full IP address has to be typed to block a specific IP. The use of a partial IP address is required to block a range of IPs. Blocking a particular domain can be done by typing the domain without the www. The tail extension is to be typed when blocking a country.

There is no limit to the entries that can be added one at a time. The “add” should be checked after each entry while the generated code is to be copied and posted into a plain text file. This file is then named .htaccess. The “.” Before the file name should be noted as well as the absence of any tail extension.

If there is already an .htaccess file in the root of the docs directory or the folder where it is to be applied, the generated code shall be added to the end of the current .htaccess file, taking extra care not to disturb the existing code. It will then be uploaded in ASCII mode.

The rel = “no follow” solution

A coalition of blogging and search engine companies have joined together to support an HTML attribute designed primarily to combat comment spam but have high potentials as well for effective use against referral spam. This attribute is known as the rel =”no follow” is being praised by many bloggers as the ultimate solution for the prevailing problem. The idea is simple enough with the hardest part being the matter of convincing the major players such as Google, Yahoo! and MSN to agree on it.

Tagging a link with rel =’no follow” attribute would prevent any contribution to the site’s PageRank. This means that comment and referral spammers will not be rewarded for their illegitimate activities on websites that implement the attribute. The problem gets solved partially but this solution is unable to end it.

This truth is sought to be explained by the fact that it is impossible to reach a 100% adoption thus there will always be an incentive to spam. Spammers essentially do not care whether their techniques are specifically effective as long as they are generally effective. They need no particular reason to hit any site and will do so as their main target is the blogosphere as a whole. It is also quite unfortunate that the resources required to fight spam, particularly referral spam, is far bigger than the resources needed to create it.

Referral spam is an HTTP request. The client doesn’t even need to acknowledge the response. All it may need is a simple packet with formatted text.

Spammers take pains to make a request look legitimate. The user – agent string would look very much like MSIE. It used to be that spam came from a single IP but things have definitely gotten more complex since then.

Filtering referrer IPs against spam blacklisting can also be done. Listing the referring URL in any section of a site’s web stats should be avoided if the IP is blacklisted. Do not pursue query once a given site is identified as a referral spam host name.

Tag:

Add to Del.icio.us | Digg | Reddit | Furl