The robots.txt file.
Nothing can be more confusing to a website owner as the robots.txt file. Born out of technology in the programming world, the robot.txt file is nothing more than a server command for search engines. Unfortunately, while search engines understand the file, humans have a difficult time understanding machine language.
The Google blog is now running a two-part series on understanding robots.txt and the robots meta tag. Both of these articles, while providing a lot of great in-depth information, is much more than any site owner or manager wants to know. Especially when you start talking technology, bots, spiders, permissions, etc.. Most owners don’t know where to start, nor do they understand the technology behind either of these issues. What people really want to know is “what do I need to do?”
In fact, most website marketers don’t care. They just want it done.
Just tell me what to do
Most people just want to know what to do, where to put it and be done with it. So if that’s you – just go to the bottom of the article and you get what you need. Otherwise, for those that are curious, but don’t like technical explanations; I’m going to explain it the best that I can, but in terms the common man, like me, can understand.
Robots.txt Explained. Sort of . . .
The best way to explain the robots.txt file, is that it is a ‘welcome mat’ for the search engines. It’s not so much that the file is necessary for search engine success, but it’s one of those hundreds of small things that you need to consider, much like everything in SEO. If you have it, it will help your search engine success in a very small way. If you don’t have it, it won’t harm you, it’s simply a technical issue.
The technical issue is that the search engines request this file before or during every spidering session. Some request it before every session, some request prior to groups of pages. Either way, search engines request this file multiple times in a session and in a day. If the file does not exist, then it shows up as a ‘page not found error’ in your log files. This is is getting borderline technical so I’ll stop here with this explanation. So, if the search engines request it, it must be important. That’s why i believe that it is important to have.
Welcome Home
I like to explain it as a ‘welcome mat’ because some people have a welcome mat at the entrance of their house and some people don’t. Either way, it doesn’t prevent people from coming into the house. The same for the robots.txt file, it simply tells search engines that they are welcome to visit the site.
Don’t Go There!
If you want to get fancy with your welcome mat, you can tell the search engine where not to go in your house. Go Away MatTypically, these are files that are not important to the search engines or files that you don’t want showing up in the search results. It’s kind of like that closet where you store all your junk. When people come over you don’t want them to go into the closet. It’s not vital for them to know its in there, as it’s stuff you typically store out of sight. For a website, some people “disallow” printer friendly pages, images, or directories that they do not want to show up in the search results.
It’s Not for Security
Now, I am not saying to used this as a way of protecting information that you don’t want people to see. If that is the case, then you need to put that behind a password. The robots.txt file is not to hide information from people. It simply to tell the search engines not to index it.
Knowing this is really what’s important from a marketing standpoint, the technical standpoint is a little more difficult, because it gets into server commands, which most people frankly don’t understand. Frankly, I’m surprised how many times I run unto problems with the robots.txt as the culprit. This little file has been the cause of a lot of problems for some very large websites.
The Robots.txt Structure
There are only two lines required for a standard robots.txt file. The first line identifies the robots you want to specifically command.
User-agent: *
The asterisk is a wildcard, meaning: all robots – follow these instructions.
The second line does allow tells the robots where not to go, which is defined either at the directory level or the page level.
Disallow:
If you don’t want to disallow anything then don’t put anything there. That’s the typical set-up to allow the search engines free reign of your website.
It’s as simple as that. And here is what it looks like, written in a notepad file.
Adding and Removing
Now, some people get a little fancy and like to disallow certain directories. This is usually done to remove any duplicate content. So, let’s say I have a directory of all of my printer-friendly pages, which are really only duplicates of the HTML pages.
User-agent: *
Disallow: /printerfriendly/
I’ve disallowed the entire directory by specifically naming it to the search engines.
The forward slash is an important part of this file. That is where most people make their mistakes, is with that slash.
Blocking your Website
By adding a slash to the disallow command, like this:
Disallow:/
You are telling the search engines to “go away” with this command.
More info
If you want more inormation about the robots.txt file and all the things you can do with it, I suggest the following resources:
Official Google Blog – The Robots Exclusion, pt 1
Google Blog – The Robots Exclusion, pt 2
Summary
Hopefully, this has helped a few understand the place and purpose of the robots.txt file. Even more than that, I hope that it has taken the fear away from dealing with this file. Many site managers are very gun-shy, as they may have had a disallowed site from the search engines with a misplaced slash at one time or another.
If you have any questions about this file, feel free to leave them in the comments. I and many others are very willing to help you understand what you need to know about the robots.txt file.
It’s better to ask questions and be sure that you are making the right move than to guess and disallow your entire website . . .