Today I was glancing over some of the source code for my site cleaning up some random errors and I happened upon a little booboo that could happen to anyone. Before we get into what I discovered about my own site I want to explain to you exactly what a robots.txt file is. No I am sorry it is not the hidden file on your server that skynet has planted a part of the upcoming takeover of the world plans. Instead it is a a simple text file when put in the base of your directory will do amazing things for you.

Normally most people run with a default or blank robots.txt which can either the search engines check every single nook and cranny of your entire directory and every single file, or they will just pass you up especially if your not a member of say the Google webmaster tools, or submitting a site map as well. All of these are great ways to help yourself stay indexed in a timely manner and cut back on content or in this case duplicate content that you do not want reread except the first time. Whether or not this affects things such as page rank or not can not be said with 100% certainty, but being organized and error free and not overloading your server are all good enough reasons to make sure your files and robots are in order.

My Robots.txt

User-agent: *

Disallow: /archives/
Disallow: /tag/
Disallow: /category/
Disallow: /cgi-bin/

# BEGIN XML-SITEMAP-PLUGIN
Sitemap: http://www.benjaminpatton.com/sitemap.xml.gz
# END XML-SITEMAP-PLUGIN

As you can see I do not want any of the search robots (hence the * instead of say just google) searching or indexing my archives, tags, categories because this will count as duplicate content. Again I am not 100% sure if this affects anything, but I want the original posts indexed and not reindexed again.  I added the cgi-bin disallow because quite frankly their is nothing there for them to see. I in the past ran a parabots email script and for some reason I have a couple of links out there to some cloaked links. This takes care of that as well.

The next part was added upon demand from my xml site map plug in which lets anyone looking know exactly where to find my site map as to make indexing me much easier.

Feel free to copy and paste my robots.txt, but make sure you understand how it works in case you need to edit it for your own circumstances.

The nomenclature of Robots.txt

A big word which basically means the language for making your own robots.txt. Blindly cutting and pasting will get you in to a bit of trouble, and as I mentioned in the top paragraph I did such a thing. The irony of this is that I did not catch that the person I copied from had disallowed their own sitemap, and as such I did to. This is not a great thing to do especially when attempting to stay indexed in a prime time manner instead of a sandbox type situation which we all fight to stay out of.

So let me share with you exactly how you can build your own robots.txt

Commands

User-agent: *
This means any robot visiting your site. You can replace the * with the proper robots name like GoogleAgent.

/Disallow *
This will not allow any robot to search or index anything on your site, for us not the goal, but depending on your content you may someday use this.

Crawl-delay: 10
This is a great tool if crawlers are causing your host some issues or your server thinks a random attack is occurring. Although if this is the case I suggest taking a look at what is going on and moreover looking for new hosting.

Allow: /folder1/myfile.html
While you can disallow say a whole directory, what say if you need to allow just one file from it. Use this command to allow that file within the disallowed directory viola success!

Request-rate: 1/5
Visit-time: 0600-0845

These are some of the new proposed extended robots commands which will allow say one page every five minutes and only allow the spider to visit between dedicated hours to offload critical traffic to hours when it is not an issue.

Put Your Robots To Work For YOU

Now you have the tools and the wisdom to setup your robots.txt to work for you. This is advantageous to help you keep that plethora of a site in perfect top Google working order and not chock to the brim with errors and issues.

I would love to know any other methods you take advantage of in your robots.txt just in case I may have missed them. Knowing is half the battle. Get to it!



  1. M. Pence (4 comments.) (Reply) on Thursday 9, 2008

    Ben,

    Thank you so much for this article. This is something that has been on my list of things to Google and get too when I can–learning about robots.txt, but you’ve pretty much saved me the basic trip–that, and you’ve explained in a manner that has completely made sense to me.

    As an easily confused beginner, that has to be one of the most important things for me; finding an article that explains it exactly and easily for me to pick up–so thanks for that!

    I wonder how many other blog owners aren’t aware of this?

    • Big Ben Patton (236 comments.) (Reply) on Thursday 9, 2008

      No problem at all really, I noticed while searching myself for a clear and concise easy to read answer that none existed that told me everything that I needed to know for my blog.

      I also wonder how many blog owners are completely unaware of what this and what it can do for them.

  2. MLRebecca (2 comments.) (Reply) on Thursday 9, 2008

    This is so helpful! I have always wondered what the Robots.txt was for, and now I know. I’m definitely going to make this a project of mine. I love reading blog posts that teach me something about blogging that I didn’t already know! Thanks for posting!

  3. Mark Sierra at MeAndMyDrum.com (9 comments.) (Reply) on Thursday 9, 2008

    Hi Ben,

    This is very timely because I’ve had a question come up recently that I cannot answer regarding one of my robots files, but I’m hopeful you can provide some insight.

    I’m using Google Webmaster Tools for all my WP blogs and sites. When I go to Web Crawl for one of those sites, there’s a topic there for “Errors for URLs in Sitemaps”. The details columns says, “robots.txt unreachable”. Why is that? I can access it just fine if I type in the path to it.

    I should point out that it’s a text file and reiterate that I’m not using a plugin for generate it. This is a static, one-page HTML site. So I don’t know if that has anything to do with it.

    Any help you can offer is very much appreciated. :)

  4. egk (25 comments.) (Reply) on Thursday 9, 2008

    Hey Ben,
    Haven’t commented in a while, but I do read every post and must say your posts have been getting consistently better over time. Every post now adays is really good! Great job outlining the common components of robots.txt.

    @Mark Sierra: if you are referring to the site meandmydrum.com that you linked to then the problem is very obvious. The syntax of your robots.txt file is one very long line, with the different components running into each other without spaces. You need to break each entry into its own line.

  5. egk (25 comments.) (Reply) on Thursday 9, 2008

    BTW Ben,
    What is the difference between “Subscribe to comments via email” (above comment box) and “Notify me of followup comments via e-mail” (below comment box)? Does the first option send you an email for every comment on BPP, even if you didn’t comment on that post?

    • Big Ben Patton (236 comments.) (Reply) on Thursday 9, 2008

      more or less yes one will send the comments one will notify you that their are some comments.

  6. Kevin from Great Wall of China Facts (1 comments.) (Reply) on Thursday 9, 2008

    Thanks for info on this. I need a robots.txt. I will do more research on this!

  7. Dinh Trung (3 comments.) (Reply) on Thursday 9, 2008

    Hi Benjamin,

    Sorry to ask you this question, since I do not know to buil a robot.txt for mine !!

    In my case, and my site (kienthucdulich.info), i faced problem with google sitemp, sometimes.

    Pls help to clarify !

    Thanks and best regards,
    Dinh Trung
    http://www.kienthucdulich.info

  8. John from Maryland Real Estate (5 comments.) (Reply) on Thursday 9, 2008

    good post Ben,

    it’s also a good idea under the User-agent to disallow a bunch of the site scraping scripts out there. You can really save you a bunch of bandwidth if you have a large site. Just make sure not to disallow any of the important search engine spiders of course.

  9. Donace (25 comments.) (Reply) on Thursday 9, 2008

    One thing I will chip in here is that following robot.txt and stuff like do-follow no-follow is all ‘recommendations’ for search engine guidelines.

    Yahoo is notorious is disobeying robot.txt files and the such.

  10. Armen Shirvanian (3 comments.) (Reply) on Thursday 9, 2008

    This is wonderful because it allows me to see the specifics of a file that is used ubiquitously. The internals of the file are quite easily altered, and you have presented them smoothly.

  11. [...] Benjamin Patton – You Need a Custom Robots.txt, So Let’s Make One [...]

  12. Ben Moreno (1 comments.) (Reply) on Thursday 9, 2008

    Hey this is cool. Thanks for the info man. I need to do this for my site. My name is Ben too! I like your blog. Nice design.

  13. Ajith Edassery (6 comments.) (Reply) on Thursday 9, 2008

    Though I was reasonably well-versed with the robot.txt file, I wasn’t aware of those extended keyword set.

    Thanks a lot,

    Cheers,
    Ajith

  14. Beijinger (6 comments.) (Reply) on Thursday 9, 2008

    Request-rate: 1/5
    Visit-time: 0600-0845

    I was having some issue with certain bots which is eating my bandwich , glad to know above commands.

  15. Narzuty (7 comments.) (Reply) on Thursday 9, 2008

    robots.txt is an extremely important part of every blog. I’m surprised to see so many people in comments that were not aware of its functions.


CommentLuv Enabled

This site uses KeywordLuv. Enter YourName@YourKeywords in the Name field to take advantage.

Sign Up To Our Mailing List for 250 FREE PLR Niche Articles



Comments protected by Lucia's Linky Love.
SEO Powered by Platinum SEO from Techblissonline Powered by Aweber Wordpress Plugin