February 7, 2012

You Need A Custom Robots.txt, So Lets Make One.

Posted on 09. Oct, 2008 by in Blogging Tips

Today I was glancing over some of the source code for my site cleaning up some random errors and I happened upon a little booboo that could happen to anyone. Before we get into what I discovered about my own site I want to explain to you exactly what a robots.txt file is. No I am sorry it is not the hidden file on your server that skynet has planted a part of the upcoming takeover of the world plans. Instead it is a a simple text file when put in the base of your directory will do amazing things for you.

Normally most people run with a default or blank robots.txt which can either the search engines check every single nook and cranny of your entire directory and every single file, or they will just pass you up especially if your not a member of say the Google webmaster tools, or submitting a site map as well. All of these are great ways to help yourself stay indexed in a timely manner and cut back on content or in this case duplicate content that you do not want reread except the first time. Whether or not this affects things such as page rank or not can not be said with 100% certainty, but being organized and error free and not overloading your server are all good enough reasons to make sure your files and robots are in order.

My Robots.txt

User-agent: *

Disallow: /archives/
Disallow: /tag/
Disallow: /category/
Disallow: /cgi-bin/

# BEGIN XML-SITEMAP-PLUGIN
Sitemap: http://www.benjaminpatton.com/sitemap.xml.gz
# END XML-SITEMAP-PLUGIN

As you can see I do not want any of the search robots (hence the * instead of say just google) searching or indexing my archives, tags, categories because this will count as duplicate content. Again I am not 100% sure if this affects anything, but I want the original posts indexed and not reindexed again.  I added the cgi-bin disallow because quite frankly their is nothing there for them to see. I in the past ran a parabots email script and for some reason I have a couple of links out there to some cloaked links. This takes care of that as well.

The next part was added upon demand from my xml site map plug in which lets anyone looking know exactly where to find my site map as to make indexing me much easier.

Feel free to copy and paste my robots.txt, but make sure you understand how it works in case you need to edit it for your own circumstances.

The nomenclature of Robots.txt

A big word which basically means the language for making your own robots.txt. Blindly cutting and pasting will get you in to a bit of trouble, and as I mentioned in the top paragraph I did such a thing. The irony of this is that I did not catch that the person I copied from had disallowed their own sitemap, and as such I did to. This is not a great thing to do especially when attempting to stay indexed in a prime time manner instead of a sandbox type situation which we all fight to stay out of.

So let me share with you exactly how you can build your own robots.txt

Commands

User-agent: *
This means any robot visiting your site. You can replace the * with the proper robots name like GoogleAgent.

/Disallow *
This will not allow any robot to search or index anything on your site, for us not the goal, but depending on your content you may someday use this.

Crawl-delay: 10
This is a great tool if crawlers are causing your host some issues or your server thinks a random attack is occurring. Although if this is the case I suggest taking a look at what is going on and moreover looking for new hosting.

Allow: /folder1/myfile.html
While you can disallow say a whole directory, what say if you need to allow just one file from it. Use this command to allow that file within the disallowed directory viola success!

Request-rate: 1/5
Visit-time: 0600-0845

These are some of the new proposed extended robots commands which will allow say one page every five minutes and only allow the spider to visit between dedicated hours to offload critical traffic to hours when it is not an issue.

Put Your Robots To Work For YOU

Now you have the tools and the wisdom to setup your robots.txt to work for you. This is advantageous to help you keep that plethora of a site in perfect top Google working order and not chock to the brim with errors and issues.

I would love to know any other methods you take advantage of in your robots.txt just in case I may have missed them. Knowing is half the battle. Get to it!

Related posts:

  1. You NEED A Custom 404 Error Page, So Lets Make One.
  2. Add a Custom Favicon to Your Site…
  3. Almost Time For WordPress 2.6
  4. 13 Essential WordPress Plugins
  5. Don’t Believe Everything You Hear!

Tags: ,

26 Responses to “You Need A Custom Robots.txt, So Lets Make One.”

  1. M. Pence 9 October 2008 at 7:11 am #

    Ben,

    Thank you so much for this article. This is something that has been on my list of things to Google and get too when I can–learning about robots.txt, but you’ve pretty much saved me the basic trip–that, and you’ve explained in a manner that has completely made sense to me.

    As an easily confused beginner, that has to be one of the most important things for me; finding an article that explains it exactly and easily for me to pick up–so thanks for that!

    I wonder how many other blog owners aren’t aware of this?

    • Big Ben Patton 9 October 2008 at 2:11 pm #

      No problem at all really, I noticed while searching myself for a clear and concise easy to read answer that none existed that told me everything that I needed to know for my blog.

      I also wonder how many blog owners are completely unaware of what this and what it can do for them.

  2. MLRebecca 9 October 2008 at 4:54 pm #

    This is so helpful! I have always wondered what the Robots.txt was for, and now I know. I’m definitely going to make this a project of mine. I love reading blog posts that teach me something about blogging that I didn’t already know! Thanks for posting!

  3. Mark Sierra at MeAndMyDrum.com 9 October 2008 at 11:52 pm #

    Hi Ben,

    This is very timely because I’ve had a question come up recently that I cannot answer regarding one of my robots files, but I’m hopeful you can provide some insight.

    I’m using Google Webmaster Tools for all my WP blogs and sites. When I go to Web Crawl for one of those sites, there’s a topic there for “Errors for URLs in Sitemaps”. The details columns says, “robots.txt unreachable”. Why is that? I can access it just fine if I type in the path to it.

    I should point out that it’s a text file and reiterate that I’m not using a plugin for generate it. This is a static, one-page HTML site. So I don’t know if that has anything to do with it.

    Any help you can offer is very much appreciated. :)

  4. egk 10 October 2008 at 3:28 am #

    Hey Ben,
    Haven’t commented in a while, but I do read every post and must say your posts have been getting consistently better over time. Every post now adays is really good! Great job outlining the common components of robots.txt.

    @Mark Sierra: if you are referring to the site meandmydrum.com that you linked to then the problem is very obvious. The syntax of your robots.txt file is one very long line, with the different components running into each other without spaces. You need to break each entry into its own line.

  5. egk 10 October 2008 at 3:33 am #

    BTW Ben,
    What is the difference between “Subscribe to comments via email” (above comment box) and “Notify me of followup comments via e-mail” (below comment box)? Does the first option send you an email for every comment on BPP, even if you didn’t comment on that post?

    • Big Ben Patton 10 October 2008 at 5:46 pm #

      more or less yes one will send the comments one will notify you that their are some comments.

  6. Kevin from Great Wall of China Facts 10 October 2008 at 5:46 pm #

    Thanks for info on this. I need a robots.txt. I will do more research on this!

  7. Dinh Trung 11 October 2008 at 9:02 am #

    Hi Benjamin,

    Sorry to ask you this question, since I do not know to buil a robot.txt for mine !!

    In my case, and my site (kienthucdulich.info), i faced problem with google sitemp, sometimes.

    Pls help to clarify !

    Thanks and best regards,
    Dinh Trung
    http://www.kienthucdulich.info

  8. John from Maryland Real Estate 11 October 2008 at 2:27 pm #

    good post Ben,

    it’s also a good idea under the User-agent to disallow a bunch of the site scraping scripts out there. You can really save you a bunch of bandwidth if you have a large site. Just make sure not to disallow any of the important search engine spiders of course.

  9. Donace 11 October 2008 at 4:46 pm #

    One thing I will chip in here is that following robot.txt and stuff like do-follow no-follow is all ‘recommendations’ for search engine guidelines.

    Yahoo is notorious is disobeying robot.txt files and the such.

  10. Armen Shirvanian 12 October 2008 at 1:11 am #

    This is wonderful because it allows me to see the specifics of a file that is used ubiquitously. The internals of the file are quite easily altered, and you have presented them smoothly.

  11. Ben Moreno 12 October 2008 at 7:07 pm #

    Hey this is cool. Thanks for the info man. I need to do this for my site. My name is Ben too! I like your blog. Nice design.

  12. Ajith Edassery 23 October 2008 at 2:22 pm #

    Though I was reasonably well-versed with the robot.txt file, I wasn’t aware of those extended keyword set.

    Thanks a lot,

    Cheers,
    Ajith

  13. Beijinger 7 February 2009 at 10:55 am #

    Request-rate: 1/5
    Visit-time: 0600-0845

    I was having some issue with certain bots which is eating my bandwich , glad to know above commands.

  14. Narzuty 2 April 2009 at 4:03 am #

    robots.txt is an extremely important part of every blog. I’m surprised to see so many people in comments that were not aware of its functions.

  15. DJ ARIF from Islam Blog 18 April 2011 at 9:08 am #

    I’ve updated my robots.txt file by watching yours, would you like to see that and suggest me to what to do with it next? waiting for your reply….
    DJ ARIF@Islam Blog´s last blog ..Importance Of Robotstxt File In Search Engine OptimizationMy ComLuv Profile

  16. Social Networking Website 27 May 2011 at 1:02 am #

    i had really gained awesome information, i will use this for my website thanks for sharing

  17. Islamic Website 8 August 2011 at 1:38 pm #

    Robots.txt file is very important as per giving access to spiders, a big thanks for summarizing the whole topic.

  18. article wizard 18 October 2011 at 3:51 pm #

    Nice post. I used to be checking constantly this blog and I am inspired! Extremely helpful information particularly the final section :) I care for such info much. I used to be seeking this particular information for a very lengthy time. Thank you and good luck.

  19. domain name web hosting 24 October 2011 at 1:16 pm #

    great issues altogether, you just received a new reader. What might you suggest in regards to your submit that you simply made a few days in the past? Any positive?

  20. Part Time Jobs 11 November 2011 at 4:18 pm #

    ok fine this is a very important for any site who run
    Part Time Jobs´s last blog ..Online jobs with googleMy ComLuv Profile


Trackbacks/Pingbacks

  1. Sunday Hyper-Linkage & Updates | The Net Fool - October 12, 2008

    [...] Benjamin Patton – You Need a Custom Robots.txt, So Let’s Make One [...]

Leave a Reply

CommentLuv Enabled

This site uses KeywordLuv. Enter YourName@YourKeywords in the Name field to take advantage.

Sign Up To Our Mailing List for 250 FREE PLR Niche Articles

Please fill the required box or you can’t comment at all. Please use kind words. Your e-mail address will not be published.

Gravatar is supported.

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by Aweber Wordpress Plugin