Wordpress SEO : using robots.txt to avoid content duplication

Google really doesn’t like content duplication on sites and so it is advisable to prevent the Google crawler from reaching the same content on your site from more than one url. Since Wordpress does offer many ways of reaching your content, you should block certain URL and URL paths by defining the right robots.txt.

Here’s my suggestion for the Wordpress robots.txt :

User-agent:  Googlebot
 
# Disallow all directories and files within
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
 
# Disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
 
# Disallow parsing individual post feeds, categories and trackbacks..
Disallow: */trackback/
Disallow: */feed/
Disallow: /category/*

Be extremely careful when implementing this. For example, some Wordpress installations have Gallery2 embedded which - for reasons unknown - likes to run with main.php in the url (even with url-rewrite enabled!). Furthermore, if your blog is in a sub-directory in your domain and you change the robots.txt for the entire domain note that you might block essential pages in other sub-directories. I imagine this is the reason why robots.txt isn't included as part of the default wordpress installation.
As explained by my fellow bloggers who trackbacked, you also need to take care with the agents you block, and it would be wise to target bots specifically instead of using the problematic * symbol in the "user-agent" field.

40 CommentsLeave a Comment

  1. filination says:

    You're absolutely right. Thanks for the helpful comments.

  2. Frostfox says:

    I tried your suggestion about the robot.txt. I am not sure if it was the reason, but my Google page rank went from a 2 to a 4. The only other reason I can think of is that I moved my blog from the URL http://www.blog.frostfox.com to http://www.frostfox.com/blog.

  3. Lonnie says:

    Consolidating your incoming links ups your page rank score. Google.com/webmaster has some tools to help you with consolidation....

    I generally avoid any exclude tags on the site...The dangers outweigh the advantages for me...Any case studies on this one?

  4. filination says:

    Frostfox - Yeah, I believe Lonnie is right, but you should know that pagerank only updates once every 3-4 month, so it's not something that you get immediate results on. But redirecting a few reachable urls into one, is very good practice, especially if Google penalized you for duplicate content.

    Lonnie - Yeah, there are, and those are all over the SEO blogosphere. You can start off by SEOBook and see his self-report as well as incoming trackbacks. As long as you closely monitor the robots.txt performance in Google webmasters tools, I believe you'll be o'right.

  5. Frostfox says:

    I have been playing around with the Google webmaster thing for a bit now.
    2 things about your robot.txt, one you have a spelling mistake "indididual" should be "individual", and isn't disallowing files that end with .php a bad idea? Your index for your site is index.php.

  6. filination says:

    Thanks for the spelling correction :P :$

    Your question about index.php is actually what content duplication is about. Some blogs allow the exact same page to appear through index.php and their main blog path "/" and that's something you want to avoid.

  7. Apache Gal says:

    Nice post dude.. You will want to check out Wordpress robots.txt for more examples.

  8. fiLi says:

    Yeah, I later found your post through various bloggers on the net (JohnTP etc.). That's a good comprehensive post you wrote there...

  9. Mark Wilson says:

    Hey Fili - thanks for the advice; unfortunately by blocking all PHP files I stopped Google from accessing my home page (the Google Webmaster Tools said that

    ).

    I read your comment above to Frostfox, do you have any advice for dealing with the situation where http://www.markwilson.co.uk/blog/ and http://www.markwilson.co.uk/blog/index.php are actually one and the same?

    TIA, Mark

  10. fiLi says:

    Glad I could help.

    I believe this next plugin will take care of that problem for you (and a few other duplicate content issues) :
    Permalink Redirect

  11. Mark Wilson says:

    Thanks again fiLi - that plugin looks really useful. M

  12. AskApache says:

    Great blog and nice post Fili.. I like how you are keeping it simple, I recently changed my robots.txt from WordPress robots.txt example, to a simpler version, perhaps its time for a followup article???

  13. This robots.txt file looks very simple and thanks for some of the points you made. Now I again have to go through all the site I visited for the robots.txt file and create my own file based on the convincing information provided by them. Hopefully, after experimenting a little I would come to know which is best for a wordpress site. Thank you for the information.

  14. Romano says:

    Many thanks for your preciuos info..
    Grazie mille and greetings from Italy.

  15. saytopedia says:

    Useful tip and a simple solution. Thanks.

  16. Seo Google Pagerank says:

    Why does it work for some sites and not others? And how come some blogs get indexed in a day and then are dropped, and others stay in Google indefinitely?…

  17. jane says:

    Interesting! Always looking for useful SEO tips.

  18. Orion SEO says:

    Some people optimizing their websites/blogs are not aware using of this, it is better that there are always someone who are ready to impart their tips and seo knowledge like you. Thanks for your kindness.

  19. Orion SEO says:

    Some people did not aware of what they are doing in SEO, thanks for someone like you.

  20. You should check Wordpress Robots.txt for Silo SEO as it goes into more detail on removing duplicate content on wordpress using robots.txt.

Leave a Comment