Wordpress SEO : using robots.txt to avoid content duplication
Google really doesn’t like content duplication on sites and so it is advisable to prevent the Google crawler from reaching the same content on your site from more than one url. Since Wordpress does offer many ways of reaching your content, you should block certain URL and URL paths by defining the right robots.txt.
Here’s my suggestion for the Wordpress robots.txt :
User-agent: Googlebot
# Disallow all directories and files within
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
# Disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
# Disallow parsing individual post feeds, categories and trackbacks..
Disallow: */trackback/
Disallow: */feed/
Disallow: /category/*
Be extremely careful when implementing this. For example, some Wordpress installations have Gallery2 embedded which - for reasons unknown - likes to run with main.php in the url (even with url-rewrite enabled!). Furthermore, if your blog is in a sub-directory in your domain and you change the robots.txt for the entire domain note that you might block essential pages in other sub-directories. I imagine this is the reason why robots.txt isn't included as part of the default wordpress installation.
As explained by my fellow bloggers who trackbacked, you also need to take care with the agents you block, and it would be wise to target bots specifically instead of using the problematic * symbol in the "user-agent" field.





You're absolutely right. Thanks for the helpful comments.
I tried your suggestion about the robot.txt. I am not sure if it was the reason, but my Google page rank went from a 2 to a 4. The only other reason I can think of is that I moved my blog from the URL http://www.blog.frostfox.com to http://www.frostfox.com/blog.
Consolidating your incoming links ups your page rank score. Google.com/webmaster has some tools to help you with consolidation....
I generally avoid any exclude tags on the site...The dangers outweigh the advantages for me...Any case studies on this one?
Frostfox - Yeah, I believe Lonnie is right, but you should know that pagerank only updates once every 3-4 month, so it's not something that you get immediate results on. But redirecting a few reachable urls into one, is very good practice, especially if Google penalized you for duplicate content.
Lonnie - Yeah, there are, and those are all over the SEO blogosphere. You can start off by SEOBook and see his self-report as well as incoming trackbacks. As long as you closely monitor the robots.txt performance in Google webmasters tools, I believe you'll be o'right.
I have been playing around with the Google webmaster thing for a bit now.
2 things about your robot.txt, one you have a spelling mistake "indididual" should be "individual", and isn't disallowing files that end with .php a bad idea? Your index for your site is index.php.
Thanks for the spelling correction
:$
Your question about index.php is actually what content duplication is about. Some blogs allow the exact same page to appear through index.php and their main blog path "/" and that's something you want to avoid.
Nice post dude.. You will want to check out Wordpress robots.txt for more examples.
Yeah, I later found your post through various bloggers on the net (JohnTP etc.). That's a good comprehensive post you wrote there...
Hey Fili - thanks for the advice; unfortunately by blocking all PHP files I stopped Google from accessing my home page (the Google Webmaster Tools said that
Glad I could help.
I believe this next plugin will take care of that problem for you (and a few other duplicate content issues) :
Permalink Redirect
Thanks again fiLi - that plugin looks really useful. M
Great blog and nice post Fili.. I like how you are keeping it simple, I recently changed my robots.txt from WordPress robots.txt example, to a simpler version, perhaps its time for a followup article???
Hey you should check out the Updated WordPress SEO robots.txt!
This robots.txt file looks very simple and thanks for some of the points you made. Now I again have to go through all the site I visited for the robots.txt file and create my own file based on the convincing information provided by them. Hopefully, after experimenting a little I would come to know which is best for a wordpress site. Thank you for the information.
Many thanks for your preciuos info..
Grazie mille and greetings from Italy.
Useful tip and a simple solution. Thanks.
Thank you for your tip. I also suggest:
http://codex.wordpress.org/Search_Engine_Optimization_for_Wordpress
Why does it work for some sites and not others? And how come some blogs get indexed in a day and then are dropped, and others stay in Google indefinitely?…
Interesting! Always looking for useful SEO tips.
Some people optimizing their websites/blogs are not aware using of this, it is better that there are always someone who are ready to impart their tips and seo knowledge like you. Thanks for your kindness.
Some people did not aware of what they are doing in SEO, thanks for someone like you.
You should check Wordpress Robots.txt for Silo SEO as it goes into more detail on removing duplicate content on wordpress using robots.txt.