Google really doesn’t like content duplication on sites and so it is advisable to prevent the Google crawler from reaching the same content on your site from more than one url. Since WordPress does offer many ways of reaching your content, you should block certain URL and URL paths by defining the right robots.txt.
Here’s my suggestion for the WordPress robots.txt :
# Disallow all directories and files within
# Disallow all files ending with these extensions
# Disallow parsing individual post feeds, categories and trackbacks..
Be extremely careful when implementing this. For example, some WordPress installations have Gallery2 embedded which - for reasons unknown - likes to run with main.php in the url (even with url-rewrite enabled!). Furthermore, if your blog is in a sub-directory in your domain and you change the robots.txt for the entire domain note that you might block essential pages in other sub-directories. I imagine this is the reason why robots.txt isn't included as part of the default wordpress installation.
As explained by my fellow bloggers who trackbacked, you also need to take care with the agents you block, and it would be wise to target bots specifically instead of using the problematic * symbol in the "user-agent" field.
- Top SEO & Speed WordPress Plugins To Boost Off 2010
- GWT’s Parameter Handling : Duplicate Content in Drupal & Gallery2
- Migrating WordPress sites to Multi User with Multiple Domains
- WordPress SEO : Using excerpt, robots.txt and noindex meta-tag for duplicate content in index, archives and categories
- Drupal SEO – using robots.txt to avoid content duplication
- WordPress 2.7, 404 errors and Magic Quotes GPC