Google really doesn’t like content duplication on sites and so it is advisable to prevent the Google crawler from reaching the same content on your site from more than one url. Since WordPress does offer many ways of reaching your content, you should block certain URL and URL paths by defining the right robots.txt.

Here’s my suggestion for the WordPress robots.txt :

User-agent:  Googlebot
# Disallow all directories and files within
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
# Disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
# Disallow parsing individual post feeds, categories and trackbacks..
Disallow: */trackback/
Disallow: */feed/
Disallow: /category/*

Be extremely careful when implementing this. For example, some WordPress installations have Gallery2 embedded which – for reasons unknown – likes to run with main.php in the url (even with url-rewrite enabled!). Furthermore, if your blog is in a sub-directory in your domain and you change the robots.txt for the entire domain note that you might block essential pages in other sub-directories. I imagine this is the reason why robots.txt isn’t included as part of the default wordpress installation.
As explained by my fellow bloggers who trackbacked, you also need to take care with the agents you block, and it would be wise to target bots specifically instead of using the problematic * symbol in the "user-agent" field.

Tagged with:
Set your Twitter account name in your settings to use the TwitterBar Section.