Block Search Engine Traffic Overload – Yahoo, CUIL & Baidu
Although they bring in very little traffic, the crawl rate for some of the smaller search engines can create a huge load on a website server and even bring down a site. For some reason, Yahoo, CUIL and Baidu are among the worst of the search engines, hammering sites repeatedly consuming bandwidth and resources.
To tackle that, I would suggest setting Yahoo’s crawl rate to something very slow and blocking CUIL and Baidu altogether (unless you have a Chinese site).
To set a lower crawl rate for Yahoo’s slurp crawler and block their useless adcrawler, add this to your ROBOTS.TXT file :
User-agent: Slurp
Crawl-delay: 300
User-agent: Yahoo!-AdCrawler
Disallow: /
Please note that this sets the crawl rate to one page every 300 seconds. That’s very very slow and will impact how your new pages appear on Yahoo. But, since Yahoo doesn’t bring in that much traffic, I’d say that does matter. Still, if you’re concerned about it, set it to something lower, perhaps between 20-50
To block CUIL altogether add this :
User-agent: twiceler
Disallow: /
Baidu ignores robots.txt and so you have the following options (source):
Block the Baidu IP range (list)
if you have mod_write on your apache enabled, add this to your .HTACCESS :
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^baiduspider [NC] RewriteRule .* - [F]
or this to you httpd.conf
SetEnvIfNoCase User-Agent "^Baidu" bad_bot
<Directory />
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Directory>
Hope that helps, let me know if you ran into this problem and if you have any others way you tackle that.

Pingback: CLUSTER1: Load problem (19/05/2010)