How To Block the CUIL Spider Bot
Some people have landed on my previous rant about CUIL after searching Google for “block cuil bot”. I realize that article does not answer that question (unless you click the link to read madstatter’s robots.txt file). So, here is how you block CUIL’s insane spiders.
CUIL’s spider is called “Twiceler”. Why? I have no idea. To block it from your site, add the following lines to your site’s robots.txt file:
User-agent: twiceler
Disallow: /
That’s it!
For more information about robots.txt files, there is some great information at robotstxt.org.
Popularity: 63%Comments
4 Responses to “How To Block the CUIL Spider Bot”
Leave a Reply
I have found that the bot does not respect robots.txt. Others have also had the same complaint. However, Cuil claims their bot does respect robots.txt, but only after 7 days.
[...] the search engine that claimed they will be the new Google, is getting a lot of bad press [...]
Similar issues here. Referral logs are a mess with badly predicted links, and I haven’t even looked at the 404 logs yet.
Took out my poor VPS for a few days, and I had to restore from an old image (logs got too loaded? I dunno, but things are fine now).
I’ve “adjusted” all of my robots.txt files, thanks for the crawler name.
As a webmaster, you definitely should use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.
I wrote more about this here:
Webmaster Tips: Blocking Selected User-Agents
http://faseidl.com/public/item/213126