How CUIL Lost Me as a Customer Long Before They Launched

Several months ago I noticed a new spider crawling madstatter.com (my baseball statistics site) called “Twiceler”. The link in the user-agent for Twiceler led to this page. Apparently it was for an new un-launched search engined called “cuil”. Obviously this Twiceler bot was just crawling sites to gather pages before the launch. Great, I was happy to be indexed by a new engine.

Then things went wrong…. very wrong.

For some reason Twiceler was trying to index links such as:

http://www.madstatter.com/06/07/07/07/07/06/06/…/scoring.php

…which of course is a ridiculous address. Twiceler was creating hundreds (if not thousands) of these types of bogus addresses and trying to index them. They all returned 404, of course. I thought surely that this would stop soon enough because no spider could be this goofy.

It continued for days. At this point I decided to look up the contact for Twiceler and send off an informative email to tell them what was happening. I really was just trying to help.

Hi Jim,

I see your Twiceler robot crawling my site (www.madstatter.com) which is all fine and dandy, except that it is creating mal-formed addresses which all return 404 errors.

For example, it is trying to find this page:
http://www.madstatter.com/06/07/07/07/07/06/…/07/07/scoring.php

…which doesn’t exist of course. Perhaps there is something goofy with the logic that parses each page’s links and creates new addresses to crawl? For example, I use a lot of “../” and “./” references in my href links which tend to throw off some url-parsing robots (that is not intentional, it is just the way it is).

I am afraid at this rate your bot may be creating an infinite number of mal-formed addresses to crawl. I do not with to block your bot, but I don’t want a whole ton of garbage addresses in my log files either :-)

Thank you for looking into this.
-Chad

The next day I received this rather cold response:

Dear Chad,

Twiceler is the crawler that we are developing for our new search
engine. It is important to us that it obey robots.txt, and that it not
crawl sites that do not wish to be crawled. If you wish I will be
glad to add your site to our list of sites to exclude.

Like all startups, we hope to launch sooner rather later, but exactly
when that will be, I don’t know. Watch our web site (www.cuill.com) for
the announcement.

Recently we have seen a number of crawlers masquerading as Twiceler,
so please check that the IP address of the crawler in question is one of
ours. You can see our IP addresses at http://cuill.com/twiceler/robot.html

You may wish to add a robots.txt file to your site (I notice you don’t
have one). That is the standard mechanism for controlling robot access and
behavior. You can read about it at
http://www.robotstxt.org/wc/exclusion-admin.html
and there a simple generator of the file here
http://www.mcanerin.com/EN/search-engine/robots-txt.asp

Incorrectly formed URLs are usually the result of links we have
picked up from earlier crawls - usually from some other unrelated site
that has a stale or mangled link to yours. We have no way of knowing
their validity until we try to access them.

I apologize for any inconvenience this has caused you and please feel
free to contact me if you have any further questions.

Sincerely,

James Akers
Operations Engineer
Cuill, Inc.

I wasn’t sure if this was a form-response email or what. It did not seem to address the issue I had raised. Furthermore, it basically said “if you’re not happy with our spider crawling your site, just block it.”

In my confusion, I sent back another email:

Hi James,
Thanks for the reply.

I do have a robots.txt file, it is just blank. I do not wish to block any crawlers, so there are not entries in the file.

The IP that is crawling is 38.99.44.103, which is in the list of valid IPs you provide.

Perhaps a solution that will suit both of us is if you could clear out the “pending addresses” in your database for madstatter.com? That way Twiceler can still crawl the site, but it will just have to start from scratch creating the address links over again.

Is that a fair/doable compromise?

Thanks,
-Chad

I received no replies after sending this email. None.

After another week of Twiceler spamming my server (and logs) with insane requests, I decided I had enough. I finally added the first (and only) entry to madstatter.com/robots.txt to block Twiceler entirely.

I’m sorry, but if you are trying to create “the Google killer” search engine, then you shouldn’t treat (or outright ignore) people that are trying to voluntarily help you and your broken spider.

It’s no wonder that Cuil is claiming that they have indexed three times as many pages as Google. They have probably crawled millions of non-existent addresses!

I feel somewhat vindicated in my anger toward Cuil by seeing the receptive opinions of their official launch be so overtly negative. I honestly don’t think Google has anything to worry about here.

Popularity: 96%

Comments

20 Responses to “How CUIL Lost Me as a Customer Long Before They Launched”

  1. Mike Rapin on July 31st, 2008 12:17 pm

    I was hoping to see something come out off cuil.com hopefully offering some kind of real competition to google, but obviously this site is just bogus…

    This post just confirms what I already assumed: cuil.com isn’t worth my time.

  2. Alexander Higgins on July 31st, 2008 7:42 pm

    It nice see someone else stepping out with a Cuil horror story.

    I think it’s only a matter of time before webmasters and developers realize just who this company is.

  3. FlixPulse - Twitter Based Movie Reviews | Alex Higgins Blog on July 31st, 2008 8:17 pm

    [...] the webmaster/programmer stopped by my blog and left me a comment that Cuil had lost him as customer even before they launched.  So I stopped by his blog and ended up reading a really cool article about the technology [...]

  4. Alexander Higgins on August 1st, 2008 3:03 am

    Mike, in all honesty… I don’t think their site is bogus, they just don’t have it together right now. And they need to learn a very important lession about respecting others on the internet. If the index is what it says it is, they will be able to compete with google.

  5. How To Block the CUIL Spider Bot : My Code is Compiling on August 2nd, 2008 8:39 pm

    [...] people have landed on my previous rant about CUIL after searching Google for “block cuil bot”.

  6. Is Cuil Killing Websites? on September 1st, 2008 1:08 pm

    [...] Cuil to see why Twiceler was hitting sites so often. James Akers, Cuil’s Operational Engineer responded to the issue by saying that “Twiceler is an experimental crawler that we are developing for [...]

  7. TechCrunch Japanese アーカイブ » Cuilでサイトがダウン? on September 1st, 2008 2:24 pm

    [...] 怒ったサイトオーナーたちが、Twicelerが何故こんなに

  8. Is Cuil Killing Websites? | aboutCREATION on September 1st, 2008 2:28 pm

    [...] Cuil to see why Twiceler was hitting sites so often. James Akers, Cuil’s Operational Engineer responded to the issue by saying that “Twiceler is an experimental crawler that we are developing for [...]

  9. Cuil Bot Misbehaving? - Irish SEO, Marketing & Webmaster Discussion on September 1st, 2008 5:28 pm

    [...] it hasn’t overloaded the server. However this reply from one of Cuil’s engineers is very worrying: How CUIL Lost Me as a Customer Long Before They Launched : My Code is Compiling If someone used that "experimental spider" excuse on me, I’d use my experimental shotgun [...]

  10. Is Cuil Killing Websites? - Web News & Trends on September 2nd, 2008 1:54 am

    [...] Cuil to see why Twiceler was hitting sites so often. James Akers, Cuil

  11. Ciul crawlers crashing websites : Internet Business on September 2nd, 2008 6:10 am

    [...] According to James Akers, Operational Engineer for Cuil, there are other “crawlers masquerading as Twiceler”, Cuil blame malformed URLs for other sites as causing the problems. [...]

  12. Cuil har tabbet seg ut » IKTAvisen on September 2nd, 2008 1:46 pm

    [...] høre hvorfor søkeagenten deres genererer så mye trafikk. Driftsingeniør James Akers i selskapet svarte følgende: - Twiceler er en eksperimentell søkeagent som vi utvikler for vår nye søkemotor. Vi [...]

  13. Prescott Shibles: B2B Internet Marketing and Media on September 3rd, 2008 9:56 pm

    Scared of Spiders?…

    My girlfriend really doesn’t like spiders, so I regularly get to roll up my newspaper and squash one. I’m not so much afraid of them, but that’s starting to change: they can be lethal to a website….

  14. NEWS » Blog Archive » Cuil e lo spider assassino on September 4th, 2008 1:40 am

    [...] Ma le proteste, a mesi dalle prime ondate di panico, non si sono placate: c’è chi lamenta come Twiceler cerchi di raggiungere URL inesistenti nel tentativo di ingrassare il proprio indice [...]

  15. www.tegal-dalnet.com » Blog Archive » Block Cuil Now, Or Say Goodbye To Your Badass! on September 5th, 2008 1:48 pm

    [...] dari laporan seorang “korban” awal yang bernama Jazzychad. Beliau melaporkan bahwa salah satu situsnya tiba-tiba saja down gara-gara kelebihan beban. Begitu [...]

  16. chad on September 7th, 2008 4:34 pm

    HAHA… I use there search engine about once a week just as a joke. Their engine never reveals any valid results. It amuses me how much it sucks…only because of the ludicrous claim they made about being better than Google.

  17. semsemsam on September 10th, 2008 7:02 pm

    Lets start fresh, rename cuil to fuil

  18. semsemsam on September 10th, 2008 7:09 pm

    By the way, isn’t it just the time to pull the plug on this “search engine”? At the moment it has 0.01% market share, what are they hoping for eventually? 1% market share?

  19. webman10000 on September 24th, 2008 10:21 pm

    Reading between the lines its quite obvious twicelers screwy design wasnt by accident. You have an Irish Techie who convinced VC’s to dump 33 million in small change into his concept, which was based mainly on their massive index size. But guess what, when the VC auditors came to see the index size for real Mr. Tom had to fill it with something. So he came up with a spur of the moment idea to run the system dictionary agaisnt every sites web directory, and presto, Cuil now has generated millions if not billions of indexes to unique ip addresses, ofcourse they are all 40(1,2,3) errors, but who cares, since he was selling index size to his VC’s not content. He passes the auditors test, they open up the bank account, and its muffins and chocolates forever (or at least until the money runs out and the VC’s try to sell a goldmine full of fools gold.

  20. Horus Kol » Cuil is getting into hot water on December 21st, 2008 3:43 am

    [...] have also crafted detailed responses to the requesters, which are full of nice fuzzy non-statements, including a bunch of “oh, [...]

Leave a Reply