No Twitter Trending Bots - Greasemonkey Script
First, the good stuff. You can read the backstory down below.
Script:
http://jazzychad.com/twitter/notrendingbots.user.js
If you are using Greasemonkey in FireFox, you can click the above link to install the script.
What does it do?
This Greasemonkey script will remove tweets by the Twitter Trending Bots from the http://search.twitter.com/ search results. Right now that list includes:
- tweet_trends
- realtimetrends
- twopular
- twopularfeed
- twopularalert
- twithority
- attrending
- trending
- tweetingtrends
- retweetingtrends
- daymix
- trendingtopics
If you are searching a trending topic, you don’t need a dozen bots telling you it’s trending. You already know that!
However, if you are specifically searching for one of these bots, the results will remain intact, just in case you want to use one of them to see the history of trends.
If any tweets are removed from the results, a little blue info window on the results page will tell you how many.
Screenshots:
![]()
![]()
For people who know how to edit user/greasemonkey scripts, there is a sections that looks like this:
trendbots.push(”tweet_trends”);
trendbots.push(”realtimetrends”);
trendbots.push(”twopular”);
trendbots.push(”twopularfeed”);
trendbots.push(”twopularalert”);
trendbots.push(”twithority”);
trendbots.push(”attrending”);
trendbots.push(”trending”);
trendbots.push(”tweetingtrends”);
trendbots.push(”retweettrends”);
trendbots.push(”daymix”);
trendbots.push(”trendingtopics”);
Feel free to add/delete other bots/accounts from that list to suit your taste.
Backstory:
Earlier this week, I launched a new twitter bot @RealTimeTrends which tweets when new trends appear and when currently existing trends move up in ranking. I found this to be more informative than the other existing trend bots. Also, I use it as a way to link to my TweetGrid Search site. Now, I have always felt that these sorts of bots were kind of spammy, but when I launched mine, I quickly found that lots of people click the links attached to those tweets… a suprising number, in fact. So, obviously people find this sort of information useful. My bot even has around 100 followers so far (in under a week, as of time of writing). Update: After 1 month it has 765 followers. *boggle*. I try not to let my bot be too verbose. It only tweets new and rising trends, not the entire list of trending topics every 5 minutes (which is how often twitter updates its trends list) like many of the other trend bots.
Suddenly I received a tweet to my bot from @gregorylent assailing its service:
@RealTimeTrends really sorry to see your service, you just add fog to twitter search ..
Apparently he was going after the other bots, too:
@twopularFeed what is your reason for being, i go to a topic to read about the topic, not to read your summary .. please get a job
I can’t say that I blame him, but like I said, it drives traffic so I’m certainly not going to turn it off.
So why write this greasemonkey script? Two reasons, really.
1. It occurred to me that the problem of removing trending bot tweets from search results could be easily achieved with a quick greasemonkey script. I had never written one before, so I took the opportunity to learn how. It was quite fun, so I think I’ll be writing some more in the future.
2. Even though my bot works, has followers, and even gets retweeted (another surprise), I still felt it was all kind of spammy. I felt like writing this sort of script would even things out. I am providing a bot to be informative, but I am also providing a mechanism to shut it up (along with others).
There you have it. As more bots come along I will try to keep the script updated.
When Cluster Computing Can Slow You Down, and How To Optimize It
I have long been fascinated with the concept of cluster (or distributed) computing. Projects like BOINC and Folding At Home that take advantage of millions of people’s idle computer cycles are brilliant. I have several computers in my house that I use for various purposes, but they are rarely ever all being used at the same time. I always wanted to try to run some sort of clustered application across them but never found a great reason until recently: making DVD backups by transcoding them into xvid files.
There is a great program for ripping DVDs in linux called dvd::rip. It it written in perl and is very easy to use if you are comfortable with a linux/console environment. One of its great features is the ability to transcode ripped DVDs by creating a computer cluster out of your networked computers. Before I get into the details, let me step back and give a very brief overview of the basic way a computing cluster works.
The Ultimate Twitter Search Widget
I have released a Twitter Search Widget and supporting API for use on any website. You can find it here:
The Ultimate Twitter Search Widget
These widgets display tweets based on the search.twitter.com API and can be customized with great flexibility. Go check it out!
Initial Google Chrome Memory Usage Comparison
Before you flame me, I know that Google Chrome is early early beta, but I find this kinda silly. Bragging about your superior memory management means it should have superior memory management. After installing Chrome and firing up my 4 most frequently loaded pages in 4 new tabs, I notice that things started to get clunky. I open up Firefox 2 and open the same 4 pages in 4 new tabs and compare. In Chrome you can open ‘about:memory’ to see memory usage statistics (which is pretty awesome). Anyway, this is what I see:
Chrome is using almost twice what Firefox is using. If you’re interested, here are the 4 sites:
- Gmail
- Google Finance
- CNN Money
Now, I know that RAM is there to be used, and maybe I’m missing some point of what Chrome is trying to do at the moment (I know that the comic book said that it tries to amortize the memory cost of a tab up front at its creation), but yikes… this is a tab tad concerning…
Hurricane Gustav Twitter Tracker
At the suggestion of waynesutton, I have created another twitter tracking site. This time it is for Hurricane Gustav related tweets. There are a few filters available to narrow down the results on each page.
Hurricane Gustav Twitter Tracker
UPDATE:
I also created a widget you can install on your site/blog at http://jazzychad.com/twitter/gustav/widget.php
In the first 24 hours, over 32,000 widgets were served.
You can see it in action at these sites (probably until Gustav blows over):
Sarah Palin Little Known Facts
Today John McCain announced Alaska governer Sarah Palin as his candidate for Vice President. No sooner had the happened than @MichaelTurk inadvertently started the newest internet viral (micro)meme on twitter: “Little Known Facts: Sarah Palin” His unassuming tweet, “Little known fact: Sarah Palin used to wrestle kodiak bears in Alaskan bare knuckles fight clubs.” set off an explosion of other twitter users coming up with their own Chuck Norris-style little known facts about Palin. MichaelTurk is keeping a log of some of these little known facts over at his site http://www.palinfacts.com/
During the next few hours, over 1,000 little known facts were tweeted (perhaps not all unique), but the point is how quickly this caught on like wildfire.
Since you can only see up to the most recent 100 tweets on search.twitter.com, I decided to write a tweet collecting engine (similar to the one used for FlixPulse.com) to keep a catalog of these little known facts. The results are shown at http://jazzychad.com/twitter/palinfacts/ In just under 2 hours, I have collected over 200 little known fact tweets. Amazing.
During the first few hours, all of the tweets were positive in Chuck Norris fashion. Then all of the sudden I started to see some negative ones creep in. It seemed that those who are unhappy with McCain’s choice, and some Obama supports, figured out that two can play at this game. Still, the overwhelming majority are positive in nature.
I see that even as I write this post, Turk from palinfacts.com has added a link to my database site on his blog. Thanks, Turk!
Who knows how long this will keep up, but I will keep logging the tweets as they come in!
Barcodes Are Fun
I have always been fascinated by barcodes and the various methods to encode human readable information into a machine readable format. A long time ago I found a great little program for my TI-89 calculator that would draw a user-entered UPC code onto the graph screen. I ported it to work in Windows using Visual Basic to gain an understanding of how barcodes are created. I even went so far as to use my CueCat to scan barcodes on products and have my program automatically reproduce the bardcoce on screen.
Later on when I was learning to use the gd library for image creation within php scripts, I decided to start by porting the barcode generator once again. I never got it quite right, but I had learned enough to move on with whatever project I was working on. Recently I stumbled upon that old barcode gd script and decided to fix it so it worked correctly.
First the barcode creator only generated UPC-A barcodes (the big 12-digit codes you scan at the check-out line). After getting that to work successfully, I decided to implement UPC-E barcode generation as well. UPC-E codes are usually found on soda cans or snacks and only have 6 digits so they take up less physical space on the product.
UPC-A and UPC-E Barcode Generator
If you should ever need to create barcodes on the fly, I have created a sort of API for them:
To create a UPC-A barcode, use the following pattern:
http://jazzychad.com/barcode/upca-<insert-12-digit-code-here>.png
Example:
http://jazzychad.com/barcode/upca-012000809965.png
will produce

To create a smaller version, simply add “small” after the 12-digit code, like so:
http://jazzychad.com/barcode/upca-012000809965small.png
will produce

If you would rather have a gif format rather than png, just use .gif instead.
UPC-E creation is very similar:
http://jazzychad.com/barcode/upce-<insert-6-digit-code-here>.png
Notice “upce” instead of “upca”.
Example:
http://jazzychad.com/barcode/upce-120850.png
will produce

Again, to create a smaller version, just add “small” after the 6-digit code:
http://jazzychad.com/barcode/upce-120850small.png
will produce

.png and .gif work the same way as well.
Note: UPC-A creation will still work even if the check-digit (the last digit after the barcode) is incorrect. I think I will change the script to indicate an invalid UPC-A code by turning the check-digit red.
I wanted to test scanning these barcodes after printing them out, but my CueCat seems to have died. Maybe I will take one to the store next time I want to buy a Mountain Dew 12-pack.
Resources:
Wikipedia UPC Article
Barcode Calculator Tool
How To Block the CUIL Spider Bot
Some people have landed on my previous rant about CUIL after searching Google for “block cuil bot”. I realize that article does not answer that question (unless you click the link to read madstatter’s robots.txt file). So, here is how you block CUIL’s insane spiders.
CUIL’s spider is called “Twiceler”. Why? I have no idea. To block it from your site, add the following lines to your site’s robots.txt file:
User-agent: twiceler
Disallow: /
That’s it!
For more information about robots.txt files, there is some great information at robotstxt.org.
Beware WordPress SQL Attack!
If you run a WordPress blog on your server (or are an admin for one), you need to read this.
I was watching my webserver log scroll by (yes, I do that a lot), and I witnessed yet another attempted SQL attack. This time it wasn’t trying to inject anything onto my server like last time. This time it was trying to get the admin password hash stored in the database. Yikes!
Here is what the request looked like:
GET /stuff/index.php?cat=999 UNION SELECT null,CONCAT(666,CHAR(58),user_pass,CHAR(58),666,CHAR(58)),null,null,null FROM wp_users where id=1
There were several variations of this request that came in very quick succession. Can you see what it’s doing?
Basically it is trying to dump your admin password hash to the screen. If successful it would look something like:
666:<password-hash-here>:666:
The evil bot/script/whatever would just look for the “666:” surrounding the hash and read it out. Then it would probably lookup the hash in a Rainbow Table. If you have a weak password it would be completely compromised.
Luckily, since I had modified my webserver’s .htaccess file after enduring the last SQL attack, this attack got a very nice “HTTP 403: SCREW YOU!” response from my server! I am re-displaying the section of the .htaccess file for your edification (I also added an entry for “outfile” which should never be used in an http request).
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /RewriteCond %{QUERY_STRING} union [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} select [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} jatest [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} http [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} outfile [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]
</IfModule>
This should also be a good reminder to always use strong passwords.
How CUIL Lost Me as a Customer Long Before They Launched
Several months ago I noticed a new spider crawling madstatter.com (my baseball statistics site) called “Twiceler”. The link in the user-agent for Twiceler led to this page. Apparently it was for an new un-launched search engined called “cuil”. Obviously this Twiceler bot was just crawling sites to gather pages before the launch. Great, I was happy to be indexed by a new engine.
Then things went wrong…. very wrong.
For some reason Twiceler was trying to index links such as:
http://www.madstatter.com/06/07/07/07/07/06/06/…/scoring.php
…which of course is a ridiculous address. Twiceler was creating hundreds (if not thousands) of these types of bogus addresses and trying to index them. They all returned 404, of course. I thought surely that this would stop soon enough because no spider could be this goofy.
It continued for days. At this point I decided to look up the contact for Twiceler and send off an informative email to tell them what was happening. I really was just trying to help.
Hi Jim,
I see your Twiceler robot crawling my site (www.madstatter.com) which is all fine and dandy, except that it is creating mal-formed addresses which all return 404 errors.
For example, it is trying to find this page:
http://www.madstatter.com/06/07/07/07/07/06/…/07/07/scoring.php…which doesn’t exist of course. Perhaps there is something goofy with the logic that parses each page’s links and creates new addresses to crawl? For example, I use a lot of “../” and “./” references in my href links which tend to throw off some url-parsing robots (that is not intentional, it is just the way it is).
I am afraid at this rate your bot may be creating an infinite number of mal-formed addresses to crawl. I do not with to block your bot, but I don’t want a whole ton of garbage addresses in my log files either
![]()
Thank you for looking into this.
-Chad
The next day I received this rather cold response:
Dear Chad,
Twiceler is the crawler that we are developing for our new search
engine. It is important to us that it obey robots.txt, and that it not
crawl sites that do not wish to be crawled. If you wish I will be
glad to add your site to our list of sites to exclude.Like all startups, we hope to launch sooner rather later, but exactly
when that will be, I don’t know. Watch our web site (www.cuill.com) for
the announcement.Recently we have seen a number of crawlers masquerading as Twiceler,
so please check that the IP address of the crawler in question is one of
ours. You can see our IP addresses at http://cuill.com/twiceler/robot.htmlYou may wish to add a robots.txt file to your site (I notice you don’t
have one). That is the standard mechanism for controlling robot access and
behavior. You can read about it at
http://www.robotstxt.org/wc/exclusion-admin.html
and there a simple generator of the file here
http://www.mcanerin.com/EN/search-engine/robots-txt.aspIncorrectly formed URLs are usually the result of links we have
picked up from earlier crawls - usually from some other unrelated site
that has a stale or mangled link to yours. We have no way of knowing
their validity until we try to access them.I apologize for any inconvenience this has caused you and please feel
free to contact me if you have any further questions.Sincerely,
James Akers
Operations Engineer
Cuill, Inc.
I wasn’t sure if this was a form-response email or what. It did not seem to address the issue I had raised. Furthermore, it basically said “if you’re not happy with our spider crawling your site, just block it.”
In my confusion, I sent back another email:
Hi James,
Thanks for the reply.I do have a robots.txt file, it is just blank. I do not wish to block any crawlers, so there are not entries in the file.
The IP that is crawling is 38.99.44.103, which is in the list of valid IPs you provide.
Perhaps a solution that will suit both of us is if you could clear out the “pending addresses” in your database for madstatter.com? That way Twiceler can still crawl the site, but it will just have to start from scratch creating the address links over again.
Is that a fair/doable compromise?
Thanks,
-Chad
I received no replies after sending this email. None.
After another week of Twiceler spamming my server (and logs) with insane requests, I decided I had enough. I finally added the first (and only) entry to madstatter.com/robots.txt to block Twiceler entirely.
I’m sorry, but if you are trying to create “the Google killer” search engine, then you shouldn’t treat (or outright ignore) people that are trying to voluntarily help you and your broken spider.
It’s no wonder that Cuil is claiming that they have indexed three times as many pages as Google. They have probably crawled millions of non-existent addresses!
I feel somewhat vindicated in my anger toward Cuil by seeing the receptive opinions of their official launch be so overtly negative. I honestly don’t think Google has anything to worry about here.