|
|
|
![]() Click here to buy Sale Posters |
FlixPulse has two basic parts: the front-end (website) where data about each movie is displayed, and the back-end where all the magic happens. I'll assume you know how a website works, so let's skip to the fun part.
At the core of FlixPulse is the Tweet Classification Engine (TCE). It is basically a script that runs indefinitely on one of my linux servers. The TCE does 4 basic steps.
Step 3 is also pretty easy, but Step 2 is the interesting bit.
Some people have asked whether classification is based on keywords, that is, if I classify tweets as good if they contain words like "awesome", "great", etc. The answer is no. I did consider this method early on, but quickly discovered that there are far too many ways to express positive, negative, and indifferent opinions to find them all using keywords. Using this method would be wrong a lot of the time as well. Phrases such as "that movie was not good at all" would register as a Good review based upon the keyword "good" when it is, in fact, a Bad review. So, how do you get a computer to master the complexities of linguistic expression. Well, you can't really, but you can come close... with Bayesian Filters.
Bayesian Filters are the main workhorse for spam filtering in email. They require a large body of input data (or "knowledge") to be able to work. I was not able to implement such a system until I had a large number of classified tweets already in the database. The Mod Squad had done a great job of classifying a huge number of tweets in a short period of time. Using this data, I was able to construct some surprisingly accurate Bayesian filters to handle the onslaught of Dark Knight tweets during opening weekend.
Here is a basic explanation of how Bayesian spam filtering works, and then I will use that as a way to explain the way TCE Step 2 works.
Bayesian filters are all about probability. They basically ask, "What is the probability that this email is spam, taking into account all the previous spam messages I have seen?" The above formula boils down to this: the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email.
Now, in FlixPulse's case, replace the word "Spam" with "Good" and replace the word "email" with "tweet" and you can see how using Bayesian filters maps nicely to the problem of having a computer classify tweets on its own. Rinse and repeat for "Bad", "Indifferent", and "Noise" filters. Why have a "Noise" filter? Tweets such as "going to watch Iron Man with some friends" do not count as a review, and do not need to be marked as Good, Bad, or Indifferent. Now I have a set of four Bayesian filters that can determine the probability that any incoming tweet fits into one of those four categories.
Each filter is constructed using statistical distributions of individual words and phrases from each category of tweets.
As each tweet is received by the TCE, it is given a score by each filter. If any of the scores registers very high, and all the others score low, the tweet is automatically marked as such. If more than one score registers high, it is marked as "Unfiled" and sent off for human moderation.
Every so often, the statistical distributions are automatically recalculated and the filters are re-created. In this way, the filters are self-learning in that they use all of the most recently classified tweets to add to their "knowledge base" for the next set of incoming tweets.
So there you have it. Pretty simple, right?