The Problem Statement
I decided to collect twitter data, and use it to conclude what the people are talking about. So, this was a three-step process –
- Collect the data (tweets) from Twitter regarding the elections.
- Run some program on these tweets to figure out what these tweets pertain to (or what people are talking about through these tweets).
- Conclude the result from step 2.
Let’s figure out each step in detail.
Collecting the Data
So, I used the following script to fetch the tweets in live mode (i.e. fetching tweets as they come).
For the above code to work, you will need tweepy (which you can easily download using pip), just use the following –
pip install tweepy
The code uses the twitter stream to fetch tweets as they are posted (dynamically). Running the above will fetch us a lot of tweets (we can choose to store it for later purposes).
In a similar way, we can extract out tweets for Donald Trump as well, using twitterStream.filter(tract=[“Trump”])
Analysis of the data collected
Let’s start the actual part with a little detour –
Some of you might be thinking what exactly is Opinion Mining/Sentiment Analysis? To satisfy your inquisitiveness, I will try to answer this first –
According to Wikipedia –
Opinion mining (sometimes known as sentiment analysis or emotion AI) refers to the use of
natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
In my own words –
Opinion Mining is extracting out information from a text and determining what sentiment that particular set of words/sentences depict, i.e. whether the sentence(s) depicts a positive sense/ a negative sense or are just plain neutral. This can be applied in e-commerce e.g. Amazon Product reviews which can be automated to figure out how many positive/negative reviews a product has received as well as the severity of it.
From Step 1, I had gathered a lot of tweets (read thousands of them) on Hillary Clinton and Donald Trump and that data was just sitting in my system waiting to get analysed.
So, What next?
I need to create a Data Science Model/Classifier on which I can feed these tweets and get the result. So the next target was to create such a model. This was the tricky part.
Disclaimer: I am not a Data Scientist by any means, this was just a hobby project which I continued after my Project dissertation back in my University days. So I used some naive ways to get to the conclusion.
Creation of Data Science Model –
- Gather some training data from a reliable source which should have data in the format of (sentence -> sentiment) e.g. (“You made delicious food today” -> “positive”) / (“The roads are in really bad shape” -> “negative”)
- Clean the data (stop word removal, stemming, etc.)
- Create classifier(s)/model(s) on the cleaned training data.
So I gathered the data online and some part of it is shown below:-
the rock is destined to be the 21st century’s new “ conan “ and that he’s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
the gorgeously elaborate continuation of “ the lord of the rings “ trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j . r . r . tolkien’s middle-earth .
effective but too-tepid biopic
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .
emerges as something rare , an issue movie that’s so honest and keenly observed that it doesn’t feel like one .
the film provides some great insight into the neurotic mindset of all comics — even those who have reached the absolute top of the game .
offers that rare combination of entertainment and education .
perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .
steers turns in a snappy screenplay that curls at the edges ; it’s so clever you want to hate it . but he somehow pulls it off .
take care of my cat offers a refreshingly different slice of asian cinema .
simplistic , silly and tedious .
it’s so laddish and juvenile , only teenage boys could possibly find it funny .
exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .
[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation .
a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification .
the story is also as unoriginal as they come , already having been recycled more times than i’d care to count .
about the only thing to give the movie points for is bravado — to take an entirely stale concept and push it through the audience’s meat grinder one more time .
not so much farcical as sour .
unfortunately the story and the actors are served with a hack script .
all the more disquieting for its relatively gore-free allusions to the serial murders , but it falls down in its attempts to humanize its subject .
a sentimental mess that never rings true .
while the performances are often engaging , this loose collection of largely improvised numbers would probably have worked better as a one-hour tv documentary .
interesting , but not compelling .
on a cutting room floor somewhere lies . . . footage that might have made no such thing a trenchant , ironic cultural satire instead of a frustrating misfire .
I then cleaned the data using the usual stop word removal process and stemming (Porter Stemmer).
Now I was faced with the challenge of choosing the best classifier for the job since I was not so proficient in the field, I chose what any naive Data Science Enthusiast will choose, My Knight in the Shining Armour Naive Bayes Classifier.
I decided to go with this classifier, created a model out of it using the training data I had, and then ran the same model on my Twitter Tweets, the result I got was not something I expect it to be, a lot of accuracy issues which I could clearly see. What else I can do better!
Again, I am just a Data Science Enthusiast, but I remembered one technique which I learnt from a course on pythonprogramming.net where the instructor (in a similar course of Twitter Sentiment Analysis) suggests to use more than 1 classifier and then decide on the basis of voting among all of them. So that’s what I did. I made 7 Classifiers instead of just 1 and stored them (pickled them, if you are using python lingo) to be used for run time analysis of Twitter Tweets.
With this 7 naive classifiers were ready. Next part of the job was to feed each tweet to 7 classifiers separately and get their results. A voting algorithm will decide whether the tweet was positive or negative. i.e. if 4 out of 7 classifiers says the tweet is positive and 3 of them says it is negative, then we label the tweet as positive.
Conclusion and Results
With all this arsenal I finally ran the live tweets on my Classifiers and they gave me the results (positive/negative) which I stored in the text file. I tried plotting them on a graph using the below code :
Running the above code to plot the results, I got the following –