Sentiment is a huge driving factor in the cryptocurrency market. But it is a metric which is very hard to measure. Sentiment analysis has been on the rise for the past few years. With the introduction of new packages, sentiment analysis can be done more quickly and efficiently than ever. In this post, you’ll see why looking at the mood on the social media is not a great idea for sentiment analysis. Also, you will learn how people’s interests in topics change over the course of time.
Sentiment Analysis Using Social Media
Predicting market sentiment using social media is tricky. From the analysis, it was found that when there’s an increase in crypto prices, people share positive messages across their social media handles. And when the prices go down, people go all negative and share negative posts on social media platforms.
VADER is a great tool for simplifying sentiment analysis. Using it with NLTK Python package, we did mood analysis on Reddit. The script that used was in the form of a Jupyter Notebook.
What the script will do is it pulls data from Reddit’s API. It then assigns a compound score to each post. Now, this score predicts the mood in the post. The score lies between -1 (very negative) and 1 (very positive). Given below is the distribution of Sentiment score:
The continuous spectrum was classified as follows: positive posts, negative posts, and neutral posts. To make this possible, it was important to decide the scores corresponding to each label. Also, it was necessary to come up with an appropriate classification boundary. Therefore, a classifier was trained using logistic regression on a subset of data. Posts were labelled manually. After labelling a few hundred of them, a few things were concluded:
- Most of these posts were neutral or the posts by confused crypto-beginners.
- Sentiments and its underlying meaning are different and it’s tough to separate them when you’re labelling the two as a human.
The results are shown below. It’s clear that the results by computer labels and manual labels are not the same.
It was difficult to separate the context and sentiment at time of labelling the headlines. The problem is people bet on either side of cryptocurrencies. A happy event for you can be a sad event for someone else. Consider two statements below:
- “Bitcoin market crash leaves behind a graveyard of startups”
- “Bitcoin crash is a good news for investors”
Now both these statements are about the same topic but the sentiment is opposite. Therefore, it is hard to analyze the sentiment using mood when predicting cryptocurrency price.
So, what’s the other way to estimate cryptocurrency price? The number of crypto-related posts increase in times of intense interest in virtual currencies. Does the increase in volume lead to increase in price? Are both linked in any way?
To get a better idea, it would be nice to check out the volume of posts on certain topics. Posts on different topics will have a different impact on crypto price. Posts about vulnerabilities in crypto would have a negative impact on price. On the other hands, posts like bitcoin’s high security would have an increase in price.
Posts that we see on social media platforms like reddit, mamby, twitter etc. are very short and they do not provide a good set for sentiment analysis. There’s a News API that combines articles from all the main news outlets about crypto. These articles have body text and the headlines as well which can be used for better analysis. It also saves the hassle of doing a lot of content scraping.
Now, the question is how to classify these articles into different topics? Well, this can be with the help of Latent Dirichlet Allocation.
The basic concept behind LDA is learning topics from different words found in the text, and then figuring out which topic the document is about.
Here are the stages of LDA:
- Create a distribution for every topic which tells your understand of word it contains. Also, create a distribution for every document that defines your understanding of topics it contain.
- Allocate topics to each document by drawing each of them randomly from initial distribution. Do the same for the words as well.
- Do an optimization and arrange most natural grouping of words to those topics and topics to documents.
LDA is not supervised. One cannot know which particular words a topic will contain or which topics a document will have. To make it easier, it’s better to specify “uniform distribution” as a prior. Also, it’s hard to predict the number of topics in a document. So, the number was just a guesstimate.
The LDA system was implemented on approximately 48,000 news articles with 27 topics. Below are some of the topics which were discovered:
It can be concluded that topic 4 is about ICOs. However, it’s not possible to predict all the topics using this. The goal was to predict the price using these topics. But the topics could not be explained easily. Thus, it was bit hard to make price prediction using LDA.
There are a lot of news about crypto that breaks everyday. And there are some main topics. To make a correct estimation about the price, it is better to decide on some topics rather than choosing a random number of these topics. When there are some common topics, it would be easier to tell the algorithm what words should fall in which category. So, there is no need to use a uniform prior.
We can specify some words which are likely to appear in particular topics. For example, we can specify that bitcoin is likely to contain words like BTC, Satoshi etc.
Here’s some initial visualisations using these topics.
There is perhaps some correlation with price. After doing some mood analysis on social media, and some unsupervised topic classification. Using LDA, it was possible to make a fair prediction about the prices. It was concluded from the study that mood analysis through social media is not as effective as LDA method.