This all started when I was asked to speak at an AI FinTech forum in July. It represented a great opportunity for me to talk about Machine Box to a new audience. But there was a problem. I don’t know anything about financial technology!
My first thought was, “Google machine learning use cases in fintech”. So I did. The results were mostly about anomaly detection and fraud prevention. Great use cases for machine learning, but it is a bit of a solved problem. Given that this was a forum on AI in financial technology, I figured there would already be lots of talks from experts in anomaly detection.
One of Machine Box’s key use cases is face detection using only a single photo of a face to train upon. We already have a couple of customers using Facebox to verify people, so I figured I’d draw a link between that and securing in-person credit card transactions.
Yes, face recognition can be used to secure transactions, and I did end up talking about how you could accomplish this using Facebox and other methods, but I didn’t think it really spoke to the point of simplifying machine learning with better tools, which is what we’re all about.
If you follow my posts, then you know that I frequently use predicting the stock market as a prime example of how not to use machine learning. The stock market is a highly complex, multi-dimensional monstrosity of complexity and interdependencies. Not a good use case to try machine learning on.
But… what if you could predict the stock market with machine learning?
The first step in tackling something like this is to simplify the problem as much as possible. I decided to make it a two-class problem; given some input, the market either goes up or down. And I limited the market to the Dow Jones Industrial Average.
What would make a good input? I decided, somewhat arbitrarily, upon news headlines as the input. I would train a classification model on as many news headlines for a given day using natural language processing, sorting them into one of two classes; the market went up after the headlines, or, the market stayed the same or went down.
Now comes the most difficult part; gathering the data.
VERY fortunately, a quick Google search revealed
this excellent dataset. It is a giant table of news headlines, labeled with their the Dow Jones’ performance that day.
So, 5 minutes into the process, I had a glorious dataset and a plan. Next came the execution.
Because my developer skills are extremely limited, I decided to make life easy on myself and use this tool, which iterates through folders looking for labeled data, and then trains Classificationbox automatically. But in order to run it on files and folders, I had to first convert the dataset into many tiny text files containing the text of the headlines, and put them into folders labeled either 0 or 1, indicating upward or downward movement in the Dow Jones. This is the script I used to perform that probably unnecessary task for real developers.
Modifying this script for the dataset took about 10 minutes. Running it took less than 30 seconds. The next step is to run
Textclass on the folder with all the data in it. This will train Classificationbox with a random selection of 80% of the data. It will then verify the model with the remaining 20%.
The result; 54% accuracy.
Normally, an accuracy that low means your model isn’t useful. You need something like 80% to get to a place where the model starts to make sense for use in the real word. But when I told a room full of financial people that the model only had a 54% accuracy, I expected a chuckle, instead, I got very straight faces. A few seconds later, someone said, somewhat under their breath, “You could sell 4%”.
Perhaps. But my conclusion was that news headlines can’t predict the Dow Jones, at least, with the dataset I had. I highly recommend you give it a try and see what results you get. Maybe I made an error in my script!