Photo by Franziska Söhner on Unsplash
There are many great resources for learning data science and machine learning out there, but the one thing that might be missing is a live accounting of a non-technical individual learning these skills. I use the term “outsider” in the title because I don’t feel like I have the typical background that most people do on Kaggle. I am not a machine learning expert, mathematician, or expert computer programmer. I have experience in finance and law, not computer science or statistics.
My goal in documenting my Kaggle progress is to:
- Hold myself accountable
- Share my experiences
This is by no means a perfect explanation of how to enter and win a Kaggle competition. This is my journey, and hopefully I will get immensely better at both Kaggle competitions and writing.
WHERE TO BEGIN?
One of the toughest things about this was just committing to doing it. Learning these skills seems like an overwhelming task when new information is coming out everyday. So I began looking for good resources which would give me a framework of where to start.
I have a little experience teaching myself Python over the last few years, and I decided working with real data in a competition would be the best way to learn. Approaching this, I completed most of the learning courses on Kaggle, some MOOCs, and I dabbled in a couple competitions (not live).
The main takeaways for me were the iterative process he used in competitions. It laid out an outline for what steps in a competition produce the best model and output. Then the article recommended going back through the public kernels after your submission and writing down any successful approaches so they can be added to our repertoire for future competitions.
One important point I want to make is the reason for implementing other Kaggler’s best solutions. We do not want to simply copy the best kernel in the competition and submit it to try to move up the leader board. We want to read the best Kaggle solutions and try to understand them so we can implement those solutions in future competitions. The point is not to medal in competitions, but to learn how to use these tools.
After reading the article I decided I would try to implement my framework while also learning from the MOOC “How to Win a Data Science Competition” on Coursera. This way I could learn new skills while also completing a real competition.
ENTERING THE COMPETITION
Predict Future Sales
I begin my Kaggle journey by entering the playground contest “Predict Future Sales” because it is the competition that goes with the “How to Win a Data Science Competition” MOOC on Coursera. I did not register for the course, I simply audited it.
The first thing we need to do is get an understanding of the competition. To do this we read the information provided by the competition creators. This should give us an idea of the data and goals of the competition. I recommend competitions with smaller sets of data for beginners, as larger sets will be more difficult to control and possibly require more compute power than you have available. Next, we look at the rules of the competition, which outline how we are to compete. This will include the timeline, whether teams are allowed, and anything else the creators want to specify.
- Modeling and Initial Submission
- Feature Engineering
The first step in my Kaggle competition is the Exploratory Data Analysis (EDA). This is exactly what is sounds like, an analysis that explores the data so we can learn more about the datasets we’re using.
Most of my time was spent on learning about the dataset and fixing small mistakes I had made when preprocessing, cleaning, and engineering features. This process is focused on the small details. If you can get those right, you can save yourself significant time. The best way to do this is to explore the data after any changes.
form (similar to an excel notebook), this means it has clearly identifiable rows and columns. The code
is what I am using as the name of the training data dataset; you can choose to name your dataset anything you want) gives us an output describing the data.
The code shows the shape of the train and test dataset. The rows are on the left and columns on the right. So train would have 2,935,849 rows and 6 columns.
command to get an idea of what the values in the dataset look like.
We can view which features (columns) are numerical and which are categorical (words like the shop names), and get a feel for what kind of data we are dealing with.
, which I use to understand the data. More functions can be found in the documentation for the
library. This is the python library used with data structures and tables.
After exploring the data a little, I started to figure out the time series format. The data was given to us in days and we needed to group it into months for the submission. We do this with the sum function and then fill in the missing months with zero, as there were zero sales of the items in those months.
These are two types of plots we make in Kaggle. They both depict the number of items sold in each month. Graphs and plots like these allow us to see overarching trends in the data.
B. Univariate and Bivariate Data Analysis
There are two ways to explore the data. The first is the univariate analysis, which uses visualizations and statistics to explore a single feature in the dataset. This is a useful initial evaluation. I explored each feature individually and made a short summary in my notebook. This allowed me to have a basic understanding of each feature and how it might impact the target feature.
The Bivariate data analysis can be more in depth and revealing. A bivariate analysis analyzes two or more features at once. We are able to see their correlations and which features impact others.
C. Preprocessing and Cleaning
. We use the below function to determine how many missing values are in each column.
Once we have found the missing values we can decide what to do with them. We can drop the column if there are too many, or we can fill them with zero or another number such as the median value of that feature.
Preprocessing also consists of finding any outliers in the data, such as negative numbers or extremely large numbers. We drop these values because they can have an outsized effect on the model.
MODELING and INITIAL SUBMISSION
Even after completing most of my EDA I still struggled to understand how the data was structured and how it could be used to predict the next month’s sales. Time series data is exactly what it sounds like, data which is recorded at different times.
In a classic classification challenge we are given data to train on, such as passengers on the titanic and whether they lived or died. Then we are given completely different passengers for the test data to determine if they lived or died. In a time series competition the same item is given in the training and testing dataset, but at different periods in time. In this competition, the training data holds items and which store they were sold in. We are then given the same items and stores in the test data, but it is at a later date and we must predict how many are sold at this later time.
The first two columns list the shop and item id, and then the next thirty-three are the consecutive months from January 2013 to October 2015. In each row the number of items sold for that month is listed. I was now able to visualize what the time series data looked like, and I decided to run it through a model.
model to determine the feature importance. Training the pivot table on this
model showed me which months were most predictive (it was the most recent), because those were my features. Doing this helped me to understand the features and how the models worked.
- After you commit your code in Kaggle it will show a message that looks like this:
columns show the difference between this version and the last version you committed. The far right column (in red) is the number of lines of code you deleted or changed. The column to the left of this one (in green) shows a positive number. This is the number of lines of code added. For example, in my latest version,
, I deleted (or changed) two lines of code, and added nine lines of code.
After my initial model, we return to the feature engineering step to add new predictive features. The data is back in its original format and not in the pivot table.
We can add extra features (columns) to the dataset. This process is called feature engineering, and it is when we use existing features in the dataset to create new ones.
After looking at our feature importance and model performance on the pivot table, we were able to determine different months of sales had different predictive value. I decided that making a few moving average features should be a good start. I made a previous month feature, which created a column holding the value of the number of sales in the previous month. I then created a 6-month and 12-month moving average of sales for each item.
Next, I did some simple feature engineering by creating columns with the maximum number of each item sold, and the maximum price the item sold at.
In simple terms, if item #10 was sold twice this month and zero times next month then the mean encoded feature for this item would be 1 [(2+0) / 2]. This means the average number of items sold per month is one. If we make a feature with these values, we can produce a more accurate prediction.
Not So Fast!
We cannot forget about our validation strategy
When we build a model we need to validate it. Validation involves splitting the training dataset into two parts. One set will be the data we use to train our model and the other part will be used to test the accuracy of our model. The validation set acts as our test set, and allows us to test our models before running them on our test values.
split before we mean encode or there will be data leakage.
Data leakage is when values from the validation set leak into our training data and cause our model to perform better than it would in the real world. A simple example of data leakage would be mean encoding all our data including the validation dataset.
After we finish feature engineering we are ready to begin building our models. The modeling portion of data science is probably the most talked about portion. However, in reality this is where you will spend the least amount of your time. And even the time spent here will be mostly spent waiting for a model to train. Most of the work will be done in exploring, cleaning, and processing the data, and feature engineering.
The main models I focused on and learned about in the beginning were Decision Trees, Random Forests, Linear/Logistic Regression, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Naïve Bayes Classifier, and Gradient Boosting models such as XGBoost and CatBoost.
We have our training data split into train and validation, so for each set we need to prepare our data for the model we are going to use. To prepare it I like to separate the train data into two sections: the input (X) and the output (y). These two parts of the data are used to train the model. We take the train data and make the input by dropping the target column.
). Now, when we train the model it knows the input will be all of the test data we are given, and the output will be the target value we are trying to predict. We will do the exact same thing for the validation set we have created. It will usually look something like this:
Once we have done this, we train the model on the training set data (X & y). Then we try to find the target value of the validation set using the model we recently trained. So we input only the X portion of the validation data into the model. We then take the output the model gives us and compare it to the true target value for the validation data (y).
There is another way to get an even higher score in a competition. It is called stacking models or the ensemble method.
may have the best score, we can get an even better score by incorporating all of the models.
. In this notebook he creates a simple ensemble model using linear regression on all of the predictions made by his other models.
notebook. I did not create this image.
This seemed like the best way to get introduced to model ensembles. It is relatively simple, and did not have any meaningful impact on my score. Its not something I am too worried about as a beginner, but it is a very powerful tool that I hope master in the future.
DATA SIZE PROBLEMS
During the commit process I ran into a few problems in some of my notebooks. The main problem was trouble committing the code. It appeared the data was too large for the Kaggle environment. This may not be a problem if you have your own private environment you run your models in, but for a beginner like me who is doing most of his work on Kaggle, this was a problem.
, which is the default for Kaggle, take up much more space when loading a dataset into a notebook. If you have a value in each cell that is not extremely large, you should be able to use a smaller data type such as
(If you want to learn more about data size and bytes, there are some good resources on Coursera).
I’m not promising that doing anything in this notebook will help you win a Kaggle competition. My submission scores are about average (50%), which isn’t great, but also isn’t horrible for my first real attempt. What I am hoping to do is to share my process in an understandable way. Hopefully this will inspire others to dive in and feed their curiosity.