Creating visualizations to better understand your data and models (Part 2)

Decision Boundaries

When you train a classifier on a dataset, it is using a specific algorithm to define a set of hyperplanes that separates the data points into specific classes. Where the algorithm switches from one class to the other are called decision boundaries. On one side a decision boundary, a datapoint is more likely to be called as one class — on the other side of the boundary, it’s more likely to be called as another.

Boundaries are fuzzy, but they illustrate where key ‘decision points’ are made by the model.

This visualization compares 10 algorithms on three two-dimensional datasets with different intrinsic properties. Taken from scikit-learn.org.

Importantly, decision boundaries are not confined to just the data points you provided — they span through the entire feature space you trained on. The model can predict a value for any possible combination of inputs in your feature space. If the data you train on is not ‘diverse’, the overall topology of the model (decision boundaries and classification regions) will generalize poorly to new instances.

This is important to know for models you throw into production, or try to reuse on orthogonal datasets. There is nothing inherent to a machine learning model that will warn you if the model is not appropriate for another dataset. There is nothing that will tell you ‘this data point is very different from the ones I learned on.’

Understanding the limitations of existing models and the decision boundaries they learned is helpful for repurposing and reapplication, especially in instances where retraining or transfer learning are not possible.

Selecting Models

Training a classifier requires data and an algorithm. Choosing an algorithm is an iterative and often experimental process. Rarely am I able to correctly select the appropriate algorithm that will perform best on a particular dataset on my first try.

So why is that? Why is there no ‘one model to rule them all’? Can’t we just throw a neural net at every problem?

The “No Free Lunch Theorem” states that search and optimization algorithms with excellent performance for one class of problems will not excel at others. In other words, there is no universally-useful algorithm across all data. Selecting the right approach takes intuition, an understanding of the data and goals of the analysis, practice, and time.

Average ranking of supervised learning algorithms (scikit-learn) over 165 supervised classification datasets from the Penn Machine Learning Benchmark (PMLB). Gradient boosting is excellent, but is outperformed for many datasets. This analysis also does not taken generalizability or interpretation of the model into account. Image taken from the very excellent Olson, et al., available on PubMed.

Examining decision boundaries is a great way to learn how the training data you select affects performance and the ability for your model to generalize — especially if you’re someone who learns tactilely. Visualization of decision boundaries can illustrate how sensitive models are to each dataset. And it’s a great way to intuitively understand how specific algorithms work, and their limitations for specific datasets.

Decision Boundary Plots in Bokeh

In Part 1, I discussed using Bokeh to generate interactive PCA reports. Here I’ll discuss how to use Bokeh to generate decision boundary plots.

Training a K-nearest neighbor classifier (K=3) on the first two principal components of the iris dataset. Each class has its own color, with corresponding decision boundaries. Data points with outlines indicate test data with the model’s prediction as the outline. In the case where the outline is a different color, the model misclassified that datapoint.

My goals for this visualization tool were three-fold. Given a model and a dataset:

  • I want to see which data points were used for training, to understand if the distribution was appropriate
  • I want to see which data points were used for testing, and ‘where’ the classifier is challenged with accurate predictions
  • I want to see the decision boundaries for each class, to understand if/how the model will be useful for prediction for future, unseen data points

In addition, the code should be as generalizable as possible — it should accept any sci-kit learn classifier and any dataset with any number of classes. Note that a limitation of the approach is that it only works on two-dimensional data currently, so transforming the data (with e.g. PCA) is necessary. In the future, I may explore visualizing multi-dimensional decision boundaries in two/three human-interpretable dimensions.

Below, I’ll walk through key components of the visualization.

read original article here