in production. Real-word text classification with ULMFiT.

I’ve overcome my skepticism about for production and trained a text classification system in non-English language, small dataset and lots of classes with ULMFiT.

About the project

My friend and classmate, who is one of the founders of RocketBank (leading online-only bank in Russia), asked me to develop a classifier to help first-line of customer support.

The initial set of constraints was pretty restricted —

  • no labeled historical data
  • obfuscated personally identifiable information
  • “we want it yesterday”
  • mostly Russian language (hard to find pre-trained model)
  • no access to cloud solutions due to privacy regulations
  • ability to retrain the model if new classes arise without my involvement

The scope of work was pretty straight forward — develop a model and serving solution for incoming messages classification into 25 classes.

Initially, after thinking about restrictions — I was pretty sure, that no neural networks should be used for this approach. Why?

  • We would never label enough data for the neural model in reasonable time.
  • Building an environment for the reliable serving of neural model is a kind of pain.
  • I was skeptical about reaching the requested performance (requests per second) with reasonable resources.

Based on that, I made a sad face and created new conda environment. I was classic.

Dataset collection and initial research.

RocketBank had set up a task force consisting of a project manager and devops on their side plus a bunch of people handling dataset labeling. It was extremely smart and helpful, as, in my opinion, this constitutes a perfect team for handling a data science project in the industrial world.

We started with analyzing historical data and came to a number of conclusions:

  • To train a system we take into account only messages received by the bank before any response from customer support.
  • There are 2 distinct meta-classes of incoming messages — coming from existing customers and from new leads. Adding this information as an input to classifier should provide additional information to the system and boost classification scores.
  • Bank on their side decided on 25 distinct classes of messages ranging from ‘Credit request’ up to ‘Harassment’.

I requested around 25000 not labeled historical messages and in just a few days a task force was able to classify around 1500 historical messages into 25 classes. Initially, I assumed that this number (1.5k) is too low to even try any neural model (I was wrong).


I will fast forward through non-interesting part of the thing. I decided to test various flavors of TD-IDF, embeddings and optimize machine learning model using TPOT and DEAP.

TPOT and DEAP, for those unaware, are two secret weapons in data scientist arsenal that make model search CPU-intensive and hands-free.

TPOT runs all stack of machine learning methods embedded in sklearn plus few extra(XGBoost) and finds the optimal pipeline. I played around various embeddings, fed them into TPOT and after 24 hours said that Random Forest performs best for my model (ha-ha, what a surprise!).

Then I needed to find an optimal set of hyperparameters. I always do this using directed evolution strategies with DEAP library. That actually deserves a separate post.

Anyway, at the end of the day, I received an optimal set of settings and my precision was around 63%. I think this was close to the maximum that I was able to get from classical methods and 1.5k dataset. While 63% for 25 classes sounds good from the machine learning perspective, it’s quite bad for real-world usage. So, I decided to take a look into neural nets as a last chance. comes into play.

So, I needed a fast way to check the performance of a neural-based model on the same task. While implementing a model from scratch using Tensorflow was the most viable option, I decided to run a fast test with and their recent discovery of ULMFiT. Problem is — I needed a pretrained language model for Russian text, which isn’t available in After looking at forums I discovered an ongoing effort to create a set of a language model for most languages. There was a thread for Russian language and a pre-trained model from a Russian Kaggler Pavel Pleskov, that he used to get a second place at Yandex competition.

From there it was mostly writing 20 lines of code and few hours of GPU training time to get to 70% precision. After a few more days of tuning hyperparameters, I get to 80% precision. Some tips:

  • Use FocalLoss as a training goal.
  • Have a pretrained language model, but finetune it on a non-labeled data available.
  • Convert text to lowercase and make a token for uppercase, make a special token for pieces of obfuscated data.
  • Put token of meta-class not only in the beginning but also at the end of the message.


Ok, great. Should I convert the model into TensorFlow? Nope, I was lazy and decided to test model performance using native + Pytorch + Docker.

After running stress-tests in a single-core Docker container I was surprised to see less than 300 milliseconds response time for an average request and no crashes. What else did I need? Nothing. showed that it is a perfect solution for fast and precise development of production ML systems.


The beauty of + transfer learning is a pretty predictable result of retraining in terms of quality and speed. I’ve shared a script inside the docker container coping my final training notebook and providing a new model as an asset. I’ve run a few cycles of retraining and cross-validation and obtained highly repeatable results, so this is a simple way to deliver not only a model but a training script as well.

I can’t share the actual code and system configs, but I am ready to answer any questions.

Clap, if you like and clap, if you want more details on DEAP and TPOT.

read original article here