The terms ‘MLOps’ and ‘AIOps’ are appearing more and more. Many from a traditional DevOps background might wonder why this isn’t just called ‘DevOps’. In this article we’ll explain why MLOps is so different from mainstream DevOps and see why it poses new challenges for the industry.
Current State of DevOps vs MLOps
Why So Different?
The driver behind all these differences can be found in what machine learning is and how it is practised. Software performs actions in response to inputs and in this ML and mainstream programming are alike. But the way actions are codified differs greatly.
Traditional software codifies actions as explicit rules. The simplest programming examples tend to be ‘hello world’ programs that simply codify that a program should output ‘hello world’. Further control structures can then be added to add more complex ways to perform actions in response to inputs. As we add more control structures, we learn more of the programming language. This rule-based input-output pattern is easy to understand in relation to older terminal systems where inputs are all via the keyboard and outputs are almost all text. But it also true of most of the software we interact with, though the types of inputs and outputs can be very diverse and complex.
ML does not codify explicitly. Instead rules are indirectly set by capturing patterns from data. This makes ML more suitable for a more focused type of problem that can be treated numerically. For example, predicting salary from data points/features such as experience, education, location etc. This is a case of a regression problem, where the aim is to predict the value of a variable (salary) from the values of other variables by use of previous data. Machine learning is also used for classification problems, where instead of predicting a value for a variable, instead the model outputs a probability that a data point falls into a particular class. Example classification problems are:
- Given hand-written samples for numbers, predict which number is which.
- Classify images of objects according to category e.g. types of flowers
The line is embodied in an equation:
The coefficients/weights get set to initial values (e.g. at random). The equation can then be used on the training data set to make predictions. In the first run the predictions are likely to be poor. Exactly how poor can be measured in the error, which is the sum of the distances of all the output variable (e.g. salary) samples from the prediction line. We can then update the weights to try to reduce the error and repeat the process of making new predictions and updating the weights. This process is called ‘fitting’ or ‘training’ and the end result is a set of weights that can be used to make predictions.
So the basic picture centres on running training iterations to update weights to progessively improve predictions. This helps to reveal how ML is different from traditional programming. The key points to take away from this from a DevOps perspective are:
- The training data and the code together drive fitting.
- The closest thing to an executable is a trained/weighted model. These vary by ML toolkit (tensorflow, sc-kit learn, R, h2o, etc.) and model type.
- Retraining can be necessary. For example, if your model is making predictions for data that varies a lot by season, such as predictions for how many items of types of clothing will sell in a month. In that case training on data from summer may give good predictions in summer but will not give good predictions in winter.
- Data volumes can be large and training can take a long time.
- The data scientist’s working process is exploratory and visualisations can be an important part of it.
This leads to different workflows for traditional programming and ML development.
With traditional programming a workflow might be as follows:
- User Story
- Write code
- Submit PR
- Tests run automatically
- Review and merge
- New version builds
- Built executable deployed to environment
- Further tests
- Promote to next environment
- More tests etc.
- Monitor – stacktraces or error codes
Typically the trigger for a build is a code change in git. The packaging for an executable is normally docker.
With machine learning the driver for a build might be a code change. Or it might be new data. The data likely won’t be in git due to its size. Any tests are not likely to be a simple pass/fail since you’re looking for quantifiable performance. One might choose to express performance criteria numerically by tolerating a certain error level. What might be acceptable can vary a lot by business context. For example, consider a model that predicts a likelihood of a financial transaction being fraudulent. Then there may be little risk in predicting good transactions as fraudulent so long as the customer is not impacted directly (there may be a manual follow-up). But predicting bad transactions as good could be very high risk.
The ML workflow can also differ depending on whether the model can learn while it is being used (online learning) or if the training takes place separately from making live predictions (offline learning). For simplicity let’s assume the training takes place separately. In that case a high-level workflow could look like:
- Data inputs and outputs. Preprocessed. Large.
- Data scientist tries stuff locally with a slice of data.
- Data scientist tries with more data as long-running experiments.
- Collaboration – often in jupyter notebooks & git
- Model may be pickled/serialized
- Integrate into a running app e.g. add REST API (serving)
- Integration test with app.
- Rollout and monitor performance metrics
As suggested already, the monitoring for performance metrics part can be particularly challenging and may involve business decisions. For example, let’s say we have a model being used in an online store and we’ve produced a new version. In these cases it is common to check the performance of the new version by performing an A/B test. This means that a percentage of live traffic is given to the existing model (A) and a percentage to the new model (B). Let’s say that over the period of the A/B test we find that B leads to more conversions/purchases. But what if it also correlates with more negative reviews or more users leaving the site entirely or is just slower to respond to requests? A business decision may be needed.
The role of MLOps is to support this whole flow of training, serving, rollout and monitoring. Let’s better understand the differences from mainstream DevOps by looking at some MLOps practices and tools for each stage of this flow.
Some training platforms can also be used for Continuous Integration. For example, a training run could be triggered on a commit to git and the model could be pushed from the job for it to be available to make live predictions. As noted before, deciding whether a model is good for live use can involve a complex mixture of factors. It might be that the main factors can be tested adequately at the training stage (e.g. model accuracy on test data). Or it might be that only initial checks are done at the training stage and the new version is only cautiously rolled out for live predictions. We’ll look at rollout and monitoring later – first we should understand what live predictions can mean.
Live Predictions and Model Serving
For some models there may be predictions to be made on a file of data points or a new file each week. This kind of scenario would be offline predictions. In other cases predictions need to be made on demand. For live use-cases typically the model is made available to respond to HTTP requests. This is called serving.
apiVersion: machinelearning.seldon.io/v1alpha2 kind: SeldonDeployment metadata: name: sklearn spec: name: iris predictors: - graph: children:  implementation: SKLEARN_SERVER modelUri: gs://seldon-models/sklearn/iris name: classifier name: default replicas: 1
The ‘SeldonDeployment’ is a kubernetes custom resource. Within that resource it needs to be specified which toolkit was used to build the model (here sci-kit learn) and where to obtain the model (in this case a google storage bucket). Some serving solutions also cater for the model to be baked into a docker image but python pickles are common as a convenient option for data scientists. Submitting this resource to kubernetes will make an HTTP endpoint available that can be called to get predictions. Often the serving solution will automatically apply any needed routing/gateway configuration needed, so that data scientists don’t have to do so manually.
Self-service for data scientists can also be important for rollout. This can need careful handling because the model has been trained on a particular slice of data and that data might turn out to differ from live. The key strategies used to reduce the risk of this are:
1) Canary rollouts
With a canary rollout a percentage of the live traffic is routed to the new model while most of the traffic goes to the existing version. This is run for a short period of time as a check before switching all traffic to the new model.
2) A/B Test
With shadowing all traffic is sent to both existing and new versions of the model. Only the existing/live version of the model’s predictions are returned as responses to live requests. The non-live model’s predictions are not returned and instead are just tracked to see how well it is performing.
Deciding between different versions of a model naturally requires monitoring.
With mainstream web apps it is common to monitor requests to pick up on any HTTP error codes or an increase in latency. With machine learning the monitoring can need to go much deeper into domain-specific metrics. For example, for a model making recommendations on a website it can be important to track metrics such as how often a customer makes a purchase vs chooses not to make a purchase or goes to another page vs leaves the site.
It can also be important to monitor the data points in the requests to see whether they are approximately in line with the data that the model was trained on. If a particular data point is radically different from any in the training set then the quality of prediction for that data point could be poor. It is termed an ‘outlier’ and in cases where poor predictions carry high risk then it can be valuable to monitor for outliers. If a large number of data points differ radically from the training data then the model risks giving poor predictions across the board – this is termed ‘concept drift’. Monitoring for these is fairly advanced as the boundaries for outliers likely need to be set algorithmically by a data scientist.
For metrics that can be monitored in real-time it may be sufficient to expose dashboards with a tool such as grafana. However, sometimes the information that reveals whether a prediction was good or not is only available much later. For example, there may be a customer account opening process that flags a customer as risky. This could lead to a human investigation and only later will it be decided whether the customer was risky or not. For this reason it can be important to log the entire request and the prediction and also store the final decision. Then offline analysis run over a longer period can provide a wider view of how well the model is performing.
If something goes wrong with running software at a given point in time then we need to be able to recreate the circumstances of the failure. With mainstream applications this typically means tracking which code version was running in the form of an executable (docker image), which code commit it tracks back to and some information about the data state of the system at the time. That enables a developer to recreate that execution path in the source code. Taken to its fullest, the equivalent for machine learning would be much more extensive. It would involve knowing exactly what data was sent in (full request logging), which version of the model was running (not necessarily a docker image, a python pickle likely but could be various formats), what source code was used to build it, what parameters were set on the training run and what data was used for training. The data part can be particularly challenging as this means retaining the data from every training run that goes to live and in a form that can be used to recreate models, so any transformations on the data would need to be tracked and reproducible.
There are also wider governance challenges for ML concerning bias and ethics. Without care models might end up being trained using data-points that a human would consider unethical to use in decision-making. For instance, a loan approval system might be trained on historic loan repayment data. Without a conscious decision about which data points are to be used, it might end up making decisions based on Race or Gender.