The business card says CTO, the heart says “Ops guy” – I love building purpose-driven infrastructure at scale.
The quote “You build it, you run it” is among the most influential quotes in modern software development. It has guided the concept of high-performing software teams since its inception. Pick any successful, software-driven business of today, their mentalities will all be at least influenced by the sentiment.
It’s origins go all the way back to 2006, when the CTO of Amazon, Werner Vogels, gave a seminal interview:
“Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.”
This quote is immediately applicable to Machine Learning in production. A healthy philosophy for high-performing Machine Learning teams can be derived. In short, you train it, you run it.
That means, successful ML Teams own Machine Learning from data to production. Let’s dissect this statement into smaller chunks.
Successful Teams …
The composition of successful teams is a research topic all on its own. I won’t dive into psychological character types, or their assessment, or how they influence teams. My argument comes from a more objective angle: The skill requirements in a successful team.
Any Machine Learning project requires 5 key skills to be successful:
- Understanding of input data
- Writing reasonably good code
- Organization/coordination of experiments
- Solid understanding of Machine Learning
- Understanding of the business domain
Please note that this is not necessarily a 1-to-1 relation to team size. Your team might have multiple ML experts, each with a supplementary skill, but not a single Software Engineer. Vice versa, your project might not have a single trained ML expert, but solid Software Engineers with a good grasp of the business domain and input data.
But eventually, for a project to reach success (e.g. ROI), your skill profiles will converge on this list as a common denominator.
Own Machine Learning from data to production.
Skillset alone, unfortunately, is not enough. The most gifted boxer will not stand a chance against a well-trained, but less prodigious opponent. However, when combining skills and training, true champions are born.
When applying this analogy to our scenario at hand, we can derive that teams need to own their projects from the input data available to them all the way to the later business application of their models in production.
This begins with structuring experimentation in predictable flows. The 12 success factors for ML in production apply. If you’re unfamiliar, they are essentially:
- Guide teams towards version control of code
- enforce reproducibility through tracking
- encourage fast iterations
- focus on deployable results
These factors might sound daunting to ML practitioners without a background in Software Engineering. More experienced programmers will be very familiar with these concepts — they’ve proven their value over decades.
But why also own “production”?
Now, this the crux of this article. Why own your projects to production? Why not hand over a model on an S3 bucket? Surely, someone else would be more suited to tackle all the serving things? Well, no — not really.
A trained model does not provide value on its own. When applied to data, to solve a problem, that’s where value is created. This is what we call “production”. Therefore, if the value of a model is measured in production, the team responsible for the model also needs to be responsible for production.
And for production, two facts will always be true:
- Production is always subject to continuous improvement and/or feature expansion. Production will break.
- Naturally, the team that created a model is best-suited to expand its capabilities — and fix its problems. Data will change, and that’ll surface in production. Models might degrade — also in production. Accuracy might need improvements overall — in production. The theme should be clear.
When a team “owns” production, everyone in said team needs to understand the production environment — at least to a degree. Given the complexity of modern production infrastructure, that can be a challenge.
Similarly, a team needs to be able to discover which versions of their models are currently running. Which model is “live”, where does traffic go, and how can they examine performance?
And, ultimately, a team needs to be able to deploy new versions independently. Either to fix existing problems or to improve value generation.
This is where the money is — how. Production scenarios come in all shapes and forms. Yours might be a pre-existing eCommerce operation, with microservices on large Kubernetes clusters and a sizeable DevOps team. Or it might be an idea in a proverbial garage you’ve been working on with your classmate. Both pose their unique set of additional challenges, with very different resources at hand.
Nonetheless, both also require the same basics.
From training to serving in one pipeline
Adding serving as an afterthought is prone to fail. And a model you can’t serve can’t generate value. At a minimum, you’ll need to automate your training all the way to produce a serve-able artifact. In an ideal world, every training even produces a deployment you can examine in a real-world scenario.
Link training pipelines and deployments
Just deploying a model is not enough. Teams need to discover which pipeline yielded which deployment on their own. A bi-directional link or reference between deployment and it’s training pipeline is necessary. Only then can root-causes be possibly identified, and required improvements be made.
Architectures achieving both will follow similar patterns:
Ensuring serve-able artifacts at the end of each pipeline require a strong link, if not full inclusion, to all preprocessing methods.
Creating an automated deployment needs an integration to another backend, e.g. a CI pipeline to build a Docker container and deploy that, or an integration to ML-specific serving backends like Seldon, Cortex, or Ray Serve.
Linking training and deployment makes versioning of pipelines as immutable process mandatory, e.g. via unique execution IDs. These IDs need to be then propagated forwards all the way to the deployment, with a clear bidirectional linkage for easy discovery.
Discoverable deployments are built on either:
Close embedding of deployment mechanisms into everyday tooling. That way, one software stack manages all/most aspects of an ML project, and all efforts can be discovered from one place. Continuous educational efforts of all involved team members on the used tooling. Familiarity with the infrastructure in place is a key requirement to ensure operational awareness for efficient use of available infrastructure and a short time-to-recovery for bugs and incidents.
None of these patterns come free, but require the buy-in of involved stakeholders. But fear not, the argumentation is clear-cut. Achieving ownership all the way into production will reduce the looming threat of undiscoverable technical debt through the project, and trades a small upfront investment of engineering time for vastly increased speed of innovation during all projects down the line.
If you’d like to get a headstart for you and your project, meet my company ZenML. It’s built on a philosophy of easy-to-use integrations to the best ML tools. Check out our tutorials and examples on integrating ZenML training pipelines with an ever-growing number of backends for serving, training, preprocessing, and more — they provide an easy, guided path to achieve data-to-production ownership for your team: https://github.com/maiot-io/zenml
Create your free account to unlock your custom reading experience.