The inspiration for this post came from an interesting episode of Sam Charrington’s This Week in Machine Learning & AI Podcast featuring Tom Szumowski, a data scientist at URBN (Urban Outfitter’s parent company) on ‘Benchmarking Custom Computer Vision Services’.
In the interview, Tom describes his team’s efforts to build a machine learning pipeline to automate fashion product attribution tagging. This essentially means taking a product like this dress :
and being able to automatically and accurately generate attributes about it such as its sleeve length (none), neckline (scoop neck), length (mini dress), pattern (stripes), color (green, yellow, red), etc…
Using machine learning — specifically computer vision— to handle this product attribute classification process makes sense for URBN because of the vast, diverse catalog of products they offer. Access to these attributes help the business across some key initiatives:
- personalization around content and product recommendations
- improving discoverability and search in the user experience
- forecasting / planning for inventory management
The URBN team anticipated that the number of attributes would continue to grow and identified the opportunity costs of an in-house solution (in terms of team time spent), so they were interested in finding a scalable solution for managing the models for all those attributes.
Here’s an overview of the custom vision service vendors that their team evaluated:
Tom Szumowski has a great write up about the specific details about the case study — Exploring Custom Vision Services for Automated Fashion Product Attribution (Part I and Part II) — and shared the presentation too.
How does this apply to my data science team?
While URBN’s specific use case was interesting, it was the thought process and framework that the URBN team used to evaluate these different MLaaS products that proves extremely useful (and what will be the focus for the rest of this post).
Framework for Evaluating MLaaS offerings
Here’s a starting framework for evaluating these different MLaaS products. You’ll notice how the attributes for visibility, usability, flexibility, cost, and performance span across each portion of the workflow. We’ll dive into each section more deeply below.
This framework and its questions will help you understand whether using a MLaaS product is the right approach for your team — from the data that you’re feeding into the MLaaS offering to the actual model results you receive. To prevent these systems from being expensive black boxes, you should ask questions around how the evaluation actually occurs and whether the results can be updated.
Perhaps you’re evaluating these MLaaS options against in-house models and approaches like the URBN team did?
If so, it’s critical to have experiment comparability. Comparability means that you are consistent around the data, environment, and model that you use in order to make a fair assessment across these different options (after all, even having different library versions can impact performance). Tools like Comet.ml can help automatically track the hyperparameters, results, model code, and environment details of these experiments. With visualizations across these different experiments, it’s easy for your team to directly compare how different models performed and identify the best approach.
Evaluation in Practice
So what were the results of the URBN team’s evaluation? Here’s the table comparing classification accuracy metrics for benchmark datasets like MNIST, Fashion MNIST, CIFAR-10, and the cleaned UO Dresses data:
Their team also put together a great usability comparison table where they compared key capabilities like deployment options, retraining options, APIs, and more.
Thanks for reading! I hope you found this framework and example useful for your data science projects. If you have questions or feedback, I would love to hear them at [email protected] or in the comments below ⬇️⬇️⬇️