Lessons learned on the road to MLOps
Amit Bendor
Posted on July 7, 2022
The start
Hello and welcome to our series of posts on MLOps at Artlist!
The goal of this series is to share our MLOps journey (which is still undergoing) and how we applied our vision and perspective to a real-world production infrastructure.
In the very early days of our department, when I (Amit), just joined the company and was the first Data science employee, I had the rare opportunity to stop and think with myself about how to "build things right" this time.
After managing a few data science teams and projects, and seeing failures and success stories I felt like I had a quite clear vision of the “values” which should guide us as we build our infrastructure and practices.
It took me about 2 days of just distilling and spilling out into a notion page all of my thoughts and we were ready to go.
At that time the buzzword “MLOps” was rising in popularity and we felt like it has many similar notions to ours - although some of them are vague in terms of implementation.
Core values
A battle of ultimate goals
Before starting with values we need to understand our main goals. they will be our "guiding star" whenever we have a decision to make.
If I had to choose 1 superior goal it would be “business impact”. Quite a high level but eventually we want to bring value to our users and to impact the company’s bottom line.
Also, we can measure it for every decision we're taking - for example what features to focus on next? which implementation to choose for our pipeline tool? we can answer all of those with an estimation of the business impact.
In order to get to this goal - we can say we’d like to be:
Independent - can bring value without relying on the development teams or data engineers (ideally)
Focus on DS - in order to bring our unique value to the company, we need to do mostly data science/research work and not general engineering tasks.
As you can figure out these two goals are conflicting with each other - and this is exactly the fine line each team needs to define. We defined our own guidelines as you’ll see below.
The modern Data Science team
One of our first decisions was to build the team the “modern way”. And by that I mean we decided we’re not going for the traditional way algorithm teams used to provide some dirty research code - and letting a software engineer decipher and convert it into a running production code.
We decide to rather strive towards a “Full Cycle Data Science” paradigm, where we take ownership over every activity related to our core data science activity.
What does it means exactly? What are the bounds?
We call it the “Up to the API” approach- all activities from research, to automation using pipelines and holding APIs to externalize our predictions.
What not?
- Any operational backend/frontend work
- Data engineering ETLs
Turning “values” into infrastructure
Ok, so now let's move on to the most important subject - values.
You might be asking - so the rest of the articles just going to be a bunch of clichés? I promise not. let’s see how we took every “value” and created actionable decisions & guidelines we use every day.
Enabling creativity
What does it means?
Inspired by our company, we believe that researchers should “push the boundaries” - and that our MLOps infrastructure should enable it.
We truly believe we should enable the Maximum flexibility in the research side of our work - because this is our core value in the company.
Decisions
- Research with notebooks/scripts
👉 As long as it supports fast iterations - use Jupyter notebooks, scripts, or a combination of both
- CYOF (Choose your own framework)
👉 Prefer Tensorflow over PyTorch - not a problem. Use the framework/tools that make the most sense to you
Simplicity
What does it means?
We already talking about conflicts so here is another one.
We don’t want to activate our human decision-making process for operations that are outside of our core contribution.
Therefore, we decided to give minimal flexibility on the “engineering side” of our work.
Decisions
- One implementation TM
👉 minimal set of tools and just one implementation for 3 variants of our main components: pipeline, API, Python package
Example: for API use only FastAPI
- Write once
How is it being reflected?
-
Write code templates (cookiecutters in Python) - to provide a fully functional environment - full of nifty automations
- Use it when starting every project
- Look below the links section for a few examples
- Deep dive post is coming up soon
-
Standard operations - a library we developed covering “must-haves” of any script or notebook. (production or research)
- Establish a good standard
- Includes: Logging, configuration, and tracking modules
- Deep dive post is coming up soon
-
Everything is containerized
- A standard way to deploy our code in different services without changing anything (!)
Transparency (of tools & infra)
What does it means?
This one is connected to the previous value. (simplicity)
We want things to “just work” without touching them when possible.
We rather sacrifice flexibility and cost
Decisions
- No DevOps
👉 We’ll prefer serverless, self-managed, cloud-native solutions. Even in the cost of flexibility or cost of operation
👉 For example we chose vertex pipelines as our main pipelines tool and GCP native monitoring for those features.
- Fewest interactions - we chose libraries and implementations that don’t require the least lines of code to run
Example:
from aistdops.logging import logger logger.info()
over
logger = Logger(), logger.info()
How is it being reflected?
Cloud - prefer managed services, serverless
Choice of transparent libraries - for example, ClearML for experiment tracking as it logs most things *implicitly *
Robustness + Well engineered
What does it means?
So frankly, we really don’t like waking up in the middle of the night from a pager duty call.
Therefore, we’re doing all we can to provide testable, robust solutions
Decisions
Invest in testing code, data validation, and model monitoring
Choose production-ready tools
-
Track everything we can
- To build a lineage map, reproducibility
-
Reduce production risks as possible
- Prefer batch jobs over online APIs
- Always have a fallback
How is it being reflected?
- Data validation - with great expectations
- Tracking experiments with ClearML
- Embrace best practices in core components - configuration management, logging, dataset and model management
What's next?
In the next articles in this series, we'll go deeper into our core implementations of those values into our infrastructure.
About the writer
Amit is the head of data science at Artlist. He’s an active contributor to the Association of software architecture and Cloud security alliance organizations.
If you google his name, you’ll see he is talking about technology at every given chance.
Co-hosting the award-winning podcast “Osim Tochna” in Israel, recording videocasts, and speaking at conferences and meetups.
When he is not running into walls with the VR headset, he is spreading the world of AI for developers with cloudaiworld.com
Posted on July 7, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.