MLOps (a compound of “machine learning” and “operations”) is a practice for collaboration and communication between data scientists and operations professionals to help manage production ML (or deep learning) lifecycle. Similar to the DevOps or DataOps approaches, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. While MLOps also started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle — from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.
As shown in the above diagram, only a small fraction of a real-world ML system is composed of the ML code. The required surrounding elements are vast and complex.
DevOps versus MLOps
DevOps is a popular practice in developing and operating large-scale software systems. This practice provides benefits such as shortening the development cycles, increasing deployment velocity, and dependable releases. To achieve these benefits, you introduce two concepts in the software system development:
An ML system is a software system, so similar practices apply to help guarantee that you can reliably build and operate ML systems at scale.
However, ML systems differ from other software systems in the following ways:
- Team skills: In an ML project, the team usually includes data scientists or ML researchers, who focus on exploratory data analysis, model development, and experimentation. These members might not be experienced software engineers who can build production-class services.
- Development: ML is experimental in nature. You should try different features, algorithms, modeling techniques, and parameter configurations to find what works best for the problem as quickly as possible. The challenge is tracking what worked and what didn’t, and maintaining reproducibility while maximizing code reusability.
- Testing: Testing an ML system is more involved than testing other software systems. In addition to typical unit and integration tests, you need data validation, trained model quality evaluation, and model validation.
- Deployment: In ML systems, deployment isn’t as simple as deploying an offline-trained ML model as a prediction service. ML systems can require you to deploy a multi-step pipeline to automatically retrain and deploy model. This pipeline adds complexity and requires you to automate steps that are manually done before deployment by data scientists to train and validate new models.
- Production: ML models can have reduced performance not only due to suboptimal coding, but also due to constantly evolving data profiles. In other words, models can decay in more ways than conventional software systems, and you need to consider this degradation. Therefore, you need to track summary statistics of your data and monitor the online performance of your model to send notifications or roll back when values deviate from your expectations.
ML and other software systems are similar in continuous integration of source control, unit testing, integration testing, and continuous delivery of the software module or the package. However, in ML, there are a few notable differences:
- CI is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
- CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
- CT is a new property, unique to ML systems, that’s concerned with automatically retraining and serving the models.
Example CI/CD pipeline automation
The following diagram shows the implementation of the ML pipeline using CI/CD, which has the characteristics of the automated ML pipelines setup plus the automated CI/CD routines.
Data Science (& ML Ops) challenges
Undoubtedly this era belongs to Artificial Intelligence (AI), and this results in the use of Machine Learning in almost every field, trying to solve different kind of problems from healthcare, in business fields, and technical spaces, Machine Learning is everywhere. That, the Open Source Software (OSS) and Cloud-based Distributed Computing have caused the appearance of many tools, techniques, and algorithms and the development of Machine Learning models to solve a problem is not a challenge, the real challenge lies in the management of these models and their data at a massive scale.
The Data Science (& ML) Development Process needs to learn from SDLC (Software Engineering) in order to face these challenges, and What are these challenges?. The answer is: They are the same challenges that SDLC (Software Engineering) is facing by adopting the DevOps Practices, for example:
1. Data challenges
Dataset dependencies. Data in training and in evaluation stages can vary in real-world scenarios.
2. Model challenges
ML models are built in a Data scientist sandbox. It was not developed to take scalability in mind; rather, it was just developed to get good accuracies and right algorithm.
3. Automation
Training a simple model and putting it into inference and generating prediction is a simple and manual task. In real-world cases (at scale) that must be automated everything and anywhere.
Automation is the only way to achieve scalability in the different stages of ML-SDLC.
4. Observability
Monitoring, Alerting, Visualization and Metrics.
5. The MLOps Tool Scene
The effort involved in solving MLOps challenges can be reduced by leveraging a platform and applying it to the particular case.
Platforms for MLOps
1. Open Source
Perfect for early adopters, also suitable to implement easily Proof-of-concepts or starting your own personal project.
2. Kubernetes-based
Kubernetes and Containers are the new platform where our Applications are going to run and live, even the ML Applications.
3. Stack or Platform
I don’t want to waste efforts integrating heterogeneous tools, I want a stack or platform with mature tools already integrated seamlessly.
Summary
To summarize, implementing ML in a production environment doesn’t only mean deploying your model as an API for prediction. Rather, it means deploying an ML pipeline that can automate the retraining and deployment of new models. Setting up a CI/CD system enables you to automatically test and deploy new pipeline implementations. This system lets you cope with rapid changes in your data and business environment. You don’t have to immediately move all of your processes from one level to another. You can gradually implement these practices to help improve the automation of your ML system development and production.