Getting Started With MLOps: The Value Proposition

WWT has a long history of working with organizations to integrate the latest technologies into their processes, and our most recent focus in that area has been MLOps (Machine Learning Operations). Machine Learning (ML) is one of the most common current approaches to AI, and MLOps is the way to guarantee investment in ML resources pays dividends instead of adding costs.

This is the latest in a series of articles about the MLOps landscape as it continues to grow. Please see the previous articles, Getting Started with MLOps: For Data Scientists and Is Your Organization Ready for MLOps?. We are committed to being a leader in this field because we believe MLOps will be an essential part of the AI toolkit going forward.

Why MLOps?

One of the largest barriers to the ongoing success of AI implementations within organizations is the increasing costs of maintaining those solutions over time. Maintenance is required not only for the continued use of ML models but also to ensure the outputs of those models remain accurate and useful over time.

As visualized by the graph above, the intersection of the marginal revenue (MR) and marginal cost (MC) curves is the key data point for decision makers within an organization. If MR > MC, then the organization should carry on with the AI use case, but the opposite is true if MR < MC.

As such, any technology or process that lowers the MC of additional AI use cases is of interest because of its potential to increase the number of viable AI opportunities.

Top capabilities provided by MLOps

MLOps is a pipeline offering organizations the capabilities needed to develop, deploy, monitor, and reproduce secure ML solutions. As depicted below, the MLOps pipeline aids in the efficient management of ML models, and therefore, offers a framework to quickly productionize and scale data science solutions. The remainder of this article highlights three of the top capabilities provided by MLOps and the corresponding impact to MC.

1. Model source control and reproducibility

In the context of ML, reproducibility is the ability to build and deploy a second model that achieves similar results to an original model when trained on the same data. This process involves following the same experimental procedures used to develop the original model and is facilitated by, in theory, the exact transfer of documented knowledge and computational procedures. In the above diagram, source control and reproducibility take place before the deployment and operation step in the MLOps process.

So why is it important?

Reproducibility helps overcome human error in the experimental process. Data scientists often do not catalog their notes to the extent needed for them to exactly reproduce the model. The time commitment and effort needed increases drastically if a new data scientist must take over this task.

Furthermore, model reproducibility is needed to keep up with any changes to the dataset, updated software environments, bug fixes, and many other possible changes a productionized model goes through as it is continuously developed and improved. Therefore, aiding data consistency and reducing variation during model reruns is essential to the iterative refinement of models.

Finally, a pipeline that ensures the reproducibility of a model can help build credibility with the various stakeholders involved in a data science project. These pipelines require models to be properly designed and deployed as well as effective communication of the process between key stakeholders. This clear communication helps create transparency across the entire ML process, which in turn, creates trust from even the non-technical members of the organization.

Reproducing an ML model without MLOps is like trying to recreate a "paint-by-numbers" by only looking at the finished piece as an example. You need to determine what types of paints and brushes you need, the color combinations you need to mix, the correct sequence of strokes, and much more! The stakes get even higher when you are expected to produce an exact replica of the original painting that can act as a perfect substitute, hanging in a museum for public consumption. In contrast, the MLOps pipeline is like painting on the marked canvas – it guides you through the whole process required to replicate the artwork.

To summarize, MLOps makes the process of reproducing models more efficient by reducing the time needed for knowledge transfer between data scientists, streamlining cataloging of the experimental process, and increasing the transparency of technical solutions. By reducing the time commitment and expertise needed for model reproducibility, MLOps brings down the maintenance cost (and so the MC) associated with deploying each AI solution.

2. Model versioning and storage

Productionized ML models are built iteratively and in need of continuous monitoring and regular updates. Updates to the model parameters, underlying algorithm, or datasets used for training can all generate new versions of the model. Model versioning and storage also take place before deployment in the process diagram above.

So why is this important?

Cataloging releases and maintaining access to previous model iterations is essential to allow data scientists to return to earlier working versions in the event of a malfunction or because of any other flaw in the latest release while the model is deployed. Sometimes, a flaw in the model may not be perceived immediately and could take weeks, month, or years to be discovered. In that case, every iteration since the flawed model needs to be corrected, and cataloging makes that task easy.

Versioning also provides data scientists the freedom to experiment in code without the potential to break a working model. Data scientists usually tinker with a model to find the most effective solution possible, so having a space where they can compare various solutions and examine the trade-offs between different approaches is important.

Similar to enterprise software development, deploying ML models also depends on a complex file system. That system requires tracking the code, the dependencies, the evolving datasets, and the various deep learning packages leveraged. And if multiple models are interacting to produce a certain result, different programming languages can be an added layer of complexity. In this case, version control and dependency tracking become even more important.

Returning to the painting analogy, model versioning is like having your canvas ready on demand to be viewed or edited at certain past stages. Versioning gives you the freedom to create and experiment with color and technique without fearing you will irrevocably ruin the final painting. Similarly, model versioning lets you try out different models or hyperparameters without fear of breaking down your entire model.

Model versioning gives data scientists the creative freedom needed to develop the most effective solutions. It serves as both a safety net in the event of a model malfunction (which takes places more often than any data scientist would like) and a tool to improve the quality of the end result, thus lowering the technical debt of a solution. As such, versioning lowers the MC of each successive AI use case.

3. Continuous model training and validation

Model training is the core of any ML algorithm. This is the stage during which a model learns how to generate some result based on input data. After this stage, models should be able to handle anomalies, unexpected correlations, and new data from outside the training set while still producing reliable results. The model leverages these learnings to produce the best outputs within the constraints given by the model creators.

Model validation is the stage during which the model is tested against a dataset where the output is unknown to the model. This test assesses the accuracy of the model, quantifying its potential to produce quality results when faced with new data. Model training is a step in the MLOps process shown above, but it is also a key part of the automated MLOps workflow. Models are initially trained, then monitored for ongoing performance, and subsequently retrained when necessary.

That automation is another key factor, enabling the continuous nature of model training and validation. Setting thresholds for certain metrics is a common MLOps approach, and if any one of the metrics associated with a given model crosses its assigned threshold, an automated pipeline will trigger a refresh cycle for the model.

So why are continuous training and validation important in MLOps pipelines?

The world is continuously evolving, and models need to keep up, but are designed to reflect a snapshot in time. With any changes to a model – whether they be with the code, the dataset, or the hyperparameters – the model needs to be retrained. These updates with the latest information ensure a model outputs the best results possible and prevent "model drift." To maintain reproducibility and stay aligned with the expected goals of a data science project, each stage of training must be followed by model validation to confirm that the model is working as expected. Adding automation makes that process repeatable.

Back to our painting analogy, a big part of painting is using the right color combinations to evoke certain emotion, convey a specific concept, or create a specific mood. Time and practice are required to understand how colors mix, which colors work well together, and which ones do not. In a ML model, the dataset is the color palette. It contains independent variables (individual colors) and dependent variables (the color combinations). During the model training process, the algorithm is trained to understand the relationship between the independent and dependent variables. During model validation, we verify the model's understanding of these relationships. Without model training and validation, there would be no "learning" in ML, and you would be left with a painting so bland that even your mother would not put it up on her fridge.

Now, imagine a new color was added to the palette, or the available quantity of a certain color changed. You would have to change the color mixture to achieve the same result. If you did not do so, your ratios would be off, and your color combinations would not turn out as desired. The same holds true for any ML model. An automated training and validation process would recognize these changes took place and trigger the retraining and validation of algorithm, in this case a robotically created, but similarly pleasing, painting.

So, training and validation guarantee that a model is fit-for-purpose, and automating those steps of the ML process can alert data scientists when a model is not functioning as expected, and, actually trigger an update to the model without requiring human intervention. The MC component for training and validation is again due to technical debt; models are depreciating assets unless plugged into an MLOps pipeline that can keep them fresh. Adding automation lowers the MC even more by focusing the effort necessary for each ML use case on the initial model training and deployment; the automated pipeline handles any necessary future updates.

Evaluating opportunity

The decision to invest in MLOps as an organization is just that, an investment decision. The choice should come down to weighing the benefits MLOps can bring relative to the costs required to achieve and maintain those benefits. AI, like every technology, is most effective when applied with precision to specific use cases based on clear goals. And there is a cost associated with each application.

The additional upfront investment in MLOps will be beneficial to an organization's ML development process. Doing so will lower the marginal cost and therefore, increase the number of viable ML opportunities to pursue. The above examples of capabilities provided by MLOps, although not an exhaustive list, illustrate this point. AI is like painting; the right tools can make all the difference.

References

What Is MLOps? by Lynn Heidmann; Mark Treveil, published by O'Reilly Media, Inc., 2020
https://blog.datatron.com/datatron-blog/version-control-for-ml-models-with-code-algorithms-and-training-data-sets/
https://www.bmc.com/blogs/mlops-machine-learning-ops/
https://determined.ai/blog/reproducibility-in-ml/
https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/
https://www.decisivedge.com/blog/the-importance-of-reproducibility-in-machine-learning-applications/
https://algorithmia.com/blog/how-to-version-control-your-production-machine-learning-models