Enhancing MLOps With a Single Pane of Glass
In This Article
Machine Learning Operations (MLOps) is the application of developer operations (DevOps) principles to machine learning (ML), with the goal of producing and maintaining robust ML models rather than traditional software. As an organization increases its use of ML models in pursuit of the outsized value they can generate, monitoring model performance and evaluating the output of data science teams becomes increasingly burdensome.
Transparency for data science leadership: can you understand the status and performance of all your organization's production models in under 30 seconds?
Automation for data scientists: is there an automated way to monitor your model's performance?
Trust for end-users of data science models: does the data team proactively notify you and act to remediate any problems that arise with deployed ML models?
If the answer to any of these questions is no, your approach to model monitoring may not be effective in ensuring your suite of ML models reliably produces business value. MLOps is meant to increase transparency, automation, and trust in an organization's models, and creating a ML single pane of glass (SPoG) is one key approach to accomplishing that goal.
A ML SPoG is a dashboard that removes the veil around ML model performance. In doing so, it provides different value to the personas mentioned above based on their connection to a ML model suite. For instance, data science leaders would be able to understand the status and performance of each of an organization's models at a glance, data scientists would no longer need to manually pull and share model performance metrics with their colleagues, and end-users would have greater insight into and correspondingly trust in the models that inform their decisions.
ML models have many points of complexity, such as disparate sources producing data that is rarely stationary over time, complex algorithms that can be challenging to understand and measure, and a network of tools and individuals required to maintain the models over time. This model ecosystem, which can easily incorporate thousands of variations on hundreds of models for a large organization, is managed by cross-functional teams of data scientists, data engineers, and ML engineers, to name a few. In the context of such complexity, clarity around model performance becomes a differentiator for data teams. Not only does it make the output from data teams more tangible, but also it removes the black box nature of many ML initiatives.
ML SPoGs consolidate both tracking and monitoring data for ML models in a high-level view. Model tracking data represents a model's lineage, as models often go through many different iterations before one variation is deemed ready for deployment. Monitoring data represents the deployed model's performance in production, measured against a benchmark usually established during training. Collecting these data in a single view provides the information necessary to judge how successfully a production model is delivering the desired business value. In addition, setting up a SPoG that updates automatically requires data teams to build enhanced continuous integration, continuous deployment (CI/CD) pipelines for their model suites, a step towards greater ML maturity that lessens the manual burden on data teams for reporting and analysis.
Like the single panes of glass used elsewhere in IT, a ML SPoG should provide people with easily digestible information they can access on demand and without assistance.
The framework above visualizes WWT's perspective on the critical elements of an end-to-end MLOps solution. While other aspects of the framework are useful in their own regard, the following capabilities should be considered prerequisites before developing a ML SPoG.
Model Experiments and tracking implemented in your chosen ML Environment (e.g., Azure ML). This is key to have records of different model experiments, which allows a data scientist to determine which iteration of a model is ready for deployment. This also allows us to benchmark Expected Model Performance, which is an important metric that indicates how well the model is anticipated to perform in production based on how it fared during training.
Model Continuous Integration and Continuous Deployment (CI/CD) to add as much automation as feasible into the process of updating and re-deploying a production model. These CI/CD principles, which usually entail setting up a Model Registry of all variations of one particular model, using pipelines to deploy the best one, and delivering results through an endpoint of some fashion (e.g., API, UI). This automated setup is preferred for the SPoG over manual deployments, as it allows the dashboard to continuously update with the latest data.
Prediction Service is a third essential piece of the puzzle, as ML SPoGs are generally used to monitor the performance of models in production. These deployed models should be available for inference as a service based on end-user needs. The SPoG should then represent how accurately the model's predictions against real data are.
Audience: Who will use this dashboard, and how often? What types of questions are they inclined to ask? What models should be included in a SPoG? The answers to these questions will be informative as to what new data sources, pipelines, or other infrastructure will need to be set up to make your SPoG a reality, in addition to the design.
Phased approach: What is the most valuable information a SPoG can display for your organization? Start by setting up a very simple dashboard in an environment your organization's data team is very familiar with and can already access while keeping the Pareto Principle in mind. Focus on collecting the data and meeting end-users' business objectives before creating a more designed, elaborate SPoG.
Lifecycle management: Will your SPoG only be used to monitor production models, or will it include in-progress models as well? For monitoring non-prod model experiments, experiment tracking, logging, champion selection, and model promotion best practices should be considered for inclusion. More to come on those concepts in future articles!
Ease of reporting: Once DS Leadership has access to a fully-fledged SPoG, what types of questions can it be used to answer? What processes can be updated now that the infrastructure to support a SPoG as well as the dashboard exist? Areas impacted may include the method used to report any issues with existing models, how those models' performances are assessed, how data team members interact with end-users of their models, etc.
Production Status should reflect whatever an organization wants as the SPoG's quickest takeaway insight, shown above with red-amber-green (RAG) status. We have seen this used to represent how well a given model is performing relative to its baseline as well as whether a model is currently running in production vs being retrained or otherwise offline.
Primary Model Metric: While organizations track various error metrics such as RMSE, MAE, or MAPE in production models, it is key to define one primary metric for the SPoG. Choosing one metric will simplify the visualization of models' performances and any drift detection functionality, described in greater detail below. This metric should be selected to accurately reflect the business value obtained from the model and should be picked by the data science team working on the model based on input from business stakeholders. This metric should be updated on the dashboard every time the model runs in production.
Expected Model Performance: While the model is being trained, DS teams define a baseline or expected model performance, which acts as a benchmark against which to compare the same model's performance in production. This metric should be updated in the dashboard with every retrain or remodel.
Last Retraining Date: This metric adds more context to timelines or other visualizations while also allowing a data team to quantify model performance over time. Based on an upward or downward trend after the last retrain, a team may consider attempting to further reduce the error rate of a well-performing model or remodel entirely for a lackluster one. This information should be updated on the dashboard every time the model is retrained.
Drift Detection: This statistic is a more advanced SPoG feature, meant to determine if model performance is decaying over time. A drift detector can be as simple as implementing a change percent tracker (e.g., impute rolling average of last 10 model outputs vs a fixed value) or a comprehensive drift detection algorithm which uses advanced statistics to identify a significant change in model performance (see scikit-multiflow for examples). Determining the best method to calculate drift addresses the same objective as having a person watch a model's performance trendline; in the case of drift or a substantial decrease in performance, a model may need to be retrained or remodeled.
MLOps often becomes a topic of interest when an organization scales its use of ML modeling, and a model suite becomes increasingly difficult to manage. Data science managers need new approaches to supervise the models and individuals for which they are responsible, data scientists spend their days reporting on existing models rather than creating new business value, and end-users of models may not be confident enough in their advanced analytic capabilities to immediately derive value from them.
A ML SPoG addresses these growing pains, and we have found it a helpful tool to increase the transparency, automation, and trust around an organization's ML models. It can give data scientists the ability to focus on improving models instead of on reporting, data science leaders the ability to make informed decisions about ML within their organizations and proactively inform end-users of any potential issues, and provide end-users with the visibility that will increase their confidence in the outputs of ML.
We are engaged in the MLOps journey at WWT, both internally and with our customers and partners, so please reach out with any questions or comments. We are looking forward to hearing from you.