To Streamline Machine Learning Operations, You Need to Flip Software Development on Its Head
In this article
In our previous blog post, we described the Data Maturity Curve (Figure 1) in detail and talked about its use in assessing our customer's capabilities. This approach provides us a baseline for developing the recommended path-forward to becoming a more data-driven organization. Based on these assessments across a number of customers in a variety of industries, WWT is observing interesting trends in the machine learning (ML) space.
Over the last several years, there has been a dramatic shift in the maturity of ML and its use in large enterprises. A majority of the companies have gone from experimentation in small pockets (maturity level 0-1) to having an established team of data scientists/engineers and a platform in place to allow for exploratory analysis and ML model training (maturity level 2-3).
Of course, there are still those companies that are at a "0" that haven't yet made ML a priority, and there are those unicorns at a "5" with ML at their core, but these companies are the in the minority right now. Figure 2 below shows this shift conceptually overlaid on top of WWT's data maturity curve.
As the data maturity of companies increases, the speed at which they get value out of their data should theoretically increase as well. The process by which a data science team does their ML work will dictate how efficiently they are able to create and deploy models that bring value to the organization.
Shown below in Figure 3 are the typical steps in a model building process (this is more specific to supervised learning; other ML model types may have different steps not shown here). Each bar shows the relative amount of time data scientists spend in each area and note that about 50 percent of the time is on the upfront discovery and model training process (this will become an important point for later in this post).
One of the biggest factors holding companies back from leveraging ML to its fullest potential is having the right environmental setup to support a streamlined and holistic process across these five steps.
Data scientists with access to the right set of data and tools may be able to do endless data exploration, and they also may have the compute resources necessary to train a variety of models. The end-to-end process, however, is typically manual, and there is limited standardization across the enterprise.
Moreover, the ability to test trained models thoroughly and promote them into production is typically done in an ad-hoc fashion, with limited to no testing performed on the data, the model and its integration into the production environment.
How can that be? We have been developing software and putting it into production for many years. There are tried and true methodologies, tools and processes available at our finger tips. Can't we just take what we have done in the software development space and do the same thing for ML models?
Let's explore this question a bit deeper.
Note: If you are a software developer, or have some software development experience, this section may be pretty obvious to you, so feel free to just skim it over or skip directly to Why developing and productionizing machine learning code is different.
Software development has gone through a dramatic paradigm shift over the years, with the main goals of accelerating time-to-market and allowing for extensibility of applications and services. Software applications traditionally built with a waterfall approach and a monolithic architecture had long development cycles and hard-coded dependencies, making it difficult to get new features and functionality into production.
Now with the ability to build microservices-based applications in an agile fashion, leveraging continuous integration/continuous development (CI/CD) tools, development cycles have decreased significantly and services are abstracted away from each other allowing for updates/changes to be made without effecting the entire system.
Even with all of these dramatic shifts in architecture and methodologies, one aspect has remained unchanged — the promotion pathway of software code through the different environments:
- Quality assurance (QA)
Promotion may happen more rapidly and with more iterations, but the overall path has remained unchanged.
For software development, these three environments have distinct characteristics. For some organizations, these characteristics may be just rules-of-thumb, for others they may be memorialized as policy. In general, however, the three environments have a few commonalities.
As a software developer begins his/her journey building an application, they start in a development environment which allows for exploration and free-flowing ideation. Typically, development environments are small in size and have just enough horsepower to try different features and functionality. Security measures are light (if any) and backup and recovery is typically non-existent.
Because of these characteristics, the developer is typically not permitted to bring in production data. As long as the simulated data has a similar look and feel to what the production data will be, a small amount can be used to test out different features and functionality of the software application. Once unit testing has been performed and the developer feels their code is production-ready, it is promoted to the QA environment for acceptance and functional testing.
A software application will encounter a variety of situations in production that the code should be tested for to ensure behavioral reliability and end-user satisfaction. QA engineers perform acceptance and functional testing in a QA environment to ensure that the code itself is behaving as intended and that it performs seamlessly when integrated with the rest of the environment.
The tests can be performed in an automated or manual fashion, and the suite of tests performed should be managed and updated as new features and functionality are written. Overall, the QA environment should mimic the production environment, and data should be production data to ensure the code is robust. Once the software application code passes all QA testing, user-acceptance testing and user experience testing, it gets promoted to production.
Production is where all of the hard work gets put into action. Production code should be secure, and the environment should be sized to handle production data and workloads. Overall, a production environment will be reliable, fast, secure and flexible in order to deliver the intended user experience and have the ability to grow and scale as needed. Once a software application is in production, its code is versioned and any changes or updates must go through all of the previous stages mentioned above (we realize we are simplifying what is typically a very complex version control branching strategy, but this section is meant to merely setup the discussions below).
"Machine learning systems differ from traditional software-based systems in that the behavior of ML systems is not specified directly in code but is learned from data," (Breck, Cai, Nielsen, Salib, & Scully, 2017). This simple quote elegantly describes the subtle difference between software application development and ML code development.
In other words, for traditional software development, humans develop all of the code and directly program in rules and logic. For ML development, however, the developer programs a methodology for the computer to learn from data and the computer develops the ML code to mimic patterns it uncovers during the learning process. This subtle difference leads to major differences in the characteristics of the environments and processes needed for code development and production.
The development environment in the ML space is a place for ML developers to perform two tasks:
- Discovery — Explore the historical production data and generate insights that will inform their training process.
- Model training — Build and run code that allows the computer to learn patterns from historical data.
As mentioned above in Figure 3, these two tasks take a significant portion of a data scientists' time when developing ML models.
Let's take a step back and talk through some of the foundations of supervised ML models before we continue to put some context around the details of the discovery and model training processes…
Typical supervised ML models have two aspects: targets and predictors. Targets are information you are trying to predict (e.g., likelihood that a customer will buy a certain product) and predictors are the information that have a high likelihood of predicting the target (e.g., number of similar products that customer has bought in the recent past). The target should be chosen based on business value of the actions that can be taken based on the predictions.
The predictors should be chosen based on their predictive power. Predictors can be very basic data elements but can also be complex aggregations of data elements. The overall goal of building a valuable ML model is to find the right predictors for a given target amongst the endless combinations of data elements available. This can be a daunting task.
…now back to our regularly scheduled programming.
The discovery process allows for ML developers to "get a feel" for the data before diving in to training a model. Typical investigations include understanding the data quality, examining statistical distributions and finding correlations between predictors and other predictors, as well as predictors and targets. ML developers will create visualizations to interrogate the data from different angles and generate many new data elements through manipulation and aggregation of the provided data set.
A major aspect of the discovery process called "feature engineering" is where new potential predictors are created for the ML model to use during training. Tens of thousands of features may be engineered during discovery, many of which, however, will not show a high relative predictive power and be dropped from the final model.
Model training process
After the ML developer has a good feel for the data and has engineered a number of features, he/she is ready to train a model. There are many algorithms that can be trained, each with their own nuances and trade-offs. The ML developer must be acutely aware of these trade-offs and understand the business use case to ultimately select the right model to be used in production.
Of course, predictive performance is the overall goal, but aspects like over-fitting, quality and speed of data available in production should also be considered. Depending on the model and size of data, the training process can be compute-intensive; depending on the code efficiency and hardware available, the ML developer may have extremely long iterations between trainings. In the end, a model will be chosen and moved into testing and production.
Now, let's understand what is needed for a development environment in ML knowing that it will be used for discovery and model training. Both discovery and model training require an environment that allows for hands-on access to large amounts of production data, extensive computing power and a set of ML, statistical and visualization tools.
Because large amounts of production data are absolutely necessary, the development environment for ML should be secure, have strict access controls and back-up and recovery should be required. This is the exact opposite of the development environment for traditional software engineering, and in fact looks more like a production environment.
This major difference can get companies hung up on what exactly an ML environment is. Is it a development environment or a production environment? Is it some new hybrid of the two?
Mature companies handle this in different ways, but the bottom line is that this is a different environment than what is needed for software development. Companies should be aware of this as they build out their ML capabilities, tackle it head on and be ready to make some decisions that are outside of the norm.
As mentioned above, the ML model code is essentially written by the computer to codify what it learns from the historical patterns found in the large amount of data available. This subtle difference from traditional software development leads to the need for QA testing above and beyond the typical unit and integrations tests. If proper QA testing is not performed for ML models, the company is at risk of creating tremendous amounts of technical debt that will ultimately have to be unwound. (Scully, et al., 2014)
Google has developed a rubric of 28 actionable tests (Breck, Cai, Nielsen, Salib, & Scully, 2017) that they recommend be performed before productionizing an ML model, and they are building a platform in Tensorflow to automate some of these tests (Baylor, et al., 2017)￼. We find this rubric to be an excellent starting point for companies aspiring to build a robust QA pipeline for ML models. Some examples of the new dimensions to test based on this rubric are:
- data efficacy and quality used for training the model;
- quality of code used to engineer the features used as predictors in the model;
- thorough hyperparameter tuning;
- model quality across different slices of data; and
- integration testing of the full model pipeline (assembling training data, feature generation, model training, model verification and deployment to a serving system).
We encourage you to read the full paper here.
A production-ready ML model is essentially a sophisticated calculation that intakes data and outputs a prediction. The models themselves are not production-ready applications, but rather a production-ready service that will be leveraged by the enterprise within production applications, data pipelines and/or reports. When a model is ready for production, it may be leveraged by an enterprise in three ways:
- Embedded in an Extract, Transform, Load (ETL) pipeline for use in reports, other models, etc.
- Embedded within a production application
- Exposed as an API for ad-hoc calculations by a variety of applications and/or end-users
There will have to be tight coordination between the software development process of the application hosting the ML model and the actual ML model development. The software itself will go through the its own development/QA/production process while the ML model is being trained and tested simultaneously. The details of the agile coordination process are out of scope for this article, but will be discussed in a future post.
Developers should think through the integration of ML and software development prior to going off into their separate ML and software worlds. This article won't go deep into operational models for these teams to work together, but offers some initial guidance to ensure a streamlined path to production.
- Collaborate closely, especially upfront: The ML and the software development teams should work closely, especially in the beginning phases of development when several critical decisions are being made on tools, standards, etc.
- Align on the datasets: Historical datasets used to train the model will have several nuances that the ML development team will have to make assumptions on and adjust for while training the model (formatting, missing data points, etc.) — these nuances and adjustments need to be known by the software development team real-time while developing the software
- Align on performance needs: In production, ML models will typically have a large throughput of data with complex calculations that make the predictions in a timely fashion. To meet the prediction cadence required, infrastructure performance requirements should be discussed upfront to handle the volume, velocity and variety of data that will be coming through the ML model in production.
As shown above in Figure 3, the model building process is cyclical in nature. At the end of the process the model needs to be refreshed in order to maintain its predictive power (the predictive power of models decreases over time due to changes in the environment and/or data quality/availability). Moreover, new data may be available that could boost the overall predictive nature of the model that should be introduced.
Models may be refreshed at different rates (daily, weekly, bi-annually, etc.) depending on the nature of the data and model being used. The governance for refreshing models should be thought through as part of putting models into production.
Processes should be put in place to monitor the overall model performance in production and thresholds should be set to determine the best cadence for a refresh. In addition, once a model is in production, a team should be devoted to start the exploration process all over again to ensure the most predictive features are being developed and selected for the next refresh cycle.
The ML model development process has some subtle differences from the traditional software development process. However, these subtle differences drive major changes to the processes and environments typically used to develop software. In order for companies to mature their ML development teams and streamline the time-to-market for ML models, they should embrace these changes and build out the right environments with the right governance and processes.
Not doing so will stifle innovation in the ML space and slow down the ability to make valuable predictions. Moreover, the complexity and sheer amount of technical debt that can be created within a full ML pipeline across both data and software may lead to untrustworthy recommendations and disastrous business decisions.
Baylor, D., Breck, E., Cheng, H.-T., Fieldel, N., Yu Foo, C., Haque, Z., . . . Roy. (2017). TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. Google AI R&D.
Breck, E., Cai, S., Nielsen, E., Salib, M., & Scully, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Google AI R&D.
Scully, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., . . . Young, M. (2014). Machine Learning: The High-Interest Credit Card of Technical Debt. Google AI R&D.