Productionizing Machine Learning in Software Delivery at Scale

Introduction

Enterprises across all industries are now starting to move their machine learning (ML) efforts out of experimental labs and into production use cases. Many of these organizations are beginning to see results of leveraging ML, and it is becoming a core competency. However, many feel that they are still at the beginning of this journey.

Enterprises today typically have one group of data scientists conducting ML in isolation or siloed groups each trying to solve a unique problem. These data scientists are exemplary when working at the individual level with relatively small-scale models. While seeing the potential, many of these organizations are facing challenges when deploying these models into production and making them available for broad use. The power of scale is not present.

There are several observable characteristics of companies that have reached Level 5 on the data maturity curve. The ML team and the models they build are leveraged for data-driven insights across the organization. This includes:

Driving some of the most critical processes and decisions.
Expanding the business value landscape.
Streamlining the business-as-usual.

The models are relied upon by all members of the organization to do their work confidently. Even for these organizations, there is still room to grow.

The next step is to use agile software development techniques to integrate ML models into software applications that can be deployed across the enterprise. This enables organizations to embrace ML for the most complex transformational efforts. As discussed in our previous article, ML teams can learn from how software delivery teams have evolved over the past fifteen years. The paradigm shift from waterfall processes to agile methodologies and DevOps concepts has enabled software delivery teams to move at the speed-of-business and allowed companies to digitize their processes and information at scale.

Many enterprises have not yet been successful in deploying ML in a rapid, streamlined and standardized way, just as they do with software products. Some are aware as to how these improvements will provide them a greater competitive position in their verticals. When organizations take advantage of leveraging software engineering processes and techniques to conduct ML at scale, it allows them to move from being overly focused on training the model to a more balanced focus on training, quality assurance, productionization and model refreshing.

To do this effectively, an evolution has to take place with both the software delivery teams and ML teams, and these two teams need to work in lock-step. This article describes a model for how to achieve this vision by bringing the practitioners of software delivery and machine learning together.

As-is reality and roadblocks along the way

Achieving a future where ML is able to be integrated throughout an enterprise's appropriate software development projects requires education, flexibility and comfort with change between the ML teams and the software delivery teams. ML models have to evolve at the speed-of-business and scale across an organization, while software delivery teams need to seamlessly integrate ML models into their applications.

This requires the ML team and the software delivery teams to work together seamlessly and adopt similar processes and technology platforms, which may lead to some potential roadblocks along the way. Most of these potential roadblocks stem from three areas related to the current reality:

ML teams are stuck in the laboratory and have limited exposure to modern software delivery processes.
Software delivery teams think of ML as typical software.
Leadership is not setting expectations on the importance of embracing this change.

ML teams are stuck in the laboratory and have limited exposure to modern software delivery concepts

As ML teams shift their focus from experimentation to production, their understanding of what it means to harden a model becomes extremely important. Often ML developers are unaware of what it takes to have a model making frequent predictions with simultaneous user access and user load, across multiple applications.

To start, they may be unfamiliar with DevOps and agile processes. They may never have seen a model deployed outside of their own specific workspace. In addition, the ability to constantly refresh or retrain a model as more recent and/or more predictive data becomes available can present a host of challenges data scientists may not be ready for (e.g., the ability to A/B test models and only productionize if the performance is superior). Overcoming these challenges allows the right model to be used to solve the business problem.

Finally, ML team members will have to start thinking critically about what integration testing is required to ensure the model won't "break" anything when being placed into production. To work seamlessly with the software delivery process, ML teams will need to learn how to automate model testing in conjunction with the automated software delivery testing process that may already be in use. Each of these changes within the ML team will have to be managed and rolled out in a thoughtful manner to ensure alignment along the way.

Software delivery teams think of ML as typical software

Software delivery teams are often not taking full advantage of the ML efforts in their organization. At times, they may build a rules-based engine into an application, but it is not an ML model trained on historical data. There are a variety of reasons for this:

They may not be regularly interacting with ML team members.
ML models' inner workings are less familiar.
ML process steps are not well understood by software delivery teams (e.g., training, refreshing).
The format of the ML model code may be completely different than the application's code.
The rules-based engine is easier to work with, quicker to integrate and trustworthy.
Leveraging an ML model in real-time can be a challenge if they are in fact available.

Product owners are not typically thinking about potential opportunities for ML models in the strategic stages of product development; they may not have the full understanding of the new dimensions that an ML model can bring to the table for their product. There is no formal process for how to think about ML when defining user stories upfront or integrating the model refresh process once the application is delivered. Additionally, the use cases when ML is the best approach is often non-trivial.

Leadership is not setting expectations on the importance of embracing this change

Evolving to the point where the ML and software delivery teams work in lock-step may be a difficult change for an organization. Both the ML and software delivery teams may be set in their ways and moving through their day-to-day work in a relatively rapid fashion. To scale, however, they both need to slow down to speed up.

As mentioned above, achieving a future where ML is a core competency of an organization requires education, flexibility and comfort with change. This expectation needs to be understood and embraced by the leadership team before the journey begins so when the teams come out on the other end, they are working at the speed of business. Ideally, leadership works to understand the potential ML can provide to the organization and actively demonstrates a strong desire to have it used by the enterprise.

Focus areas for overcoming these roadblocks

It is possible to overcome these roadblocks. Teams can organize and execute so that ML and software delivery teams, who may not have historically worked together before, can do so to create integrated solutions at scale. There are several areas that an organization should focus on in order to make this a reality over the long term.

Leadership alignment and sponsorship

Both alignment and sponsorship across functional and technical leadership is instrumental to bringing data scientists and software delivery teams together. Leadership's understanding of the value of ML, and stating the importance of this initiative to the business' objectives, cannot be overlooked. It need not be a complex statement; simply stating why this initiative can bring better value to the business, such as allowing software to meet specific market advantages, is a sufficient rallying point.

Additionally, leadership needs to establish who will represent them as the group's first product owner. This product owner needs to be highly engaged throughout the initiative's duration which may mean the individual needs to relinquish current responsibilities. Multiple members of leadership should plan on attending the weekly demonstrations. Leadership will gain a collective understanding of what's possible in advance of establishing additional teams and will be able to socialize the impact this new team is having early and often.

Education on both sides

Both the ML and the software delivery team need to keep an open mind and be ready to learn about new ways of working. ML team members need to better understand agile and DevOps practices, while the software delivery team needs to learn more about how ML models are developed and their unique operational requirements once in production. These teams should ultimately work together seamlessly and be viewed as partners in the pursuit of creating value for the business.

Shared continuous integration and delivery

Today, best in class software delivery teams focus on a whole team approach. Team members typically treat the code base with shared ownership, all driving to continuously deliver business value. Both user functionality and scale are verified through automated tests. Both the software and tests are automatically deployed and continuously executed via a pipeline that serves to continuously integrate and deliver the software to multiple environments such as development, testing, staging and production.

As ML has become mission critical, data scientists have been coalescing as teams and are now looking to adopt this same continuous delivery approach. While in software this referred to as DevOps, in ML it's referred to as MLOps.

A software delivery team's continuous integration and continuous delivery (CI/CD) pipeline and a data scientist team's MLOps pipeline can be used to facilitate rapid testing and deployment. Specifically, the MLOps pipeline can deliver the usable model as a microservice fronted by RESTful APIs. While this may seem like a tactical technical item, it has a tremendous business value.

It facilitates models being created to be incorporated early and often by the software delivery team so that they can reach the business more quickly. To further accelerate automation, a DevOps practitioner (often known as a delivery engineer) can work across both the software delivery and ML team members to create a CI/CD pipeline to build the ML model's continuous training and deployment pipeline. This is in a similar vein to a software development team leveraging a CI/CD pipeline to continuously deliver software. For the MLOps pipeline, the delivery engineer is acting as an ML infrastructure engineer.

With early integration successes, coupled with early user feedback by the product owner and alpha testers, the ML model has a true chance of providing business value with the software's release. An initial set of user stories can be written to immediately require the need to create an ML model, and a use case for software that implements it.

There are several advantages for doing this right out of the gate. First, teams have fewer technical barriers that would impede technical collaboration. They can see the results of their working together in near real time. Second, this facilitates A/B testing of different models to be sure the most predictive model is being used at any given time. This is similar to how software delivery teams conduct side-by-side tests to glean what resonates the most with their users. Finally, leadership can see the collaboration coming to fruition and continue endorsing the path forward.

Creating continuous delivery pipelines that allow for software to quickly execute against continuously deployed ML models facilitates exploring which ML techniques are the best for the software use case. One significant use is that a CI/CD pipeline can assist ML teams in ingesting data at scale and in leveraging automation executed via the pipeline to assist in ensuring model veracity.

Software delivery teams can access these productionized models very similar to using any other software component. Once these models are created to be deployed, they can be serialized and pushed to cloud storage to be exposed via REST endpoints, with the APIs discoverable, and they can be more readily leveraged by other software delivery teams in the organization (such as via a software library repository).

MLOps pipelines — Figure 2: MLOps teams will use the CI/CD pipeline to automate the deployment of their MLOps Pipelines (CT pipeline + prediction service) to deploy ML models as a microservice and allow software delivery teams to easily access these prediction services (diagram from [1] expanded upon to include software delivery team).

Shared agile practices

Success cannot be achieved alone by having ML deployment pipelines. One critical aspect is how the ML and software delivery teams work together. Leveraging agile processes facilitates this.

Particularly for these two groups, agile provides several benefits. The shared team is able to rally around set of shared, prioritized goals from a single product owner. Regular, frequent team interaction occurs through daily stand-ups, weekly product demonstrations and via regular retrospectives. Based on the learnings from the demonstrations and retrospectives, the team collectively rallies around improvement areas and pivots. Conducting test-driven development highlights how software that takes advantage of ML models is able to be rigorously tested.

A transition guide

Tactical steps are available to avoid and overcome typical roadblocks when starting a new initiative. In lieu of a "big bang" approach that demands drastic change (e.g., all software delivery teams inject ML into their software and all ML teams must now use an MLOps platform and integrate with the CI/CD methodologies and technology), a small steering team can be created. This team can define a single use case, considered the minimum viable product (MVP), in which to build, iterate, learn and deliver. First, let's make some assumptions about the current state of an organization that is about to start their journey.

Baseline starting point

Organizations that are looking to bring ML and software delivery together and make ML a core competency have a repository of relatively high-quality data at their fingertips. They have at least one pocket of data scientists building predictive models and at least one model in production making predictions. They have felt the pain of getting an ML model into production and are concerned that their current methods of injecting ML into software are not scalable. They typically have several use cases ready in the backlog that they are confident provide business value if they are able to develop, deploy and iterate in a streamlined fashion.

On the software delivery side, these organizations have explored agile development in some capacity and are looking to move to a more continuous integration and continuous delivery model if they aren't already maturely there. Finally, leadership understands the benefits and upfront investment needed to bring these two teams together and is ready and willing to be that change agent.

Step 1: Assess

Getting a detailed picture of the current state gives the organizations a starting point. An assessment should be performed by the small steering team on what the current state of the software delivery and the ML teams are. From the software perspective example topics to be assessed are:

What are the software delivery processes and methodologies being used?
How do software delivery teams organize around use cases?
What are the preferred tools?
What are the organizational capabilities creating and deploying microservices?

On the ML side, example topics to be assessed are:

What are the current preferred processes and tools?
What are the major bottlenecks that they faced when placing their models into production?
Are these teams working mainly in a public cloud, on-premise, etc.?

This can be performed via surveys and/or interviews and the leaders who are sponsoring this effort should be looped into this process. Finally, a roadmap can be created for the transition plan that is customized to the current state. It is critical to over-communicate these findings so that everyone is aligned on next steps.

Step 2: Educate and align

After the current state baseline has been established and a roadmap has been created, both teams need to start making their way up the learning curve independently and aligning on how they work together moving forward.

The ML team needs to ensure education is obtained on agile methodologies, including integrating and delivering in a continuous fashion. They need to learn how their counterparts on the software side do this, and also become very adept with processes and tools specifically related to delivering ML capabilities in a continuous fashion through an MLOps platform.

The software delivery team needs to educate themselves on the value of including ML models in their software and define how the software project will interface with outputs from an ML model (i.e., how the contract is defined). In addition, the upfront design and brainstorming work that helps initially define the value of a software application (e.g., storyboarding, wireframing) needs to incorporate how and where ML models can be injected.

Step 3: Build the MVP together

After selecting the use case for the MVP, which should be based on the value to the organization and its simplicity, there are three equally important objectives for the team:

The ability to continuously deliver working software leveraging ML models that may be updated by data scientists asynchronously through their MLOps platform.
ML and software delivery teams have an operating model to execute as one team, following one shared Agile process: combined daily stand-ups, customer demonstrations, and leveraging the same deployment model.
The ability to capture lessons learned for a broader roll-out to the enterprise and the iterative conveying of these learnings. Many of these learnings can be gathered through a regular team retrospective that is predominately focused on current team improvements.

The MVP should define a way for APIs to front the ML model being created so it can be accessed via software. The team should establish that one of the software engineers be focused, at least part-time, on the model to facilitate the API creation. Additionally, the team should establish that one of the ML team's data scientists be focused, at least part-time, on how the software is going to leverage the ML model to its fullest extent. For the MVP, the model may not be performant out of the gate, but it is critical to productionize it on a regular cadence. The main focus is to ensure that the integration with software can be tested and proved out in an environment as close to production as possible.

Organizational success

*Figure 3: Mature integrated ML and software ecosystem*

Successfully bringing together ML and software delivery teams allows organizations to achieve business outcomes at scale that historically were not achievable:

The organization's strongest technical mindshare is brought together and have both an agile process and technical methodology for working together.
The organization gains demonstrable experience of quickly shipping software that leverages ML models.
ML models are created in such a way that they can be published and reused across applications via microservices similar to other software components.
Software applications swap out models with the same API interface based on A/B testing to update to the relevant predictions.
Software and model quality are strengthened based on the reliability that comes with continuous delivery and created automated tests.

Organizations can transition toward this state no matter where they are in their ML and software development journeys. By starting small, initial successes can be achieved that bring value to the organization; teams can learn these new methodologies and work better together.

References

[1] Architecture for MLOps using TFX, Kubeflow Pipelines, and Cloud Build. Last Updated 2020-04-20. Accessed on 2020-06-09.