Reinventing AI Research & Development: Part IV

In the fourth article of our Reinventing AI R&D series, we recap the developments made over these past four months with our project tracking, AI R&D branding and building out of our MLOps capability. But first, let's take a brief look back at how far this program has come since the first Reinventing AI R&D post just over a year ago.

Following the first few months of the program's operations, the first AI R&D Operations Team rotation saw an opportunity to add rigor to the program's overall structure in order to address fundamental organizational challenges. In our first article, we focused on developing the people and processes driving the project work via formal program structure and outlining the various roles. In addition, we defined the rotational periods, resources and responsibilities for each of the roles.

The three roles outlined were:

Project Selection Panel (PSP) – Selects the projects to execute based on their business and technical relevance to the program's overall mission.
Data Science & Engineering Research Team – Performs core R&D project work.
Operations Team – Comprised of consultants to support program administration and development.

Depending on their role, members of these groups cycle in and out yearly, quarterly or monthly, allowing for a constant flow of ideas that expand the team's data science and engineering skills through hands-on experience.

With the right foundation of people and processes, we transitioned during the second rotation to build out a framework for the program's discovery and model training platform. As reflected in our second article, we built a cloud-first, containerized platform leveraging a range of software, hardware and tools. This happened both in the cloud and in the ATC to facilitate R&D work and accelerate novel findings.

In our third and most recent article, we expanded on our capabilities by reassessing the methods by which data scientists work on projects in the backlog. We also introduced additional methodologies, notably MLOps, by which our data science teams push models into production. In developing this process, we accomplished two things:

Strengthened our foundations with a focus on streamlining the processes and people-side of the platform.
Expanded our platform capabilities with a goal of increasing the R&D work produced by our team.

These developments have been crucial in getting us to where we are today. With the above foundation in place, the most recent rotation has been working to successfully build upon the program's current-state capabilities and features to promote its advancement in two ways:

Enhancing our platform by overlaying ML workflow tools and technologies on top of the existing model training platform to support an MLOps capability.
Standardizing our processes with project enablement and management, as well as with the publication and marketing of the project deliverables.

In addition to these platform and process changes, the AI R&D program had a unique opportunity to contribute to the efforts made towards COVID research. With the foundations and iterative improvements to our people and processes in place, we were more easily able to act in an agile manner and shift our focus toward participating in the multitude of COVID-19 projects out there.

Enhancing our platform

In the previous rotation, we introduced the concept of MLOps and our intention to develop this capability internally to streamline and scale out the support of our R&D projects. To help drive this operational pivot, we further integrated our colleagues at Application Services into the AI R&D program. Their work alongside our data science and engineering team contributed to the significant progress these past couple of months.

**Figure 1.** High-level machine learning workflow

To refresh, MLOps is the concept of applying DevOps principles to the machine learning (ML) development process (outlined in Figure 1) to create a more streamlined and documented continuous integration/continuous delivery (CI/CD) pipeline. The goal is to facilitate the scaling out and productionalization of ML models in an efficient and repeatable manner. To do so, there must be a focus on more than the just model code. As Google's seminal paper on MLOps describes, the ML system is vast and complex, made up of the resources and tools at each stage of the ML workflow. Without MLOps principles, the surrounding ML infrastructure can become burdensome and one's ML workloads become susceptible to fundamental challenges including:

Technical debt at a system level.
Tedious and manual productionalization of models.
Lack of reproducibility and repeatability.
Lack of model and data versioning.

These challenges are only compounded as data science programs mature and become more expansive and complex over time.

With these challenges in mind, we embarked on our journey of applying MLOps to our AI R&D program. We needed frameworks and tools to support the supporting ML infrastructure and workflow, as well to help drive three foundational MLOps concepts:

Automation/Orchestration – Containerization of code to promote portability as a means to eliminate manual effort and technical debt.
Collaboration – Interchangeable code components with a goal of reuse and open-sourcing of code across different ML projects.
Innovation/Iteration – Versioning and documented data science work to promote reliable refresh and modification of ML models.

To help put these concepts into practice, we began by surveying our data science team to gain insight into how they worked, where most of their efforts went toward, where they saw areas for improvement and what ML pipeline capabilities would be most valuable to their work. This information allowed us to hone our research — focusing on those industry stories most pertinent to our specific use cases and learning from past examples how best to bring the capabilities associated with MLOps to bear.

For this initial leap, given our program's current needs and level of maturity, we landed on Kubeflow and TensorFlow Extended (TFX). Kubeflow is flexible with different model types and can scale containers elastically in parallel with the change in predictions. Likewise, TFX provides compatibility and integration with many existing data science tools. The relative ease for our data scientists and engineers to learn these platforms will facilitate the rapid deployment of ML pipelines and bring the value of MLOps sooner. We use several other MLOps tools with Kubeflow and TFX, but they can be interchanged based on project-specific technical requirements.

**Figure 2.** Data science survey questions with sample results from the 13 respondents

Our next step before building out the entire MVP reference architecture was to home in on the Deployment and Monitoring stage and apply TensorFlow to an already-published project in "Image Classification of Race Cars." We did this to demonstrate TensorFlow's capabilities, in particular TensorFlow Serving, as an MLOps tool. The result was a successfully deployed TensorFlow service and Flask car classifier demo web application that allows a client to call for an image classification in a readily available and repeatable manner. Without the addition of TensorFlow, the delivery of results from the image classification model would be a manual process for the end user. This demo showcased the power of TensorFlow while also providing us working experience with the tool.

**Figure 3.** TensorFlow Serving demo for Deploying Serialized Race Car Classifier Model

The next step in our journey was to apply MLOps to not just the Deployment and Monitoring stage but the full end-to-end ML workflow. This would leverage the full scope of our existing AI R&D platform and its tools and resources as represented in Figure 4.

The setup and support of the existing AI R&D platform architecture is designed to be ML-forward. Overlaying the ML workflow onto the platform highlights the alignment in architecture that will help accelerate the value of the MVP. Among the many tools in use, the two tools selected at the onset — Kubeflow and TFX — play big parts in all this. The Kubeflow pipelines are where the magic of MLOps happens. Kubeflow logs the movements, datasets and transformations at each step and provides that automation and orchestration in integrating from stage to stage. This facilitates the replication of setup and experience at different points in time. Meanwhile, TFX helps build those ML pipelines and data engineering components at the model validation and evaluation stages, but in a scalable and automated manner. With that, it forms the foundation to perform project work in the environment.

**Figure 4.** Reference architecture of the MVP MLOps platform overlaid on top of the AI R&D platform architecture

Fully embracing MLOps by building the architecture, technologies, processes and appropriate project work are significant steps forward for our AI R&D program. This should help address fundamental challenges we experience daily when working with ML infrastructure and workflows. We also anticipate more companies in all verticals encountering similar challenges as they advance up the data maturity curve and leverage ML models as part of their daily operations. With the hands-on experience we will gain, we will be in a prime position to advise and guide our customers on their own journey.

Standardizing our processes

As our program continues to expand, it has become increasingly important to foster continued standardization in our processes. This will make onboarding more seamless for the resources from our Application Services department and the WWT-dedicated solution architects from vendor partners. Additionally, this practice will increase consistency in how users access the output of our R&D projects. We were able to accomplish this goal in four ways:

Evangelization of our central project tracking systems across siloed ad-hoc projects
Improvement in knowledge transfer process
Standardization of AI R&D output material
A new dedicated AI R&D platform page

Evangelization of our central project tracking system across siloed ad-hoc projects

In late 2019, we opened our project backlog to allow data scientists not on rotation to have the chance to kickstart projects with preliminary research and data gathering at their discretion.

With the potential for several projects to be ongoing at any given time, we needed a way to uniformly log ad-hoc project updates. We weighed several software options, including Microsoft Planner, Trello — which was the system used for on-rotation projects — and Jira. In the end, we chose Microsoft Planner given it meets our current needs, can be hosted through Microsoft Teams (our R&D document repository and collaboration tool) and is easily scalable as ad-hoc projects come and go.

Improvements in knowledge transfer process

Since the inception of the program, a key dynamic of the team has been its three-month rotational structure that allows for a constant flow of ideas from different consultants, data scientists and data engineers. As the team welcomes new members, individuals may pick up work that is in progress or be asked to improve models on ongoing projects. Whatever the situation may be, one thing is for certain: an efficient and high-quality knowledge transfer methodology is needed. Making knowledge transfer more efficient implies having well-documented processes and tracking methods for every project and task.

As a first step, we met with the entering rotational data science team to understand their knowledge transfer experience and to explore areas of opportunities. The team decided that each project should have its own individual Kanban board (Trello/Microsoft Planner) to track individual tasks and their status. A Kanban board is an agile project management tool designed to help visualize work, understand work in progress, maximize efficiency and identify blockers. This new method of documenting work will enhance the team's visibility and improve everyone's understanding of which tasks are getting accomplished and which remain in the backlog.

In addition, we thought that upcoming members to the rotation program would benefit from a high-level overview of the projects to understand why some steps were taken and why a specific direction was chosen. The deliverable of this information will be decided by each team, but everyone is encouraged to create a presentation outlining the project and providing a definition of the current approach and possible next steps.

Currently, the AI R&D program is experiencing an influx of members from other departments that are contributing in advancing specific tasks. Therefore, improving knowledge transfer is essential for the AI R&D team's growth.

Standardization of AI R&D output material

Another area of improvement was in the presentation of our program. We looked to provide content that is both consistent and attributable to the AI R&D program, as well as formalize the way we conduct projects and publish findings. With those goals in mind, we needed a standard template and logo to include with all our published material: demos, articles, ATC Insights, white papers and presentations.

We created the WWT Artificial Intelligence Research & Development team logo to represent the collective ongoing effort between the Business and Analytics Advisors and Application Services groups. We updated the existing white papers with an ending page that talks about the AI R&D program and links back to our WWT platform page. Since these white papers are often distributed at conferences to demonstrate thought leadership in AI/ML, it gives readers a way to find us later and learn more about our program.

A picture containing table

Description automatically generated — **Figure 5.** Official logo of the AI R&D program featured in all our outputs

A new dedicated AI R&D platform page

Having standardized the AI R&D's documentation and outputs, we now needed a centralized location where people could download our white papers, read our articles and easily learn more about the AI R&D program. Until now, most AI R&D work was disseminated through existing customers passing it along, word of mouth or ad-hoc emails.

To provide internal and external users a streamlined process to engage our program directly, we created a separate landing page on the WWT platform. This becomes a game changer for many personas, ranging from account managers wanting to learn more about how we have engaged with some of the most pressing use cases to companies wanting to leverage our expertise to build their own internal capabilities. Additionally, as the program expands to include more people from WWT and through strategic partnerships, we find it even more important to keep track of the various personas interacting with our material.

This works in tandem with our ongoing marketing efforts to develop a more holistic picture of how our platform is being used and by whom. Funneling as many interactions with our program's material as possible through the platform page also lets us more actively log interactions, facilitating better follow-up measures to ensure users find relevant material and answers.

Next steps

Going forward, MLOps will continue to be the focus of the AI R&D team as we look to complete the MVP platform architecture and push this methodology throughout the program with all future projects. To accomplish this, we will start with an MLOps natural language processing (NLP) proof of concept (POC) project to showcase how the MLOps platform can facilitate complex feature engineering automation work often needed in data science work.

Fully integrating MLOps into the program will require knowledge transfers, ongoing support from data engineers and continued testing by data scientists to standardize workflows. Additionally, as the platform is implemented and tested internally, a group of team members will also be assisting with the organizational change management needed to adapt this new process throughout the program. Keep an eye out for Part V of Reinventing AI Research & Development, which will include a deeper dive into MLOps progresses and how the data science team is working together in adopting the new platform.