Reinventing AI Research & Development: Part III

In our first blog post, we discussed the people and processes around the AI R&D program, detailing the structure of the program and the roles within it. In our second post in the series, we focused on the need for our AI R&D team to build a modern AI platform to accelerate the development and production of AI models.

The two key considerations taken into account when designing and building this platform were:

Flexibility and extensibility: Ability to handle a variety of data sets, workloads and algorithms to be used across a variety of AI R&D projects and respond to the quickly changing landscape of AI.
Mimicking future AI platforms: Anticipate what our customers' AI platforms will look like in the future and gain hands-on experience with these cutting-edge capabilities.

Since then, our program has experienced changes around the platform, people and processes. The white papers published so far have gained visibility, and customers are increasingly interested in AI capabilities.

This blog post builds on the topics above and describes how we have:

Strengthened our foundations with a focus on streamlining the processes and people-side of the program.
Expanded our platform capabilities with the goal of increasing the kind of R&D work produced by our team.

Strengthening our foundations

As our team grows, the solidification of people and processes will allow the AI R&D work to scale and become more streamlined. The main areas we are streamlining in the near term include:

Adding an ad hoc method for data scientists to work on projects in the certified backlog.
Incorporating members from WWT's Applications Services team to build out our "MLOps" capabilities.
Project Selection Panel (PSP) process.

Ad Hoc Backlog Review

While our AI R&D work in the past has been driven by two data scientists rotating their time, we have realized the need for a more fluid structure — one that can run in parallel and involve more resources when they have the time relative to their day-to-day work. Additionally, with companies rapidly adopting AI for a multitude of applications, the current rotation model could not keep up with the increasing demand for AI work.

We adapted to this change by introducing an ad-hoc project model in addition to a focused rotation program. This encourages all data scientists to work on projects in the certified backlog as time permits. Expectations of regular updates and check-ins ensure that such work is ongoing and consistent while at the same time giving them more flexibility. This will run in parallel to our focused rotation with two data scientists more dedicated to the AI R&D program. See Figure 1 for details.

Ad-Hoc R&D Team Rotation — Figure 1: Additional Ad-Hoc R&D Teams

Application Services

In addition to adding an ad hoc project model for data scientists, we have incorporated members from WWT's Application Services team. They will focus on building the automation needed for the productionalization of models, often called "MLOps" (discussed in detail below). Overall, we look forward to Application Services bringing best-in-class agile software development concepts to the AI space and helping our data scientists develop, test and productionize AI models in a more streamlined manner.

Project Selection Panel

While the Project Selection Panel (PSP) has been in place for about a year now, their processes for how projects are selected has yet to receive a close inspection. It has been a challenge for the PSP to make decisions without multiple meetings. Moreover, there have been no set deadlines for getting proposals into the queue, which has led to last minute additions that weren't properly vetted.

To remedy these issues, we put in place a more structured selection process that starts at the beginning of every month. The changes to this process include:

A more structured and quantifiable method of vetting new projects.
Longer pitching sessions by shortlisted candidates to understand value add to customers.
A hard schedule and deadlines around submissions.

Expanding our platform capabilities

Since our last blog post, we have continued to develop the platform in the areas highlighted in Figure 2. These areas can be categorized into two broad sections: data accessibility and MLOps.

Data accessibility

Data accessibility has been improved across the data architecture in the data lake and the data catalog.

The first step in building a more accessible data lake is defining an organizational structure for the data within it (i.e., a data architecture). This will allow the data science team to have a more streamlined understanding of where to find different data sets and what to expect about their quality, subject matter and normalization. The platform has now been organized into three major zones: Raw, Conformed and Semantic.

The Raw Zone houses data which is completely untouched (i.e., in its raw form).
The Conformed Zone houses data that has been manipulated, but in ways which did not result in a loss of information (e.g., format normalization).
The Semantic Zone houses data which has been manipulated in ways that bring business value and may be reusable across projects (e.g., running averages), but may also have resulted in the loss of information (e.g., running averages).

A data catalog was established to allow for colleagues to quickly familiarize themselves with the data, cutting down on knowledge sharing time and increasing the efficiency during the discovery phase. The organization and upkeep of the data catalog will be maintained by members of the AI R&D Operations team. A process is currently being developed for any new data brought into the AI R&D platform to ensure it is cataloged appropriately.

MLOps

Before talking about MLOps, it is important to note the difference here between MLOps and AIOps. AIOps is a more specific AI-driven category of the broader automation trend in IT, but not all forms of automation would necessarily be considered AIOps. According to Gartner's definition, "AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection, and causality determination." Put another way, it's about using AI and data to automatically identify or handle issues that would have once depended on a person to take care of manually.

Our work only dealt with MLOps, which refers to the idea that having a consistent and streamlined methodology for promoting an AI model through testing and into production in an area ripe for disruption. The software development world has extremely intelligent methodologies to continuously develop, test and ship software, yet this hasn't yet taken hold in the AI space.

There are subtle differences between developing software and developing AI code that make the productionalization process less clear for AI. For this work, we are collaborating with our Application Services peers, a team with significant DevOps and Agile methodology expertise, to tackle this problem with the goal of developing a best-in-class MLOps model for our customers.

In order to build out this Model Production Zone, we leveraged a real use case from a past R&D project. We chose the NASCAR image classification model (Image Classification for Race Cars ) as a good place to start.

For some background, during a NASCAR race thousands of pictures are taken of the cars racing around the track. Each team has access to these images through a shared folder, but has to manually sift through the pictures to try to find the cars of interest to them. This model automates that process by using an ensemble of eight CNN models to identify the cars and number corresponding to each car.

The NASCAR model was chosen due to its need to be refreshed frequently (typically every race) and its use of a relatively large data set. These characteristics will allow us to stress the MLOps process. With the data available to us, we can simulate every race from the entire season and run the MLOps process as we would have in real-time if we were live.

Our AI and Application Services teams have been working closely to streamline this process and ultimately scale it to all of the AI R&D projects we have done and will do in the future.

Next steps

As we move into the next rotation, the goal is to keep the project work momentum high and develop our program.

Actionable steps for the platform-side include solidifying Application Services' work on developing the Model Production Zone and scaling the productionalization process so it can be used for all AI R&D work.

Once we have a process for productionalizing our models, we can then work with Application Services to build demos using those models

With our Business Analytics & Advisors team growing exponentially, interest in our R&D program is gaining visibility. We must continue to build our platform capabilities to keep up with the demand and be able to scale accordingly. We will continue to blend Application Services into the AI R&D team across the different team rotations and representation in leadership meetings.

To streamline project work, we will also continue to enhance the data catalog. This will improve accessibility to data for our team as they rotate on projects.

Eventually, we aim to expand the platform to have a hybrid on-prem/public cloud architecture that will enhance the consistency of our hybrid cloud system.

With these ongoing efforts to solidify our foundations, we look forward to the coming months where we can further leverage our program. WWT's upcoming AI Industry Day will be a great start to our efforts, where our team can interact with industry experts and work toward our program's growth goals.