Reinventing AI Research & Development: Part II

In our last article around AI R&D, we discussed several organizational challenges that accompanied the growth of our R&D program:

no documented processes for project selection (data scientists worked on what they personally felt was most interesting at any given time);
little direction around a technical platform or technologies to utilize (work was done locally, often without version control);
a lack of standardization around coding practices, style and quality led to inconsistencies between code developed by various data scientists; and
cumbersome and time-consuming knowledge sharing techniques (either took away from actual R&D time or was simply pushed aside for client work).

To address these challenges, the AI R&D team focused on building out three areas: people, process and platform. The previous article elaborated on the people and process aspects in maturing the AI R&D program.

On the people side, we discussed the AI R&D team's organizational structure and the different roles and responsibilities among team members. On the process side, we discussed our decision to create rotations of varying tenure for our operational team, project team and project selection panel. In this post, we'll cover the platform in detail, shedding light on the tools, capabilities and strategies of the AI R&D program.

The AI R&D platform

In order to learn about and experiment with cutting-edge technologies, algorithms and processes, it was critical that the AI R&D team build a modern AI platform to accelerate the development and production of AI models. Two key considerations were taken into account when designing and building this platform:

Flexibility and extensibility: Ability to handle a variety of data sets, workloads and algorithms to be used across a variety of AI R&D projects and respond to the quickly changing landscape of AI
Mimic future AI platforms: Anticipate what our customers' AI platforms will look like in the future and gain hands-on experience with these cutting-edge capabilities

The combination of these factors has been realized in our modern AI platform which will fuel continuous growth over time, positioning the AI R&D team as a thought-leader in the AI space. In order to meet the two important considerations, a set of foundational architectural principles for the platform were developed:

Accessible to WWT users in a variety of user-friendly methods
Data lake is consumable by other environments
Portable
Code stored in and accessed via external WWT git
Create and employ containers for stable packaging
Cost-efficient
Consume only what is necessary
Secure
Resources not exposed to public internet unless secured for specific use cases

Why a cloud platform?

Over the last five years, WWT's customers have matured their data analytics capabilities and are now asking for guidance in areas that force WWT's AI R&D team beyond traditional analytics. Using WWT's Data Maturity Curve as a measure of this growth, most of our customers now sit in the 2-3 point range of maturity, meaning we must mature our expertise and capabilities to stay ahead of demand and continue being thought-leaders.

WWT data maturity curve — *Figure 1: Evolution of WWT's customers along the Data Maturity Curve*

WWT's customers are seeking increasingly more flexible AI and machine learning solutions, typically in the cloud, and are now asking questions about:

traditional "big data" in the cloud;
API-driven services to increase flexibility and streamline integrations;
the "democratization of AI" with tools such as AWS Sagemaker; and
cost effective ingest and data storage for machine learning services in the cloud.

To meet and get ahead of these new demands, the WWT AI R&D team has expanded its playground for data science exploration into the cloud. The AI R&D platform now includes integrations with on-premise components in the WWT Advanced Technology Center (ATC) and a public cloud component currently on Amazon Web Services (AWS). AWS was chosen first due to the team's experience with their cloud offerings, however, the AI R&D program plans to leverage other cloud providers in the future.

The AWS environment compliments the work already being performed in the ATC and provides additional benefits. The ATC provides significant hands-on development and testing capabilities using GPU-based solutions for our data scientists and data engineers. Similarly, the R&D platform within AWS provides the same hands-on environment to build, test and explore solutions within the cloud. Github and Docker are used across both environments to store code and data and easily share this information across the team for future projects, reducing time necessary for knowledge transfers.

Additional benefits are realized through the AWS platform like increased flexibility in provisioning, scaling and consuming resources without interfering with POCs and customer engagements within the ATC as well as the ability to leverage pre-baked services from the cloud providers (e.g., AWS Rekognition, AWS Comprehend). These dual environments now place WWT as a leading strategic advisor not only for traditional analytics but for AI, data science and machine learning within on-premise, cloud and hybrid environments.

The platform

On the platform side, the AI R&D team has mapped out the important functionalities required to go from raw data to running models in a production environment. The platform includes a variety of tools and capabilities for the AI R&D team to experiment with to develop first-hand experience and insights to best advise our clients.

The current platform is illustrated in Figure 2a, where the areas in blue are functionalities that are currently available, and the greyed-out portions represent future functionalities that will be built out. Additionally, current available functionalities and tools on the platform are designated as either an 'always on' or an 'on demand' capability. 'Always on' represents functionalities and tools that are constantly up and running. 'On demand' represents functionalities and tools that are turned on when they are required, a unique feature of working in a public cloud environment. Furthermore, the platform is divided into five main categories:

ETL Tools, Enterprise Service Bus, Enterprise Message Bus: A set of tools and communication systems that enable the raw data to be ingested, transformed to a state suitable for data science usage and loaded onto the containerization section
Containerization: Platform and tools to containerize models for streamlined production through modeling environments
Resource Management: Rules and standards for maintenance of CPUs, GPUs and other resources to enable CPU and GPU as a service
Data and Feature Catalog: Application for end users to see the entire landscape of data available to them and certified features used in past models
Code and Data Management: Code and data managed via standardized set of tools (e.g., GitHub) and processes (e.g., version control strategy)

Currently, the platform has an 'always on' raw zone that serves as the initial landing destination for ingested data (i.e., untouched, read-only data that serves as the new source data for all experiments). Additionally, an 'always on' discovery zone enables data manipulation while an 'on-demand' model training and inference zone allows AI/ML features to be engineered and models to be trained and validated.

The compute resources, including CPUs and GPUs, are spun-up 'on-demand' during the discovery and training phases. The list of end user tools is listed in box L, highlighting the variety of tools used by end users and administrators on the platform. A set of standards around code and data management are implemented to ensure consistency and reusability.

ideal AI R&D platform architecture 1 — Figure 2(a): *Current high-level architecture for AI R&D Platform*

ideal AI R&D platform architecture 2 — Figure 2(b): *Ideal future state architecture for AI R&D Platform*

platform architecture legend — *Figure 3: AI R&D Platform components legend*

The ideal future state for the platform is illustrated in Figure 2b, focusing first on establishing a streaming functionality and a model production zone. A streaming functionality will identify and build out tools for streaming data directly into the data lake, discovery zones and model inference zones.

Importantly, this will unlock and feed into the model production zone, leveraging large datasets that would stress the platform's capabilities. The combination of these two high priority functionalities will enable the AI R&D team to work through challenges and in environments representative of real-world implementation, creating an invaluable opportunity to stay ahead of the customer maturity curve.

Modernization of process for AI

Today, many organizations recognize AI as an important asset to fuel technological growth and transformation that will enable data-driven insights and actions to positively impact business processes. At the same time, most companies that have an AI team are running into obstacles when it comes to rapidly iterating on experiments and putting models into production.

Understanding the challenges faced by customers today, WWT's AI R&D team identified the following strategies as the four focus areas to modernize our AI processes:

Data: a work in progress and will be the main focus in the upcoming year
Code execution: will constantly evolve and transform to meet the requirements of different projects
Coding process: a work in progress to become more efficient in how we code, based on software development best practices
Version control: established process of managing data and model iterations

These four strategies form the 'how' WWT's AI R&D team is tackling the AI space. Additionally, WWT's Application Services team is partnering with the AI R&D team to push R&D to work in more modern fashion. While AI is vastly different from software development, there are fundamentals and best practices within the software development world that may be borrowed to fast-track an AI platform.

Data strategy: Establish future dataflow process and leverage different levels of data readiness

The AI R&D team's data strategy aims to create standardized dataflow processes inside data pipelines in order to better manage and reuse data transformations. Data will be ingested into the Raw Zone and become the source data for all R&D experiments. As data is transformed, it will flow into the Conformed Zone for light transformations and the Semantic Zone for larger transformations and manipulations.

Any transformations that should be saved for potential reuse in the future will be stored in the Certified Feature Store to make experimenting with the same data sets easier and more efficient. Data can be utilized for discovery or model training, inference or production from any of the four zones within the data pipeline, driving flexibility in data consumption.

data pipeline for AI R&D — *Figure 4: Data pipeline for AI R&D*

Code execution strategy: Code execution on the cloud and less time on building the stacks

Code execution strategy can be implemented either on-premise or on the public cloud. For on-premise ATC code execution, customized environments must be built each time data scientists execute code. However, for cloud-based code execution, prebuilt and reusable environments are provided out-of-the-box by AWS Sagemaker or customized images we create in Docker Hub. The Docker Hub environment is a repository for those custom images which can be pulled by data scientists for reuse or as a starting point for a similar image to drive further efficiencies.

code execution decision tree — *Figure 5: Code execution decision tree*

Code process strategy: Mature AI and machine learning via best practices from software development

In terms of code processing, applying best practices from software development to mature coding processes in the AI and machine learning space will provide the governance required for efficient collaboration and knowledge transfer.

WWT's AI R&D team has identified a set of fundamental and core principles within the software development space that are applicable to the AI space. These are principles that all data science resources must abide by; the continual application of these principles will provide structure, increased efficiency and reusability to AI processes and workflows. Figure 6 provides a detailed view of the value to AI that may be extracted from best practices in the software development space.

linking software development best practices to AI — *Figure 6: Linking software development best practices to AI*

Version control strategy: Establish framework to tackle unique collaboration challenges

Data Version Control (DVC) is a high-value tool in the AI space, thus the AI R&D team is experimenting with the different functionalities to create a framework for tackling the unique collaboration challenges during data science development. One benefit to DVC is that it allows for full control over code and data versions, tracking a complete picture of the evolution of an AI or machine learning (ML) model.

As a result, models could be reproduced from any stage for reusability, contributing to efficient processes and workflows. Additionally, governance around consistent collaboration are outlined in DVC to guide knowledge transfer, result sharing and running finished models in a production environment. Overall, the functionalities of DVC positively impact processes and workflows in a variety of ways:

reuse data and outputs between experiments;
maintain code for a specific use case in its own repository;
easily transfer knowledge amongst the data science team;
reproduce results across different experiments;
compare different evaluation metrics across experiments;
collaborate across projects with team members; and
version data science models.

Next steps

The next high-priority focus for the AI R&D team is to mature our data strategy. One avenue to achieve this is through building out a data catalog. The data catalog will help organize all of the data in the AI R&D program. Once this is made, knowledge sharing will be more efficient by allowing teams to quickly and easily understand datasets.

An early effort in building out a data catalog is important as it will prevent large amounts of unorganized data from piling up. Another important aspect to the data strategy is governance – which will inform on how new data should be brought in. From a collaborative perspective, the AI R&D team is extending its reach within WWT to involve teams and individuals with the goal of extracting and applying knowledge of relevant best practices to fast-track development and maturity of the platform.

Ultimately, we will continue to push the limits of the algorithms we investigate and identify unique use cases to demonstrate their value. For more on our work, take a look at the initial overview of the AI R&D program's beginnings or read our most recent white paper that discusses using representative data created by Generative Adversarial Networks (GANs) to train AI models.

We look forward to keeping you updated on our progress!