Deploying the AI R&D MLOps Platform to Enable End-to-End ML Workflows
Deploying the Kubeflow MLOps platform in AWS to enabled our Data Science team to create end-to-end ML workflows for automated delivery of machine-learning models.
In this white paper, you will learn about the MLOps platform that a WWT machine-learning (ML) platform infrastructure team built to reliably deliver trained and validated ML models into production. By deploying the Kubeflow MLOps platform in AWS as a component of our common ML infrastructure, the team enabled WWT data scientists to create end-to-end ML workflows. As part of the MLOps platform deployment, the team built an automated delivery pipeline proof-of-concept to train and productionize a natural language processing (NLP) deep learning model, along with microservices that enable a user to search for relevant WWT platform articles that have been ranked by that productionized model.
As more organizations adopt a data-driven culture by moving up the Data Maturity Curve, MLOps has become a necessity for any data mature company. While ML capabilities have progressed greatly over the last decade, many organizations still struggle with how to deploy these models consistently and reliably. As the industry enters the second decade of the machine learning revolution, there is a need for machine learning systems that can operate in an automated, repeatable and platform independent manner. That upgraded environment should allow data scientists to focus more on the data science and less on where and how they will do the data science.
Further, as cloud computing has allowed for big data and massive modeling (e.g., large language models like Bidirectional Encoder Representations from Transformers (BERT) which are specialized Natural Language Processing (NLP) neural networks that often contain 110 million or more parameters) the demand for ML pipelines that encompass the entire machine learning lifecycle is also increasing. Such pipelines allow data scientists to work faster in a repeatable manner while still allowing for innovation. For example, many organizations have built data science teams deploying many models but have siloed the various elements of these models in such a way that these models start to break down in a way that requires constant attention. Through MLOps these models can constantly be deployed reliably and automatically.
Here, we present a full ML pipeline with a massive NLP model using Kubeflow as example of a successful pipeline. This by no means constitutes the only available ML pipeline but demonstrates what is possible. MLOps involves building an advanced infrastructure in which the various elements of the pipeline and tools need to integrate seamlessly. Thus, organizations may put together a different puzzle based on data maturity, resources and current infrastructure. That said, many of the lessons learned while building the end-to-end ML workflow with Kubeflow can be applied to a variety of ML pipelines.
Experimental Setup and Methodology
An effective ML infrastructure team enables data scientists to iterate quickly and more efficiently. An organization’s ML infrastructure typically includes platforms and internal tools for ML model training, experimentation and deployment. By unburdening data scientists from the complexity of building ML infrastructure, they can focus on researching the latest concepts in Artificial Intelligence and Deep Learning, performing feature engineering, and training ML models that will eventually provide value in production. We selected Kubeflow to enhance the MLOps capabilities of our AI R&D platform and built a proof-of-concept system including data ingestion, an end-to-end ML workflow and a microservices ecosystem.
MLOps Platform Selection
When selecting the platform, we assessed multiple options, including MLFlow, Flyte, Domino Data Lab and Kubeflow. We chose Kubeflow because it had several advantages including its level of adoption, its deployment flexibility and its development community. Many organizations have successfully adopted Kubeflow to build reproducible ML pipelines, e.g., Spotify, Bloomberg, Volvo, US Bank, Chase and others. Kubeflow is freely available and, because it is installed on a Kubernetes cluster, it shares Kubernetes’ deployment flexibility in any of the primary cloud providers (AWS, GCP and Azure) and on-premises. To provide the type of MLOps platform that had previously only been achievable by major tech companies building ad-hoc platforms with large ML infrastructure teams, Kubeflow was created by developers at Google, Cisco, IBM, Red Hat, CoreOS and CaiCloud. Critically, the Kubeflow development community provides valuable guidance around installation and use of the MLOps platform and its components.
Many MLOps platform options exist today. Selecting one of these platforms and performing a deep dive has allowed WWT to not only gain expertise with Kubeflow but has also enabled WWT data scientists to learn generalized MLOps skills, techniques, patterns that readily transfer across MLOps platforms.
Building the Proof-of-Concept
As a proof-of-concept of the AI R&D Group’s MLOps capabilities, our ML Platform Infrastructure team built an ML- driven system that enables users to search for technology articles (such as the one you are reading right now) that have been published on WWT’s website, the Advanced Technology Center Platform (which we will call the WWT platform). To build this system, the team deployed Kubeflow, created a component that ingests data from the WWT platform, trained a deep learning model via an end-to-end ML workflow, and built a web application that enables the user to search for articles (Figure 1). The team consisted of a Data Scientist, a Data Engineer and three ML Infrastructure Engineers.
For your convenience, we are bolding the key features below.
Because the AI R&D group had adopted AWS for our development environment, we deployed Kubeflow in an Amazon Elastic Kubernetes Service (EKS) cluster. In addition to deploying the MLOps pipeline to Amazon EKS in AWS Cloud, the team also deployed the solution to an EKS cluster running on AWS Outposts. AWS Outposts is a fully managed offering by AWS that offers similar AWS Infrastructure, services, and APIs locally to any datacenter or facility. This is provided via a rack of hardware delivered to a local facility. GitHub Actions along with the Terraform infrastructure-as-code utility helped us automate cluster provisioning and Kubeflow installation. Terraform allowed the team to provision different cloud services with varying configurations, then perform experiments, and immediately tear down the services to avoid the cost associated with keeping the cloud services (including GPU-accelerated instances) available. In addition to provisioning the EKS cluster, Terraform provisions many other AWS services, including the AWS Elastic File System (EFS). EFS enables multiple Kubeflow components (running as Kubernetes pods) to share training data and ML models with other components.
As part of the deployment, after Terraform deploys the AWS cloud services, we use the kfctl utility to install the Kubeflow custom resources on top of the running EKS cluster. By integrating with GitHub Actions, the team built a one-click deployment feature that enables a Data Scientist to easily spin up a Kubeflow cluster without having to gain expertise in Kubernetes or cloud service provisioning.
While a deployed Kubeflow cluster and Pipelines let you build end-to-end ML workflows, an ML infrastructure team must enable multiple data scientists to share the cluster. For this multi-tenancy, the team configured multi-user isolation (included with Kubeflow 1.1) which supports multiple user profiles in individual Kubernetes namespaces. In addition, dex (an OpenID Connect identity provider) authenticated each Data Scientist via web-based login form.
To further reduce the cluster running cost, the ML infrastructure team implemented autoscaling so GPUs would only run during training.
For data ingestion, the data engineer and data scientist collaborated to determine how to transform the WWT article content. The text data from the WWT platform was extracted using the Advanced Technology Center (ATC) connect API. Text extraction used a Python 3.8 program to run locally, ingesting a single HTML article (Figure 2). After several iterations, it evolved into retrieving multiple articles from the ATC Connect API and saving them to a folder within the project code base. Two primary Python packages, BeautifulSoup and html5lib, helped parse the HTML and transform into comma-separated values (CSV) formatted data for training ML models.
Once the base code was complete, we transitioned to automating the execution via AWS Lambda. With Lambda and S3, we scheduled a (CRON) time-based job to run against the WWT articles weekly and save the parsed data into a S3 Bucket. Further, we also experimented with a framework called Serverless to automate deployments but ultimately it lacked the ability to upload multiple layers properly to Lambda. Layers are third-party dependencies a user can upload. Lambda contains a specific set of libraries. BeautifulSoup, and several others, were needed as layers for the R&D ML Pipeline Ingestion Lambda to execute properly. This system allowed us to dynamically pull, parse and save data into S3 for later use by the ML Models.
End-to-End ML Workflow
Similar to how platforms such as Jenkins or Gitlab CI/CD produce automated delivery pipelines to put traditional software (applications and APIs) into production, Kubeflow automates delivery pipelines to put ML models into production. The purpose of our Kubeflow-enabled end-to-end ML workflow is to train, validate and deploy BERT, a massive NLP model.
BERT was developed by Google and is a massive pre-trained language model with more than 100 million parameters that can be fine-tuned for use in a specific domain (Devlin et all 2018). Unlike traditional NLP deep learning models such as recurrent neural networks (RNNs), BERT can be trained in a parallel fashion, thus taking advantage of the CUDA cores and Tensor Cores of the cloud GPUs utilized by Kubeflow components. Furthermore, much of BERT’s power lies in its attention to context of words in each sentence, unlike many previous NLP models.
Using the Kubeflow Pipelines feature, this deep learning model is trained on content from WWT platform articles. The pipeline is composed of a series of components as shown in Figure 3.
Using Kubeflow Pipelines, each component runs sequentially, and passes intermediate artifacts to the next component. Note that each component runs as a separate Docker container to manage software library dependencies and re-use individual pipeline components. Containers are critical for pipelines to easily re-use on a variety of platforms. Overall, the end-to-end ML workflow pre-processes the training data, trains and evaluates the ML model, and sends the ML model to a location where an API can expose it to external services.
With Kubeflow Pipelines, each component acts as an independent and self-contained step of the end-to-end ML workflow. In Figure 3, training data is first loaded from the Kubernetes persistent volume to be pre-processed. The “Perform Topic Modeling” ML pipeline component then uses the Scikit-Learn Python library to apply Latent Semantic Analysis (LSA) and find hidden topics in the training data; it also performs dimensionality reduction. Each topic is a cluster of words that typically relate in the corpus of WWT articles. Four of these hidden topics can be seen in Figure 4, where each topic is represented as a “word cloud.” Words that occur frequently in the WWT articles are larger.
Training a deep learning model requires a large amount of data. To build that data set, the extracted topics are first matched with topics from a large set of Wikipedia entries, and then the content from each of those matching entries is saved as a row in a .csv file. Next, the .csv file is saved to the persistent volume where it can be pulled in by the next component in the pipeline.
The next pipeline component is “Train BERT Model”; it takes data from the “Perform Topic Modeling” component and fine-tunes (re-trains) the pretrained BERT. This fine-tuning lets the model understand the domain-specific language found in WWT articles (such as Kubernetes, networking, Lean UX, Agile Software Development and other technical language – words that may not be as typical of BERT’s original training data).
To clarify our use case: we want to predict how well a user’s search phrase matches the content from a specific section of a WWT article. For example, we expect BERT to return a higher score for the search phrase “Kubernetes in AWS” along with content from a section of a Kubernetes-specific WWT article; and a lower score for the same search phrase but content from an article about Lean UX Workshops.
The “Train BERT Model” component first transforms the training data into BERT’s expected format (Devlin et all 2018). Next, using the Keras and the Huggingface Python library, we define the architecture of a neural network that includes the pre-trained BERT embedding layer (see Figure 5). A summary of the assembled model is shown in Figure 6 – thanks to Keras. The “Train BERT Model” component fine-tunes the model with our new data in a massively parallel strategy; we run four NVIDIA Tesla V100 GPUs in AWS. Metrics from each of the four GPUs’ fine-tuning of BERT are in Figure 7.
Finally, the fine-tuned ML model is serialized and saved in SavedModel (protobuf) format.
The next pipeline component (“Evaluate Trained Model”) loads and evaluates the serialized model using Mean Reciprocal Rank (MRR), a metric for evaluating ranking models. After evaluation, the ML model is pushed to an S3 bucket where it can be accessed for the purpose of serving predictions.
So far in the end-to-end ML workflow, the model has been trained on pre-processed data, serialized, evaluated and pushed to cloud storage. However, the ML model predictions are still not accessible to other members within the organization. For that, the final component in the end-to-end ML workflow is “TensorFlow Serving”; it points to the trained and serialized ML model in the S3 bucket and exposes that ML model as a gRPC (Google’s RPC system) endpoint. The organization’s other applications and services can access that gRPC endpoint.
Productionizing the Trained Deep NLP Model
Completing the pipeline means the fine-tuned BERT is now ready to serve! We built a production web application that uses the BERT model to rank WWT articles based on relevance to a user’s search term. The web application consisted of a Python Flask application, a Scala-based Search Service, and a Lucene full text index. Figure 8 shows a sequence diagram illustrating the steps.
In Step 1 of Figure 8, the user enters search terms using the web application. A JSON payload with the search terms is sent to the Search Service (Step 2). Next, the full-text index is queried for all WWT articles that contain the user’s search terms (Step 3). In Steps 4,5 and 6, the Search Service returns the article content, and the web application builds a BERT-formatted request (from a batch of WWT articles) for the prediction service.
In Steps 7 and 8, TensorFlow Serving takes the request, performs model inference (by running the batch of inputs through the model to make predictions), and returns the prediction scores. Finally, the web application uses the ML predictions of article relevance to sort the WWT articles for the user (Step 9).
Modern MLOps tools provide a standard approach for full end-to-end machine learning pipelines. Our proof of concept was on Kubeflow with a massive NLP model. As the ML revolution has progressed, data and model parameter sizes have exploded. Thus, the pipeline presented here demonstrates the ability of MLOps with a truly deep and massive ML model. Smaller, less complicated models can easily be deployed within this framework. Still, there is not a one-size-fits-all solution for MLOps. Different organizations will likely require a different combination of tools, but many of the lessons are transferrable across solutions.
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).