MLOps Tools: The Ins and Outs of Choosing a Cloud Provider

MLOps has the potential to transform business outcomes. The first article in this MLOps Tools series, How to Choose MLOps Tools: Top Considerations that Impact Decision Making, focused on the business elements of MLOps. From a high level, it detailed a strategic approach to evaluate your business capabilities and needs to maximize value from MLOps capabilities.

The next step is choosing the right tools for your business.

MLOps can be hosted both on-premises and in the cloud, each with its own benefits. Cloud-based MLOps will give you access to a wide variety of offerings and capabilities, like the ones offered by Amazon Web Services (AWS) and similar public cloud service providers (CSPs). These CSPs allow you to run your MLOps processes in their cloud, providing whatever tools and compute capabilities you need. CSPs make it so that you don't have to procure the necessary hardware and build an environment in-house.

On the other hand, building an on-premises MLOps environment will ensure all cybersecurity and compliance requirements your organization is subject to are met.

This article will focus on cloud-based MLOps tools, highlighting the three industry-leading CSP platforms: Google Cloud, Amazon Web Services (AWS) and Microsoft Azure.

Cloud service provider refresher

Google Cloud

Google Cloud is a relatively new cloud provider compared with AWS and Azure. However, its reach is expanding and there are currently more than 100 products detailed on their website. Google strives to differentiate its cloud offering by focusing on open-source tooling and integration.

AWS

AWS is the oldest provider of the three. Being a subsidiary of Amazon, AWS mirrors its parent's obsession with providing the best possible service to customers. AWS is the largest of all CSPs, estimated to own about 30 percent of all cloud services as of 2021.

Azure

Microsoft Azure is one more tool to expand Microsoft's enterprise reach. If you are already using many Microsoft applications, it's likely that Azure can easily integrate with your systems. Another major strength is its ease-of-use and user-friendly approach to configuration and operations.

MLOps offerings in the cloud

A common analogy is to compare MLOps to a LEGO structure, where each individual piece is a key component to the overall system. As a reminder, the key features of an MLOps system are as follows:

Data management
Model versioning and storage
Model training and deployment
Model validation
Continuous integration & continuous delivery (CI/CD)
Model monitoring

Google kickstarted the discipline of MLOps in 2015 when they proposed that there was more to fully utilizing Machine Learning (ML) than just writing code. Since then, all three CSPs have taken great strides in improving and innovating their MLOps offerings. There are also many new third-party platforms and tools dedicated both to overall MLOps systems as well as some of the individual capabilities listed above. This article focuses on the native tools of each platform, so does not explore the additional third-party tools they support. Below is a listing of their offerings.

Google Cloud

AWS

Azure

TensorFlow Enterprise
Google Kubernetes Engine
Vertex AI
Cloud Build

AWS SageMaker
AWS CodePipeline
AWS CodeBuild
AWS StepFunctions
Autopilot
Ground Truth

Azure Machine Learning
Azure DevOps
Azure Data Factory

Google Cloud

TensorFlow Enterprise: TensorFlow Extended is an open-source ML pipeline framework developed by Google. Its specialty is training and monitoring deep neural networks. TensorFlow Enterprise adds additional capabilities to the open-source version and is available exclusively to Google Cloud customers.

Google Kubernetes Engine: Another example of an exclusive offering built atop an open-source product originally developed by Google (i.e., Kubernetes). Kubernetes is a container manager to develop, manage and scale ML models. Google Kubernetes Engine's main purpose is to simplify running Kubernetes. Use it to spend less time managing your workstations and server, and devote more energy to building and maintaining your models.

Vertex AI: Brings all of Google's ML capabilities into one unified API. Complete with a Jupyter notebook UI, Vertex AI offers a single environment for building and managing the lifecycle of your ML project.

Cloud Build: Google's CI/CD tool lets customers build, test and deploy code to production. CI/CD is a key aspect of MLOps because its standardized framework gives engineers more flexibility and time to spend on innovating models instead of redundant operational tasks.

AWS

Amazon SageMaker is AWS's end-to-end ML platform. It offers services across data preparation, model building, model training, and model deployment, as well as ongoing model management. SageMaker's high degree of modularity allows engineers to access the desired tools within the SageMaker environment, including: SageMaker projects, SageMaker Pipeline and SageMaker Model Registry. This modularity also allows for the integration of any additional non-SageMaker AWS or third-party tools. Its main UI is mostly based on Jupyter notebooks, although a Python SDK is also available.

Additional AWS services include:

AWS CodePipeline: A service providing continuous delivery of code (CDC) changes. It allows for iterative deployment of changes to ML models already in production.
AWS CodeBuild: Helps build source code, perform tests and create and deploy code packages. CodeBuild works together with AWS CodePipeline to create automated CI/CD pipelines.
AWS StepFunctions: Serves as an ML pipeline that automates and orchestrates tasks implemented through SageMaker as an end-to-end workflow.
SageMaker Autopilot: An automated model building, training and selection tool that creates models based on your data, which you can then easily deploy through the AWS Pipeline.
Amazon SageMaker GroundTruth: Serves as SageMaker's data labelling solution, which is key to being able to effectively monitor and review model progress and metrics.

Azure

Azure ML Pipelines: A platform that provides end-to-end ML lifecycle support, including data preparation, model training, validation and monitoring. Based on our experience, Azure ML has very strong support for R across all cloud platforms. In addition, it is the only platform with PyTorch Enterprise, which provides excellent support services that aid your ML projects. You will have access to long-term support to selected versions of PyTorch for up to two years, which allows for a stable production environment without the constant need for upgrade. In addition, you will also be eligible for prioritized troubleshooting and seamless integration with Azure ML and other PyTorch add-ons, including ONNX Runtime, for faster inferencing.

Azure DevOps: A CI/CD tool that enables cross-team collaboration in code development and application deployment. Both on-premises and cloud options are available to choose from per organizational budget and needs. Azure DevOps supports integration with a wide range of services , including but not limited to GitHub, Campfire, Slack, Trello and more.

Azure Data Factory: A cloud ETL/ELT tool that provides code-free monitoring and management of data integration and transformation. It contains more than 90 built-in, maintenance-free connectors with no added costs.

Where to go from here

MLOps is growing in popularity and many companies are beginning to explore its potential upside for their business. As demand grows, more MLOps products and tools are being brought to market and the seemingly endless options can be overwhelming. However, just like with all technology, the best choice is the one that best serves your organization's needs, and that will vary from case to case.

*Special thanks to Betty Cao for her help in writing this article.

Disclaimer: This article provides a point-in-time snapshot of the offerings from the three major cloud service providers. We anticipate that the tools and services in the MLOps space will continually evolve given the rapid pace of MLOps and technology development today.