GPUs for ETL: Transforming the data science landscape

Many organizations are interested in expanding and leveraging their AI and Deep Learning capabilities across various use cases. However, acquiring the necessary AI infrastructure remains a chicken-and-egg problem: one needs the infrastructure to experiment with AI project ideas, but one needs AI project ideas to justify the investment in infrastructure. It is much easier to justify the investment if the same infrastructure can be used to accelerate current processes as well.

Summary

The primary component of many data science projects is to clean and transform data, evaluate missing information, and create additional features that can be used for further discovery and model training. When it comes to pushing the model to production, the entire data processing pipeline needs to be set up. That data processing pipeline involves steps like Extraction, Transformation, and Loading (ETL) of data into a data lake. For a model to run efficiently and quickly, this data processing pipeline also needs to be fast. The whole process of making ETL robust is iterative and thus laborious. Moreover, with the exponential increase in the amount of data available today for analysis, it becomes imperative to understand how GPUs can help cut down the processing time of ETL steps.

In order to show how processing time can be decreased with GPUs, we did some comparative benchmark testing between GPUs and CPUs and recorded the results for understanding.

ATC Insight

CUDA is a parallel computing platform and API model created by Nvidia.

Under the hood: understanding why GPUs can process tasks faster

GPUs have transformed the AI space in the last few years. CPUs & GPUs process tasks differently because of their different architectures. CPU focuses its smaller number of cores on individual tasks and on getting things done quickly. This makes CPUs more effective when processing serial instructions or tasks. On the other hand, the GPU process parallelizes tasks in a more effective way as they are constructed through thousands of simple and efficient cores. Click this video by NVIDIA is a fun metaphor to explain this difference.

This ability to parallelize processes is making GPUs more prominent for processing tasks such as going through billions of possible combinations of the item – warehouse – carrier combinations for e-commerce orders in milliseconds to come up with the least cost combination, building deep neural networks, or processing image and video data. GPU-accelerated computing offloads compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU.

Benchmarking the processing time

We were able to speed up the complete ETL process we tested by a factor of 7. Some specific tasks such as joining data frames (left/outer join) saw a greater increase.

Graphical form depiction of CPU vs GPU Benchmark results of testing (figure 1):

Tabular form depiction of CPU vs GPU Benchmark results of testing (figure 2):

Highlights & challenges: Processing time should not be the only factor when evaluating GPUs and CUDA libraries

CUDA based libraries have shown:

Ease of use
Libraries can be installed like any other python library
The API interface is similar to commonly used libraries (e.g. pandas), making it easier to convert existing CPU-based code to CUDA-based code to leverage GPUs
Significantly lower processing time (due to high bandwidth) in performing ETL operations especially with larger datasets
Availability of basic and intermediate functions to do ETL

The speed-up in processing time can justify using GPUs on its own, but other key factors and challenges also need to be considered while evaluating to move to a GPU-based ETL infrastructure. One of those aspects that data scientists must consider has to do with the libraries used and their capabilities. See below for a quick comparison of two popular libraries used.

Summary and Our Thoughts:

CUDA-based data processing libraries are still very new in the ETL space, but even now they show significant speed-up in processing large datasets when compared to their CPU counterparts, and more importantly, they also show the same ease of use.

The key to moving from a CPU- to a GPU-based ETL process today is evaluating multiple factors and understanding the needs of the process. GPUs work well for simple and intermediate ETL operations and large datasets, which is usually what we encounter in analytical problems. However, a library such as pandas is more suitable for very complex ETL operations.

In further market improvements, NVIDIA has recently announced a RAPIDS Accelerator for Spark 3.0 that intercepts and accelerates ETL pipelines by dramatically improving the performance of Spark SQL and dataframe operations. Spark 3.0 also reaps performance gains by minimizing data movement to and from GPUs. More and more open-source frameworks are incorporating support for GPU acceleration going forward.

In summary, with the CUDA-based libraries evolving so rapidly, running end-to-end data science pipelines on multiple GPUs is already transforming how we do ETL and other operations.

Test Plan

What kind of ETL processes did we focus on?
To assess the computational power GPUs can provide, we chose an existing solution developed for a client which required multiple complex ETL operations. The problem included:

Reading and writing large csv files
Merging multiple dataframes/tables on specific keys by different join logics: left-join, inner-join, etc.
Creating new variables by combining existing variables on different conditions (e.g., if-else operations)
Aggregating data (Summary statistics, e.g., mean, maximum, minimum values on different groups in the data)
Reshaping data (long to wide format and vice versa)
Manipulating string variables

We implemented the different operations in Python using primarily the pandas library, and then compared the performance and the feasibility of the same tasks using the cuDF library (Python GPU DataFrame library).

Data Profile
The experiment used multiple datasets with time-series data from IOT-driven devices.

Infrastructure Details
Benchmarking was performed on an AWS Cloud Instance with this infrastructure:

References
https://rapids.ai/dask.html (Refer to the example notebooks section on this page)

Technology

Using CUDA libraries to evaluate the power of GPUs

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model proprietary to NVIDIA which allows users to leverage GPU processing and high-bandwidth GPU memory through user-friendly interfaces. Built on the CUDA framework, RAPIDS is an open-source suite of libraries developed by NVIDIA for executing end-to-end data science and analytics pipelines entirely on GPUs. We worked with the cuDF library for the purpose of gauging the impact GPUs can have on speeding up the ETL process. RAPIDS is an open-source, cross-platform and free to download and use.

The two main libraries we worked with are:

pandas - a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series
cuDF - a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data. It provides a pandas-like API that which is easy to understand to use