General Computing on the GPU: An Example Using Metal

Over the last few decades, engineers have recognized that the Graphics Processing Unit (GPU) can do more than just render graphics. Why not take that power and use it for general-purpose computing on the GPU (GPGPU)? While GPGPU was adopted early for scientific research and cryptocurrency computation, GPGPU APIs are now available to most platforms. Some examples of APIs include NVIDIA's CUDA, Kronos' OpenCL, and Apple's Metal. This article will explore the potential performance benefits and limitations of GPGPU using Apple's Metal API.

What is a GPU?

A GPU is a co-processor built into most computer devices, including desktop and notebook computers and phones. The primary function of the GPU is to process graphics and render them on screen. That is no small task. If you run a device with a 4K display, the GPU may have to update 8 million pixels 60 times a second. That is over 480 million pixels a second!

To achieve that level of processing speed, engineers design GPUs differently than the computer's main brain, the CPU (Central Processing Unit). The CPU is designed as a general processor with usually 32 or fewer processing units (that is, cores). Each core can handle a wide range of computational tasks, in some cases with multiple threads. As a result, the CPU can handle most general-purpose computing problems. The CPU can even handle work often associated with the GPU, such as complex graphics work such as 3D rendering.

The GPU, on the other hand, can have hundreds or thousands of processing units. Unlike the CPU, each GPU processing unit is slower, has a more limited instruction set, and has less memory. Engineers chose these design characteristics to run many small, similar work items at the same time: parallel computing. This design is perfect for computing the color of every pixel on the screen. These design characteristics of the GPU also make the GPU attractive for some general-purpose applications.

The choice: CPU vs. GPGPU

If a problem is suited for GPGPU, then the choice of using GPGPU or sticking with the CPU comes down to performance. How quickly can a job be completed and how much energy is consumed? Can you make your app "snappy" without draining the device's battery using GPGPU? The answer is "maybe."

Given the architecture of the GPU, performance gains can only be obtained for highly parallel problems; that is to say, the same computations are executed many times and are not interdependent. The more parallel jobs required, the better the gains using the GPU. If the job does not fit these requirements, the CPU, rather than the GPU, is the best choice.

GPUs consume more energy than CPUs. Even if there is a performance gain using GPGPU, the increased energy consumption could be detrimental to the user. The balance between speed and energy consumption may be difficult to assess, but if the job is highly parallelizable, then the speed increase may offset the cost of energy consumption.

In most cases, developers should consider GPGPU programming as an optimization phase of development rather than a priority. Standard CPU development programming techniques are simpler to write, easier to debug, and easier to test. If performance is lacking at that stage, and the job is parallelizable, then GPGPU might be the next step.

Example problem

The United States Geological Survey provides free data about the world. Once such dataset is the GTOPO30 dataset for topography of the world. The data are arranged in a global grid of 43200 by 21600 datapoints (over 900 million datapoints!). As it would take 120 4K displays to fully visualize this dataset, it is not practical to use this dataset directly to work with global topography. Instead, the dataset can be used to generate statistics about global topography and learn more about the Earth.

Given the size and the potentially parallel nature of the data, processing the GTOPO30 data on a range of Metal-compatible devices can provide insight into GPGPU performance. Each device was tasked to compute a range of topographic statistics. Each statistic was calculated on 10 by 10 subsets of the GTOPO30 dataset resulting in four datasets of 4320 by 2160 grids. Figures 1-4 show the results of these calculations converted into raster maps. Figure 1 shows the average elevations for each 10 by 10 grid, ignoring missing or oceanic data. This figure shows the general topography of the Earth. Figures 2 and 3 show the maximum and minimum elevations in each 10 x 10 grid. These statistics show the topographic extremes within the GTOPO dataset. Figure 4 shows relief, which is the maximum elevation minus the minimum elevation. This map highlights the roughness, or flatness, of the Earth.

Each device and algorithm used was expected to produce identical statistics, however, they did not produce color maps as in Figures 1-4. The figures were created using a custom Metal compute shader using a color texture based on the GMT topography color palette and hillside shading algorithm to add detail.

Calculations

Both the CPU and GPU implementations used the same basic approach to calculate the statistics for each output datapoint. The algorithm reads 100 datapoints from a buffer in memory and calculates all four statistics. However, implementations differed between the CPU and GPU.

For CPUs, two implementations were tested. The first implementation was for using a single core/single thread. This implementation was meant to be the simplest possible implementation of the algorithm. Since only one core was used, the code cycled through each sub-grid, performed the calculations, and stored the results. The second implementation was for multiple cores/threads. In this case, the job must be split into many pieces and run separately. This was achieved using Apple's Operations API where each job was added to a queue and run when resources were available. This approach also requires extra care because different threads share resources including the original data and output buffers and can lead to slower performance.

For the GPU, a custom compute shader was written to do the calculations. A compute shader in Metal is a simple program that will run on the GPU for each thread. In this case, each thread will compute the statistics for a single 10x10 sub-grid (over 9 million threads required to complete the calculations), and output results into the result buffers. Given the design of the shader, race conditions are not a problem since each thread is completely independent of the rest. Unlike CPU development, additional code must be written to set up execution on a GPU. For example, if a GPU does not share RAM with the CPU, the GTOPO dataset must be transferred to where the GPU can read the data. On mobile devices, where applications cannot make full use of available RAM, the GTOPO data must be read from disk in segments, which can affect performance. These additional steps were included in performance calculations.

Test devices

Three Apple devices were tested for CPU performance (see Table 1). The three devices represent not only different generations of CPU, but also CPU category: Xeons are workstation/server CPUs, Intel I-series are for more standard computing, and Apple's M1 represents a new generation ARM-based low-power CPUs. The devices had sufficient RAM to store the 1.8 GB of the original dataset and output data in memory and did not require any special segmented I/O computation.

Table 2 shows the list of test GPUs in traditional computers. Like the CPUs, the GPUs tested represents are range of GPU generations. In addition, they also represent a range of categories of GPU. Three GPUs tested were discrete, meaning they are separate from the CPU and do not share resources with the CPU, and two GPUs were integrated and share memory with the CPU. All computers, except the MacBook Air, contain two GPUs. Each GPU on a device was tested independently. The maximum buffer size for the Mac Pro GPUs prevented loading the entire dataset in GPU-accessible memory; therefore, multiple reads into the GPU were required.

Table 2. GPU test devices (traditional computers)

Table 3 shows the list of mobile GPUs tested. The mobile GPUs were all generations of Apple's ARM A-Series chips with integrated GPUs using a shared-memory model. These mobile devices had memory constraints that limited performance: the data buffers could not hold the complete dataset and the OS would not allow the use of that much memory anyway. Thus, the algorithm for these devices included segmented reading the data from disk. As a result, all run times for these devices include reading from disk.

Computation performance

CPU computation times are based only on the time to calculate and assemble the desired statistics. Time does not include reading from disk, data verification, or writing results to disk. Depending on the device, reading the GTOPO would add approximately 1 second to the run times.

GPU run times are measured similarly to the CPU with a few differences. Discrete GPUs must copy the GTOPO data from RAM to the GPU's memory. Since this was not a step required for the CPU, this data copy phase was included in the run times. On mobile devices, the run times included disk reads because of memory constraints.

Using a single CPU core, the performance correlates with the age of the CPU (Figure 5). The oldest CPU on the Mac Pro performed the slowest, whereas the newer M1 was fastest. However, even the fastest of these cases, 9 minutes, is too long for an average user.

Figure 5. CPU performance for single core

When splitting the work across multiple cores, performance improves, but that improvement depends on the CPU design (Figure 6). The most performant CPU tested was the M1 using 4 cores. When using 8 cores, the least performant CPU was also the M1 (using the low-power cores reduced performance). In contrast, the older Mac Pro with its server/workstation-grade CPU performs better with 8 or more threads than the MacBook Pro. With the best result just under 3 minutes, the compute time is approaching tolerability, if you only need to make the calculation once.

Figure 6. CPU performance for multiple cores

Figure 7 shows the performance of the GPUs. All GPUs, including those on mobile devices, calculated the statistics in less than 6 seconds. Like the CPUs, GPU design impacted performance. Integrated GPUs calculated the results quickest because they used a shared memory model with the CPU and did not require copying the data to GPU memory. The discrete GPUs performed similarly since the data copy time took longer than processing the data.

Energy performance

The energy performance of the GTOPO calculations is difficult to assess. Direct measurements of power consumption are preferred but unavailable for this study. However, energy performance can be estimated from published statistics. Chip energy consumption is reported by the thermal design power (TDP), or maximum theoretical load. Tables 1 and 2 show estimated TDP for test devices. To estimate energy consumption, calculates assume that the full TDP was consumed during the calculations. For the GPU power consumption, the full energy consumption rate for both the CPU and GPU since using the GPU also uses the CPU. Figure 8 shows the energy use of the CPU calculations using the multi-core calculations. Not surprisingly, the estimates show that the Mac Pro would consume the most energy, whereas the M1 MacBook Air consumed the least energy.

For GPUs, energy consumption was much lower. Figure 9 shows estimated energy consumption for each of the Mac GPUs. All the GPUs consumed less energy than the CPUs in this experiment. The discrete GPUs consumed more energy than the integrated GPUs as discrete GPUs tend to be more powerful and require more hardware to power.

Conclusions

Apple's Metal implementation of GPGPU is a powerful computational tool. Given the right problem, Metal performance can outclass standard CPUs. In the test case, all GPUs performed better than CPUs, both in terms of speed and energy. This performance was due to the highly parallelizable nature of the problem. In cases where the calculations are not as large or as parallel, the CPU may be more performant. As a result, developing for the CPU should usually be the first choice, and developing for the GPU should be seen as a potential optimization solution.