Do GPUs Like to Crochet?

If you're like me, you may have wondered whether GPUs share some of the same architectural patterns that exist in CPUs.

For example, do they run op codes? Do they have stacks? Are they just fast and wide SIMD machines?

As a guy who grew up in the 8-bit computer era keying in machine code hex dumps out of magazines, I became familiar with simple CPU architecture and machine code: Registers, the program counter, the stack pointer, memory and interrupts. That's all you need to know to understand the rest of this article. If you need a refresher:

Architecture of the 8085 Microprocessor

Crochet terminology

But let's get back to the titular question: Do GPUs like to crochet? No, not really; but they do like warps, blocks and grids. I'll explain, but before we continue, it pays to know what those words mean in crochet circles. The analogy will help you visualize GPU operations.

Here's what grandma said about these terms:

Thread: Needs no explanation.
Warp: The set of parallel threads that run down the length of a loom.
Yes, ok, this one has more to do with weaving than crochet.
Block: She sits for hours crocheting little squares, each containing the same pattern. That's a block.
Grid: She stitches all of those squares together to form a larger piece, like a quilt.

GPUs have program counters … and compilers

Quick review of opcodes, program counters and stacks

Having learned those crochet terms, let's now start building the GPU executable layer cake from the bottom up. As you probably know, CPUs run binary codes known as "opcodes" (operation codes) arranged sequentially in memory. Each opcode (one or more numbers) is fetched from memory, and each such code represents a single instruction. Opcodes are typically expressed in hexadecimal notation. 0x90 (or 90h) is the x86 opcode for "NOP" (no operation), i.e., do nothing then move on to the next instruction.

Each CPU core has a program counter register (PC), as well as a stack pointer (SP). These registers are essentially high-speed memory locations that have special names and behave in special ways. The program counter increases to the next memory address every time the CPU executes an instruction. The stack pointer changes when you use "push" and "pop" opcodes to move data from registers onto or off of the stack.

Hold my beer

Stacks are very useful in procedural programming because they're an easy way to "stash" information about what the CPU is currently doing, so that it can go do something else and then come back to resume what it was doing before. For example, to call a subroutine the CPU will push the PC (program counter) onto the stack before jumping to the location of the subroutine (by setting the PC's value to that address), and will return to what it was doing before by popping the value of the PC off the stack. If that subroutine in turn calls another subroutine, there will be a bunch of PCs pushed onto the stack, and each one will be popped off in reverse order on the way back.

What about GPUs?

So, do GPUs have opcodes? Yes: GPUs run opcodes and have a program counter and stack pointer like a traditional CPU. In other words, GPUs run programs. Those programs are expressed as "binaries" akin to the binaries used on traditional CPUs. Those binaries are generated by compilers that take higher-level source code as input.

In that sense, GPUs are just like traditional CPUs. However…

GPU programming models

CUDA, HIP and OpenCL

GPUs extend that programming model by allowing for parallel execution streams. More on that soon. First, let's quickly name the technologies that represent GPU programming models:

NVIDIA CUDA (Compute Unified Device Architecture)
AMD HIP (Heterogeneous-compute Interface for Portability)
Krhonos Group's OpenCL

Whereas HIP and CUDA's lowest layer may be conceived of as C/C++ enhanced with some additional pragmas and keywords that target the differences between GPUs and CPUs, OpenCL is closer to being a set of programming primitives that you can use with relatively more standard C compilers. HIP and CUDA compile to intermediate representations (such as PTX) which are then either pre- or JIT compiled to GPU specific opcodes. OpenCL is compiled to an intermediate representation known as Standard Portable Intermediate Representation (SPIR) or employs source code relatively directly (see Deeper Dive below). In either case, OpenCL targets the physical GPU via just-in-time compilation (JIT) of SPIR or source.

Programming model summary

Here is a summary table:

NOTE: ROCm (Radeon Open Compute) above is unrelated to ROCe (RDMA over Converged Ethernet)

CUDA vs HIP vs OpenCL: See "Deeper Dives" below.

Kernels

GPUs work in concert with CPUs to complete tasks. The role of the CPU is to orchestrate the activity of one or more GPUs by loading code onto the GPUs, arranging for data to flow into and out of the GPU, triggering GPU computations and providing user interfaces related to the tasks being performed.

The segments of code that are loaded onto GPUs are known as "kernels." Each kernel defines inputs, outputs and code. The operating system that hosts the CPU side of computations will typically have a "driver" for the GPU (or GPU cluster) loaded. This driver manages the communication context between the CPU(s) and GPU(s), which typically consists of ring buffers, memory-mapped IO ports, and PCIe or other physical communication channels.

When a kernel needs to be executed, these mechanisms are used to put the kernel into the execution context of the GPU. This step includes configuring aspects of the execution context for that kernel, including memory and geometry. A key difference between GPUs and CPUs is that GPUs are architected to run similar operations in a highly parallelized manner across massive amounts of data. By "geometry" I mean the nature and scale of that parallelization; for example, a given kernel may need to be run in parallel on each item of a dataset that is arranged as a 1-D list of length N i.e., the kernel will run in N parallel "threads," each one operating on a single datum in the list simultaneously.

Python

"But wait," you may be thinking, "I thought Python was the language of choice for ML practitioners!?" How does a dynamic non-compiled language like Python fit into the high-performance opcode binary-based execution model of a GPU?

Python itself does not run directly on the GPU. Instead, it plays a critical role as a high-level orchestration and scripting interface. Through libraries like PyCUDA, Numba, and CuPy, Python can prepare data, allocate GPU memory, compile kernels, and trigger their execution using low-level APIs wrapped in Pythonic interfaces. These libraries interact with the underlying CUDA or OpenCL runtimes, often via bindings to C-level libraries (like the CUDA Driver API or OpenCL ICD). The Python code handles host-side responsibilities: Compiling GPU kernels, uploading data, launching execution, while the bulk of the compute is done by binaries running on the GPU, or binary code triggered from Python on the CPU.

For example, a string containing the kernel source code (like the OpenCL or CUDA kernels in Deeper Dives below) may be passed into a runtime compiler from Python (e.g., SourceModule in PyCUDA, or clCreateProgramWithSource via PyOpenCL), and invoked onto the GPU as a compiled kernel from Python code. This allows ML engineers to prototype and launch kernels with minimal boilerplate, while still having full access to the GPU's performance. Python handles the preparation and control flow, but when the kernel is launched, execution transitions to the compiled GPU binary in device memory; Python waits for results and triggers further steps.

PyTorch

Frameworks like PyTorch use Python as the user-facing API layer, but internally, performance-critical parts are implemented in C++ (or variants supplied by frameworks like CUDA and HIP). PyTorch employs JIT (just-in-time) compilation in two main ways: one, through TorchScript, which traces or compiles subsets of Python code into an intermediate representation that can be optimized and executed without the Python interpreter; and two, through dynamic kernel fusion and generation via tools like torch._inductor or functorch, which emit CUDA kernels on the fly, compile them during runtime, and launch them directly. This hybrid approach keeps Python expressive and flexible while delivering near-C performance on the GPU backend.

This is not your grandma's parallelism

The needs of the many outweigh the needs of the one

– Spock

CPUs are designed to handle a few threads quickly, employing deep instruction pipelines and sophisticated branch prediction. Physical threads are not the same as the threads offered up by operating systems (e.g., pthreads on Unix). The latter are software constructs related to how operating systems divide time onto the physical threads provided by the hardware.

At a hardware level, CPUs have one thread per physical core; or two, when "hyperthreading" or "SMT" is enabled on x86. The physical resources used by threads at this level are materialized directly in the hardware. For example, when hyperthreading two threads onto a single physical core, that core's registers are effectively duplicated so that each of the two hyperthreads experiences its own "copy" of each CPU register. Crucially, this is done in hardware at hardware speeds: Each time the execution pipeline switches to the other hyperthread, the CPU's registers do not have to be moved in and out of slower storage via CPU instructions.

For operating system threads, CPU registers and other physical resources are instead virtualized by "manually" managing multiple views and swapping the right view into place before continuing the execution of the associated thread. That's the role of the operating system "scheduler." This is significantly slower than what happens with physical threads, where there is either no "swapping" at all; or, as with hyperthreads, any such swapping happens without executing additional op-code level instructions, transparently to any software layer, and therefore much more efficiently.

Massively parallel

Whereas CPUs handle roughly as many physical threads as there are cores, GPUs tackle thousands of threads simultaneously. Grandma might crochet one square at a time with care and attention; a GPU spins up a warp of 32 threads and blasts through data in SIMD (single-instruction, multiple-data) fashion. It doesn't speculate, it doesn't branch easily—it marches forward like a platoon, all threads executing the same instruction unless diverged. Picture an armature tied to grandma's arms and fingers, so that her movements can be replicated across 32 balls of yarn, to make 32 squares at once.

Cue AI-generated image…

GPUs do like to crochet!

The fundamental unit of execution on an NVIDIA GPU is the warp, typically 32 threads that execute in lockstep on a streaming multiprocessor (SM). These warps are grouped into blocks, and blocks are organized into grids, forming a hierarchical execution geometry (and completing the crochet analogy). This structure isn't just theoretical—when you launch a CUDA kernel, you define this geometry explicitly, deciding how your workload will be subdivided and distributed across the available cores.

Within the warp, each thread executes the same opcode in lockstep. In a sense, they all have the same program counter. However, each thread has its own copy of physical registers like the stack pointer and those pointers are initialized differently, and therefore access different pieces of data, as the warp marches forward.

When utilizing a framework like PyTorch, part of its value is that ML practitioners can focus on expressing the computations they wish to execute and describe the geometry of those computations essentially declaratively. The framework then translates those declarations into specific CUDA calls that materialize that geometry onto physical GPU threads running the requested kernels.

Breaking rank

It's easy to see how a team of threads marching along the same instructions across different data can work in lockstep when there are no branches (decisions) within the instruction stream. But what happens when those instructions do have a branch or decision? If the threads in a warp take different sides of a branch, this leads to "decoherence" or "divergence" of the threads in the warp, and the GPU falls back to a slower un-ganged/serial mode of execution across the threads.

Understanding the implications of warp divergence, memory coalescing, and occupancy is essential for getting peak performance; just like crochet has tension and yarn weight, GPU programming has registers per thread and shared memory limits. Where this model really shines is in tasks that map naturally to data-parallel operations: matrix multiplications, image convolutions, simulations, etc. These aren't tasks where you want one brilliant CPU core to think deeply; they're jobs for an army of cores, each doing simple operations over lots of data, quickly and together.

Summary

It's not a stretch to say that the GPU has become the loom of the modern ML era, weaving tensors and gradients instead of yarn. But the abstraction is fragile. GPU programming is elegant at a high level, but challenging when facing hardware realities: thread scheduling behavior, memory latency hiding, warp shuffles, and bank conflicts.

Just like your grandma's quilting projects, the pattern may look beautiful from afar, but achieving that regularity at scale requires precision, constraint, and understanding of the substrate. Thus, the popularity of frameworks like PyTorch.

Deeper dives

CUDA vs HIP vs OpenCL

CUDA

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void vector_add(float *A, float *B, float *C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

int main() {
    int N = 512;
    size_t size = N * sizeof(float);

    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);
    for (int i = 0; i < N; i++) { h_A[i] = i; h_B[i] = i; }

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    vector_add<<<(N+255)/256, 256>>>(d_A, d_B, d_C, N);

    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    printf("C[0] = %f\n", h_C[0]);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

HIP

#include <hip/hip_runtime.h>
#include <stdio.h>

__global__ void vector_add(const float *A, const float *B, float *C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

int main() {
    int N = 512;
    size_t size = N * sizeof(float);

    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i;
    }

    float *d_A, *d_B, *d_C;
    hipMalloc(&d_A, size);
    hipMalloc(&d_B, size);
    hipMalloc(&d_C, size);

    hipMemcpy(d_A, h_A, size, hipMemcpyHostToDevice);
    hipMemcpy(d_B, h_B, size, hipMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    hipLaunchKernelGGL(vector_add, dim3(blocksPerGrid), dim3(threadsPerBlock), 0, 0, d_A, d_B, d_C, N);

    hipMemcpy(h_C, d_C, size, hipMemcpyDeviceToHost);

    printf("C[0] = %f\n", h_C[0]);

    hipFree(d_A); hipFree(d_B); hipFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

OpenCL

#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>

const char* kernel_src =
"__kernel void vector_add(__global float* A, __global float* B, __global float* C) {\n"
"  int i = get_global_id(0);\n"
"  C[i] = A[i] + B[i];\n"
"}\n";

int main() {
    int N = 512;
    size_t size = N * sizeof(float);
    float *A = malloc(size), *B = malloc(size), *C = malloc(size);
    for (int i = 0; i < N; i++) { A[i] = i; B[i] = i; }

    cl_platform_id platform;
    cl_device_id device;
    clGetPlatformIDs(1, &platform, NULL);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

    cl_program program = clCreateProgramWithSource(context, 1, &kernel_src, NULL, NULL);
    clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    cl_kernel kernel = clCreateKernel(program, "vector_add", NULL);

    cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, size, NULL, NULL);
    cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, size, NULL, NULL);
    cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size, NULL, NULL);

    clEnqueueWriteBuffer(queue, d_A, CL_TRUE, 0, size, A, 0, NULL, NULL);
    clEnqueueWriteBuffer(queue, d_B, CL_TRUE, 0, size, B, 0, NULL, NULL);

    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_A);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_B);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_C);

    size_t global = N;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);

    clEnqueueReadBuffer(queue, d_C, CL_TRUE, 0, size, C, 0, NULL, NULL);

    printf("C[0] = %f\n", C[0]);

    clReleaseMemObject(d_A); clReleaseMemObject(d_B); clReleaseMemObject(d_C);
    clReleaseKernel(kernel); clReleaseProgram(program);
    clReleaseCommandQueue(queue); clReleaseContext(context);
    free(A); free(B); free(C);
    return 0;
}