Unlocking AI Performance with Intel's AMX Accelerator in 5th-Gen Xeon Processors

As artificial intelligence (AI) continues to evolve, the demand for efficient, high-performance compute solutions grows exponentially. Traditionally, Graphics Processing Units (GPUs) have been the go-to hardware for accelerating AI workloads, especially for large models. However, recent advancements in Central Processing Unit (CPU) architecture have introduced powerful features that can handle many AI tasks without relying on GPUs. One such innovation is Intel's Advanced Matrix Extensions (AMX), integrated into the 5th-generation Intel Xeon processors.

World Wide Technology engineers recently explored how Intel's AMX accelerator can enhance AI workloads, particularly for models with up to 13 billion parameters. This capability offers an efficient alternative to GPU acceleration.

What is Intel AMX, and how does it potentially take the place of a conventional GPU for AI Workflows?

Intel AMX is an embedded feature within the Intel Xeon Processor and is available within the 4th, 5th and 6th generation Xeon Scalable Processor solutions. AMX is a set of hardware instructions designed to accelerate matrix operations, which are fundamental to many AI models, especially neural networks. AMX introduces new data types and instructions optimized for matrix multiplication, enabling CPUs to perform these computations more efficiently and at higher throughput.

Key features of Intel AMX include:

Tile-based matrix operations: Breaks down large matrix computations into smaller tiles for efficient processing.
Enhanced vector processing: Extends existing vector units with matrix-specific instructions.
Support for 8-bit and 16-bit integer operations: Suitable for quantized models, reducing memory bandwidth and increasing performance.

While GPUs excel at massive parallelism for extremely large models, CPUs equipped with AMX can deliver impressive performance for a wide range of AI tasks, especially when models are within a certain size threshold.

Advantages of using Intel AMX for AI

Reduced latency and improved throughput: AMX accelerates matrix multiplications directly on the CPU, reducing the need for data movement between CPU and GPU. When workload computation grows larger than the HBM footprint on a GPU, it must move the computational components off and back onto the architecture, which makes the GPU very inefficient at times.
Lower power consumption: CPUs with AMX can perform AI tasks more efficiently, consuming less power compared to GPU-based solutions.
Simplified deployment: Eliminates the need for specialized GPU infrastructure, easing integration into existing CPU-based data centers.

Supporting models up to 13 billion parameters

Large language models (LLMs) and AI architectures often exceed the capabilities of standard CPU instructions. However, Intel's AMX is optimized for models with up to approximately 13 billion parameters, making it feasible to run these models efficiently on CPU hardware.

Why this threshold?

Memory footprint: Models larger than 13 billion parameters require extensive memory bandwidth and compute resources, which are often better suited to GPUs, specifically GPUs or accelerators like the Gaudi Accelerator, which have architectures that link multiple GPUs/Accelerators with high-speed interconnect technologies.
Model complexity: For models within this size, AMX provides sufficient acceleration to achieve near-GPU performance levels, especially when optimized.

Practical implications

Cost-effective AI infrastructure: Organizations can deploy powerful AI inference and training capabilities without investing heavily in GPU clusters.
Edge and data center deployment: CPUs with AMX are ideal for edge devices and data centers where power, space and cost constraints are critical.
Simplified software stack: Developers can optimize existing CPU-based code to leverage AMX instructions, reducing complexity.
NLP and LLM inference (including quantized models)
Vision and object detection
Recommendation systems
Quantized DL via ONNX-Runtime/PyTorch with AMX-aware kernels
Emerging HPC workloads, like LLM token decoding, benefit from AMX + sparsity

Testing of the AMX instruction sets

To show the results of different AMX instruction sets, we used Pytorch torch.matmul to compute the matrix multiplication of two tensors. While these were run on 5th-generation Intel Xeon processors, we look forward to running with Xeon 6, which will add float16-supported ISAs.

We ran ten iterations across different Instruction Set Architectures (ISAs), and the results are summarized below.

A full List of ISAs can be found here.

ISA in use

SSE41 Intel Streaming SIMD Extensions 4.1 (Intel SSE4.1)
AVX2_VNNI Intel AVX2 with Intel Deep Learning Boost (Intel DL Boost)
AVX512_CORE Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions
AVX512_CORE_BF16 Intel AVX-512 with Intel DL Boost and bfloat16 support

Taking it a step further for a more real-world look at what these processors can do, we used vLLM to serve the Qwen/Qwen2-VL-7B-Instruct model and run a benchmark test.

vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

python3 benchmark_serving.py \

--backend openai-chat \

--model Qwen/Qwen2-VL-7B-Instruct \

--endpoint /v1/chat/completions \

--dataset-name hf \

--dataset-path lmarena-ai/VisionArena-Chat \

--hf-split train \

--num-prompts 1000

#Source of benchmark

https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md

Grafana graph of the Token Throughput

CPU usage and memory usage per process. KV Cache size was set to 200GB, and TCMalloc was used for high-performance memory allocation. Hyper-threading was disabled, and two CPU cores were reserved for the serving framework

export VLLM_CPU_KVCACHE_SPACE=40

export VLLM_CPU_NUM_OF_RESERVED_CPU=2

A screenshot of a computer

AI-generated content may be incorrect.

Conclusion

Intel's AMX accelerator in the 5th-generation Xeon processors signifies a major step forward in CPU-based AI acceleration. By enabling efficient matrix operations directly on the CPU, AMX makes it feasible to run models with up to 13 billion parameters without relying on GPUs. This advances AI deployment across various sectors, offering a cost-effective, power-efficient and simplified approach to high-performance AI computing.

As AI models continue to grow, innovations like AMX will play a crucial role in democratizing access to powerful AI capabilities, ensuring organizations can meet the increasing demand without the complexity of GPU infrastructure.