Intel AutoRound Boosts Speed and Efficiency of Quantized LLM Models on Intel GPUs and CUDA Devices, Cresent Island with FP8, MXFP8, and MXFP4 Unveiled
Intel’s groundbreaking technology is setting a new standard in the world of Large Language Models (LLMs) with its innovative post-training quantization algorithm, AutoRound. This advancement promises enhanced efficiency and speed, optimizing LLM delivery across Intel’s extensive range of CPUs and GPUs. Moreover, the upcoming Crescent Island is set to support the latest quantization formats, MXFP8 and MXFP4.
Revolutionizing LLM Performance with AutoRound
Intel’s AutoRound, a cutting-edge post‑training quantization (PTQ) algorithm, has been integrated into LLM Compressor to elevate model performance. This collaboration ensures:
- Improved accuracy for low bit-width quantization
- Streamlined tuning process requiring only hundreds of steps
- Zero added inference overhead
- Effortless compatibility with compressed-tensors, enabling direct serving in vLLM
- Easy workflow: quantize and serve models with minimal code
Intel’s AutoRound is a pivotal advancement that minimizes output reconstruction errors, optimizing rounding and clipping for LLMs and VLMs.
Key Features and Capabilities of AutoRound
AutoRound is an advanced PTQ algorithm designed to enhance Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor—v (rounding offset), α, and β (clipping range controls). By sequential processing and signed gradient descent, it optimizes both rounding and clipping, thus reducing output reconstruction errors.

The core strengths of AutoRound include:
- Exceptional accuracy, particularly at very low bit‑widths
- Support for multiple data types: W4A16, MXFP8, MXFP4, FP8, NVFP4, and more
- Mixed‑bit, layer‑wise precision search for a balance between accuracy and efficiency
- Application across both LLMs and VLMs
This technology enables the use of quantized models in a variety of low‑bit formats, accelerating inference on Intel Xeon processors, Intel Gaudi AI accelerators, Intel Data Center GPUs, and Intel Arc B‑Series Graphics, along with CUDA-based GPUs.
Looking Ahead: Next-Gen Support and Beyond
As Intel looks to the future, it’s integrating native support for formats like FP8, MXFP8, and MXFP4 in its upcoming Intel Data Center GPU, codenamed Crescent Island. With AutoRound, quantized models are poised to leverage these formats for seamless scaling across Intel’s AI hardware lineup, bridging the gap between algorithmic advancements and practical deployment.