,

Demystifying Quantizations: Guide to Quantization Methods for LLMs

Quantization is key to running large language models efficiently, balancing accuracy, memory, and cost. This guide explains quantization from its early use in neural networks to today’s LLM-specific techniques like GPTQ, SmoothQuant, AWQ, and GGUF.

Igor Šušić Avatar
Demystifying-Quantizations-Guide-to-Quantization-Methods-for-LLMs

You need to consider multiple factors when selecting which LLM to deploy. The goal of inference engines is to maximize throughput, minimize memory footprint, keep accuracy high, and control costs as much as possible. Quantization can help with this.

So what is quantization, and why does it matter? Quantization is the process of constraining an input from a continuous or otherwise large set of values to a discrete set.

The term “quantization” is often used in conjunction with high-throughput and memory-efficient serving engines such as vLLM, SGLang, Triton, etc. However, few good resources exist that make quantization more approachable to curious practitioners. 

After all, LLMs are not reserved for machine learning engineers only.  Even the vLLM documentation lists GGUF under quantization options, which might mislead readers into thinking GGUF is a quantization method. In fact, it’s simply a file format used for storing models

There is a well-known quote by Tim Dettmers that captures the essence of quantization research:

“Quantization research is like printers. Nobody cares about printers. Nobody likes printers. But everybody is happy if printers do their job.”

This article begins with quantization before the LLM era and explains why it matters in the LLM world. It then explores LLM-specific quantization and breaks down popular terms and techniques such as GGUF, SmoothQuant, AWQ, and GPTQ – providing just enough detail to clarify the concepts and their practical use.

Why quantization? Recap of data types used in LLMs

Note: There are two main approaches to quantization: training-aware quantization (QAT) and post-training quantization (PTQ). This article focuses only on post-training quantization (PTQ), which allows us to modify the model after the training or, for most engineers, after the model has been open-sourced.

To understand quantization, it helps to start with a quick recap of the data types involved. When an open-source model is downloaded, the artificial neural network inside is essentially a collection of numbers stored across multiple files, along with some accompanying metadata.

Integers

Integers are the most basic data type, represented as a sequence of bits. They’re straightforward, efficient, and inexpensive to compute with, but they sacrifice precision. For example, representing a bank account balance solely with integers would be highly unreliable.

Floating-point number representation

When you need to represent fractional values with high precision, turn to floating-point number representation, it’s the IEEE 754 standard.

Single precision 

In a 32-bit representation:

  • 1 bit is reserved for the sign, allowing both positive and negative values.
  • 8 bits are allocated to the exponent, which defines the range of representable values essentially how large or small a number can be.
  • 23 bits are used for the mantissa, which determines the level of precision.

A 32-bit floating-point number, or single precision, is the most used precision in the industry while training the neural networks throughout their existence.

Half precision

The IEEE 754 standard also defines double precision (64-bit) and half precision (16-bit) formats. The double-precision format is not interesting to us since it’s not frequently used in LLM deployments and training, so ignore it here. Meanwhile, half precision has become the de facto standard for the newly released models, with a bit of a twist.

Above, see a representation of a 16-bit floating-point number, yet most of the models published lately do not follow the format above, even if they are 16-bit formats. The mentioned twist is that Google invented a new type of format called “bfloat“ or “brain floating point, “ which is the standard used today when publishing unquantized models.

bfloat

The difference is in the allocation of bits for the exponent and mantissa. The number of exponent bits is the same as in IEEE single-point precision, which makes it easy to convert from bf16 to fp32. It keeps the same dynamic range as fp32 but sacrifices precision.

4-bit numbers

Finally, consider 4-bit numbers such as int4 and fp4. While models can also be quantized using binary or ternary schemes, those approaches fall outside the scope of this discussion. In practice, 4-bit precision is generally the lowest useful level applied in post-training quantization.

Each format presents its own trade-offs. For instance, fp4 (e1m2) prioritizes precision over dynamic range and can represent infinite or NaN values. In contrast, fp4 (e3m0) sacrifices that capability, making it unable to represent NaN or infinite values.

Another important factor to consider with different formats is the memory requirements for each value. In an example of the Qwen3-32B model, it would mean that around 45 GB of memory less needed to store the model.

Although it adds some overhead and slightly increases runtime costs, it directly affects which GPUs are compatible and how expensive the inference workload becomes.

Energy needed to execute per format

With data types defined and knowing that LLMs devote much of their computation to raw number crunching, the next step is to examine the efficiency of ML pipelines. The table below shows the energy required to perform specific operations, depending on the numerical format used.

It is evident that the choice of number format and operation significantly impacts the efficiency of machine learning. By now, it should be clear how these factors influence model performance during both training and inference.

Intuition behind the neural network quantization: common types of values and operations

Let’s recap the most common types of values and operations when executing a forward pass. 

The diagram below illustrates a single artificial neuron, serving as a quick reminder of the core calculations involved. Regardless of the model’s architecture, each neuron applies weight adjustments followed by an activation function.

The output in the sketch is unbounded and can range from -inf to +inf, as you did not select which specific activation function will be applied. The fact that the sigma is used to represent the activation function should not be confused with the sigmoid activation function. In this context, sigma is a placeholder for any general activation function, and the output before applying the activation function can span the entire real number range, depending on the inputs and weights.

From the image above, remember:

  • Weights
  • Activations
  • Bias
  • Inputs

When quantizing the network, all components are handled differently depending on which method is applied.

Now let’s explore the topic at hand: quantization.

Quantization: a short history

In the years leading up to transformer-based models and LLMs, machine learning models grew increasingly powerful. The field saw numerous breakthroughs, with the industry racing to uncover the next big advancement

It’s important to note that quantization as an idea originated in other fields, such as signal processing, and was not a product of machine learning hype. 

Before 2017, quantization for neural networks was mostly an academic topic, but then the paper Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference made all the difference. Finally, there was evidence that quantization could be applied in practice. Methods from the paper were implemented in TensorFlow Lite; it was a linear quantization approach.

These are the two common methods that were used:

  • K-means-based quantization
  • Linear quantization

Both approaches originated well before the Transformer architecture and LLMs. They were widely applied to convolutional neural networks (CNNs), which at the time represented the primary workhorse with the greatest commercial potential.

This section provides a high-level overview of the key ideas without delving too deeply into the methods

K-means-based quantization

Weights in any given layer are typically normally distributed with a small number of outliers. The graph below shows the density of weights in a pruned and fine-tuned model, where the distribution appears bimodal rather than normal. Values near zero were pruned, and the graph reproduces results from the Deep Compression paper.

Once you apply the K-means quantization to the given model, the weight distribution becomes discrete. Only a few centroids remain after the process, as shown in the image below.

In short, this quantization method clusters numbers around some centroid. If you want to know how it does that, check the Deep Compression paper. The basic idea is that for the input numbers you want to quantize, you apply K-means clustering and find 2n different centroids to map the continuous values to centroids as discrete values.

Linear quantization

Linear quantization is an affine mapping of integers to real numbers. There are two modes of linear quantization: symmetric and asymmetric. The asymmetric mode is illustrated in the image below. The exact mechanics will not be covered here; for a deeper explanation, see this paper.

Weights are represented as r (real numbers), and the objective is similar to the K-means approach. The goal is to map the blue space on the graph to the few discrete red dots. Integers are represented by q (quantized values). The minimum and maximum values for integers depend on which quantization you want to apply. 

Let’s look at the table below to spot a pattern in determining min and max:

From the 4-bit quantization example in the table, weights can be represented with integer values ranging from –8 to 7. The remaining step is to determine the scaling parameter (S) and the zero point (Z).

This part is where the most magic happens, as there are multiple approaches to determining both parameters. Discovering these parameters is a topic for another article.

In a super simplified example, the process would be something as described in the image below.

If you compare the blue vector (original fp16 input) and the yellow vector (reconstructed fp16 vector), there is a visible difference between the values. That difference is called quantization error.

Quantization in the LLM era

Why are these methods not good enough for quantizing LLMs? Looking at the graph below, you can see that at around 6.7B parameters inside the model, some features start to emerge that do not allow us to easily quantize the model.

What are those features? Simply put, in neural networks, most of the weights look like this:

[0.32, 0.64, 0.98, 0.11, 0.43]

But after scaling to and beyond 6.7B parameters, some vectors will look as follows:

[-60, -45, -51, -35, -20, -67]

Outliers like the one above usually occur in activations, but they can also occur in weights. In some literature, these vectors are called saliency weights, meaning they are the weights that contribute the most to the model’s output.

Usual methods will not work well here. The outlier inside the weights or activations will make our methods squish and cause us to lose too much knowledge.

Now let’s examine the methods developed for LLMs in chronological order. Each method relies on the previous one.

Quantization methods for LLMs

Warning: This article will not cover the following methods in depth. Instead, the focus is on highlighting their importance and key contributions to the field.

Note on all ‘When to use it?’ sections: These provide only a rough idea. The actual choice depends on many more details and specific circumstances.

GPTQ

GPTQ is significant because it was the first quantization method to compress LLMs down to the 4-bit range while maintaining accuracy. When this paper was written, the method didn’t provide any speedups for computation, as hardware did not support it. At the time of writing, GPTQ speedups are available on some hardware.

GPTQ did not do any compression on activations.

When to use it?

This technique may not be used frequently, but it remains a solid approach that can provide significant memory savings when needed

SmoothQuant

SmoothQuant provides a solution to quantize both weights and activations to 8 bits (W8A8). It also speeds up mathematical operations on hardware. The main idea of this work shows that activations are much more outlierish than weights. Combining this with the ability to transfer some of that outlier size to the weights smooths the quantization process.

When to use it?

It will be useful for batch workloads regarding speedups and will compress your models. This means a 530B model could be served on just one node. Still, a single inference will not experience a speedup, as the memory will still be limited, at least on LLMs. Now, it all depends on which LLM you are actually serving.

Activation aware quantization (AWQ)

As the name suggests, this quantization technique is special, as it quantizes the weights with respect to the activations. Remember those outliers mentioned above? The method compresses only weights to 4 bits and leaves activations untouched in 16 bits (W4A16).

When to use it?

This method speeds up the inference process while maintaining most of the advantages of the previous method. AWQ, when published, was presented as an SOTA method.

GGUF

First, it is important to note that GGUF is simply a file format for storing models for inference. It is efficient and supports quantized models. GGUF originated from the GGML library, which provides several methods for post-training quantization

You can find the term “GGUF quantization” on many blog posts or GitHub issues because Ivan Kawrakow implemented most of the quantization techniques in the GGML library. He did it in his spare time, hacked multiple approaches together, and was not interested in publishing the papers or advertising himself.

The following is a general description of the approach taken by most GGML quantization methods. A detailed exploration of these methods will be reserved for another article

In GGML, you have (at the time of writing this post) three types of quants:

  • Legacy quants – named Q4_0 etc. These are not really used
  • k-quants – named Q3_K_Setc. These are older, and while still used are not recommended anymore
  • i-quants – named IQ3_Setc. These are SOTA in GGML/llama.cpp and it is recommended to use them

They are all block-based quants, meaning that scaling (such as the one I described in linear quantization) is applied, and statistics are determined based on the given block of the tensor.

Currently, there are no significant comparisons between the methods in GGML/llama.cpp and those published in academic papers; this topic will be addressed in a separate article.

When to use it?

Keep in mind that you really have a great choice of quantization options in GGML, which will improve speed and accuracy. The best option is to try out a few methods and compare the results. Test these methods for smaller models. Keep in mind, vLLM does not have code fully optimized for the GGML quants. 

Note about hardware

It’s important to mention that each specific data type needs custom implementation on hardware. Therefore, the hardware you have at your disposal also influences your quantization process. This, in turn, dictates the speedup and improvements you can reach when running an inference engine.

One example of hardware support would be the lovely table created by vLLM.

Conclusion

This article has aimed to provide a high-level overview of LLM quantization, clarifying the core concepts while leaving out many of the finer details. Readers are encouraged to explore the topic further through additional research. For those seeking fast, practical results, AI Enabler offers solutions that address these challenges at scale.

Cast AIBlogDemystifying Quantizations: Guide to Quantization Methods for LLMs