With ChatGPT shocking the word with its performance, large language models (LLMs) have been the center of attention for a while in the field of Generative AI. Despite their remarkable capabilities, both training and serving LLMs are budget and energy-consuming due to their immense model size. One of the possible methods to overcome such issue is quantization, which reduces the model’s weights and activations with lower-bit units to reduce GPU memory requirements.

While the solution seems relatively simple, quantizing activations of LLMs has proven to be difficult. To address this problem, a paper by Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han proposes SmoothQuant to provide an accurate and efficient post-training quantization (PTQ) solution.

The below summarizes the paper and our key takeaways about SmoothQuant.

**Quantization**

To provide some context, let us go over what quantization really is before delving into the details of SmoothQuant. Quantization refers to the act of mapping a high-precision value into lower-precision, discrete levels. In this paper, integer uniform quantization, specifically quantization to Int8 is studied.

For the above equation, X corresponds to the original, high-precision tensor, x̄ to the quantized tensor, Δ to the quantization step size, [ ] to the rounding function, and finally N to the number of bits (8 in this paper). Notice that the quantization step size is calculated using a max value retrieved from the input. Depending on how this maximum value is defined, 4 types of quantization can be identified as follows.

- Static quantization: maximum calculated offline from calibration samples.
- Dynamic quantization: maximum calculated from runtime statistics of activations.
- Per-Tensor quantization: maximum calculated from the entire tensor.
- Per-Token/Channel quantization: maximum calculated from each token/channel dimension.

This paper focuses on quantization applied to linear layers in transformers, as they make up most of the parameters and computation in LLMs. The equation for the operation is simple.

Y with dimension T x Co refers to the output, X with dimension T x Ci to activations, and W with dimension Ci x Co to the weights. Here, T is the number of tokens, Ci is the input channel, and Co is the output channel. Quantization is applied not only to the weights, but also to the activations to leverage Int8 GEMM kernels, for both inference acceleration and storage reduction.

With the above information in mind, here are the properties of LLMs in relation to quantization.

- Activations are harder to quantize than weights. In contrast to weights, whose distribution is uniform and flat, activations possess significant outliers.
- The outliers cause quantization to be difficult. As depicted in fig. 2, the presence of outliers (~x100 larger than other values) heavily affect maximum calculation in when calculating quantization step (refer to above equation). This leads to low effective quantization bits/levels for non-outliers if per-tensor quantization is to be applied.
- One noticeable aspect of these outliers is that they appear in a small fraction of the channels (see fig. 2). Additionally, the channel with the outlier has a consistent value range across its token dimension. Taking advantage of such behavior, the issue discussed in 2 can in fact be circumvented by per-channel quantization. However, such operation is sub-optimal, as scaling operations with GEMM kernels can only be applied to outer dimensions of the matrix operations, thus on token dimension and output channel dimension of weights.

**SmoothQuant**

As a solution to the above problems, SmoothQuant “smoothes” the input activation by division with a per-channel smoothing factor s. To maintain the mathematical equivalence, the inverse of the smoothing factor is applied to the weights, resulting in the below equation.

Such smoothing factor can easily be calculated offline from values of weights and activations, allowing efficient post-training quantization.

The equation for the smoothing factor essentially migrates the quantization difficulty from the activations to the weights and vice versa, thereby making the operation for both tensors to be manageable. The hyperparameter **α** is introduced to control quantization difficulty migration between activations and weights. This difficulty mitigation process is visualized below.

As previously discussed, the SmoothQuant operation is employed on linear layers as well as on BMM operators, used during attention computation. The below diagram shows on which part of transformer blocks the SmoothQuant is applied.

**Results**

With above configurations, SmoothQuant preserves the accuracy of existing language models across different scales when quantized to Int8. The below graph shows its performance on OPT models. Per-tensor (weight) and per-tensor static (activations) quantization configuration of SmoothQuant was used (named SmoothQuant-O3). Note that LLM.int8() uses mixed precision and suffers from inefficiency, unlike SmoothQuant.

**Conclusion**

The key takeaways for this paper are as follows:

**Conventional quantization methods on LLMs were unsuccessful**due to excessive presence of outliers in LLM activations, leading to low effective bits. While weights can successfully be quantized to achieve storage load reduction, inference speed remains sub-optimal as activations are left un-quantized.**SmoothQuant splits the quantization difficulty**of transformers between their activations and weights. This difficulty mitigation can be controlled by a hyperparameter that must lie in a sweet-sport region to maintain model performance after quantization.- The method alleviates quantization difficulty of activations and
**achieves efficient post-training quantization**of LLMs up to 530B parameters, significantly decreasing computational cost and accelerating inference while maintaining original model performance.

One of the major complaints on TTS models based on LLMs is their speed: this novel quantization method provides a direct solution. Not only this method helps with faster TTS, but also reduces memory load, saving overall costs.

Full Paper: https://arxiv.org/abs/2211.10438