AI Adequacy - Part I: Efficient Deployment of LLMs with Quantization


The advent of large language models (LLMs) like GPT-3 and ChatGPT 4.0 has been an incredible breakthrough for the AI industry. Their emergent abilities to understand and generate natural language go far beyond what we explicitly programmed them to do. The downside is that these models require a massive number of parameters - ChatGPT 4.0 has around 1.7 trillion parameters! 

A neural network comprises layers containing simple computing units called neurons. Each neuron performs a computation on its input and passes the result to the next layer. To do this, neurons have numeric values called weights and associated biases. You can think of weights as the strength of connections between neurons. A higher weight means a stronger connection. Biases are like an offset that is added to the computation. Together, they are the parameters of the neural network.

The gigantic parameter sizes of large language models require huge computational and memory resources. For example, only a handful of Big Tech companies would have the computational resources to deploy ChatGPT 4.0. So, while LLMs have clearly opened up a new era in language AI, their gigantic size puts most companies at the mercy of a few Big Tech firms that are centralizing knowledge on an unprecedented scale.

Scaling computational power is not feasible; hence, the only way to liberate yourself from these companies' mercy is to reduce the resource needs of large language models. Quantization is one way to achieve this.

Quantization refers to techniques for converting the parameters from high-precision floating point numbers (typically 32-bit) to low-precision integers, often 8-bit or even lower. The conversion greatly reduces the computation and memory requirements of the model. Especially effective in downsizing is non-uniform quantization, where each layer or parameter gets quantized to its own optimal low precision based on its sensitivity. 

While quantizing solves the resource issue, transforming an already trained model would decrease accuracy. However, training huge models from scratch is too expensive to do repeatedly.


Carefully Designed Finetuning

A reasonable solution is to use techniques like LoRA (Low Rank Adaptation) and QLoRA (Quantized Low Rank Adaptation) to finetune a pretrained model like Llama v2 70b into a quantized model without quality loss.  

LoRA is a method to finetune a pretrained model like GPT-3 into a quantized version without significant loss of accuracy.

It works by adding a regularization term during finetuning that encourages the parameters to have values friendly for low-bit quantization. Specifically, it penalizes parameters with a large difference between their real and quantized values. This penalization is stronger for layers where quantization causes a larger drop in performance. This nudges the model during finetuning to adjust its parameters towards values that can be quantized to low precision without much error. The regularization hyperparameter controls the strength of this quantization-friendly adaptation. 

QLoRA is another method for fine tuning a pretrained model into a quantized version without significant loss of accuracy. It directly simulates the effects of quantization during finetuning to make the model robust. For example, it quantizes and dequantizes the model parameters on the fly during training as a way to mimic inference-time quantization. This exposes the model to the nonlinear effects of quantization like saturation. QLoRA also allows for different quantization policies per layer based on their sensitivity. Low-sensitivity layers can be more aggressively quantized. This quantization simulation forces the model to learn to be unaffected by quantization, reducing accuracy loss when deployed in a quantized manner. Thus, QLoRA directly bakes in quantization robustness during finetuning rather than relying just on regularization like LoRA. This results in an accurately quantized model without expensive full retraining.

With optimized training data and a slightly narrowed use case, quantization-aware finetuning will even produce better results than the original model while using a fraction of resources!

In summary, techniques like quantization-aware finetuning with LoRA and QLoRA are a solution to improve model accuracy while drastically reducing resource requirements. The key questions that remain are 1) how to determine the optimal training data and finetuning approach to maximize accuracy gains for a given use case, and 2) how to operationalize and manage the resulting multiple finetuned models optimized for different tasks. In an upcoming blog post, I will cover specific methodologies for data and technique selection and scalable model management architectures. But quantization-aware finetuning paves the way for customized, efficient models that capture the strengths of large language models without the massive resource demands.