run open-source llms on everyday hardware, a hacking guide

19Apr,2024

imagine running a massive 30-billion-parameter language model on your gaming GPU. sounds impossible, right? thanks to breakthroughs in weight quantization, it’s now a reality. techniques like GPTQ, GGML, and NF4 enable us to compress these colossal models into just 4 bits, dramatically reducing their size while maintaining impressive performance. this means you can now run models like LLaMA-30B on consumer hardware, such as an RTX 3090, without breaking the bank.

in this article, we’re diving deep into the GPTQ algorithm, one of the most popular 4bit quantization techniques. we’ll break down how it works and walk you through its implementation using the AutoGPTQ library.

let’s unlock the power of large language models for everyone. ready to get started?

optimal brain quantization: making big models small without losing smarts

imagine you have a giant neural network with millions (or even billions) of weights. running it on your laptop sounds impossible, right? that’s where quantization comes in. but not all quantization is created equal. some methods shrink models but make them dumber. others, like optimal brain quantization (OBQ), keep the model smart while making it small. let’s see how it works.

the problem: how to shrink without breaking

for every layer in a neural network, we have a set of weights (Wℓ). these weights are like the brain of the model, they decide how it thinks. quantization replaces these weights with smaller, simpler versions (Ŵℓ). but here’s the catch: we don’t want the model to start making dumb decisions after quantization. so, we need to make sure the outputs of the quantized weights (ŴℓXℓ) are as close as possible to the original outputs (WℓXℓ).

in math terms, we’re solving:

a r g {min}_{{\hat{W}}_{ℓ}} ‖ W_{ℓ} X_{ℓ} - {\hat{W}}_{ℓ} X_{ℓ} ‖_{2}^{2}

this is called the layer-wise compression problem, and OBQ solves it brilliantly.

how OBQ works: one step at a time

OBQ is like a skilled surgeon. instead of chopping off parts of the model randomly, it carefully quantizes one weight at a time while adjusting the rest to keep the model accurate. here’s how it does it:

pick the easiest weight to quantize:
OBQ looks at all the weights and asks, “which one can i simplify without causing too much trouble?” it uses something called the hessian matrix (HF) to figure this out. the hessian tells us how sensitive the model is to changes in each weight.
adjust the other weights:
after quantizing a weight, OBQ tweaks the remaining weights to make up for the loss of precision. this ensures the model stays sharp.

the formula for this adjustment is: $δ_{F} = - \frac{w_{q} - quant (w_{q})}{[H_{F}^{- 1}]_{q q}} \cdot (H_{F}^{- 1})_{:, q} .$

don’t worry if this looks complicated, it’s just OBQ’s way of saying, “let’s fix the model after each change.”

handle outliers:
some weights are like troublemakers, they’re way bigger or smaller than the others and can mess things up. OBQ deals with these outliers by quantizing them as soon as they’re spotted, so they don’t cause problems later.
speed things up:
quantizing a big model can take forever, but OBQ has a trick up its sleeve. instead of recalculating everything from scratch after each quantization, it updates the hessian matrix efficiently using gaussian elimination :

$H_{- q}^{- 1} = {(H^{- 1} - \frac{1}{[H^{- 1}]_{q q}} H_{:, q}^{- 1} H_{q, :}^{- 1})}_{- p} .$

it also processes multiple weights at once using vectorization, making the whole process faster.

the trade-Off

OBQ is amazing, but it’s not magic. as the model gets bigger, the computation time grows cubically. this makes it tough to use OBQ on massive models with billions of parameters. but for smaller models or specific layers, OBQ is a fantastic tool for keeping models both small and smart.

the GPTQ algorithm: scaling quantization for massive models

introduced by frantar et al. (2023), the GPTQ algorithm builds on the foundation of optimal brain quantization (OBQ) but takes it to the next level. GPTQ is designed to handle very large language models, something OBQ struggles with due to its computational complexity. let’s break down how GPTQ works and why it’s a game-changer.

step 1: arbitrary order insight

one of the key limitations of OBQ is that it quantizes weights in a specific order, starting with the ones that introduce the least error. while this works well for smaller models, it becomes inefficient for massive models with billions of parameters.

GPTQ makes a clever observation: for large models, the order in which weights are quantized doesn’t matter as much as we thought. here’s why:

even if some weights introduce more error when quantized early, they’re compensated for later in the process when fewer weights are left to adjust.
this means we can quantize weights in any fixed order without sacrificing performance.

this insight is a big deal because it simplifies the process. instead of carefully selecting which weight to quantize next, GPTQ quantizes all weights in the same order for every row of the weight matrix. this makes the algorithm:

faster: certain computations only need to be done once per column, not once per weight.
scalable: it can handle massive models without blowing up computation time.

why this matters

by removing the need for a carefully chosen quantization order, GPTQ eliminates a major bottleneck in the OBQ method. this makes it possible to quantize huge models, like those with billions of parameters, on consumer hardware without losing performance.

Key takeaways

GPTQ builds on OBQ but is optimized for large-scale models.
it quantizes weights in a fixed order, making the process faster and more scalable.
this approach maintains model accuracy while significantly reducing computational overhead.

step 2: lazy batch-updates

while the GPTQ algorithm is powerful, there’s a catch: updating a massive matrix entry by entry is slow. this approach doesn’t fully utilize the parallel processing capabilities of GPUs and can hit memory bottlenecks, especially for large models. to solve this, GPTQ introduces a clever optimization called lazy batch-updates.

the problem: slow matrix updates

in the original approach, each weight update requires modifying a small part of a huge matrix. this leads to:

inefficient GPU usage: GPUs are designed for parallel processing, but updating one entry at a time doesn’t take advantage of this.
memory bottlenecks: constantly reading and writing to memory slows down the process, especially when dealing with billions of parameters.

the solution: lazy batch-updates 😉

GPTQ’s lazy batch-updates solve these issues by processing multiple columns at once. here’s how it works:

batch processing: instead of updating one column at a time, GPTQ processes a batch of columns (e.g., 128 columns) simultaneously. this allows the GPU to work on multiple updates in parallel, maximizing its compute power.
local updates: during batch processing, GPTQ only updates the columns in the current batch and their corresponding block of the matrix. this reduces the number of memory operations, avoiding bottlenecks.
global updates: once a batch is fully processed, GPTQ performs a global update on the entire matrix. this ensures that all changes are reflected accurately across the model.

the math behind it

the lazy batch-update mechanism relies on two key formulas:

weight adjustment: $\delta_F = -(w_Q - \text{quant}(w_Q))([H_F^{-1}]_{QQ})^{-1}(H_F^{-1})_{:,Q}. $
hessian update: $H_{- Q}^{- 1} = {(H^{- 1} - H_{:, Q}^{- 1} ([H_{F}^{- 1}]_{Q Q})^{- 1} H_{Q, :}^{- 1})}_{- Q} .$

these formulas ensure that the updates are precise and efficient, even when processing multiple columns at once.

why this matters

lazy batch-updates make GPTQ faster and more scalable:

faster processing: by batching updates, GPTQ fully utilizes GPU parallelism, speeding up the quantization process.
memory efficiency: reducing the number of memory operations avoids bottlenecks, making it feasible to quantize massive models.
scalability: this approach allows GPTQ to handle models with billions of parameters without slowing down.

key takeaways

GPTQ’s lazy batch-updates solve the inefficiencies of updating a matrix entry by entry.
it processes multiple columns at once, maximizing GPU parallelism and minimizing memory bottlenecks.
the algorithm performs local updates during batch processing and global updates afterward to ensure accuracy.
this optimization makes GPTQ faster, more memory-efficient, and scalable for large models.

step 3: cholesky reformulation

as GPTQ scales up to handle very large models, a new challenge emerges: numerical inaccuracies. repeated operations can lead to small errors that accumulate over time, potentially destabilizing the quantization process. to tackle this, GPTQ introduces a cholesky reformulation, a numerically stable method that ensures accuracy even for massive models.

the problem: numerical instability

when working with large models, small numerical errors can snowball into bigger problems. specifically:

error accumulation: repeated operations (like matrix updates) can cause tiny errors to build up, leading to inaccurate results.
unstable computations: without a stable method, the algorithm might fail to converge or produce unreliable quantized models.

the solution: cholesky decomposition

GPTQ uses cholesky decomposition, a mathematically robust technique, to solve this problem. here’s how it works:

precompute with cholesky: before starting the quantization process, GPTQ precomputes key information from the hessian inverse matrix using the cholesky method. this ensures that all subsequent calculations are stable and accurate.
dampening for stability: to further prevent numerical issues, GPTQ adds a small constant (a process called dampening) to the diagonal elements of the matrix. this tweak keeps the computations well-behaved, even for massive models.

how GPTQ works: step by step

here’s a breakdown of the GPTQ algorithm with cholesky reformulation:

cholesky decomposition: start by performing a cholesky decomposition on the hessian inverse matrix. this sets the stage for stable and efficient computations.
batch processing: GPTQ processes the model in batches of columns. for each column in a batch:

quantize the weights.
calculate the error introduced by quantization.
update the weights in the current block to minimize the error.

global updates: after processing a batch, GPTQ updates all remaining weights based on the errors from the current block. this ensures that the quantization process remains accurate across the entire model.

real world performance

GPTQ was tested on large language models like BLOOM (176B parameters) and OPT (175B parameters). here’s how it performed:

hardware: quantization was done using a single NVIDIA A100 GPU.
comparison: GPTQ outperformed simpler methods like round-to-nearest (RTN), which rounds all weights to the nearest quantized value without considering their impact on the model’s performance.
results: GPTQ maintained high accuracy while significantly reducing the model size, making it a powerful tool for deploying large models on consumer hardware.

why this matters

the cholesky reformulation makes GPTQ numerically stable and scalable:

stability: by precomputing with Cholesky and adding dampening, GPTQ avoids numerical errors that could derail the quantization process.
scalability: this approach allows GPTQ to handle models with billions of parameters without compromising accuracy.
efficiency: the batch processing and global updates ensure that the algorithm runs efficiently, even on a single GPU.

key takeaways

GPTQ uses cholesky decomposition to ensure numerical stability for large models.
dampening (adding a small constant to diagonal elements) further prevents numerical issues.
the algorithm processes weights in batches, quantizing them and updating errors block by block.
GPTQ outperforms simpler methods like RTN, making it a top choice for quantizing massive models.

!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers

then, load the libraries and define the model you want to quantize:

import random

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer


# Define base model and output directory
model_name = "gpt2"
quantized_model_dir = model_name + "-GPTQ"

loading the model and tokenizer

next, we load the model and tokenizer. the tokenizer is loaded using the classic AutoTokenizer class from the transformers library. for the model, we use the AutoGPTQForCausalLM class, which requires a specific configuration, BaseQuantizeConfig—to set up the quantization process.

in this configuration, we specify bits=4, which reduces the model to 4bit precision, making it smaller and faster while maintaining performance. we also define the group_size, which determines the size of the lazy batch (e.g., 128 or 1024). while optional, using a group size improves quantization quality with minimal computational cost. for example, group_size=1024 is a common choice. additionally, we set the damp_percent parameter to help stabilize the cholesky reformulation, and this should generally be left unchanged. finally, there’s the desc_act (act order) parameter, which processes rows based on decreasing activation. this means the most impactful rows (determined by sampled inputs and outputs) are quantized first, placing most of the quantization error on less significant weights and improving overall accuracy. however, when used with group_size, it can cause performance slowdowns due to frequent reloading of quantization parameters. for now, we’ll disable this, though future updates may address the issue.

here’s how to load the quantize config, model, and tokenizer:

# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)
new_model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
new_tokenizer = AutoTokenizer.from_pretrained(model_name)

preparing samples for quantization

the quantization process relies heavily on samples to evaluate and enhance the quality of the quantized model. these samples allow us to compare the outputs of the original model with those of the quantized model. the more samples we use, the better the comparison, leading to improved quantization quality.

for this article, we’ll use the C4 dataset (colossal clean crawled corpus), a large-scale, multilingual collection of web text from the common crawl project. the C4 dataset has been cleaned and prepared specifically for training large-scale language models, making it an excellent resource for tasks like quantization. another popular option is the WikiText dataset, but we’ll stick with C4 for this example.

loading and tokenizing samples

here’s how we load 1024 samples from the C4 dataset, tokenize them, and format them for quantization:

# Load data and tokenize examples
n_samples = 1024
data = load_dataset("allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split=f"train[:{n_samples*5}]")
tokenized_data = new_tokenizer ("\n\n".join(data['text']), return_tensors='pt')

# Format tokenized examples
examples_ids = []
for _ in range(n_samples):
    i = random.randint(0, tokenized_data.input_ids.shape[1] - new_tokenizer.model_max_length - 1)
    j = i + new_tokenizer.model_max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

quantizing the model

now that the dataset is ready, we can start the quantization process. we’ll use a batch size of 1 and optionally enable OpenAI Triton, a CUDA alternative, to optimize GPU communication. once the quantization is complete, we’ll save the model and tokenizer in the safetensors format, which is efficient and secure.

here’s how to quantize the model and save the results:

%%time

# Quantize with GPTQ
new_model.quantize(
    examples_ids,
    batch_size=1,
    use_triton=True,
)

# Save model and tokenizer
new_model.save_quantized(quantized_model_dir, use_safetensors=True)
new_tokenizer.save_pretrained(quantized_model_dir)

loading the quantized model

once the model is quantized and saved, you can load it back using the AutoGPTQForCausalLM and AutoTokenizer classes. Here’s how:

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Reload model and tokenizer
new_model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

testing the quantized model

the quantized model works just like a normal transformers model, making it compatible with inference pipelines. let’s test it with a simple text generation task:

from transformers import pipeline

generator = pipeline('text-generation', model=new_model, tokenizer=tokenizer)
result = generator("i enjoy training neural networks. it feels like I’m teaching a small baby.", do_sample=True, max_length=60)[0]['generated_text']
print(result)

results and next steps

the quantized GPT-2 model produces high-quality completions, showing that the quantization process successfully preserves the model’s performance. while the results are promising, a more thorough evaluation, such as measuring the perplexity of the quantized model compared to the original, would provide deeper insights into the impact of quantization. however, that’s a topic for another time. for now, we’ve achieved our goal: a compact, efficient model that delivers great results.

conclusion

in this article, we explored the GPTQ algorithm, a state of the art quantization technique that makes it possible to run large language models (LLMs) on consumer-grade hardware. we walked through how GPTQ solves the layer-wise compression problem using advanced techniques like:

arbitrary order insight: simplifies the quantization process for large models.
lazy batch updates: optimizes GPU utilization and reduces memory bottlenecks.
cholesky reformulation: ensures numerical stability for accurate quantization.

these innovations significantly reduce memory and computation requirements, making powerful LLMs accessible to a broader audience.

we also demonstrated how to quantize a GPT-2 model using a free T4 GPU and generate text with the quantized version. if you’re inspired to try this yourself, you can push your own 4bit quantized models to the hugging face hub and share them with the community.

while GPTQ is a powerful tool, it’s not the only option for 4bit quantization. alternatives like GGML and NF4 offer slightly different approaches and are worth exploring. i encourage you to dive deeper into these methods and experiment with them to see which works best for your needs.

references

B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal Brain Surgeon and general network pruning,” IEEE International Conference on Neural Networks, San Francisco, CA, USA, 1993, pp. 293-299 vol.1, doi: 10.1109/ICNN.1993.298572.
Elias Frantar, Sidak Pal Singh, & Dan Alistarh. (2023). Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. arXiv preprint.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, & Dan Alistarh. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, & Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1-67. arXiv preprint.