run open-source llms on everyday hardware, a hacking guide
imagine running a massive 30-billion-parameter language model on your gaming GPU. sounds impossible, right? thanks to breakthroughs in weight quantization, itās now a reality. techniques likeĀ GPTQ,Ā GGML, andĀ NF4Ā enable us to compress these colossal models into just 4 bits, dramatically reducing their size while maintaining impressive performance. this means you can now run models likeĀ LLaMA-30B on consumer hardware, such as an RTX 3090, without breaking the bank.
in this article, weāre diving deep into theĀ GPTQ algorithm, one of the most popular 4bit quantization techniques. weāll break down how it works and walk you through its implementation using theĀ AutoGPTQ library.
letās unlock the power of large language models for everyone. ready to get started?
optimal brain quantization: making big models small without losing smarts
imagine you have a giant neural network with millions (or even billions) of weights. running it on your laptop sounds impossible, right? thatās where quantization comes in. but not all quantization is created equal. some methods shrink models but make them dumber. others, like optimal brain quantization (OBQ), keep the model smart while making it small. letās see how it works.
the problem: how to shrink without breaking
for every layer in a neural network, we have a set of weights (Wā). these weights are like the brain of the model, they decide how it thinks. quantization replaces these weights with smaller, simpler versions (Å“ā). but hereās the catch: we donāt want the model to start making dumb decisions after quantization. so, we need to make sure the outputs of the quantized weights (Å“āXā) are as close as possible to the original outputs (WāXā).
in math terms, weāre solving:
this is called the layer-wise compression problem, and OBQ solves it brilliantly.
how OBQ works: one step at a time
OBQ is like a skilled surgeon. instead of chopping off parts of the model randomly, it carefully quantizes one weight at a time while adjusting the rest to keep the model accurate. hereās how it does it:
pick the easiest weight to quantize:
OBQ looks at all the weights and asks, āwhich one can i simplify without causing too much trouble?ā it uses something called the hessian matrix (HF) to figure this out. the hessian tells us how sensitive the model is to changes in each weight.adjust the other weights:
after quantizing a weight, OBQ tweaks the remaining weights to make up for the loss of precision. this ensures the model stays sharp.
the formula for this adjustment is:
donāt worry if this looks complicated, itās just OBQās way of saying, āletās fix the model after each change.ā
handle outliers:
some weights are like troublemakers, theyāre way bigger or smaller than the others and can mess things up. OBQ deals with these outliers by quantizing them as soon as theyāre spotted, so they donāt cause problems later.speed things up:
quantizing a big model can take forever, but OBQ has a trick up its sleeve. instead of recalculating everything from scratch after each quantization, it updates the hessian matrix efficiently using gaussian elimination :it also processes multiple weights at once using vectorization, making the whole process faster.
the trade-Off
OBQ is amazing, but itās not magic. as the model gets bigger, the computation time grows cubically. this makes it tough to use OBQ on massive models with billions of parameters. but for smaller models or specific layers, OBQ is a fantastic tool for keeping models both small and smart.
the GPTQ algorithm: scaling quantization for massive models
introduced byĀ frantar et al. (2023), theĀ GPTQ algorithmĀ builds on the foundation ofĀ optimal brain quantization (OBQ)Ā but takes it to the next level. GPTQ is designed to handleĀ very large language models, something OBQ struggles with due to its computational complexity. letās break down how GPTQ works and why itās a game-changer.
step 1: arbitrary order insight
one of the key limitations of OBQ is that it quantizes weights in a specific order, starting with the ones that introduce the least error. while this works well for smaller models, it becomes inefficient for massive models with billions of parameters.
GPTQ makes a clever observation:Ā for large models, the order in which weights are quantized doesnāt matter as much as we thought. hereās why:
- even if some weights introduce more error when quantized early, theyāre compensated for later in the process when fewer weights are left to adjust.
- this means we can quantize weights inĀ any fixed order without sacrificing performance.
this insight is a big deal because it simplifies the process. instead of carefully selecting which weight to quantize next, GPTQ quantizes all weights in theĀ same order for every row of the weight matrix. this makes the algorithm:
- faster: certain computations only need to be done once per column, not once per weight.
- scalable: it can handle massive models without blowing up computation time.
why this matters
by removing the need for a carefully chosen quantization order, GPTQ eliminates a major bottleneck in the OBQ method. this makes it possible to quantizeĀ huge models, like those with billions of parameters, on consumer hardware without losing performance.
Key takeaways
- GPTQ builds on OBQ but is optimized for large-scale models.
- it quantizes weights in aĀ fixed order, making the process faster and more scalable.
- this approach maintains model accuracy while significantly reducing computational overhead.
step 2: lazy batch-updates
while the GPTQ algorithm is powerful, thereās a catch: updating a massive matrix entry by entry is slow. this approach doesnāt fully utilize the parallel processing capabilities of GPUs and can hit memory bottlenecks, especially for large models. to solve this, GPTQ introduces a clever optimization calledĀ lazy batch-updates.
the problem: slow matrix updates
in the original approach, each weight update requires modifying a small part of a huge matrix. this leads to:
- inefficient GPU usage: GPUs are designed for parallel processing, but updating one entry at a time doesnāt take advantage of this.
- memory bottlenecks: constantly reading and writing to memory slows down the process, especially when dealing with billions of parameters.
the solution: lazy batch-updates š
GPTQās lazy batch-updates solve these issues by processing multiple columns at once. hereās how it works:
- batch processing: instead of updating one column at a time, GPTQ processes a batch of columns (e.g., 128 columns) simultaneously. this allows the GPU to work on multiple updates in parallel, maximizing its compute power.
- local updates: during batch processing, GPTQ only updates the columns in the current batch and their corresponding block of the matrix. this reduces the number of memory operations, avoiding bottlenecks.
- global updates: once a batch is fully processed, GPTQ performs aĀ global update on the entire matrix. this ensures that all changes are reflected accurately across the model.
the math behind it
the lazy batch-update mechanism relies on two key formulas:
weight adjustment: $\delta_F = -(w_Q - \text{quant}(w_Q))([H_F^{-1}]_{QQ})^{-1}(H_F^{-1})_{:,Q}. $
hessian update:
these formulas ensure that the updates are precise and efficient, even when processing multiple columns at once.
why this matters
lazy batch-updates make GPTQĀ faster and more scalable:
- faster processing: by batching updates, GPTQ fully utilizes GPU parallelism, speeding up the quantization process.
- memory efficiency: reducing the number of memory operations avoids bottlenecks, making it feasible to quantize massive models.
- scalability: this approach allows GPTQ to handle models with billions of parameters without slowing down.
key takeaways
- GPTQās lazy batch-updates solve the inefficiencies of updating a matrix entry by entry.
- it processes multiple columns at once, maximizing GPU parallelism and minimizing memory bottlenecks.
- the algorithm performs local updates during batch processing and global updates afterward to ensure accuracy.
- this optimization makes GPTQĀ faster,Ā more memory-efficient, andĀ scalable for large models.
step 3: cholesky reformulation
as GPTQ scales up to handle very large models, a new challenge emerges: numerical inaccuracies. repeated operations can lead to small errors that accumulate over time, potentially destabilizing the quantization process. to tackle this, GPTQ introduces a cholesky reformulation, a numerically stable method that ensures accuracy even for massive models.
the problem: numerical instability
when working with large models, small numerical errors can snowball into bigger problems. specifically:
- error accumulation: repeated operations (like matrix updates) can cause tiny errors to build up, leading to inaccurate results.
- unstable computations: without a stable method, the algorithm might fail to converge or produce unreliable quantized models.
the solution: cholesky decomposition
GPTQ uses cholesky decomposition, a mathematically robust technique, to solve this problem. hereās how it works:
- precompute with cholesky: before starting the quantization process, GPTQ precomputes key information from the hessian inverse matrix using the cholesky method. this ensures that all subsequent calculations are stable and accurate.
- dampening for stability: to further prevent numerical issues, GPTQ adds a small constant (a process calledĀ dampening) to the diagonal elements of the matrix. this tweak keeps the computations well-behaved, even for massive models.
how GPTQ works: step by step
hereās a breakdown of the GPTQ algorithm with cholesky reformulation:
- cholesky decomposition: start by performing a cholesky decomposition on the hessian inverse matrix. this sets the stage for stable and efficient computations.
- batch processing: GPTQ processes the model in batches of columns. for each column in a batch:
- quantize the weights.
- calculate the error introduced by quantization.
- update the weights in the current block to minimize the error.
- global updates: after processing a batch, GPTQ updates all remaining weights based on the errors from the current block. this ensures that the quantization process remains accurate across the entire model.
real world performance
GPTQ was tested on large language models like BLOOM (176B parameters) and OPT (175B parameters). hereās how it performed:
- hardware: quantization was done using a single NVIDIA A100 GPU.
- comparison: GPTQ outperformed simpler methods likeĀ round-to-nearest (RTN), which rounds all weights to the nearest quantized value without considering their impact on the modelās performance.
- results: GPTQ maintained high accuracy while significantly reducing the model size, making it a powerful tool for deploying large models on consumer hardware.
why this matters
the cholesky reformulation makes GPTQ numerically stable and scalable:
- stability: by precomputing with Cholesky and adding dampening, GPTQ avoids numerical errors that could derail the quantization process.
- scalability: this approach allows GPTQ to handle models with billions of parameters without compromising accuracy.
- efficiency: the batch processing and global updates ensure that the algorithm runs efficiently, even on a single GPU.
key takeaways
- GPTQ usesĀ cholesky decompositionĀ to ensure numerical stability for large models.
- dampeningĀ (adding a small constant to diagonal elements) further prevents numerical issues.
- the algorithm processes weights inĀ batches, quantizing them and updating errors block by block.
- GPTQ outperforms simpler methods like RTN, making it a top choice for quantizing massive models.
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers
then, load the libraries and define the model you want to quantize:
import random
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer
# Define base model and output directory
model_name = "gpt2"
quantized_model_dir = model_name + "-GPTQ"
loading the model and tokenizer
next, we load the model and tokenizer. the tokenizer is loaded using the classic AutoTokenizer
class from the transformers library. for the model, we use the AutoGPTQForCausalLM class, which requires a specific configuration, BaseQuantizeConfigāto set up the quantization process.
in this configuration, we specify bits=4
, which reduces the model to 4bit precision, making it smaller and faster while maintaining performance. we also define the group_size
, which determines the size of the lazy batch (e.g., 128 or 1024). while optional, using a group size improves quantization quality with minimal computational cost. for example, group_size=1024
is a common choice. additionally, we set the damp_percent
parameter to help stabilize the cholesky reformulation, and this should generally be left unchanged. finally, thereās the desc_act
(act order) parameter, which processes rows based on decreasing activation. this means the most impactful rows (determined by sampled inputs and outputs) are quantized first, placing most of the quantization error on less significant weights and improving overall accuracy. however, when used with group_size
, it can cause performance slowdowns due to frequent reloading of quantization parameters. for now, weāll disable this, though future updates may address the issue.
hereās how to load the quantize config, model, and tokenizer:
# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
damp_percent=0.01,
desc_act=False,
)
new_model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
new_tokenizer = AutoTokenizer.from_pretrained(model_name)
preparing samples for quantization
the quantization process relies heavily on samples to evaluate and enhance the quality of the quantized model. these samples allow us to compare the outputs of the original model with those of the quantized model. the more samples we use, the better the comparison, leading to improved quantization quality.
for this article, weāll use theĀ C4 datasetĀ (colossal clean crawled corpus), a large-scale, multilingual collection of web text from the common crawl project. the C4 dataset has been cleaned and prepared specifically for training large-scale language models, making it an excellent resource for tasks like quantization. another popular option is theĀ WikiText dataset, but weāll stick with C4 for this example.
loading and tokenizing samples
hereās how we loadĀ 1024 samplesĀ from the C4 dataset, tokenize them, and format them for quantization:
# Load data and tokenize examples
n_samples = 1024
data = load_dataset("allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split=f"train[:{n_samples*5}]")
tokenized_data = new_tokenizer ("\n\n".join(data['text']), return_tensors='pt')
# Format tokenized examples
examples_ids = []
for _ in range(n_samples):
i = random.randint(0, tokenized_data.input_ids.shape[1] - new_tokenizer.model_max_length - 1)
j = i + new_tokenizer.model_max_length
input_ids = tokenized_data.input_ids[:, i:j]
attention_mask = torch.ones_like(input_ids)
examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})
quantizing the model
now that the dataset is ready, we can start the quantization process. weāll use aĀ batch size of 1Ā and optionally enableĀ OpenAI Triton, a CUDA alternative, to optimize GPU communication. once the quantization is complete, weāll save the model and tokenizer in theĀ safetensors format, which is efficient and secure.
hereās how to quantize the model and save the results:
%%time
# Quantize with GPTQ
new_model.quantize(
examples_ids,
batch_size=1,
use_triton=True,
)
# Save model and tokenizer
new_model.save_quantized(quantized_model_dir, use_safetensors=True)
new_tokenizer.save_pretrained(quantized_model_dir)
loading the quantized model
once the model is quantized and saved, you can load it back using theĀ AutoGPTQForCausalLMĀ andĀ AutoTokenizerĀ classes. Hereās how:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Reload model and tokenizer
new_model = AutoGPTQForCausalLM.from_quantized(
quantized_model_dir,
device=device,
use_triton=True,
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
testing the quantized model
the quantized model works just like a normal transformers model, making it compatible with inference pipelines. letās test it with a simple text generation task:
from transformers import pipeline
generator = pipeline('text-generation', model=new_model, tokenizer=tokenizer)
result = generator("i enjoy training neural networks. it feels like Iām teaching a small baby.", do_sample=True, max_length=60)[0]['generated_text']
print(result)
results and next steps
the quantized GPT-2 model produces high-quality completions, showing that the quantization process successfully preserves the modelās performance. while the results are promising, a more thorough evaluation, such as measuring the perplexity of the quantized model compared to the original, would provide deeper insights into the impact of quantization. however, thatās a topic for another time. for now, weāve achieved our goal: a compact, efficient model that delivers great results.
conclusion
in this article, we explored theĀ GPTQ algorithm, a state of the art quantization technique that makes it possible to run large language models (LLMs) onĀ consumer-grade hardware. we walked through how GPTQ solves theĀ layer-wise compression problemĀ using advanced techniques like:
- arbitrary order insight: simplifies the quantization process for large models.
- lazy batch updates: optimizes GPU utilization and reduces memory bottlenecks.
- cholesky reformulation: ensures numerical stability for accurate quantization.
these innovations significantly reduce memory and computation requirements, making powerful LLMs accessible to a broader audience.
we also demonstrated how to quantize a GPT-2 model using a free T4 GPU and generate text with the quantized version. if youāre inspired to try this yourself, you can push your own 4bit quantized models to the hugging face hub and share them with the community.
while GPTQ is a powerful tool, itās not the only option for 4bit quantization. alternatives likeĀ GGMLĀ andĀ NF4Ā offer slightly different approaches and are worth exploring. i encourage you to dive deeper into these methods and experiment with them to see which works best for your needs.
references
B. Hassibi, D. G. Stork, and G. J. Wolff, āOptimal Brain Surgeon and general network pruning,ā IEEE International Conference on Neural Networks, San Francisco, CA, USA, 1993, pp. 293-299 vol.1, doi: 10.1109/ICNN.1993.298572.
Elias Frantar, Sidak Pal Singh, & Dan Alistarh. (2023). Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. arXiv preprint.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, & Dan Alistarh. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, & Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1-67. arXiv preprint.