sonu...

run open-source llms on everyday hardware, a hacking guide

imagine running a massive 30-billion-parameter language model on your gaming GPU. sounds impossible, right? thanks to breakthroughs in weight quantization, it’s now a reality. techniques likeĀ GPTQ,Ā GGML, andĀ NF4Ā enable us to compress these colossal models into just 4 bits, dramatically reducing their size while maintaining impressive performance. this means you can now run models likeĀ LLaMA-30B on consumer hardware, such as an RTX 3090, without breaking the bank.

in this article, we’re diving deep into theĀ GPTQ algorithm, one of the most popular 4bit quantization techniques. we’ll break down how it works and walk you through its implementation using theĀ AutoGPTQ library.

let’s unlock the power of large language models for everyone. ready to get started?

optimal brain quantization: making big models small without losing smarts

imagine you have a giant neural network with millions (or even billions) of weights. running it on your laptop sounds impossible, right? that’s where quantization comes in. but not all quantization is created equal. some methods shrink models but make them dumber. others, like optimal brain quantization (OBQ), keep the model smart while making it small. let’s see how it works.

the problem: how to shrink without breaking

for every layer in a neural network, we have a set of weights (Wā„“). these weights are like the brain of the model, they decide how it thinks. quantization replaces these weights with smaller, simpler versions (œℓ). but here’s the catch: we don’t want the model to start making dumb decisions after quantization. so, we need to make sure the outputs of the quantized weights (œℓXā„“) are as close as possible to the original outputs (Wā„“Xā„“).

in math terms, we’re solving:

argminW^WXW^X22

this is called the layer-wise compression problem, and OBQ solves it brilliantly.

how OBQ works: one step at a time

OBQ is like a skilled surgeon. instead of chopping off parts of the model randomly, it carefully quantizes one weight at a time while adjusting the rest to keep the model accurate. here’s how it does it:

  1. pick the easiest weight to quantize:
    OBQ looks at all the weights and asks, ā€œwhich one can i simplify without causing too much trouble?ā€ it uses something called the hessian matrix (HF) to figure this out. the hessian tells us how sensitive the model is to changes in each weight.

  2. adjust the other weights:
    after quantizing a weight, OBQ tweaks the remaining weights to make up for the loss of precision. this ensures the model stays sharp.

the formula for this adjustment is: δF=wqquant(wq)[HF1]qq·(HF1):,q.

don’t worry if this looks complicated, it’s just OBQ’s way of saying, ā€œlet’s fix the model after each change.ā€

the trade-Off

OBQ is amazing, but it’s not magic. as the model gets bigger, the computation time grows cubically. this makes it tough to use OBQ on massive models with billions of parameters. but for smaller models or specific layers, OBQ is a fantastic tool for keeping models both small and smart.

the GPTQ algorithm: scaling quantization for massive models

introduced byĀ frantar et al. (2023), theĀ GPTQ algorithmĀ builds on the foundation ofĀ optimal brain quantization (OBQ)Ā but takes it to the next level. GPTQ is designed to handleĀ very large language models, something OBQ struggles with due to its computational complexity. let’s break down how GPTQ works and why it’s a game-changer.

step 1: arbitrary order insight

one of the key limitations of OBQ is that it quantizes weights in a specific order, starting with the ones that introduce the least error. while this works well for smaller models, it becomes inefficient for massive models with billions of parameters.

GPTQ makes a clever observation:Ā for large models, the order in which weights are quantized doesn’t matter as much as we thought. here’s why:

this insight is a big deal because it simplifies the process. instead of carefully selecting which weight to quantize next, GPTQ quantizes all weights in theĀ same order for every row of the weight matrix. this makes the algorithm:

why this matters

by removing the need for a carefully chosen quantization order, GPTQ eliminates a major bottleneck in the OBQ method. this makes it possible to quantizeĀ huge models, like those with billions of parameters, on consumer hardware without losing performance.

Key takeaways

step 2: lazy batch-updates

while the GPTQ algorithm is powerful, there’s a catch: updating a massive matrix entry by entry is slow. this approach doesn’t fully utilize the parallel processing capabilities of GPUs and can hit memory bottlenecks, especially for large models. to solve this, GPTQ introduces a clever optimization calledĀ lazy batch-updates.

the problem: slow matrix updates

in the original approach, each weight update requires modifying a small part of a huge matrix. this leads to:

the solution: lazy batch-updates šŸ˜‰

GPTQ’s lazy batch-updates solve these issues by processing multiple columns at once. here’s how it works:

  1. batch processing: instead of updating one column at a time, GPTQ processes a batch of columns (e.g., 128 columns) simultaneously. this allows the GPU to work on multiple updates in parallel, maximizing its compute power.
  2. local updates: during batch processing, GPTQ only updates the columns in the current batch and their corresponding block of the matrix. this reduces the number of memory operations, avoiding bottlenecks.
  3. global updates: once a batch is fully processed, GPTQ performs aĀ global update on the entire matrix. this ensures that all changes are reflected accurately across the model.

the math behind it

the lazy batch-update mechanism relies on two key formulas:

  1. weight adjustment: $\delta_F = -(w_Q - \text{quant}(w_Q))([H_F^{-1}]_{QQ})^{-1}(H_F^{-1})_{:,Q}. $

  2. hessian update: HQ1=(H1H:,Q1([HF1]QQ)1HQ,:1)Q.

these formulas ensure that the updates are precise and efficient, even when processing multiple columns at once.

why this matters

lazy batch-updates make GPTQĀ faster and more scalable:

key takeaways

step 3: cholesky reformulation

as GPTQ scales up to handle very large models, a new challenge emerges: numerical inaccuracies. repeated operations can lead to small errors that accumulate over time, potentially destabilizing the quantization process. to tackle this, GPTQ introduces a cholesky reformulation, a numerically stable method that ensures accuracy even for massive models.

the problem: numerical instability

when working with large models, small numerical errors can snowball into bigger problems. specifically:

the solution: cholesky decomposition

GPTQ uses cholesky decomposition, a mathematically robust technique, to solve this problem. here’s how it works:

  1. precompute with cholesky: before starting the quantization process, GPTQ precomputes key information from the hessian inverse matrix using the cholesky method. this ensures that all subsequent calculations are stable and accurate.
  2. dampening for stability: to further prevent numerical issues, GPTQ adds a small constant (a process calledĀ dampening) to the diagonal elements of the matrix. this tweak keeps the computations well-behaved, even for massive models.

how GPTQ works: step by step

here’s a breakdown of the GPTQ algorithm with cholesky reformulation:

  1. cholesky decomposition: start by performing a cholesky decomposition on the hessian inverse matrix. this sets the stage for stable and efficient computations.
  2. batch processing: GPTQ processes the model in batches of columns. for each column in a batch:
  1. global updates: after processing a batch, GPTQ updates all remaining weights based on the errors from the current block. this ensures that the quantization process remains accurate across the entire model.

real world performance

GPTQ was tested on large language models like BLOOM (176B parameters) and OPT (175B parameters). here’s how it performed:

why this matters

the cholesky reformulation makes GPTQ numerically stable and scalable:

key takeaways

!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers

then, load the libraries and define the model you want to quantize:

import random

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer


# Define base model and output directory
model_name = "gpt2"
quantized_model_dir = model_name + "-GPTQ"

loading the model and tokenizer

next, we load the model and tokenizer. the tokenizer is loaded using the classic AutoTokenizer class from the transformers library. for the model, we use the AutoGPTQForCausalLM class, which requires a specific configuration, BaseQuantizeConfig—to set up the quantization process.

in this configuration, we specify bits=4, which reduces the model to 4bit precision, making it smaller and faster while maintaining performance. we also define the group_size, which determines the size of the lazy batch (e.g., 128 or 1024). while optional, using a group size improves quantization quality with minimal computational cost. for example, group_size=1024 is a common choice. additionally, we set the damp_percent parameter to help stabilize the cholesky reformulation, and this should generally be left unchanged. finally, there’s the desc_act (act order) parameter, which processes rows based on decreasing activation. this means the most impactful rows (determined by sampled inputs and outputs) are quantized first, placing most of the quantization error on less significant weights and improving overall accuracy. however, when used with group_size, it can cause performance slowdowns due to frequent reloading of quantization parameters. for now, we’ll disable this, though future updates may address the issue.

here’s how to load the quantize config, model, and tokenizer:

# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)
new_model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
new_tokenizer = AutoTokenizer.from_pretrained(model_name)

preparing samples for quantization

the quantization process relies heavily on samples to evaluate and enhance the quality of the quantized model. these samples allow us to compare the outputs of the original model with those of the quantized model. the more samples we use, the better the comparison, leading to improved quantization quality.

for this article, we’ll use theĀ C4 datasetĀ (colossal clean crawled corpus), a large-scale, multilingual collection of web text from the common crawl project. the C4 dataset has been cleaned and prepared specifically for training large-scale language models, making it an excellent resource for tasks like quantization. another popular option is theĀ WikiText dataset, but we’ll stick with C4 for this example.

loading and tokenizing samples

here’s how we loadĀ 1024 samplesĀ from the C4 dataset, tokenize them, and format them for quantization:

# Load data and tokenize examples
n_samples = 1024
data = load_dataset("allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split=f"train[:{n_samples*5}]")
tokenized_data = new_tokenizer ("\n\n".join(data['text']), return_tensors='pt')

# Format tokenized examples
examples_ids = []
for _ in range(n_samples):
    i = random.randint(0, tokenized_data.input_ids.shape[1] - new_tokenizer.model_max_length - 1)
    j = i + new_tokenizer.model_max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

quantizing the model

now that the dataset is ready, we can start the quantization process. we’ll use aĀ batch size of 1Ā and optionally enableĀ OpenAI Triton, a CUDA alternative, to optimize GPU communication. once the quantization is complete, we’ll save the model and tokenizer in theĀ safetensors format, which is efficient and secure.

here’s how to quantize the model and save the results:

%%time

# Quantize with GPTQ
new_model.quantize(
    examples_ids,
    batch_size=1,
    use_triton=True,
)

# Save model and tokenizer
new_model.save_quantized(quantized_model_dir, use_safetensors=True)
new_tokenizer.save_pretrained(quantized_model_dir)

loading the quantized model

once the model is quantized and saved, you can load it back using theĀ AutoGPTQForCausalLMĀ andĀ AutoTokenizerĀ classes. Here’s how:

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Reload model and tokenizer
new_model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

testing the quantized model

the quantized model works just like a normal transformers model, making it compatible with inference pipelines. let’s test it with a simple text generation task:

from transformers import pipeline

generator = pipeline('text-generation', model=new_model, tokenizer=tokenizer)
result = generator("i enjoy training neural networks. it feels like I’m teaching a small baby.", do_sample=True, max_length=60)[0]['generated_text']
print(result)

results and next steps

the quantized GPT-2 model produces high-quality completions, showing that the quantization process successfully preserves the model’s performance. while the results are promising, a more thorough evaluation, such as measuring the perplexity of the quantized model compared to the original, would provide deeper insights into the impact of quantization. however, that’s a topic for another time. for now, we’ve achieved our goal: a compact, efficient model that delivers great results.

conclusion

in this article, we explored theĀ GPTQ algorithm, a state of the art quantization technique that makes it possible to run large language models (LLMs) onĀ consumer-grade hardware. we walked through how GPTQ solves theĀ layer-wise compression problemĀ using advanced techniques like:

these innovations significantly reduce memory and computation requirements, making powerful LLMs accessible to a broader audience.

we also demonstrated how to quantize a GPT-2 model using a free T4 GPU and generate text with the quantized version. if you’re inspired to try this yourself, you can push your own 4bit quantized models to the hugging face hub and share them with the community.

while GPTQ is a powerful tool, it’s not the only option for 4bit quantization. alternatives likeĀ GGMLĀ andĀ NF4Ā offer slightly different approaches and are worth exploring. i encourage you to dive deeper into these methods and experiment with them to see which works best for your needs.

references


  1. B. Hassibi, D. G. Stork, and G. J. Wolff, ā€œOptimal Brain Surgeon and general network pruning,ā€ IEEE International Conference on Neural Networks, San Francisco, CA, USA, 1993, pp. 293-299 vol.1, doi: 10.1109/ICNN.1993.298572.

  2. Elias Frantar, Sidak Pal Singh, & Dan Alistarh. (2023). Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. arXiv preprint.

  3. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, & Dan Alistarh. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint.

  4. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, & Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1-67. arXiv preprint.