LoRA & Parameter-Efficient Fine-Tuning

Low-rank adapters, QLoRA, rank selection, and serving multiple adapters in production

Full fine-tuning rewrites every weight in the model. For most practical purposes, that's wasteful — LoRA achieves 90–99% of the quality at a fraction of the cost by training tiny low-rank matrices that get added to the frozen base model. It's the single most important technique for making fine-tuning accessible.

How LoRA Works

The core idea is simple. For a pre-trained weight matrix W of shape (d x k), LoRA freezes W and adds a trainable low-rank decomposition:

W' = W + BA

where B is (d x r) and A is (r x k), with rank r ≪ min(d, k). Typically r = 8, 16, or 32 while d and k are in the thousands.

This means:

The base model is frozen — no gradients flow through the original weights.
Only B and A are trained — a tiny fraction of the original parameter count.
At inference, BA can be merged into W — zero latency overhead after merging.

A 7B model has ~7 billion parameters. A rank-16 LoRA on attention layers might add ~10–50 million trainable parameters — less than 1%.

QLoRA: Quantized Base + LoRA

QLoRA combines 4-bit quantization of the base model with LoRA adapters trained in BF16/FP16. The result: you can fine-tune a 70B model on a single 48GB GPU.

The recipe:

Load the base model in 4-bit NormalFloat (NF4) quantization.
Attach LoRA adapters in higher precision (BF16).
Train only the adapters while the base stays frozen and quantized.
Use double quantization to further reduce the memory of quantization constants.

QLoRA's memory savings are dramatic. A 70B model that would need ~140GB in FP16 fits in ~35GB with NF4 quantization, leaving room for LoRA adapters and optimizer states on a single A100/H100.

Rank Selection

Rank is the most important hyperparameter. Guidelines:

r = 8 — good starting point for most tasks. Surprisingly effective.
r = 16 — the sweet spot for most production fine-tuning. Slightly better quality, still very efficient.
r = 32–64 — for complex tasks or when you're seeing underfitting at lower ranks.
r = 128+ — rarely needed. If you need this much capacity, consider whether full fine-tuning of a subset of layers would be simpler.

The scaling factor alpha is usually set to 2x the rank (alpha = 32 for r = 16). It controls the magnitude of the LoRA update relative to the original weights. Higher alpha = stronger adaptation.

Which Layers to Target

Not all layers benefit equally from LoRA:

Attention projections (Q, K, V, O) — the default and highest-impact target. Start here.
MLP layers (gate, up, down projections) — adding these to attention layers often gives a meaningful quality bump, especially for knowledge-heavy tasks.
All linear layers — the kitchen-sink approach. More parameters but sometimes the right call for complex domain adaptation.
Embedding / LM head — rarely targeted and usually not worth it.

Practical advice: start with Q and V projections only. If quality is insufficient, add K and O. If still not enough, add MLP layers. Profile quality vs training cost at each step.

Adapter Merging

After training, you can merge the LoRA weights back into the base model:

W_merged = W + BA

This produces a single model with no adapter overhead at inference. The tradeoff: you lose the ability to swap adapters dynamically.

Merging strategies:

Simple merge — add BA to W. Default and usually sufficient.
Ties merging — when combining multiple LoRA adapters, resolve conflicts by keeping the sign that the majority of adapters agree on.
DARE (Drop and Rescale) — randomly drop some adapter weights before merging to reduce interference between adapters.

Serving Multiple LoRA Adapters

In production, you often have one base model and many task-specific LoRA adapters. Loading a full model copy per adapter is wasteful. The solution: serve the base model once and swap adapters per request.

S-LoRA — batches requests across different LoRA adapters efficiently. Uses a unified memory pool for adapters and custom CUDA kernels for batched LoRA computation.
Punica — similar concept, uses a "segmented gather" kernel to apply different adapters to different requests within the same batch.
vLLM with LoRA — vLLM has built-in multi-LoRA support. Register adapters, specify which one per request. The simplest path to production.

This pattern — one base model, many lightweight adapters — is powerful. You can personalize a single deployment for dozens of use cases without multiplying your GPU spend.

Practical Fine-Tuning Recipes

Recipe 1: Quick domain adaptation (7B–13B model)

QLoRA, rank 16, target Q/V/K/O projections
Learning rate: 2e-4 with cosine schedule
1–3 epochs on your dataset
Single GPU (24GB+ VRAM)

Recipe 2: Production fine-tune (70B model)

QLoRA, rank 32, target all attention + MLP layers
Learning rate: 1e-4 with warmup + cosine
1 epoch, carefully curated dataset
1–2 GPUs (48GB+ each)

Recipe 3: Maximum quality (any size)

Full LoRA (BF16 base, not quantized), rank 64
FSDP across 4–8 GPUs
Lower learning rate: 5e-5
Multiple epochs with early stopping on a held-out set

Things That Go Wrong

Learning rate too high — LoRA is sensitive to learning rate. If loss spikes, cut it in half.
Too few training examples — LoRA can overfit fast on small datasets. Use fewer epochs or lower rank.
Wrong layer targets — targeting only embeddings or only the LM head will barely move the needle.
Forgetting to set the base model to eval mode — dropout in the frozen base will hurt quality.
Merging then continuing training — once you merge, the adapter is baked in. Keep unmerged checkpoints.

LoRA & Parameter-Efficient Fine-Tuning

On this page