LoRA & PEFT: Fine-tuning on a Budget

Full fine-tuning of large models is out of reach for most teams. LoRA and other PEFT methods have made task-specific fine-tuning accessible on consumer hardware.

LoRA & PEFT: Fine-tuning on a Budget

Full fine-tuning of large models is out of reach for most teams. LoRA and other PEFT methods have made task-specific fine-tuning accessible on consumer hardware.

How LoRA Works

LoRA (Low-Rank Adaptation) is based on a key insight: during fine-tuning, updates to large weight matrices tend to be low-rank — the meaningful changes can be expressed as the product of two much smaller matrices.

LoRA exploits this by freezing the original weight matrix W and training two small adapter matrices A and B, where the effective update is A × B. If W is 4096 × 4096 but you choose rank 16, then A is 4096 × 16 and B is 16 × 4096 — reducing from 16.7 million parameters to 131,000. The base model stays frozen; only the adapters are trained.

At inference time, you can keep adapters separate (useful for swapping between tasks) or merge them into the base weights for zero-overhead inference.

Practical Benefits

10–100x fewer trainable parameters. Memory requirements drop dramatically.

Consumer GPU viability. 7B–13B models can be LoRA fine-tuned on a single RTX 4090.

Swappable adapters. One base model, many fine-tuned behaviors — swap at runtime without reloading the base.

QLoRA: Going Further

QLoRA combines quantization with LoRA — the base model loads in 4-bit precision, then LoRA adapters train on top in full precision.

Result: 70B models can be fine-tuned on a single 80GB A100. 13B models fit on consumer GPUs with 24GB VRAM. Quality trade-off versus full-precision LoRA is minimal for most tasks.

Other PEFT Methods

Prefix tuning — prepends trainable tokens to the input at each transformer layer. No weights modified; only prefixes trained.

Prompt tuning — trainable tokens added only to the input embedding layer. Simpler and cheaper, but generally less effective than LoRA on complex tasks.

Tooling

Hugging Face PEFT library — the standard starting point. Wraps any supported model with LoRA configuration in a few lines of code.

Unsloth — 2–5x faster training than standard PEFT through custom CUDA kernels, with lower memory usage. The go-to for training on consumer hardware.

Axolotl and LLaMA-Factory — higher-level configuration-driven pipelines supporting LoRA and QLoRA across a wide range of open-source models.

Have a follow-up question about this topic?

Ask AI

← Previous

How Fine-tuning Works

Fine-tuning with OpenAI, Anthropic & Google