Full fine-tuning of large models is out of reach for most teams. LoRA and other PEFT methods have made task-specific fine-tuning accessible on consumer hardware.
Full fine-tuning of large models is out of reach for most teams. LoRA and other PEFT methods have made task-specific fine-tuning accessible on consumer hardware.
LoRA (Low-Rank Adaptation) is based on a key insight: during fine-tuning, updates to large weight matrices tend to be low-rank — the meaningful changes can be expressed as the product of two much smaller matrices.
LoRA exploits this by freezing the original weight matrix W and training two small adapter matrices A and B, where the effective update is A × B. If W is 4096 × 4096 but you choose rank 16, then A is 4096 × 16 and B is 16 × 4096 — reducing from 16.7 million parameters to 131,000. The base model stays frozen; only the adapters are trained.
At inference time, you can keep adapters separate (useful for swapping between tasks) or merge them into the base weights for zero-overhead inference.
10–100x fewer trainable parameters. Memory requirements drop dramatically.
Consumer GPU viability. 7B–13B models can be LoRA fine-tuned on a single RTX 4090.
Swappable adapters. One base model, many fine-tuned behaviors — swap at runtime without reloading the base.
QLoRA combines quantization with LoRA — the base model loads in 4-bit precision, then LoRA adapters train on top in full precision.
Result: 70B models can be fine-tuned on a single 80GB A100. 13B models fit on consumer GPUs with 24GB VRAM. Quality trade-off versus full-precision LoRA is minimal for most tasks.
Prefix tuning — prepends trainable tokens to the input at each transformer layer. No weights modified; only prefixes trained.
Prompt tuning — trainable tokens added only to the input embedding layer. Simpler and cheaper, but generally less effective than LoRA on complex tasks.
Hugging Face PEFT library — the standard starting point. Wraps any supported model with LoRA configuration in a few lines of code.
Unsloth — 2–5x faster training than standard PEFT through custom CUDA kernels, with lower memory usage. The go-to for training on consumer hardware.
Axolotl and LLaMA-Factory — higher-level configuration-driven pipelines supporting LoRA and QLoRA across a wide range of open-source models.
Have a follow-up question about this topic?
Ask AI