RAG vs Fine-tuning vs Prompting

Three ways to customize AI behavior — when to use each and the tradeoffs involved.

Start With the Simplest Thing That Works

The most common mistake when building with LLMs is reaching for RAG or fine-tuning before exhausting what plain prompting can do. Both add complexity and cost. Both are the right answer in specific circumstances. But "my model doesn't know about our internal docs" does not automatically mean "we need to build a RAG pipeline."

Prompting: Zero Infrastructure, High Leverage

Prompting means giving the model the information it needs inside the context window along with your question. No pipelines, no databases, no training runs. You write good prompts.

When prompting is the right answer: - Your domain knowledge fits in the context window - Your information doesn't change frequently (or you can update the prompt when it does) - You need results today, not after a two-week infrastructure build - You're still learning what the model can and can't do for your use case

The biggest lever in prompting is the system prompt. A well-crafted system prompt that establishes role, constraints, output format, and examples will outperform a bare API call to a more expensive model.

Prompting limits: - Hard context window ceiling. You can't fit 500 documents in a prompt. - Cost scales with context size. Long system prompts that repeat on every call add up. - No truly dynamic information. Stale if the data changes and you forget to update the prompt.

RAG: Retrieval Augmented Generation

RAG solves the "the data doesn't fit in the context and it changes" problem. The architecture:

1.Embed your documents into vectors using an embedding model
2.Store those vectors in a vector database (Pinecone, pgvector, Weaviate, Chroma)
3.At query time, embed the user's question using the same embedding model
4.Retrieve the most semantically similar document chunks
5.Inject those chunks into the prompt
6.Generate the response using that augmented context

The model never learns your data — it reads it at runtime, the same way it reads anything else in the context window.

When RAG is the right answer: - You have a large corpus (hundreds to thousands of documents) that doesn't fit in context - Your data changes frequently (product docs, support articles, internal wikis) - You need citations or sourcing (you retrieved specific chunks, so you can cite them) - You have private data that can't leave your environment (RAG stays in your infrastructure)

RAG is not magic. Common failure modes: - Retrieval quality is the bottleneck, not generation. If you retrieve the wrong chunks, the model generates a confident wrong answer. - Chunking strategy matters enormously. Split documents too small and you lose context. Too large and retrieval becomes imprecise. - Embeddings don't capture everything. Keyword-based retrieval (BM25) sometimes outperforms or complements semantic search — hybrid approaches are often best.

Costs to budget for: - Embedding API calls (cheap per token, but scales with corpus size) - Vector database hosting ($0 self-hosted with pgvector, $70+/mo for managed Pinecone) - Engineering time to build and maintain the pipeline

Fine-tuning: What It Actually Does

Fine-tuning is the most misunderstood option. Fine-tuning does not inject new knowledge into a model. It shapes the model's behavior, style, and format.

What fine-tuning is good for: - Consistent output format (always respond as JSON with these exact fields) - Tone and style matching (respond like our brand voice, not like a generic assistant) - Domain-specific behavior patterns (triage support tickets into these categories) - Reducing the need for long few-shot examples (if you've been using 20 examples in every prompt, fine-tuning on those examples can shorten your prompts)

What fine-tuning is bad for: - Teaching the model facts it doesn't know. Models don't reliably retain specific factual information from fine-tuning datasets — they're more likely to hallucinate confidently than to correctly recall a specific fact. - Anything you could accomplish with good prompting. Fine-tuning adds cost and a training workflow for marginal gains.

Provider availability: - OpenAI: Fine-tuning supported for GPT-4o-mini and GPT-3.5-turbo. Requires training data in JSONL format. Costs ~$8/MTok for training, higher inference costs for fine-tuned models. - Anthropic: Limited fine-tuning access, not widely available as of mid-2025. Anthropic's position is that prompting with Claude is usually sufficient. - Google: Fine-tuning available for Gemini models via Vertex AI. - Open source: If you're running Llama, Mistral, etc., fine-tuning is fully available via tools like Unsloth or Axolotl.

Data requirements: You need at minimum ~100 high-quality examples to see any benefit. More like 1,000+ for meaningful behavior change. The quality of training examples matters more than the quantity.

The Decision Framework

Work through these in order:

1. Can you solve it with prompting? Try a well-crafted system prompt with a few examples first. For most use cases, this is sufficient. If your context fits and the data doesn't change often, stop here.

2. Do you have private/dynamic data that doesn't fit in context? Add RAG. Start simple — even a basic similarity search with pgvector (which you likely have if you're on Postgres) will get you 80% of the way there.

3. Do you have a consistent behavior/format problem that prompting doesn't fully solve? Consider fine-tuning. Make sure you have enough quality training examples and a clear evaluation metric.

The pattern that doesn't work: fine-tuning instead of RAG because you think it'll "teach the model your docs." It won't. If your problem is knowledge injection, RAG is the answer. Fine-tuning for knowledge is an expensive mistake that engineers make repeatedly.

Have a follow-up question about this topic?

Ask AI

← Previous

Prompting for Code

Embeddings & Vector Search