Embeddings & Vector Search

What embeddings are, how vector databases work, and when you actually need them.

Text as Coordinates

An embedding is a numerical representation of text as a vector — a list of floating-point numbers, typically 768 to 3072 dimensions depending on the model. The critical property: texts with similar meanings produce vectors that are close to each other in that high-dimensional space.

"The dog ran through the park" and "A canine sprinted across the green space" will produce very similar vectors. "The quarterly revenue exceeded expectations" will produce a very different one.

This is the foundation of semantic search, RAG pipelines, duplicate detection, clustering, recommendation systems, and more.

How Embeddings Capture Meaning

Embedding models are trained (often contrastively) to map semantically similar text to similar vector positions. The model learns that "attorney" and "lawyer" should be neighbors, that code about database connections and code about SQL queries should cluster together, that two product descriptions for the same item in different phrasings should be close.

The magic is that this works across paraphrase, synonym, and even some cross-language similarity, in a way that keyword matching never could.

Generating Embeddings

Both OpenAI and Anthropic provide embedding APIs. OpenAI's are more commonly used in open-source tooling:

```python from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", # 1536 dimensions, cheap input=text ) return response.data[0].embedding

vector = embed("How do I reset my password?") print(f"Dimensions: {len(vector)}") # 1536 ```

```python import anthropic

client = anthropic.Anthropic()

# Anthropic uses the voyage-* models via their API partnership # For direct Anthropic embeddings, use the Voyage AI API # which is Anthropic's embedding partner import voyageai

vo = voyageai.Client() result = vo.embed(["How do I reset my password?"], model="voyage-3") vector = result.embeddings[0] ```

For most RAG use cases, OpenAI's text-embedding-3-small is a strong default — cheap, fast, and 1536 dimensions is plenty.

Vector Databases: What They Do

A vector database stores embeddings and makes similarity search fast. Given a query vector, it returns the most similar stored vectors (and their associated documents) efficiently — even across millions of entries.

The similarity metric is almost always cosine similarity or dot product. Cosine similarity measures the angle between two vectors, independent of magnitude — a good match for text where you care about direction (meaning) not length.

Your main options:

| Database | Best for | Notes | |---|---|---| | pgvector | Already using Postgres | Zero new infra, excellent for < 1M vectors | | Pinecone | Managed, production at scale | Paid, very easy to operate | | Weaviate | Self-hosted + GraphQL | More features, more complexity | | Chroma | Local dev and prototyping | In-memory or SQLite, no server needed | | Qdrant | Self-hosted, high performance | Rust-based, fast, good Kubernetes story |

For most teams: start with pgvector if you have Postgres, move to Pinecone or Qdrant if you hit scaling issues or operational pain.

Semantic Search vs Keyword Search

Keyword search (like Postgres full-text search or Elasticsearch) matches documents that contain the same words as the query. It's fast, precise, and fails on synonyms, paraphrase, and conceptual queries.

Semantic search finds documents with similar meaning, regardless of exact wording. It handles "how do I log in" matching "account authentication process" gracefully.

Hybrid search uses both. A practical approach: run BM25 keyword search and semantic search in parallel, then merge and rerank the results. This tends to outperform either alone, especially for short queries where keyword precision matters.

The Basic RAG Pipeline with Embeddings

```python from openai import OpenAI import numpy as np

client = OpenAI()

# 1. Embed your documents at index time def build_index(documents: list[str]) -> tuple[list[list[float]], list[str]]: response = client.embeddings.create( model="text-embedding-3-small", input=documents ) vectors = [item.embedding for item in response.data] return vectors, documents

# 2. Embed the query and find nearest neighbors def retrieve(query: str, vectors: list[list[float]], documents: list[str], top_k: int = 3): query_vec = client.embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding

# Cosine similarity q = np.array(query_vec) sims = [np.dot(q, np.array(v)) / (np.linalg.norm(q) * np.linalg.norm(v)) for v in vectors] top_indices = np.argsort(sims)[-top_k:][::-1] return [documents[i] for i in top_indices]

# 3. Inject retrieved context into your prompt docs = ["Users reset passwords via /account/reset", "The API rate limit is 1000 req/min", ...] vectors, docs = build_index(docs)

context = retrieve("How do I change my password?", vectors, docs) # ["Users reset passwords via /account/reset"] ```

In production you'd replace the in-memory search with pgvector or Pinecone, but the logic is identical.

When You Actually Need This

You don't need embeddings and vector search for: - A chatbot with a fixed system prompt - A task where all relevant context fits in a single prompt - A simple Q&A against a 10-page document (just paste it in)

You do need it for: - Search over hundreds of documents or more - Finding semantically similar support tickets - Recommendation systems ("users who liked X also liked Y") - Duplicate detection in large datasets - Building a RAG pipeline against a knowledge base that changes

Costs

Embedding is cheap — roughly $0.02 per million tokens with text-embedding-3-small. For a 1,000-document corpus with average 500 tokens each, you're looking at $0.01 to build the index. The ongoing cost is embedding each query, which is negligible.

The real costs are storage and retrieval infrastructure. Vectors for 1M documents at 1536 dimensions take ~6GB. Self-hosted pgvector is essentially free. Managed services start at $70/mo for Pinecone's starter tier.

Have a follow-up question about this topic?

Ask AI

← Previous

RAG vs Fine-tuning vs Prompting

Model Benchmarks: What They Mean