How to handle API errors, retries, backoff strategies, and rate limiting gracefully.
AI APIs are external services. They return errors. They get slow. They go down occasionally. If you treat them like they're always available and always fast, your production service will be fragile. Investing an afternoon in proper error handling, retry logic, and fallback strategies saves you a 2 AM incident three months later.
401 — Authentication Error Your API key is wrong, missing, or revoked. Not retryable — fix the key.
```python # Anthropic anthropic.AuthenticationError
# OpenAI openai.AuthenticationError ```
Check: is your environment variable actually set? Is the key correct? Has it been rotated?
429 — Rate Limit You've exceeded your requests-per-minute or tokens-per-minute limit. Retryable with backoff. The most common error in production.
python
anthropic.RateLimitError
openai.RateLimitError
The response headers include retry-after — how many seconds to wait. Always respect it.
400 — Bad Request Your request is malformed — invalid parameters, a message array that violates the schema, a context window overflow. Not retryable without fixing the input.
Common causes:
- max_tokens exceeds the model's maximum output length
- Messages array starts with an assistant message (must start with user)
- Content block format is wrong
- For Anthropic: system prompt in the messages array instead of top-level
500 / 529 — Server Error Internal error at the provider. Transient and retryable. Uncommon but happens.
Timeout The request took too long and you closed the connection. For long responses, increase your client timeout.
```python # Anthropic — set timeout client = anthropic.Anthropic(timeout=60.0)
# Per-request timeout response = client.messages.create( ..., timeout=30.0 ) ```
Never retry immediately on rate limits. Implement exponential backoff with jitter — each retry waits longer, plus a random component to prevent thundering herd.
```python import time import random import anthropic
def call_with_retry(client, max_retries=5, base_delay=1.0, kwargs): """ Make an API call with exponential backoff on retryable errors. """ for attempt in range(max_retries): try: return client.messages.create(kwargs)
except anthropic.RateLimitError as e: if attempt == max_retries - 1: raise delay = base_delay (2 * attempt) + random.uniform(0, 1) print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1})") time.sleep(delay)
except anthropic.InternalServerError as e: if attempt == max_retries - 1: raise delay = base_delay (2 * attempt) + random.uniform(0, 1) print(f"Server error. Retrying in {delay:.1f}s (attempt {attempt + 1})") time.sleep(delay)
except anthropic.AuthenticationError: raise # Don't retry auth errors
except anthropic.BadRequestError: raise # Don't retry bad requests
# Usage response = call_with_retry( client, model="claude-opus-4-5", max_tokens=1024, messages=[{"role": "user", "content": "Hello"}] ) ```
The SDKs have some built-in retry logic, but having your own wrapper gives you control over logging, metrics, and fallback behavior.
Rate limits have two dimensions: requests per minute (RPM) and tokens per minute (TPM). You can hit either one.
Track your own usage. If you're approaching TPM limits, you'll see 429s. Add a token counter to your wrapper and implement client-side throttling before hitting the API:
```python import time from collections import deque
class RateLimitedClient: def __init__(self, client, tokens_per_minute=100_000): self.client = client self.tpm_limit = tokens_per_minute self.token_timestamps = deque() # (timestamp, token_count) pairs
def _tokens_used_last_minute(self): now = time.time() cutoff = now - 60 while self.token_timestamps and self.token_timestamps[0][0] < cutoff: self.token_timestamps.popleft() return sum(t for _, t in self.token_timestamps)
def create(self, *kwargs): # Simple check — could be more sophisticated while self._tokens_used_last_minute() > self.tpm_limit 0.9: time.sleep(1)
response = self.client.messages.create(**kwargs)
total_tokens = response.usage.input_tokens + response.usage.output_tokens self.token_timestamps.append((time.time(), total_tokens)) return response ```
Increasing limits: Both OpenAI and Anthropic raise limits automatically as you spend more. If you're hitting limits consistently, contact support — they can often increase limits with a clear use case.
When your input exceeds the model's context window, you get a 400 error. Strategies:
1. Chunk and process separately (for long documents):
``python
def chunk_text(text: str, max_chars: int = 80_000) -> list[str]:
"""Split text into chunks at sentence boundaries."""
sentences = text.split('. ')
chunks, current = [], ""
for sentence in sentences:
if len(current) + len(sentence) > max_chars:
chunks.append(current)
current = sentence
else:
current += sentence + '. '
if current:
chunks.append(current)
return chunks
2. Summarize the conversation history when it gets long: ```python def compress_history(messages: list, client, keep_last_n=4) -> list: if len(messages) <= keep_last_n: return messages
to_compress = messages[:-keep_last_n] summary_response = client.messages.create( model="claude-haiku-3-5", # Use cheap model for summarization max_tokens=512, messages=[{ "role": "user", "content": f"Summarize this conversation in 2-3 sentences: {json.dumps(to_compress)}" }] ) summary = summary_response.content[0].text
return [{"role": "user", "content": f"[Prior conversation summary: {summary}]"}] + messages[-keep_last_n:] ```
3. Sliding window: Always include the system prompt and last N turns, drop older context.
Set max_tokens conservatively. A runaway prompt or a loop bug can generate thousands of dollars in API costs in minutes.
python
# Be explicit about output length limits
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512, # Don't default to the model maximum
...
)
Set up budget alerts in your provider's dashboard. Both OpenAI and Anthropic have spend limit notifications.
Log enough to debug production issues without storing sensitive user data:
```python import logging import time
logger = logging.getLogger("ai_client")
def logged_create(client, kwargs): start = time.time() try: response = client.messages.create(kwargs) elapsed = time.time() - start logger.info( "api_call_success", extra={ "model": kwargs.get("model"), "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "latency_ms": int(elapsed 1000), "stop_reason": response.stop_reason, } ) return response except Exception as e: elapsed = time.time() - start logger.error( "api_call_error", extra={ "model": kwargs.get("model"), "error_type": type(e).__name__, "latency_ms": int(elapsed 1000), } ) raise ```
When a model is unavailable or rate limited beyond recovery:
claude-haiku-3-5 or gpt-4o-mini when your primary model is unavailableHave a follow-up question about this topic?
Ask AI