Learn/API & Integration/Error Handling & Rate Limits
API & Integration

Error Handling & Rate Limits

How to handle API errors, retries, backoff strategies, and rate limiting gracefully.

Reliability Is an Engineering Problem

AI APIs are external services. They return errors. They get slow. They go down occasionally. If you treat them like they're always available and always fast, your production service will be fragile. Investing an afternoon in proper error handling, retry logic, and fallback strategies saves you a 2 AM incident three months later.

Common Error Types

401 — Authentication Error Your API key is wrong, missing, or revoked. Not retryable — fix the key.

```python # Anthropic anthropic.AuthenticationError

# OpenAI openai.AuthenticationError ```

Check: is your environment variable actually set? Is the key correct? Has it been rotated?

429 — Rate Limit You've exceeded your requests-per-minute or tokens-per-minute limit. Retryable with backoff. The most common error in production.

python anthropic.RateLimitError openai.RateLimitError

The response headers include retry-after — how many seconds to wait. Always respect it.

400 — Bad Request Your request is malformed — invalid parameters, a message array that violates the schema, a context window overflow. Not retryable without fixing the input.

Common causes: - max_tokens exceeds the model's maximum output length - Messages array starts with an assistant message (must start with user) - Content block format is wrong - For Anthropic: system prompt in the messages array instead of top-level

500 / 529 — Server Error Internal error at the provider. Transient and retryable. Uncommon but happens.

Timeout The request took too long and you closed the connection. For long responses, increase your client timeout.

```python # Anthropic — set timeout client = anthropic.Anthropic(timeout=60.0)

# Per-request timeout response = client.messages.create( ..., timeout=30.0 ) ```

Exponential Backoff with Jitter

Never retry immediately on rate limits. Implement exponential backoff with jitter — each retry waits longer, plus a random component to prevent thundering herd.

```python import time import random import anthropic

def call_with_retry(client, max_retries=5, base_delay=1.0, kwargs): """ Make an API call with exponential backoff on retryable errors. """ for attempt in range(max_retries): try: return client.messages.create(kwargs)

except anthropic.RateLimitError as e: if attempt == max_retries - 1: raise delay = base_delay (2 * attempt) + random.uniform(0, 1) print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1})") time.sleep(delay)

except anthropic.InternalServerError as e: if attempt == max_retries - 1: raise delay = base_delay (2 * attempt) + random.uniform(0, 1) print(f"Server error. Retrying in {delay:.1f}s (attempt {attempt + 1})") time.sleep(delay)

except anthropic.AuthenticationError: raise # Don't retry auth errors

except anthropic.BadRequestError: raise # Don't retry bad requests

# Usage response = call_with_retry( client, model="claude-opus-4-5", max_tokens=1024, messages=[{"role": "user", "content": "Hello"}] ) ```

The SDKs have some built-in retry logic, but having your own wrapper gives you control over logging, metrics, and fallback behavior.

Rate Limit Strategy

Rate limits have two dimensions: requests per minute (RPM) and tokens per minute (TPM). You can hit either one.

Track your own usage. If you're approaching TPM limits, you'll see 429s. Add a token counter to your wrapper and implement client-side throttling before hitting the API:

```python import time from collections import deque

class RateLimitedClient: def __init__(self, client, tokens_per_minute=100_000): self.client = client self.tpm_limit = tokens_per_minute self.token_timestamps = deque() # (timestamp, token_count) pairs

def _tokens_used_last_minute(self): now = time.time() cutoff = now - 60 while self.token_timestamps and self.token_timestamps[0][0] < cutoff: self.token_timestamps.popleft() return sum(t for _, t in self.token_timestamps)

def create(self, *kwargs): # Simple check — could be more sophisticated while self._tokens_used_last_minute() > self.tpm_limit 0.9: time.sleep(1)

response = self.client.messages.create(**kwargs)

total_tokens = response.usage.input_tokens + response.usage.output_tokens self.token_timestamps.append((time.time(), total_tokens)) return response ```

Increasing limits: Both OpenAI and Anthropic raise limits automatically as you spend more. If you're hitting limits consistently, contact support — they can often increase limits with a clear use case.

Handling Context Window Overflow

When your input exceeds the model's context window, you get a 400 error. Strategies:

1. Chunk and process separately (for long documents): ``python def chunk_text(text: str, max_chars: int = 80_000) -> list[str]: """Split text into chunks at sentence boundaries.""" sentences = text.split('. ') chunks, current = [], "" for sentence in sentences: if len(current) + len(sentence) > max_chars: chunks.append(current) current = sentence else: current += sentence + '. ' if current: chunks.append(current) return chunks

2. Summarize the conversation history when it gets long: ```python def compress_history(messages: list, client, keep_last_n=4) -> list: if len(messages) <= keep_last_n: return messages

to_compress = messages[:-keep_last_n] summary_response = client.messages.create( model="claude-haiku-3-5", # Use cheap model for summarization max_tokens=512, messages=[{ "role": "user", "content": f"Summarize this conversation in 2-3 sentences: {json.dumps(to_compress)}" }] ) summary = summary_response.content[0].text

return [{"role": "user", "content": f"[Prior conversation summary: {summary}]"}] + messages[-keep_last_n:] ```

3. Sliding window: Always include the system prompt and last N turns, drop older context.

Cost Controls

Set max_tokens conservatively. A runaway prompt or a loop bug can generate thousands of dollars in API costs in minutes.

python # Be explicit about output length limits response = client.messages.create( model="claude-opus-4-5", max_tokens=512, # Don't default to the model maximum ... )

Set up budget alerts in your provider's dashboard. Both OpenAI and Anthropic have spend limit notifications.

Logging for Debugging

Log enough to debug production issues without storing sensitive user data:

```python import logging import time

logger = logging.getLogger("ai_client")

def logged_create(client, kwargs): start = time.time() try: response = client.messages.create(kwargs) elapsed = time.time() - start logger.info( "api_call_success", extra={ "model": kwargs.get("model"), "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "latency_ms": int(elapsed 1000), "stop_reason": response.stop_reason, } ) return response except Exception as e: elapsed = time.time() - start logger.error( "api_call_error", extra={ "model": kwargs.get("model"), "error_type": type(e).__name__, "latency_ms": int(elapsed 1000), } ) raise ```

Fallback Strategies

When a model is unavailable or rate limited beyond recovery:

  • Fallback to a smaller/cheaper model: Route to claude-haiku-3-5 or gpt-4o-mini when your primary model is unavailable
  • Queue and retry later: For non-real-time tasks, queue the request and retry with backoff
  • Graceful degradation: Return a "try again in a moment" message rather than a 500 error to your users
  • Cross-provider fallback: If Anthropic is down, fall back to OpenAI for critical paths (requires having both integrated)

Have a follow-up question about this topic?

Ask AI