Reliability¶

Agents make LLM calls and execute tools — both can fail. Building reliable agents means handling failures gracefully and preventing runaway behavior.

Common Failure Modes¶

Failure	Cause	Symptom
API timeout	Network issues, rate limiting	`openai.APITimeoutError`
Rate limiting	Too many requests per minute	`openai.RateLimitError` (HTTP 429)
Invalid tool args	Model produces bad function arguments	`json.JSONDecodeError`, missing keys
Infinite loops	Model keeps calling tools without converging	Agent runs forever
Context overflow	Conversation exceeds token limit	`openai.BadRequestError`
Refusal	Model declines to answer	Empty response or safety refusal

Max Iterations: The Essential Guardrail¶

Every agent loop must have a maximum iteration limit:

@dataclass
class Agent:
    max_iterations: int = 10  # Never remove this

def run(agent, messages, client):
    for i in range(agent.max_iterations):
        response = client.chat.completions.create(...)
        if response.choices[0].finish_reason == "stop":
            return response.choices[0].message.content
        # ... handle tool calls
    return "Max iterations reached"

Without this, a confused model can loop forever. This is the single most important reliability measure.

Retry with Exponential Backoff¶

For transient API errors, retry with increasing delays:

import time
from openai import APITimeoutError, RateLimitError

def call_with_retry(fn, max_retries=3):
    """Retry a function call with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return fn()
        except (APITimeoutError, RateLimitError) as e:
            if attempt == max_retries - 1:
                raise
            delay = 2 ** attempt  # 1s, 2s, 4s
            logging.warning("Retry %d/%d after %ds: %s", attempt + 1, max_retries, delay, e)
            time.sleep(delay)

Don't retry everything

Only retry transient errors (timeouts, rate limits). Don't retry permanent errors like invalid API keys or malformed requests.

Tool Execution Safety¶

Tools execute arbitrary code. Guard against unexpected inputs:

def execute_tool(name: str, args: dict, tool_functions: dict) -> str:
    """Safely execute a tool function."""
    fn = tool_functions.get(name)
    if fn is None:
        return json.dumps({"error": f"Unknown tool: {name}"})

    try:
        result = fn(**args)
        return json.dumps(result) if not isinstance(result, str) else result
    except Exception as e:
        return json.dumps({"error": str(e)})

Return error messages as tool results rather than crashing — the model can often recover from a tool error by trying a different approach.

Timeout Patterns¶

Set timeouts at multiple levels:

# Per-request timeout
response = client.chat.completions.create(
    model=model,
    messages=messages,
    timeout=30.0,  # 30-second timeout
)

# Per-agent timeout (wall clock)
import signal

def timeout_handler(signum, frame):
    raise TimeoutError("Agent execution timed out")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(120)  # 2-minute total timeout
try:
    result = run(agent, messages, client)
finally:
    signal.alarm(0)  # Cancel alarm

Circuit Breaker Pattern¶

For systems with multiple agents, a circuit breaker prevents cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=3, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = 0

    def call(self, fn):
        if self.failures >= self.threshold:
            elapsed = time.time() - self.last_failure
            if elapsed < self.reset_timeout:
                raise RuntimeError("Circuit breaker open — too many failures")
            self.failures = 0  # Reset after timeout

        try:
            result = fn()
            self.failures = 0
            return result
        except Exception:
            self.failures += 1
            self.last_failure = time.time()
            raise

Graceful Degradation¶

When a specialist agent fails, the system should still produce useful output:

try:
    result = run_worker(client, model, worker_name, task)
except Exception as e:
    logger.warning("[%s] Worker failed: %s — using fallback", worker_name, e)
    result = f"Unable to complete task: {task.description}. Error: {e}"
    # The manager can still synthesize a report from partial results

Key Takeaways¶

Max iterations is non-negotiable — every agent loop needs a hard limit
Retry transient errors (timeouts, rate limits) with exponential backoff
Return errors as tool results — let the model recover gracefully
Set timeouts at both request and agent levels
Design for partial failure — a multi-agent system should degrade gracefully