Chat Completions API¶
Chapter Slides
Before building agents, you need to understand the foundation: how Large Language Models (LLMs) process conversations through the Chat Completions API.
The Chat Completions API¶
The Chat Completions API is a stateless, request-response interface. You send a list of messages, and the model returns a completion — the next message in the conversation.
sequenceDiagram
participant App as Your Code
participant API as Chat Completions API
participant LLM as Language Model
App->>API: messages + parameters
API->>LLM: Process messages
LLM-->>API: Generated completion
API-->>App: Response with choices
Every request includes the full conversation history. The API has no memory between calls — you manage the conversation state.
Messages and Roles¶
A conversation is a list of messages, each with a role:
| Role | Purpose | Example |
|---|---|---|
system |
Sets the model's behavior, persona, and constraints | "You are a helpful travel assistant." |
user |
Human input — questions, instructions, data | "What's the best time to visit Japan?" |
assistant |
Model's previous responses (for multi-turn context) | "Spring (March-May) is ideal for cherry blossoms..." |
tool |
Results from tool/function calls (covered later) | {"temperature": 22, "condition": "sunny"} |
The developer role
The latest OpenAI models also support a developer role as a replacement for system. They are functionally equivalent for our purposes. This workshop uses system because it works universally across all providers.
Single-Turn vs. Multi-Turn¶
Single-turn — one question, one answer:
messages = [
{"role": "system", "content": "You are a travel assistant."},
{"role": "user", "content": "What's the best time to visit Japan?"},
]
Multi-turn — a conversation with history:
messages = [
{"role": "system", "content": "You are a travel assistant."},
{"role": "user", "content": "What's the best time to visit Japan?"},
{"role": "assistant", "content": "Spring (March-May) is ideal..."},
{"role": "user", "content": "What about budget tips?"},
]
The model sees the entire message list on every call. This is how it maintains context — and why context management matters as conversations grow (see Context Management).
Key Parameters¶
| Parameter | What It Does | Typical Values |
|---|---|---|
model |
Which model to use | gpt-4o-mini, gpt-4o |
temperature |
Randomness (0 = deterministic, 2 = very random) | 0.0 – 1.0 |
max_tokens |
Maximum response length | 100 – 4096 |
top_p |
Nucleus sampling (alternative to temperature) | 0.1 – 1.0 |
Temperature for agents
For agentic tasks, use low temperature (0.0 – 0.3) to get consistent, reliable behavior. Save higher temperatures for creative tasks like brainstorming.
Token Usage¶
Every API call consumes tokens (roughly 4 characters per token in English). The response includes usage information:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
print(response.usage.prompt_tokens) # Tokens in your input
print(response.usage.completion_tokens) # Tokens in the response
print(response.usage.total_tokens) # Total
Understanding token usage matters for cost control and for staying within context window limits (see Context Management).
Key Takeaways¶
- The Chat Completions API is stateless — you send the full conversation every time
- Roles define who said what:
system,user,assistant,tool - Temperature controls randomness — use low values for agents
- You manage conversation state — the API doesn't remember previous calls
References¶
- OpenAI Chat Completions API Guide
- OpenAI API Reference
- Andrej Karpathy — "Intro to Large Language Models" (YouTube)
- Anthropic Prompt Engineering Guide
Hands-On Exercise¶
Now try it yourself — head to the Chat Completion exercise to build a travel assistant with single-turn and multi-turn conversations.
You can run exercises from the terminal or use the Workshop TUI.