📝 LLM & AI

Streaming LLM Responses: Getting the Answer Piece by Piece

0
Author
04e5cc8b-58ac-4bdc-bdee-661bbb
📅
Published
04.06.2026
⏱️
Reading time
1 min
👁️
Views
18
🌱
Level
Beginner

By default, messages.create() waits until the model has fully generated its response before returning anything. For long responses this means 5–15 seconds of silence. Streaming solves this: the response arrives token by token as the model generates it.

How Streaming Works

Technically it is HTTP keep-alive: the server holds the connection open and sends data chunks as they are generated. The SDK wraps this in a convenient context manager, messages.stream().

Basic Example

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain what an API is"}]
) as stream:
    for text_chunk in stream.text_stream:
        print(text_chunk, end="", flush=True)

print()  # newline after completion

Key details:
- with ... as stream: — context manager, closes the connection automatically
- stream.text_stream — iterator over text chunks (strings)
- end="" — do not add a newline after each chunk
- flush=True — flush the buffer immediately, don’t wait for it to fill up

Getting Usage After Streaming

with client.messages.stream(...) as stream:
    for chunk in stream.text_stream:
        print(chunk, end="", flush=True)

    # After completion — the final message with usage stats:
    final = stream.get_final_message()

print()
print(f"Tokens: {final.usage.input_tokens} in / {final.usage.output_tokens} out")

Streaming with rich

from rich.console import Console

console = Console()

with client.messages.stream(...) as stream:
    for chunk in stream.text_stream:
        # highlight=False is required when streaming:
        console.print(chunk, end="", highlight=False)

highlight=False matters: rich tries to apply syntax highlighting to each chunk individually — this breaks the output. Disable it when streaming token by token.

Handling Interruption

try:
    with client.messages.stream(...) as stream:
        for chunk in stream.text_stream:
            print(chunk, end="", flush=True)
except KeyboardInterrupt:
    print("\n[interrupted]")

Event Streaming (Low Level)

stream.text_stream is a high-level iterator over text only. To access all events (start, delta, stop), iterate over stream directly:

with client.messages.stream(...) as stream:
    for event in stream:
        if hasattr(event, 'delta') and hasattr(event.delta, 'text'):
            print(event.delta.text, end="", flush=True)

When to Use Streaming

Use streaming when:
- The response is long (>200 tokens)
- A user is watching the screen
- UX matters — the feeling of a “live” response

Don’t use streaming when:
- You are processing the response programmatically (parsing JSON)
- You are making batch requests in the background
- You only need the final result

Collecting the Full Text While Streaming

chunks: list[str] = []

with client.messages.stream(...) as stream:
    for chunk in stream.text_stream:
        print(chunk, end="", flush=True)
        chunks.append(chunk)
    final = stream.get_final_message()

full_text = "".join(chunks)  # complete response to save to history

Your reaction to the article

💬 Comments (0)

🔐 Sign in to leave a comment
🚪 Login
💭

No comments yet

Be the first to share your opinion about this article!

🔗 Similar

Similar articles

Continue learning with these materials

📝

Anthropic SDK: Getting Started with the Claude API

Anthropic Python SDK is the official library for working with Claude. It hides the complexity...

📅 04.06.2026 👁️ 16
📝

uv: The Modern Python Package Manager

uv is a next-generation tool for managing Python dependencies. Written in Rust by Astral, it...

📅 04.06.2026 👁️ 16
📝

System Prompts: Defining the Model's Role

A system prompt is a hidden instruction for the model set by the developer. The...

📅 04.06.2026 👁️ 13

Did you like the article?

Subscribe to our updates and receive new articles first. Grow with PyLand!