When you send a request to a language model, it doesn’t see words or letters. It sees tokens — chunks of text roughly a few characters each. Understanding tokens will help you save money and write better prompts.
What Is a Token
A token is the unit of text a model works with. It’s neither a character nor a word — something in between. The model is trained to predict the next token based on all previous ones. That’s how it “generates” a response — one token at a time.
Rough rules:
- 1 English word ≈ 1–1.5 tokens
- 1 non-Latin word ≈ 2–3 tokens (non-Latin scripts are more expensive)
- 1 token ≈ 4 characters in English
- Code with indentation = more tokens than it looks
Examples:
- Hello → 1 token
- Привет → 2 tokens
- Hello, World! → 4 tokens
- 100 lines of Python code → ~500–800 tokens
Why Non-English Text Costs More
Models are trained predominantly on English text, so English words more often map cleanly to single vocabulary entries. Words in other languages are frequently split into multiple pieces. Practical implication: a system prompt in English is typically 1.5–2× cheaper than one in another language.
Input and Output Tokens
Anthropic charges by tokens:
- Input tokens — everything you sent: system prompt + full conversation history + the new message
- Output tokens — everything the model responded with
# After each request:
print(response.usage.input_tokens) # tokens sent
print(response.usage.output_tokens) # tokens in the response
A typical chat exchange is roughly 200–500 input + 300–800 output tokens. At thousands of requests, the numbers add up.
Tokens and Money
Anthropic pricing (2026):
| Model | Input | Output |
|---|---|---|
| claude-haiku-4-5 | $0.80 / 1M | $4 / 1M |
| claude-sonnet-4-6 | $3 / 1M | $15 / 1M |
| claude-opus-4-7 | $15 / 1M | $75 / 1M |
A typical Sonnet request costs roughly $0.005–0.01. A thousand requests ≈ $5–10.
Context Window
Every model has a context window — the maximum number of tokens in a single request (input + output combined). For claude-sonnet-4-6 that’s 200,000 tokens — about 150,000 words, or three novels.
As conversation history grows, it fills the context window. Strategies:
- Truncate old messages (keep only the last N)
- Summarize history via a separate request
- Store only key facts rather than verbatim exchanges
Counting Tokens in Advance
The SDK lets you count tokens without sending a request:
response = client.messages.count_tokens(
model="claude-sonnet-4-6",
system="You are a Python tutor.",
messages=[{"role": "user", "content": "What is a decorator?"}]
)
print(response.input_tokens) # exact token count
Use this to verify a prompt won’t exceed limits before sending it.
LLMs Are Predictors, Not Knowledge Bases
The key insight: an LLM doesn’t look up answers in a database. It predicts the most likely next token. That’s why:
- The model can “hallucinate” — generating plausible but incorrect facts
- temperature=0 produces more stable answers — less randomness in token selection
- The same question at temperature=1 gives different answers each time
- Longer context = the model “sees” more = better understanding of the task
💬 Comments (0)
No comments yet
Be the first to share your opinion about this article!