OpenAI Prompt Caching: How It Works and What It Saves
OpenAI bills cached input at $0.125 per million tokens on GPT-5 versus $1.25 fresh, as of June 2026. The prefix rules, cache-friendly prompt structure, and hit-rate math.
OpenAI prompt caching bills the repeated part of your prompts at a steep discount, automatically: on GPT-5, cached input costs $0.125 per million tokens versus $1.25 fresh, a 90 percent cut, as of June 2026 (openai.com/api/pricing). To qualify, requests must share a byte-identical prefix of at least 1,024 tokens, which makes the entire craft one rule applied everywhere: static content first, variable content last. Below are the cache rules, the savings math, the prompt structure that earns hits, and how to measure the hit rate you actually get.
The rules the cache plays by
Caching is automatic. There is no parameter to set and no separate endpoint; OpenAI applies the discount whenever a request qualifies.
Matching is exact-prefix. The cache compares the opening tokens of your request against recent requests; the first divergent token ends the match. Everything inside the matched prefix bills at the cached rate, everything after bills fresh.
The minimum is 1,024 tokens. Prompts shorter than that never cache, no matter how often they repeat.
Entries are short-lived. OpenAI describes cache entries as evicting after minutes of inactivity, typically around 5 to 10, up to about an hour. Caching rewards steady traffic; a job that runs once a day re-pays full price every run.
Hits are visible per request, in the usage object:
{
"usage": {
"prompt_tokens": 3500,
"completion_tokens": 350,
"prompt_tokens_details": { "cached_tokens": 3000 }
}
}
What a good hit rate is worth
Take a support assistant on GPT-5 with a 3,000-token static prefix (system prompt plus tool schemas), 500 variable tokens (retrieved context plus the user message), and 350 output tokens, at one million requests a month. Say 90 percent of requests hit the cache.
Without caching:
input 3,500 × 1M = 3.5B tokens × $1.25/1M = $4,375.00
With caching (90% of requests hit):
cached 3,000 × 900k = 2.7B × $0.125/1M = $337.50
fresh remaining 800M × $1.25/1M = $1,000.00
input total $1,337.50
Output, both cases: 350 × 1M = 350M × $10/1M = $3,500.00
The monthly bill drops from $7,875 to $4,837.50: 69 percent off input, 39 percent off the total. The output line does not move, which is the honest boundary of this lever. Caching discounts the prefix; it never touches output.
Structuring prompts for cache hits
Order every request by stability, most static first:
1. System prompt static, identical every request
2. Tool definitions static, same order, same serialization
3. Few-shot examples static
4. Retrieved context variable
5. User message variable
The classic failure is one volatile token at the top:
# Cache-hostile: hit rate 0%
[system] You are SupportBot. Current time: 2026-06-12T09:14:03Z. User: maria-7c2f. <policies, tools...>
# Cache-friendly: first 3,000 tokens identical on every request
[system] You are SupportBot. <policies, tools, examples...>
[user] (time: 2026-06-12T09:14:03Z, user: maria-7c2f) How do I change my plan?
One timestamp at the top of a system prompt can zero out an entire cache. The same applies to request IDs, user names, session metadata, and randomized example order. Keep tool schemas in a fixed order with deterministic JSON serialization, and remember that every variant of a system prompt is its own cache entry, so an A/B test on the prefix halves the traffic each arm sees.
Prompts below the 1,024-token threshold have nothing to win here; their lever is trimming, not caching.
Measuring the hit rate you actually get
The formula, over any window:
hit rate = sum(cached_tokens) ÷ sum(prompt_tokens)
Chat-style traffic with a stable prefix typically lands at 60 to 80 percent. Low-volume endpoints score worse because entries evict between calls, and a sudden drop usually means a deploy moved something variable into the prefix.
Every request through our endpoint logs cached_tokens, and the free cache analytics chart hit rate per key over time, so a structure regression shows up as a line going down instead of a bill going up.
Where caching pays the most
Agent loops, first. Every call in an agent task reopens with the same system-plus-tools prefix, and a task that makes 18 calls re-sends that prefix 18 times within seconds of each other, comfortably inside the eviction window. On GPT-5, a 2,000-token stable prefix inside a 6,000-token average input drops input cost per call from $0.0075 to about $0.0053, roughly 30 percent off the input line of every call after the first, with no code changes.
Multi-turn chat, second. A conversation re-sends its whole transcript every turn, and that transcript is a growing prefix: system prompt plus all earlier turns are identical from one message to the next. The cache match extends as the conversation does, which takes real pressure off the quadratic cost curve of long chats.
RAG endpoints, third, with one caveat: retrieved chunks differ per request, so the match ends where retrieval begins. Keep the retrieval block below the static prompt and accept that only the static share discounts.
What caching cannot fix
Output spend, first of all: generation bills at full rate, and on GPT-5 output costs eight times input. One-off prompts and once-a-day jobs, second: there is no repeat traffic to discount, which is where the Batch API takes over for asynchronous bulk work. And model rates themselves: a cached GPT-5.5 token still costs more than many fresh Mini tokens, so routing stays the bigger lever. The full per-model table lives in OpenAI API pricing explained, and caching’s place among all seven cost levers is ranked in how to reduce OpenAI API costs.
Caching is the rare discount that costs nothing but discipline. Structure the prompts once, then check what your bill looks like after the discount in the calculator.
Frequently asked questions
How does OpenAI prompt caching work?
Automatically. When a request repeats the exact opening tokens of a recent request and the shared prefix is at least 1,024 tokens, OpenAI bills the repeated part at the cached-input rate. There is no flag to set; the discount shows up in each response's usage object as cached_tokens.
How much does prompt caching save?
Cached input bills at $0.125 per million tokens on GPT-5 versus $1.25 fresh, a 90 percent discount, as of June 2026. The effect on a bill depends on how much of your input is a repeated prefix: at a 60 percent prefix share, input costs drop roughly in half. Output tokens are never discounted.
Why is my prompt cache hit rate zero?
Almost always because something variable sits at the top of the prompt. A timestamp, request ID, or user name in the first line makes every prefix unique, and prefix matching is exact. Move all variable content below the static block, and keep the static block above the 1,024-token minimum.
How do I measure my prompt cache hit rate?
Read prompt_tokens_details.cached_tokens in each response's usage object and divide the sum by total input tokens over a window. Chat-style traffic with a stable system prompt typically lands at 60 to 80 percent. ProxyLLM's free cache analytics chart the rate per key so you can watch it instead of computing it.
Does prompt caching discount output tokens?
No. Caching discounts repeated input only; output always bills at the full per-model rate. Workloads dominated by generation see little benefit from caching and more from output caps and model choice.