GPT-5’s Reasoning Tokens Count Against Your Output Budget — and Bill at the Output Rate

Brian Carpio·

We support both Claude and GPT-5 on Amazon Bedrock in a code-generation platform that deploys into customer AWS accounts. This is the engineering writeup of what that integration actually took — the two APIs hiding under one service, the cost surprise, and the production failures that taught us each lesson. It’s vendor-neutral on purpose: if you’re doing this integration yourself, this is the post I wish I’d had. No pitch.

32K-token budget. The model reasoned for the full 900-second Lambda timeout. Emitted zero visible tokens. Timed out.

That was our introduction to what “GPT-5 on Bedrock” actually means once you get past the launch-day blog posts. We’d been running Claude on Bedrock for months. Adding the GPT-5 family looked, from the outside, like a config change — same service, same account, same IAM. It is not a config change. It’s a second integration wearing the first one’s clothes.

Here are the five things that surprised us, in the order they cost us time.

1. Bedrock is not one API. It’s two.

AWS markets Bedrock as a unified gateway to many model families. That’s true for billing and IAM. It is not true at the API layer.

  • Converse (bedrock-runtime.converse_stream) is the unified chat-completion surface. Same JSON shape across Claude, Nova/Titan, Llama, Mistral, Cohere. One client, many model families. This is the Bedrock most people mean when they say “Bedrock.”
  • Responses — AWS launched the surface under the name Mantle (the endpoint is literally bedrock-mantle.{region}.api.aws) — is where the GPT-5 family lives on Bedrock. Its payload mirrors OpenAI’s own /v1/responses contract, not Converse. AWS didn’t fold it into Converse — plausibly because the reasoning-token and tool-use semantics don’t normalize cleanly. Worth knowing: Mantle actually hosts two dialects. GPT-5 and Grok are on the Responses dialect; a growing list of open-frontier models (Qwen, DeepSeek, Mistral variants, MoonshotAI, and others) are on a chat-completions dialect that lives on Mantle too. This post is scoped to the Responses dialect.

Same account. Same IAM identity. Same KMS keys. Two completely different payload shapes depending on which model family you call.

There’s also a Mantle-specific bearer-token flow (bedrock:CreateBearerToken) sitting alongside your standard IAM identity — which is why a plain boto3.client("bedrock-runtime").invoke_model() won’t Just Work against GPT-5 even with the right IAM policy. Two auth surfaces to reason about, not one.

The same logical request — system prompt plus a user message — looks like this on Converse (Claude):

{
  "system": [{"text": "You are a code assistant."}],
  "messages": [{"role": "user", "content": [{"text": "explain X"}]}],
  "inferenceConfig": {"maxTokens": 8192, "temperature": 0.1},
  "toolConfig": {"tools": ["..."]}
}

And like this on Responses (GPT-5):

{
  "model": "openai.gpt-5.5",
  "input": [{"role": "user", "content": [{"type": "input_text", "text": "explain X"}]}],
  "instructions": "You are a code assistant.",
  "max_output_tokens": 32768,
  "temperature": 1.0,
  "reasoning": {"effort": "minimal"},
  "text": {"format": {"type": "text"}},
  "tools": ["..."],
  "store": false
}

Different field names (system vs instructions, messages vs input, maxTokens vs max_output_tokens). Different content-block shapes. Different defaults — note the temperature: Converse takes your 0.1 fine; the GPT-5 reasoning models reject low temperatures and effectively force 1.0. If you carry your Claude inference config over verbatim, you get a validation error from a model the IAM simulator swears you can call.

If you want a single call site that can target either family, something has to absorb this. Bedrock doesn’t. You do.

2. The reasoning-token tax (the one nobody warns you about)

This is the big one. It’s a cost story and a correctness story at the same time.

Claude’s max_tokens is a ceiling on visible output. GPT-5’s max_output_tokens is a ceiling on reasoning tokens plus visible output, combined — and the reasoning tokens are billed at the output rate while being invisible to your application unless you explicitly opt into reasoning summaries.

Read that again, because it has two consequences.

Correctness: if you set max_output_tokens = 8192 and the model spends 6000 tokens reasoning, you have 2192 left for the actual answer. Ask for something that needs more and the response truncates — or, in the pathological case, the model spends the entire budget thinking and emits nothing. That’s the cold open above: a generous 32K budget, high effort, a hard question, and the model reasoned until the Lambda timed out with zero user-visible bytes.

Cost: GPT-5.4 is list-priced almost identically to Claude Sonnet 4.6 on the output line. But every call quietly carries reasoning tokens billed at that output rate. When we measured the same logical turn across both, GPT-5.4 came in roughly 1.8–2.5x the Sonnet cost once the reasoning tokens were counted. The pricing page comparison and the invoice comparison are not the same comparison.

Approximate Bedrock list pricing we were reasoning against (us-east, mid-2026 — check current pricing before you cite this, it moves):

ModelInput $/1MOutput $/1MHidden reasoning billed at output rate
Claude Haiku 4.5$1.00$5.00No
Claude Sonnet 4.6$3.00$15.00No
GPT-5.4$2.75$16.50Yes
GPT-5.5$5.50$33.00Yes

GPT-5.5 lands around 2x GPT-5.4 on top of that. So “which tier handles this turn” stops being a “use the best model” reflex and becomes a real per-request cost decision — especially for grounded retrieval, where the cheaper sibling is often the correct pick, not just the frugal one.

Two mitigations that mattered: give the Responses path a much larger max_output_tokens headroom than you’d give Claude (we run ~4x) so reasoning can’t starve the answer, and pin reasoning effort to the floor by default for anything that doesn’t need deliberation. Which brings up a trap —

The effort vocabulary is not consistent. OpenAI’s direct API accepts minimal | low | medium | high. The Bedrock Mantle equivalent accepts none | low | medium | high | xhigh. Same dial, different words at the extremes. Send minimal to Mantle and you get a validation error. Send none to OpenAI-direct and same thing. If you’re normalizing across both, this is a lookup table, not a constant.

3. Streaming: two event vocabularies that don’t overlap — and a silent outage

Both surfaces stream. The event vocabularies share nothing.

Converse:

messageStart → contentBlockStart → contentBlockDelta (text) → contentBlockStop
  → messageStop → metadata { usage: { inputTokens, outputTokens } }

Responses (SSE):

response.created → response.output_item.added (reasoning) → response.output_item.added (message)
  → response.output_text.delta → … → response.completed { usage: { input_tokens, output_tokens } }

Even the usage casing diverges — inputTokens vs input_tokens. Anything downstream that does cost-by-model rollup, usage logging, or token metering has to speak both.

The one that actually paged us: GPT-5 streams its reasoning tokens silently. They never reach the client, but the model is working the whole time. From your load balancer’s perspective, that connection looks idle. Our ALB’s default idle_timeout of 60 seconds killed the stream before the first visible byte arrived on longer reasoning turns. Claude never exposed this because its first content chunk lands within seconds — there’s no long silent gap to time out against.

Fix was mechanical (raise idle_timeout to 900 to match the Lambda ceiling). Knowing the failure mode exists is the part that saves you the outage. A silent-but-working stream is invisible to every layer that only watches for bytes on the wire.

4. The small divergences that add up

Individually trivial. Collectively, the difference between “a config line” and “a quarter of integration work”:

  • Parameter names for the same concept. maxTokens vs max_output_tokens, system vs instructions, and several tool-schema shape differences.
  • Parameters one backend requires and the other rejects. GPT-5 wants reasoning.effort; Converse has no such field and errors on it. Converse’s inferenceConfig block has no home on Responses.
  • Forced defaults. The temperature floor on reasoning models mentioned above. store: false if you don’t want server-side conversation state you didn’t ask for.
  • Two failure modes to catch. A bad Converse call throws ValidationException. A Mantle entitlement problem throws a 401 access_denied — because, unlike Claude (access via IAM policy), the GPT-5 family is gated by a per-model commercial entitlement that AWS Sales controls. A brand-new account with correct IAM will still get a 401 until that entitlement clears. Your IAM policy simulator will tell you the call is allowed. The call is not allowed. Budget for that gap in your rollout timeline.

5. Don’t branch your prompts per model. Fix the prompt.

We learned this one the expensive way. Same prompt, two models, divergent output: Claude obeyed a structural-attribution rule; GPT-5.5 fabricated aggregate numbers that weren’t in the source.

The tempting fix is if model_id.startswith("openai."): <special-case the prompt>. Don’t. Every per-model branch you add is a combinatorial tax you pay forever, on every prompt, for every future model. We deleted the branch and instead tightened the rule wording and added a couple of few-shot examples. That closed the gap on both models — Claude got more reliable too — and left the code model-agnostic.

The discipline that fell out of it: the layer that routes between backends is a wire abstraction only. It translates parameter names, normalizes the effort vocabulary, drops rejected fields with a logged warning, and applies family-aware budget ceilings. It does not know about prompts. Prompts are model-agnostic by design, and when a model misbehaves, the prompt is where you fix it. Push model-awareness up into your prompt layer and you’ve built a maintenance bomb.

The checklist, if you’re doing this yourself

  • Bedrock is two APIs (Converse + Responses/Mantle). Plan for a dispatch boundary, not a config flag.
  • max_output_tokens on GPT-5 is reasoning + visible, billed at output rate. Give it headroom, pin effort low by default, and price against your invoice, not the pricing page — expect ~1.8–2.5x Sonnet for GPT-5.4 on the same turn.
  • Two streaming vocabularies; watch the usage-field casing. And raise your load-balancer idle timeout — silent reasoning streams look idle and get reaped.
  • Normalize the effort vocabulary (minimal vs none), expect forced defaults (temperature), and catch both ValidationException and 401 access_denied. The 401 is a commercial entitlement gated by AWS Sales, not an IAM problem — it won’t show up in the policy simulator.
  • Never branch prompts per model. Fix the prompt; keep the routing layer wire-only.

None of these are dealbreakers. Each is small in isolation. In aggregate they’re the difference between “we support GPT-5” as a one-line claim and “we support GPT-5” as a real engineering migration. We did the work once so it lives in one place — but the divergences above are true whether you build on a platform or roll your own. If you’re integrating the GPT-5 family on Bedrock, steal all of it.

OutcomeOps: The Future of AI Engineering

Opens Substack in a new tab to confirm. No spam — unsubscribe anytime.

Or Skip Building This

We wrote the dispatch layer described above for OutcomeOps — a code-generation platform that deploys into customer AWS accounts and supports Claude, GPT-5, Grok, and the rest of Bedrock’s model list through a single tfvars change. If you’d rather buy than build, that’s what we’re for. Otherwise, hopefully the teardown above saves you a couple of production incidents.

Model choice is configuration. The dispatch layer is the moat.

Related reading