You're Probably Using the Wrong Bedrock Model. Here's How to Tell.
We ran eight models through the same RAG pipeline. The cheapest Claude model won. Here's why — and what it means for your model selection strategy.
The Follow-Up
Last month we published Same Context. Three Models. The Floor Isn't Zero. — three Bedrock models, identical context, wildly different output quality. It hit Reddit. The feedback was immediate:
"This was a good read. Please update if you do again with more/newer models."
"Why would you even attempt Nova instead of Nova 2?"
"Would be interesting to see this for open source models."
Fair. So we ran it again. Same pipeline. Same question. Same audit table. Eight models this time — spanning five providers on AWS Bedrock.
And the results broke an assumption we didn't know we had.
Same Pipeline, More Models
Same setup as before. Thirteen ADRs. Two vector stores. Compliance documents. Twenty-eight indexed blog posts. The same Context Engineering methodology. The Bedrock Converse API — model-agnostic, so swapping models is a single Terraform variable change. No code modifications. No prompt tuning. Every model gets the exact same system prompt, the exact same RAG context, the exact same question:
"A customer asked about SOC 2 compliance, how do I respond?"
Every response logged to DynamoDB with full audit trail — model ID, input tokens, output tokens, duration, cost. Same methodology. More data.
The Results
| Model | Provider | Input Tokens | Output Tokens | Duration |
|---|---|---|---|---|
| Haiku 4.5 | Anthropic | 59,221 | 1,122 | 13.8s |
| Sonnet 4.6 | Anthropic | 59,222 | 655 | 16.7s |
| GPT OSS 120B | OpenAI | 51,027 | 539 | 5.7s |
| DeepSeek R1 | DeepSeek | 52,438 | 481 | 12.2s |
| DeepSeek V3.2 | DeepSeek | 52,437 | 400 | 18.9s |
| Qwen3 235B | Alibaba | 52,282 | 377 | 10.1s |
| Llama 4 Maverick | Meta | 50,269 | 344 | 4.7s |
| Nova 2 Lite | Amazon | 49,547 | 294 | 5.3s |
Same context. Eight models. Five providers. Every response logged with full audit trail. Let's talk about what happened.
Haiku Beat Sonnet
Read that again. The cheapest Claude model outperformed the mid-tier Claude model on the same context. Haiku produced 1,122 tokens — a complete sales playbook with pushback handlers, framework-specific compliance answers, competitive positioning, "what NOT to say" guardrails, and CRM-ready talking points.
Sonnet produced 655 tokens. Good structure. Pushback handler. Guardrails. But less comprehensive. Fewer objection scenarios. No HIPAA/PCI-DSS follow-ups. No CRM template.
The more expensive model gave us less.
This isn't a fluke. It reveals something fundamental about how different models handle different cognitive tasks.
The Wrong Model for the Job
Here's the insight that changes how you should think about model selection:
Retrieval + Formatting Tasks
The answer already exists in the context. The model needs to find it, structure it, and present it. This is what our sales assistant does — the ADRs contain the compliance positioning, the competitive intelligence is in the blog posts, the framework details are in the compliance docs. The model's job is extraction and formatting.
Haiku excels here. Fast, cheap, thorough.
Reasoning + Synthesis Tasks
The answer doesn't exist anywhere in the knowledge base. The model has to construct it — connecting information across sources, inferring relationships, drawing conclusions that aren't explicitly stated. "Explain the relationship between app_a, app_b, and app_c" requires reasoning, not retrieval.
Sonnet earns its cost here. It thinks, not just formats.
Match the model to the cognitive task, not the price tier.
Our RAG sales assistant is a retrieval task. The answer is already in the ADRs. The model just needs to find it and format it correctly. Haiku is the right tool. Throwing Sonnet at it is like hiring a senior architect to fill out a form.
The mistake enterprises are making is picking one model for everything and tuning the prompt to compensate. Wrong direction. You tune the model selection to the task type and keep the prompts simple.
The Full Breakdown
Haiku 4.5 — The Retrieval King
1,122 tokens · 13.8sComplete sales playbook from a single query: plain-English explanation, ready-to-send email, pushback handler with Terraform analogy, HIPAA/PCI-DSS framework answers, "what NOT to say" guardrails, competitive positioning against Cursor and Copilot, and a CRM notes template.
The rep handles a three-round CISO conversation without escalating to engineering.
Sonnet 4.6 — Good, But Overqualified
655 tokens · 16.7sWell-structured response with email copy, a pushback handler, guardrails, and CRM notes. But fewer objection scenarios. No framework-specific follow-ups. No competitive positioning. Slower and more expensive than Haiku for a less complete result.
Good — but not $4x-the-cost better. The wrong tool for this task.
GPT OSS 120B — Clean but Shallow
539 tokens · 5.7sProfessional email with solid bullet points. Fast. No objection handling, no competitive context, no guardrails. The rep sends a good first email but is exposed the moment the CISO pushes back.
DeepSeek R1 & V3.2 — Capable, Not Exceptional
R1: 481 tokens · 12.2s | V3.2: 400 tokens · 18.9sR1 produced a structured email with talking points — decent but surface-level. V3.2 was similar quality but the slowest model in the entire test at 18.9s. Both extracted the core facts but missed the deeper context: no framework-specific answers, no guardrails, no competitive positioning.
Interesting footnote: R1 is a reasoning model, but on a retrieval task, that reasoning overhead didn't translate to better output.
Qwen3 235B — Competent Middle of the Pack
377 tokens · 10.1sStructured email with bold formatting, PoC offer, and compliance bullets. No objection handling. No competitive context. Correctly identified the deployment model advantage but didn't anticipate follow-up scenarios.
Nova 2 Lite — Improved, Still Minimal
294 tokens · 5.3sReddit asked us to test Nova 2. It's better than Nova Lite v1 — it produced a ready-to-send email with the Terraform analogy, which Nova v1 never surfaced. But still minimal depth. One email, no pushback handling, no framework specifics. The floor got raised. The ceiling didn't move much.
Llama 4 Maverick — The Disappointment
344 tokens · 4.7sMeta's newest model. Fast — 4.7s. But the weakest extraction from context in the entire test. Generic email with bullet points that any model could produce without the RAG context. 50,269 input tokens of ADRs, compliance docs, and competitive intelligence — and it produced a response that barely acknowledges any of it.
The fastest model to give you the least useful answer.
Ranked by Usefulness, Not Tokens
Output tokens don't tell you quality. A 1,122-token response that covers every objection beats a 539-token response that only handles the first email. Here's how we rank them by what actually matters — can the sales rep handle the full conversation without escalating?
A Framework for Model Selection
This data gives us a practical framework. Stop picking models by benchmark scores or price tier. Pick them by what cognitive task you're asking the model to perform.
| Task Type | Description | Right Model |
|---|---|---|
| Retrieval + Formatting | Answer exists in context. Find it, structure it, present it. | Haiku-class |
| Reasoning + Synthesis | Answer must be constructed. Connect sources, infer relationships. | Sonnet-class |
| Complex Architecture | Novel design decisions. Multi-system tradeoffs. Ambiguous requirements. | Opus-class |
The sales assistant is a retrieval task. The ADRs contain the compliance positioning. The blog posts contain the competitive intelligence. The model's job is to find it and format it. Haiku does this better than Sonnet — not because Haiku is a "better" model, but because it's the right model for the task.
Ask Sonnet to explain the relationship between three interconnected microservices using only code summaries and ADRs as evidence — and it earns its cost. That's reasoning. That's synthesis. That's where the extra capability matters.
The most expensive model isn't the best model. The right model is the best model.
What the Open-Source Models Tell Us
Reddit specifically asked about open-source models. Here's the honest answer: on a retrieval task with well-engineered context, they're fine. Not great. Fine.
Qwen3 235B, DeepSeek R1, DeepSeek V3.2, GPT OSS 120B — they all produced usable first-email responses. A sales rep could send any of them. But none of them anticipated the follow-up. None of them built guardrails. None of them surfaced competitive positioning that wasn't explicitly asked for.
The context contained all of that information. These models read it and returned the minimum viable answer. Haiku read it and returned the maximum useful answer.
For internal tools where "good enough" works? Open-source models are viable. For customer-facing workflows where the first response determines the outcome? The extraction gap matters.
A note on DeepSeek R1
R1 is a reasoning model — chain-of-thought, deliberate thinking. On a retrieval task, that reasoning overhead doesn't help. It's like using a chess engine to look up a phone number. The engine is brilliant at chess; it's not better at phone books. R1 would likely shine on our reasoning tasks. On retrieval, it's average.
About Llama 4 Maverick
We expected more. Meta's newest model, Mixture of Experts architecture, 17B active parameters. Fastest response time in the test at 4.7 seconds. And the weakest context extraction across all eight models.
It produced a generic email that barely referenced the 50,000+ tokens of context it received. The deployment model fact was there. The compliance positioning was absent. The competitive intelligence was ignored. The framework-specific details were skipped.
Speed without extraction is just fast nothing.
The Real Takeaway
The first blog proved that context engineering raises the floor. This follow-up proves something more nuanced:
1. Context engineering is still the foundation.
Every model — even Llama 4 Maverick — got the core fact right because the context contained it. Without context, they all hallucinate. With context, they all produce something usable. The floor is real.
2. More expensive doesn't mean better.
Haiku outperformed Sonnet on a retrieval task. Sonnet's extra reasoning capability didn't translate to better extraction — it translated to a more cautious, less comprehensive response. The model was overqualified for the task.
3. Match the model to the cognitive task.
Retrieval tasks need extraction capability, not reasoning power. Reasoning tasks need synthesis capability, not just speed. The right model isn't the most expensive one — it's the one whose strengths align with what you're asking it to do.
Eight models. Five providers. One pipeline. One question. The data is in the audit table.
Stop picking models by price tier. Start picking them by task type.
See the Full Outputs
Want to see all eight unredacted responses side-by-side? We'll walk you through:
- The complete output from every model
- How to identify retrieval vs. reasoning tasks in your workflows
- A model selection framework for your Bedrock deployment
- What this looks like applied to your engineering team
Context engineering raises the floor. The task type determines the model. Choose accordingly.
Eight models. One pipeline. The cheapest Claude model won. Not because it's the best model — because it was the right model.
Related Reading
- Context Engineering: The Next Evolution Beyond DevOps — The methodology behind the pipeline that made these results possible.
- What is an ADR? Why They're Critical for AI Development — The 13 ADRs that powered every model's context in this test.
- Why Most AI Platforms Over-Engineer RAG — The retrieval architecture behind the pipeline.
- Anthropic Says Build Skills, Not Agents. We Already Do. — Why model selection is a skill-level decision, not an agent-level one.