You're Probably Using the Wrong Bedrock Model. Here's How to Tell.

We ran eight models through the same RAG pipeline. The cheapest Claude model won. Here's why — and what it means for your model selection strategy.

Brian Carpio
Context EngineeringAI Model ComparisonRAGAWS BedrockEnterprise AILLM Evaluation

The Follow-Up

Last month we published Same Context. Three Models. The Floor Isn't Zero. — three Bedrock models, identical context, wildly different output quality. It hit Reddit. The feedback was immediate:

"This was a good read. Please update if you do again with more/newer models."

"Why would you even attempt Nova instead of Nova 2?"

"Would be interesting to see this for open source models."

Fair. So we ran it again. Same pipeline. Same question. Same audit table. Eight models this time — spanning five providers on AWS Bedrock.

And the results broke an assumption we didn't know we had.

Same Pipeline, More Models

Same setup as before. Thirteen ADRs. Two vector stores. Compliance documents. Twenty-eight indexed blog posts. The same Context Engineering methodology. The Bedrock Converse API — model-agnostic, so swapping models is a single Terraform variable change. No code modifications. No prompt tuning. Every model gets the exact same system prompt, the exact same RAG context, the exact same question:

"A customer asked about SOC 2 compliance, how do I respond?"

Every response logged to DynamoDB with full audit trail — model ID, input tokens, output tokens, duration, cost. Same methodology. More data.

The Results

ModelProviderInput TokensOutput TokensDuration
Haiku 4.5Anthropic59,2211,12213.8s
Sonnet 4.6Anthropic59,22265516.7s
GPT OSS 120BOpenAI51,0275395.7s
DeepSeek R1DeepSeek52,43848112.2s
DeepSeek V3.2DeepSeek52,43740018.9s
Qwen3 235BAlibaba52,28237710.1s
Llama 4 MaverickMeta50,2693444.7s
Nova 2 LiteAmazon49,5472945.3s

Same context. Eight models. Five providers. Every response logged with full audit trail. Let's talk about what happened.

Haiku Beat Sonnet

Read that again. The cheapest Claude model outperformed the mid-tier Claude model on the same context. Haiku produced 1,122 tokens — a complete sales playbook with pushback handlers, framework-specific compliance answers, competitive positioning, "what NOT to say" guardrails, and CRM-ready talking points.

Sonnet produced 655 tokens. Good structure. Pushback handler. Guardrails. But less comprehensive. Fewer objection scenarios. No HIPAA/PCI-DSS follow-ups. No CRM template.

The more expensive model gave us less.

This isn't a fluke. It reveals something fundamental about how different models handle different cognitive tasks.

The Wrong Model for the Job

Here's the insight that changes how you should think about model selection:

Retrieval + Formatting Tasks

The answer already exists in the context. The model needs to find it, structure it, and present it. This is what our sales assistant does — the ADRs contain the compliance positioning, the competitive intelligence is in the blog posts, the framework details are in the compliance docs. The model's job is extraction and formatting.

Haiku excels here. Fast, cheap, thorough.

Reasoning + Synthesis Tasks

The answer doesn't exist anywhere in the knowledge base. The model has to construct it — connecting information across sources, inferring relationships, drawing conclusions that aren't explicitly stated. "Explain the relationship between app_a, app_b, and app_c" requires reasoning, not retrieval.

Sonnet earns its cost here. It thinks, not just formats.

Match the model to the cognitive task, not the price tier.

Our RAG sales assistant is a retrieval task. The answer is already in the ADRs. The model just needs to find it and format it correctly. Haiku is the right tool. Throwing Sonnet at it is like hiring a senior architect to fill out a form.

The mistake enterprises are making is picking one model for everything and tuning the prompt to compensate. Wrong direction. You tune the model selection to the task type and keep the prompts simple.

The Full Breakdown

Haiku 4.5 — The Retrieval King

1,122 tokens · 13.8s

Complete sales playbook from a single query: plain-English explanation, ready-to-send email, pushback handler with Terraform analogy, HIPAA/PCI-DSS framework answers, "what NOT to say" guardrails, competitive positioning against Cursor and Copilot, and a CRM notes template.

The rep handles a three-round CISO conversation without escalating to engineering.

Sonnet 4.6 — Good, But Overqualified

655 tokens · 16.7s

Well-structured response with email copy, a pushback handler, guardrails, and CRM notes. But fewer objection scenarios. No framework-specific follow-ups. No competitive positioning. Slower and more expensive than Haiku for a less complete result.

Good — but not $4x-the-cost better. The wrong tool for this task.

GPT OSS 120B — Clean but Shallow

539 tokens · 5.7s

Professional email with solid bullet points. Fast. No objection handling, no competitive context, no guardrails. The rep sends a good first email but is exposed the moment the CISO pushes back.

DeepSeek R1 & V3.2 — Capable, Not Exceptional

R1: 481 tokens · 12.2s | V3.2: 400 tokens · 18.9s

R1 produced a structured email with talking points — decent but surface-level. V3.2 was similar quality but the slowest model in the entire test at 18.9s. Both extracted the core facts but missed the deeper context: no framework-specific answers, no guardrails, no competitive positioning.

Interesting footnote: R1 is a reasoning model, but on a retrieval task, that reasoning overhead didn't translate to better output.

Qwen3 235B — Competent Middle of the Pack

377 tokens · 10.1s

Structured email with bold formatting, PoC offer, and compliance bullets. No objection handling. No competitive context. Correctly identified the deployment model advantage but didn't anticipate follow-up scenarios.

Nova 2 Lite — Improved, Still Minimal

294 tokens · 5.3s

Reddit asked us to test Nova 2. It's better than Nova Lite v1 — it produced a ready-to-send email with the Terraform analogy, which Nova v1 never surfaced. But still minimal depth. One email, no pushback handling, no framework specifics. The floor got raised. The ceiling didn't move much.

Llama 4 Maverick — The Disappointment

344 tokens · 4.7s

Meta's newest model. Fast — 4.7s. But the weakest extraction from context in the entire test. Generic email with bullet points that any model could produce without the RAG context. 50,269 input tokens of ADRs, compliance docs, and competitive intelligence — and it produced a response that barely acknowledges any of it.

The fastest model to give you the least useful answer.

Ranked by Usefulness, Not Tokens

Output tokens don't tell you quality. A 1,122-token response that covers every objection beats a 539-token response that only handles the first email. Here's how we rank them by what actually matters — can the sales rep handle the full conversation without escalating?

1
Haiku 4.5— Full playbook. Email + pushback + frameworks + guardrails + CRM.
2
Sonnet 4.6— Good structure, guardrails, but less comprehensive than Haiku.
3
GPT OSS 120B— Professional first email, exposed on follow-up.
4
DeepSeek R1— Decent structure, surface-level extraction.
5
DeepSeek V3.2— Similar to R1 but slowest in the test.
6
Qwen3 235B— Competent email, no objection handling.
7
Nova 2 Lite— Core fact plus basic email. Improved over Nova v1.
8
Llama 4 Maverick— Fastest to respond, least useful response.

A Framework for Model Selection

This data gives us a practical framework. Stop picking models by benchmark scores or price tier. Pick them by what cognitive task you're asking the model to perform.

Task TypeDescriptionRight Model
Retrieval + FormattingAnswer exists in context. Find it, structure it, present it.Haiku-class
Reasoning + SynthesisAnswer must be constructed. Connect sources, infer relationships.Sonnet-class
Complex ArchitectureNovel design decisions. Multi-system tradeoffs. Ambiguous requirements.Opus-class

The sales assistant is a retrieval task. The ADRs contain the compliance positioning. The blog posts contain the competitive intelligence. The model's job is to find it and format it. Haiku does this better than Sonnet — not because Haiku is a "better" model, but because it's the right model for the task.

Ask Sonnet to explain the relationship between three interconnected microservices using only code summaries and ADRs as evidence — and it earns its cost. That's reasoning. That's synthesis. That's where the extra capability matters.

The most expensive model isn't the best model. The right model is the best model.

What the Open-Source Models Tell Us

Reddit specifically asked about open-source models. Here's the honest answer: on a retrieval task with well-engineered context, they're fine. Not great. Fine.

Qwen3 235B, DeepSeek R1, DeepSeek V3.2, GPT OSS 120B — they all produced usable first-email responses. A sales rep could send any of them. But none of them anticipated the follow-up. None of them built guardrails. None of them surfaced competitive positioning that wasn't explicitly asked for.

The context contained all of that information. These models read it and returned the minimum viable answer. Haiku read it and returned the maximum useful answer.

For internal tools where "good enough" works? Open-source models are viable. For customer-facing workflows where the first response determines the outcome? The extraction gap matters.

A note on DeepSeek R1

R1 is a reasoning model — chain-of-thought, deliberate thinking. On a retrieval task, that reasoning overhead doesn't help. It's like using a chess engine to look up a phone number. The engine is brilliant at chess; it's not better at phone books. R1 would likely shine on our reasoning tasks. On retrieval, it's average.

About Llama 4 Maverick

We expected more. Meta's newest model, Mixture of Experts architecture, 17B active parameters. Fastest response time in the test at 4.7 seconds. And the weakest context extraction across all eight models.

It produced a generic email that barely referenced the 50,000+ tokens of context it received. The deployment model fact was there. The compliance positioning was absent. The competitive intelligence was ignored. The framework-specific details were skipped.

Speed without extraction is just fast nothing.

The Real Takeaway

The first blog proved that context engineering raises the floor. This follow-up proves something more nuanced:

1. Context engineering is still the foundation.

Every model — even Llama 4 Maverick — got the core fact right because the context contained it. Without context, they all hallucinate. With context, they all produce something usable. The floor is real.

2. More expensive doesn't mean better.

Haiku outperformed Sonnet on a retrieval task. Sonnet's extra reasoning capability didn't translate to better extraction — it translated to a more cautious, less comprehensive response. The model was overqualified for the task.

3. Match the model to the cognitive task.

Retrieval tasks need extraction capability, not reasoning power. Reasoning tasks need synthesis capability, not just speed. The right model isn't the most expensive one — it's the one whose strengths align with what you're asking it to do.

Eight models. Five providers. One pipeline. One question. The data is in the audit table.

Stop picking models by price tier. Start picking them by task type.

See the Full Outputs

Want to see all eight unredacted responses side-by-side? We'll walk you through:

  • The complete output from every model
  • How to identify retrieval vs. reasoning tasks in your workflows
  • A model selection framework for your Bedrock deployment
  • What this looks like applied to your engineering team

Context engineering raises the floor. The task type determines the model. Choose accordingly.

Eight models. One pipeline. The cheapest Claude model won. Not because it's the best model — because it was the right model.

Related Reading