Same Context. Three Models. The Floor Isn't Zero.

The Experiment

We built a marketing chatbot for our VAR sales reps. Same Context Engineering methodology we sell to enterprises. Two vector stores. Thirteen ADRs. Six industry-specific compliance documents. Twenty-eight blog posts indexed. Running on AWS Bedrock.

Then we asked all three models the same question:

"A customer asked about SOC 2 compliance, how do I respond?"

Same RAG pipeline. Same vector stores. Same ADRs. Same system prompt. Three models.

Three completely different output qualities.

And the evidence isn't anecdotal. Every input, every output, every token count, every cost — logged in DynamoDB. Same timestamp range. Same question. Three model IDs. Three responses. What follows are the actual outputs.

Metric	Nova Lite	Nova Pro	Haiku 4.5
Input Tokens	49,067	49,067	53,674
Output Tokens	244	368	1,534
Response Time	5.5s	13.5s	15.6s
Cost Per Query	~$0.003	~$0.040	$0.049

Same context in. Wildly different value out. Let's look at what each model actually produced.

Nova Lite: Technically Correct, Completely Useless

Nova Lite Output

244 tokens · $0.003

Subject: Re: SOC 2 Compliance Inquiry

Hi [Customer's Name],

Thank you for your interest in OutcomeOps. I understand your concern about SOC 2 compliance and want to assure you that our platform is designed with your security and compliance needs in mind.

OutcomeOps deploys entirely within your own AWS account, which means your existing SOC 2 compliance posture fully covers our deployment. There is no separate SOC 2 report for OutcomeOps because the platform is an extension of your infrastructure, not a standalone service...

...This response is grounded in the knowledge base and adheres to the ADRs provided.

Four generic paragraphs. The core fact is there — OutcomeOps deploys in your account. But that's it. No pushback handler. No competitive positioning. No escalation path.

The sales rep sends this email. The CISO asks a follow-up question. The rep is stuck.

Nova Lite read 49,067 tokens of context — ADRs, compliance docs, competitive intelligence, blog posts — and extracted one fact. The context contained a full playbook. The model returned a form letter.

Nova Pro: Accurate but Reads Like a Spec Sheet

Nova Pro Output

368 tokens · $0.040

Here's how it works:

1. Data Residency: All data, including code and AI interactions, stays in your AWS account.

2. Authentication: Uses your Azure AD for OIDC authentication.

3. Access Control: You manage who has access to the AWS account and Azure AD.

4. Monitoring: Your CloudWatch alarms, SNS notifications, and incident response teams handle monitoring.

5. Patching: You update Terraform modules, Lambda runtimes, and container base images.

6. Secret Management: Your SSM Parameter Store and KMS keys manage all secrets.

7. Compliance Scope: Your existing SOC 2 compliance program applies to OutcomeOps.

...This response adheres to ADR-008, explaining that OutcomeOps' architecture is the compliance advantage.

Better. Seven bullet points. References ADR-008 by name. More technical detail. The information is accurate.

But it still reads like AWS documentation, not a sales toolkit. Still no pushback handler for when the CISO says "but we need a SOC 2 report." Still no "think of it like this" analogy that reframes the conversation. Still no competitive positioning against Cursor or Copilot.

The rep sends this and sounds like they copy-pasted from a technical wiki. The CISO is informed but not persuaded.

Haiku 4.5: The Full Playbook

Haiku 4.5 Output

1,534 tokens · $0.049

Plain English First:

"The answer is: No, OutcomeOps doesn't have a separate SOC 2 report because it doesn't operate a SaaS service. Instead, OutcomeOps deploys into the customer's own AWS account..."

Ready-to-Send Email:

"OutcomeOps deploys entirely in your AWS account. There is no separate OutcomeOps SaaS infrastructure, no OutcomeOps-operated backend, and no OutcomeOps personnel with access to your environment post-deployment..."

If They Push Back — "But we need SOC 2 certification from the vendor":

"Think of it like Terraform or AWS Lambda. You don't ask Terraform for a SOC 2 report; you validate Terraform's code and then use it in your environment. OutcomeOps works the same way..."

If They Ask About HIPAA or PCI-DSS:

"Your existing HIPAA/PCI-DSS compliance program for AWS extends to OutcomeOps automatically. No separate BAA needed..."

What NOT to Say:

❌ "OutcomeOps is SOC 2 compliant" — Compliance is an organizational obligation, not a product certification.

❌ "We can provide a SOC 2 report" — We can't. It doesn't exist. Offering it will destroy credibility.

Key Talking Points (Memorize These):

5 bullet points for CRM notes with competitive positioning against Cursor and Copilot.

Read that again. From a single question, Haiku produced:

✓A plain-English explanation so the rep understands before they respond
✓A copy-paste email ready to send
✓A pushback handler for "but we need a SOC 2 report" with a Terraform analogy
✓Framework-specific answers for HIPAA, PCI-DSS, SOX
✓Competitive positioning against Cursor and Copilot
✓"What NOT to say" guardrails to prevent credibility-destroying mistakes
✓CRM-ready talking points to memorize

The rep who gets this response handles a three-round conversation with a CISO. Cold. Without escalating to engineering.

The Takeaway

All three models read the same context. The ADRs contained the compliance positioning. The blog posts contained the competitive intelligence. The industry docs contained the framework-specific details. The pushback handlers, the Terraform analogy, the "what not to say" guardrails — all of it was in the context.

Only Haiku extracted the full value.

Context engineering raises the floor for every model. Nova Lite without context would hallucinate compliance claims. With context, it at least gets the core fact right. That's the floor being raised.

But some models can't reach the shelf where the good stuff is.

Nova Lite saw the context and pulled one fact. Nova Pro saw the context and organized it into bullet points. Haiku saw the context and synthesized it into a playbook — anticipating follow-up objections, surfacing competitive angles, and building guardrails the rep didn't know they needed.

That's not a context problem. That's a reasoning problem. The context was there. The extraction capability wasn't.

The Cost Metric Everyone Gets Wrong

Here's where the procurement spreadsheet lies to you.

Nova Lite

$0.003

per query

Rep comes back 3-4x

~$0.012+ per answer

if they ever get the full answer

Nova Pro

$0.040

per query

Rep comes back 2-3x

~$0.120 per answer

pieced together across queries

Haiku 4.5

$0.049

per query

Done in one query

$0.049 per answer

complete playbook, first try

Cost per token is the wrong metric. Cost per useful answer is what matters.

Nova Pro is cheaper per token. But the rep who gets the Nova Pro response needs to come back two or three more times to get the pushback handler, the framework-specific details, the competitive positioning. Each return trip burns another 49K+ input tokens through the RAG pipeline.

The rep who gets the Haiku response is done in one query. One pass through the pipeline. One answer. Complete.

Haiku wins the cost-per-useful-answer metric by a mile. And it's not even close.

But Token Cost Isn't Even the Real Cost

Let's talk about the cost they don't put in the spreadsheet.

The rep who gets the Nova Lite response sends a generic email. The CISO asks: "But we need a SOC 2 report from the vendor." The rep doesn't have a pushback handler. They escalate to engineering. An engineer spends 30 minutes drafting a response.

The real cost of the $0.003 query:

•$0.003 in tokens
•+ 30 minutes of engineer time ($75)
•+ 24 hours of deal delay
•+ the CISO's confidence drops because the rep couldn't answer

That $0.003 query just cost you $75 and a day of momentum.

The rep who gets the Haiku response handles it in the meeting. Pushback handler ready. Terraform analogy locked and loaded. HIPAA follow-up answered before the CISO asks. The deal moves forward.

Five cents. The whole conversation handled. No escalation. No delay.

What This Proves About Context Engineering

This experiment proves two things simultaneously:

1. Context engineering is the foundation.

Without the ADRs, without the vector stores, without the indexed docs — none of these models produce anything useful. Nova Lite without context hallucinates. Haiku without context gives generic advice. The context is what makes any of this possible.

2. The model is the multiplier.

Same context, different extraction. The context contained everything — pushback handlers, competitive positioning, framework details, guardrails. Nova Lite extracted 10% of the value. Nova Pro extracted 40%. Haiku extracted 95%. The context raises the floor. The model determines how high you go from there.

Context engineering means the model matters less. But less isn't zero. And the difference between "less" and "zero" is the difference between a form letter and a playbook.

Eating Our Own Cooking

One more thing. This isn't a lab experiment. This is our own sales enablement chatbot — built with the same Context Engineering methodology we sell to enterprises.

Thirteen ADRs. Two vector stores. Six industry-specific compliance documents. Twenty-eight blog posts indexed. Running on Haiku 4.5 — the cheapest Claude model. Producing enterprise-grade responses that handle CISO objections, competitive positioning, and regulatory compliance across multiple industries.

Built in a weekend.

That's not a product demo. That's eating our own cooking. Every prospect who reads this should be thinking: "If they built this for their sales team in a weekend, what could they build for my engineering team?"

And now this blog post goes into the marketing vector store. The next time a prospect asks our chatbot "why Claude over Nova?" — it can reference this article, with the actual data, from the actual audit logs.

The system feeds itself.

Your Competitors Can't Write This Blog

Copilot can't show a side-by-side of three models producing different quality outputs from the same context — because Copilot doesn't do context engineering. Cursor can't show industry-specific compliance responses — because Cursor doesn't ingest ADRs. Nobody else has the receipts.

We can show you the DynamoDB entries. Same timestamp range. Same question. Three model IDs. Three outputs. That's not marketing.

That's evidence.

OutcomeOps: The Future of AI Engineering

Opens Substack in a new tab to confirm. No spam — unsubscribe anytime.

See the Full Outputs

Want to see the unredacted side-by-side? We'll walk you through:

• The complete outputs from all three models
• How 13 ADRs and 2 vector stores power the pipeline
• What this looks like applied to your engineering workflows
• Why cost-per-answer beats cost-per-token every time

Book an Enterprise Briefing

Context engineering raises the floor. The model determines the ceiling. Choose accordingly.

Five cents per query. One pass. Full playbook. The model matters less — but less isn't zero.

Follow-Up: Eight Models, One Pipeline

Reddit asked for more models. We delivered — and the results broke an assumption we didn't know we had.

Read: You're Probably Using the Wrong Bedrock Model →