RAG Has a Code Blind Spot: Why OutcomeOps Runs Two Retrieval Modes and Lets a Router Pick Per Query

Brian Carpio·

RAG is everywhere in 2026. It is the default for any AI coding tool that wants to ground its answers in your codebase, and the default is mostly correct. RAG is good. But it has a blind spot — and the blind spot shows up at exactly the moments retrieval matters most: when an engineer asks which services call this handler, or every class that extends this base type, or every consumer of this shared library I am about to refactor. A summary-based retrieval system gives you what was documented yesterday. A code knowledge graph gives you what the AST says today. OutcomeOps now runs both, and a router decides per query which one (or both) to use.

This post explains what each retrieval mode is good at, where each one fails, why we did not replace RAG with a graph, and how the router shows up across Chat, PR review, and code generation without the engineer ever seeing it.

What OutcomeOps RAG Already Does Well

OutcomeOps RAG is not just embeddings over raw source files. The platform generates code-maps — LLM-produced summaries of every service, every handler, every shared library, written from technical and business angles. Those summaries get embedded, weighted, and stored alongside ADRs, README content, and architecture decision records, then retrieved at query time. We covered the underlying pattern in Self-Documenting Architecture: When Code Becomes Queryable.

Code-maps are extraordinary for the kind of question that asks for an application graph: which services participate in the order-fulfillment flow, how does the billing pipeline talk to the audit log, which area of the codebase handles tenant isolation. The retrieval returns summaries that an LLM can reason over, and the answer comes back grounded in language a human wrote (or that the system wrote and a human reviewed). This is what RAG was designed for, and OutcomeOps has been shipping it since the early days of the platform.

The Blind Spot

Code-maps are summaries. Two things follow from that:

  • They lag the source. A summary was correct when it was generated. Between then and now, three commits added two new callers and renamed a function. The next code-map regeneration will catch up; the next retrieval before that regeneration will not.
  • They are not exhaustive. A summary describes a handler at a level of abstraction that loses individual call sites. A summary of a shared library mentions typical consumers, not every consumer. A summary of a base class names the obvious subclasses, not the obscure one in a service nobody’s touched in eighteen months.

Both gaps stop mattering for application-graph questions, where you want the gestalt. Both gaps start mattering immediately for symbol-level questions, where you want completeness. Every caller of util.timestamps.format_iso matters when you are about to change the function signature. Every class that extends BaseEventHandler matters when you are about to add a required method. Summaries cannot promise completeness, and a refactor that misses a single caller is the kind of bug that ships to production and gets discovered six weeks later by a confused on-call.

What a Code Knowledge Graph Adds

A code knowledge graph parses the source directly — AST-level — and produces a structured database of typed nodes and edges. Files, directories, classes, functions, modules become nodes. The relationships between them become edges: calls, inherits-from, imports, implements, references. The graph is rebuilt continuously as code changes, so the answer to every caller of this function is exact and current at query time.

The graph does not replace summaries. It complements them. A graph traversal can tell you with certainty that UserCreated is consumed by sixteen handlers across four services, but it cannot tell you why the system is shaped that way — that explanation lives in the ADRs and code-maps the RAG layer indexes. The graph is the structural ground truth; the RAG layer is the intent and reasoning. Together, they cover the spectrum of questions an engineer actually asks.

For an enterprise audience, the operational properties of a code knowledge graph also matter. The graph is rebuilt from the source, so there is no separate corpus to keep in sync, no drift between “what we documented” and “what we shipped.” The graph is queryable in the same VPC the rest of OutcomeOps runs in — we covered that deployment posture in AI Coding Tool That Deploys in Your AWS Account. And because both RAG and the graph live inside the customer’s AWS account, the audit trail of what got retrieved for any given query stays where compliance can query it directly.

Why We Added Both, Not Replaced RAG

The lazy version of this post would be “graphs beat RAG, switch.” That is wrong. A pure-graph system handles symbol-level questions beautifully and gives terrible answers to architectural questions because the response is a list of edges instead of a coherent narrative. A pure-RAG system handles narrative questions beautifully and misses callers because the underlying summaries are not exhaustive. The two systems have different failure modes, and the failure modes are complementary.

Hybrid retrieval — RAG plus graph — is the architectural answer. It is also what every serious code-AI platform will converge on over the next two years, because there is no question shape that one mode handles strictly better than both modes. The interesting engineering question is not which mode, it is how do you decide per query.

The Router: A Classifier Per Query

OutcomeOps puts a small classifier in front of every retrieval call. The classifier looks at the incoming query — whether that is a chat message, a PR diff being analyzed, or a code-generation task — and picks one of three modes:

  • RAG only — for questions that want a narrative or an architectural overview. How does the auth flow work? Which area of the codebase owns billing?
  • Graph only — for questions that want exact symbol traversal. Every caller of this function. Every subclass of this handler. Every import of this module.
  • Both — for hybrid questions where the engineer wants the structural answer and the architectural reasoning behind it. Why does the recommendation handler dispatch through SQS instead of calling the embedder directly, and what calls the dispatcher today?

The engineer never sees the routing decision. They get an answer; the answer is grounded in the right kind of retrieval; the citations come back pointing at either the code-map summaries, the source files the graph traversed, or both. Hiding the routing is the point of the design. Forcing engineers to know which retrieval primitive is best for their question is exactly the kind of cognitive overhead that AI tooling is supposed to remove.

Where the Router Lives

The router is not a chat-only feature. It is wired into every place OutcomeOps reasons about code:

Chat

The most obvious surface. An engineer types a question, the router picks the retrieval mode, the chat handler returns a grounded answer with citations. This is where the routing pays back the most for individual users — the same chat box answers how does the order-fulfillment pipeline work with a narrative and every caller of OrderRepository.save with a complete list, and the engineer never has to explain to the system which kind of question they are asking.

PR review

When OutcomeOps analyzes a pull request, the structural-review pass needs to know exactly what a change touches. A diff that modifies a shared library function needs the graph to enumerate every consumer; a diff that adds a new method to a base class needs the graph to enumerate every subclass. The graph is the only retrieval mode that can promise completeness here, and incomplete coverage in PR review is how regressions ship. The router routes the structural pass to the graph and the contextual pass (does this change line up with our ADRs?) to RAG.

Code generation

Before generating new code, OutcomeOps runs an impact-analysis pass — what does this change affect? The router pulls graph data to enumerate exact dependencies and RAG data to retrieve relevant ADRs and patterns. The generated code arrives already grounded in both what the codebase actually looks like right now and what your team has decided about how this kind of code should be written. The combination is what produces the first-pass production-ready output rates we have been writing about since the early days of the platform — we covered the ADR side of that loop in How 3 ADRs Changed Everything: The Spring PetClinic Proof.

What This Means for Buyers

Two things, primarily.

Refactor confidence. The single biggest source of preventable bugs in AI-assisted refactoring is incomplete consumer enumeration. The model changes a function signature, the test suite passes for the modified call sites, and a forgotten caller breaks in production three days later. Code-knowledge-graph retrieval makes this class of bug significantly less likely — the model knows up front that there are seventeen callers, not the four it remembered from the last summary regeneration. PR review surfaces them; code generation accounts for them.

Better architectural reasoning. When a chat question or a PR review needs both structural ground truth and architectural intent — does this change line up with our pattern, does it cover all the consumers, does it respect the relevant ADR — the router pulls both and the answer combines them. This is the kind of answer that gets a senior engineer to nod, not the kind that triggers a comment chain about what the AI obviously missed.

The Anticipated Objection: “Just Build a Better RAG”

Reasonable take, and worth addressing. You can absolutely improve RAG retrieval — better chunking, better metadata weighting, fresher summary regeneration, larger context windows, more aggressive reranking. All of those help. None of them give you the property a graph traversal gives you for free, which is completeness. A summary describes; a graph enumerates. Even a perfect RAG system that never misses a relevant chunk cannot promise that a particular function has exactly seventeen callers and here they are. The graph can. For symbol-level questions, that completeness guarantee is what you actually want.

The right framing is not “RAG is broken, replace it.” It is “RAG and graphs solve different problems and the platform should run both.”

When This Doesn’t Help

Hybrid retrieval is not free. There is operational cost to maintaining a code knowledge graph alongside the embedding store, and the router itself is one more component to monitor. For tiny single-repo teams, RAG with code-maps is sufficient and the graph is overhead. The pattern starts paying back when the codebase has enough shared libraries, base classes, or cross-service dependencies that completeness becomes a real concern — typically 20+ engineers, multi-repo, or any codebase old enough that nobody confidently knows where every consumer of a critical utility lives.

If you are not in that environment, RAG alone is the right call. If you are, the graph is the difference between PR review that sometimes misses things and PR review that does not.

How to Evaluate

The two-week proof of concept exercises both retrieval modes. Apply the Terraform into a non-production AWS account, connect 20 representative repositories — ideally with at least one shared library and one base class hierarchy worth caring about — and run the same query through chat as a narrative question and as a symbol question. Watch the router pick. Inspect the graph traversal output and the RAG citations. Verify the audit log captured both.

Book an enterprise briefing to start the PoC, or run the five-minute Readiness Assessment to get a written report on where your organization sits before scheduling.

Related reading