Why OutcomeOps Doesn’t Use DynamoDB Global Tables: How We Survive a Region-Wide AWS Outage
When an AWS region degrades, the teams responding to it aren’t just trying to keep their apps running. They’re trying to figure out which apps depend on the failed region, what the blast radius looks like, which on-call gets paged, and what the architectural workaround is. Increasingly they query a knowledge platform to get those answers — ADRs, code maps, dependency graphs, runbook summaries. The platform that holds the map of how your systems work has to stay up when those systems are misbehaving, because that’s exactly when teams need the map.
That is why OutcomeOps ships with multi-region support. If we deployed into a single AWS region and that region had a bad day, your engineering, security, and architecture teams would lose the queryable view of their own infrastructure at the worst possible moment. The October 20, 2025 us-east-1 event was full of teams rediscovering this lesson the hard way — and full of services that failed because they depended on DynamoDB Global Tables, a cross-region service whose own control plane went down. We deliberately avoided that pattern. This post walks the simpler architecture we run instead.
OutcomeOps uses Lambda dual-writes to DynamoDB and S3 Vectors in both regions. Customer-managed DNS pointing at two stable per-region endpoints. AWS AppConfig deciding which region’s scheduled jobs do the work so the same ingestion doesn’t happen twice. Human-in-the-loop failover via DNS update or a Slack/Teams announcement. No managed cross-region service in the dependency graph.
The Pattern That Failed in October 2025
DynamoDB Global Tables solve a real problem: take a regional database and make it look multi-regional with last-writer-wins replication handled by AWS. They work well in the common case. The failure mode that bit everyone in October 2025 was that the control plane responsible for coordinating that replication was itself centralized. When the control plane degraded, every Global Table degraded with it — not just the ones in the affected region. The pattern that’s supposed to make you multi-region depended on a service that wasn’t.
That’s the architectural lesson worth carrying forward: any managed “global” AWS service has a regional control plane somewhere. CloudFront has us-east-1. IAM has us-east-1. Route 53 health checks have a regional brain. If your multi-region story depends on AWS getting every one of those control planes right under simultaneous load, you have a multi-region story that fails on the worst possible day.
What OutcomeOps Actually Runs
The architecture is deliberately simple. Five components, no managed cross-region services.
1. Lambda dual-writes to DynamoDB
Every write to the workspace metadata table, the audit log table, and the code-graph (knowledge-graph) tables for ingested repos is performed by the same Lambda invocation against both regions. The write is atomic per region; the dual-write is sequential within the Lambda. If both writes succeed, the ingestion cycle acknowledges. If one fails, the cycle retries on the next scheduled run — the per-write observability tells us exactly which region drifted. RPO for any committed write is zero because both regions have the data before the job is marked complete.
2. Lambda dual-writes to S3 Vectors
Same pattern for the vector store. The Lambda that processes code-maps and ingests documents from GitHub, GitLab, Confluence, SharePoint, Jira, and the rest of the integration list writes the embeddings to S3 Vectors in both regions before completing. No async replication, no cross-region control plane, no S3 Cross-Region Replication that introduces its own failure modes. The customer’s ingested corpus is identical in both regions, all the time.
3. Customer-managed DNS, two stable endpoints
Most enterprise customers run their own internal DNS through their platform team and don’t want a per-AI-tool dependency on Route 53. The deployment exposes two stable per-region endpoints — for example outcomeops1.company-internal.com and outcomeops2.company-internal.com — each pointing at one region’s internal-only ALB. The customer points outcomeops.company-internal.com at whichever region is currently active, using whatever DNS provider they already have. Users in a region that’s closer to the secondary endpoint can use it directly — both regions are continuously serving the same data, so there’s no “wrong” region to hit.
4. AWS AppConfig as the per-region schedule gate
One of the harder problems in active-active is preventing duplicate work for scheduled jobs. EventBridge fires the hourly ingestion sync in both regions; without coordination, that would mean every repo gets re-ingested twice. AppConfig holds a per-region is_active_for_scheduled_work flag that the ingestion Lambda checks at the top of every invocation. The active region does the work; the passive region’s Lambda invokes, checks the flag, and exits cleanly. Failover means flipping the flag in both regions — a single AppConfig update, no infrastructure changes.
5. Human-in-the-loop failover
Failover is deliberate, not automatic. When AWS posts an event affecting one region, the customer’s on-call updates AppConfig in both regions, optionally updates the active-region DNS record, and announces the alternate endpoint over MS Teams or Slack. Users on the affected region get a one-line message: “Use outcomeops2.company-internal.com while the AWS event is going on.” Time-to-recovery is measured in the seconds-to-minutes range, depending on whether the customer prefers DNS or chat-based failover.
The reason failover is human-in-the-loop is a feature, not a limitation. Automatic failover requires a health-check service that has to itself be highly available, and we’ve already established what tends to happen to highly available control planes during the worst kind of AWS event. We’d rather hand the customer a runbook that takes one minute than ship a routing decision that fails the moment it’s most needed.
RTO and RPO
Stated honestly:
- RPO ≈ 0 for any acknowledged ingestion. Both regions have the data before the cycle completes.
- RPO ≈ one ingestion interval (typically one hour) in the rare case of an asymmetric dual-write failure where the cycle did not complete cleanly — the next scheduled run picks it up.
- RTO ≈ seconds-to-minutes, customer-controlled. A Slack/Teams endpoint announcement is near-zero. A DNS A-record flip is bounded by the customer’s internal-DNS TTL, typically 5–15 minutes.
These are honest numbers, not aspirational ones. The architecture is what makes them honest — there’s no managed replication to wait on, no warm-up to sequence, no cross-region control plane to recover before the surviving region becomes usable.
The Anticipated Objection: “AWS Has Improved Global Tables”
AWS has, in fact, fixed the specific race condition that triggered the October 2025 event, and the underlying systems have improved since then. That’s the right counter-argument, and it’s worth taking seriously.
Our position isn’t that Global Tables are unfixable. It’s that the architectural pattern — a managed cross-region control plane that has to be available for your “global” service to be available — hasn’t changed. Every managed cross-region AWS feature has a regional control plane somewhere, and those control planes will, eventually, have a bad day. We prefer designs that don’t require AWS to get every cross-region control plane perfect under simultaneous load. The Lambda dual-write pattern keeps the dependency graph short: customer code, customer Lambdas, customer DynamoDB, customer S3 Vectors. No managed cross-region service in the path.
It’s also worth noting what this doesn’t argue. We use plenty of managed AWS services within a region — Bedrock, S3 Vectors (GA 2025), DynamoDB, Comprehend, KMS, the whole VPC endpoint catalog. Regional managed services have a tight failure domain that’s easy to reason about: if the region is up, the service is up. The pattern we avoid is specifically cross-region managed services that paper over a centralized control plane.
What This Means for Procurement and Compliance
Multi-region adds two questions to the security review path: where does the data live and who controls failover.
Both answers stay simple. The data lives in the customer’s two AWS regions, in the customer’s S3 buckets and DynamoDB tables, encrypted with the customer’s KMS keys. There is no third-party replication service, no vendor-managed cross-region pipeline, no new entry on the SOC 2 sub-processor list. Failover is initiated by the customer’s on-call, using the customer’s DNS or the customer’s collaboration tooling. OutcomeOps personnel are not in the failover path.
For HIPAA-eligible workloads in healthcare, the second region inherits the same BAA-covered AWS services as the first — Bedrock, DynamoDB, S3, KMS — because we deliberately didn’t add a new managed service to the dependency graph. For FedRAMP environments, the same is true with GovCloud regions. The compliance posture extends to the second region by virtue of being the same architecture.
When This Doesn’t Make Sense
Multi-region is an option, not a default. For non-regulated buyers running greenfield workloads with no audit-traceability requirement, single-region is fine and cheaper. The compute-and-storage footprint roughly doubles when you turn on the second region (because both are continuously active), and the dual-write Lambda overhead adds latency to ingestion cycles. We turn it on for customers who explicitly need region-wide outage resilience — financial services, healthcare, insurance, defense, or anyone who’s already lost a procurement cycle to a vendor that couldn’t articulate their multi-region story.
How to Evaluate
The two-week proof of concept includes the multi-region option. Apply the Terraform into two AWS accounts (or two regions in one account), connect 20 representative repositories, watch the dual-write metrics, and run a simulated failover drill. The compliance review of the deployment is a Terraform read-through: same posture as the single-region case, just doubled.
Book an enterprise briefing to start the PoC, or run the five-minute Readiness Assessment to get a written report on where your organization sits before scheduling.
Related reading
- AI Coding Tool That Deploys in Your AWS Account — the single-region architecture this post extends.
- AI Coding Tools for Regulated Industries — the compliance-burden lens on deployment model.
- Context Engineering Platforms: A Comparison Guide — how deployment model dominates platform selection.
- Why F500s Got It Wrong Again: The AWS us-east-1 Outage — the pattern this post is the architectural answer to.
- Security & Compliance overview — multi-region is documented here for procurement reviewers.