All work
Distributed Systems SaaS infrastructure 2024

Multi-region failover for a global API

Active-active across three regions with automatic failover measured in seconds, not hours.

<30s
recovery time
3
active regions
99.995%
availability

## The challenge
A single-region API meant a single point of failure — and an SLA the team couldn’t honestly promise.

## What we did
- Adopted CockroachDB for a geo-distributed, strongly consistent data layer.
- Fronted services with Envoy and health-aware routing for instant traffic shifting.
- Ran quarterly game-days that physically pulled a region to prove the numbers.

## Outcome
Failover now completes in under 30 seconds, unattended, and the platform holds a verified 99.995% availability.

Next project Cutting cloud spend 41% without losing headroom