~ Mohan Sankaran.
From steady state to surge
Some systems grow predictably; others get stress-tested overnight. When the world moved online at once, traffic patterns stopped looking like gentle hills and started looking like cliffs. The first lesson was humility: capacity plans built for “peak season” weren’t built for sustained peak. We learned to treat surge as a mode, not a moment-something you operate in deliberately, with different controls, dashboards, and expectations.
From capacity to elasticity
Autoscaling helped, but elasticity isn’t just more machines. It’s admission control at the edge, backpressure in the middle, and graceful degradation at the leaf. Rate limit early. Queue predictably. Make writes idempotent so retries are safe. Choose responsive timeouts over optimistic ones; a fast failure beats a slow collapse. When upstreams wobble, trip circuit breakers, serve cached or partial results, and give users a path that still “works enough.” Not perfect-present.
From features to fundamentals
During a surge, every UI pixel competes with CPU cycles. Freeze non-critical launches. Turn experiments off. Replace synchronous chains with event-driven workflows so one slow service doesn’t strand the user. Cache aggressively (CDN, edge, device), precompute hot views, and batch cold work. Treat the product like a power grid: shed non-essential load so core flows stay lit. Users remember what worked when everything else didn’t.
From monitoring to sense-making
Charts don’t keep systems up-shared reality does. We shifted from internal metrics to user-centric SLOs: time-to-login, time-to-checkout, time-to-confirmation. We traced a single request across gateways, services, and storage using correlation IDs, then painted the whole journey on one screen. When alerts fired, the incident channel showed logs, traces, and runbook links in line. Less swivel-chair, more context. The result wasn’t fewer alerts; it was faster decisions.
From reliability to resilience
Reliability is hitting targets in calm seas. Resilience is staying useful in rough water. We invested in graceful degradation deliberately: read-only modes for admin screens, “try again later” tokens for heavy reports, degraded image quality on slow paths, and offline queues on mobile so work could continue without a network. We wrote “disaster contracts” into APIs-documented fallback payloads the client could always render. Every service promised a minimum, not just an ideal.
From heroics to habits
Early spikes ran on adrenaline. That doesn’t scale. We built staged rollouts with automatic guardrails: canary, watch, expand. Feature flags gated new behaviors; configs could flip instantly without waiting on app stores. We rehearsed failure: terminate pods mid-checkout, drop a region, corrupt a cache keyspace, throttle the database. Chaos drills stopped being a novelty and became routine-short, controlled, and blameless. Confidence came from rehearsal, not hope.
From local fixes to global thinking
Bottlenecks weren’t just CPU. They were DNS TTLs, TLS handshakes, NAT port exhaustion, cloud quotas, and third-party limits we didn’t own. We moved static assets closer to users, warmed caches before campaigns, and negotiated capacity reservations with critical vendors. Where possible, we split traffic by geography and feature so one region’s surge didn’t drown the rest. Globally distributed systems fail in local ways; plan local answers.
From privacy promises to practical design
Surge isn’t an excuse to collect more data. We kept privacy budgets intact by aggregating telemetry, tokenizing identifiers, and tuning sampling instead of widening logs. For models, we shipped binaries to devices rather than raw events to servers. When we had to ask the user for more friction, we explained why in plain language. Trust is an SLA, too.
From team strain to team design
Systems weren’t the only things under pressure. We standardized incident command, shortened on-call rotations, and rotated roles so no one carried the pager forever. Postmortems were short, specific, and focused on systemic fixes-guardrails, tests, preflight checks-over individual error. We archived learnings as playbooks and made them part of onboarding. Culture is infrastructure; under load, it’s the first dependency that matters.
From surge response to durable advantage
The most valuable outcome of the surge wasn’t just uptime-it was clarity. We discovered the few metrics that truly represent user trust. We built fast paths and honest fallbacks. We learned which features earn their CPU and which ones don’t. And we transformed release practices so changes arrive smaller, safer, and easier to reverse.
Resilience under pressure isn’t a heroic posture; it’s a design principle. Admit reality quickly. Degrade with intent. Rehearse the bad day. Measure what the user feels. Do these well and the next spike feels less like a cliff and more like the next step up.
Leave a Reply