System Design Deep Dives — Staff Engineer Reference

01 / 07 · Distribution

Consistent Hashing

A hashing technique that minimizes key remapping when nodes are added or removed from a distributed ring. Core to distributed caches, databases, and load balancers.

ShardingPartitioningDistributed CacheDynamoDBCassandra

Interactive Visualizer

Add nodes or hash a key to see placement on the ring

⚑ Staff signal: With N nodes, adding or removing 1 node moves only ~1/N keys. Modular hashing (key % N) moves ALL keys. Virtual nodes give each physical node ~150 ring positions for even distribution across heterogeneous hardware.

Keys moved on resize

~1/N

vs 100% with modulo hashing

Virtual nodes

~150 per node

Cassandra & DynamoDB default

Ring lookup

O(log N)

Binary search on sorted node list

Interview Questions & Deep Dives

Q1 Explain consistent hashing and why it's better than modulo hashing for distributed systems. ▾

With modulo hashing, key % N, adding or removing one node requires remapping nearly every key because every bucket assignment changes. Consistent hashing maps both nodes and keys onto a ring (0–2³²), and each key is owned by the nearest node clockwise. Adding a node only claims keys from its clockwise neighbor — about 1/N of the keyspace.

Deep Dive

Implementation: Hash each node to a position on the ring using a stable hash function (MurmurHash3, xxHash). For a lookup, hash the key and do a binary search on the sorted node list to find the successor node — O(log N).

The hotspot problem: With only physical nodes, some nodes may own more ring real estate than others due to hash distribution — uneven load. The fix is virtual nodes (vnodes): each physical node appears N times on the ring (e.g., 150 positions). A removal of one physical node redistributes load to many different nodes rather than one overwhelmed neighbor.

Data movement on scale-out: When adding a node, the new node takes over keys in the arc between itself and its counter-clockwise neighbor. Cassandra and DynamoDB stream this data automatically. In a manual sharding setup (Redis Cluster), you explicitly migrate slots.

Replication factor: Each key is replicated to the next R clockwise nodes, not just one. Cassandra default is RF=3. The coordinator for a write hashes the partition key, finds the ring position, then writes to the next 3 distinct nodes.

Follow-up Questions

How does Cassandra use consistent hashing for its partition key?

What happens to in-flight requests during a node addition?

How do you handle a hotspot key (one key receiving disproportionate traffic)?

⚑ Staff signal: Know that consistent hashing solves the resharding problem but not the hotspot key problem. A hotspot key (e.g., a celebrity's user ID) always hashes to the same node regardless of how many vnodes exist. Fix: add a random salt suffix, use Redis sharded data structures, or fan out the hot key across multiple cache entries.

Q2 Design a distributed cache (like Redis Cluster) — walk me through the sharding strategy. ▾

Redis Cluster uses a static hash slot approach: 16,384 slots, each node owns a contiguous range. A key maps to slot = CRC16(key) % 16384. This is simpler than a full ring but requires explicit slot migration during rebalancing. A distributed cache built from scratch should use consistent hashing with vnodes for more elastic scaling.

Deep Dive

Key decisions:
1. Partition key selection: Choose a key with high cardinality (user_id, order_id) — not low-cardinality (status, country) to avoid hotspots.

2. Replication: Each primary shard has ≥1 replica. Redis Cluster uses async replication by default (can lose data on failover). For strong durability: WAIT 1 0 before acknowledging writes.

3. Client routing: Smart clients cache the slot-to-node map and route directly. Dumb clients send to any node; nodes return MOVED redirect. Redis clients (Jedis, ioredis) maintain a slot cache and refresh on MOVED.

4. Resharding: Redis Cluster migrates slots one key at a time: CLUSTER SETSLOT slot MIGRATING srcNode → CLUSTER GETKEYSINSLOT → MIGRATE → CLUSTER SETSLOT slot NODE dstNode. During migration, ASKING command is used for interim lookups.

5. Multi-key operations: Operations spanning multiple slots (MGET, LRANGE on different keys) are disallowed unless using hash tags: {user}.followers and {user}.posts always land on the same slot.

Follow-up Questions

Why does Redis use 16,384 slots rather than a power of 2?

How do you handle partial writes across multiple shards atomically?

How would you add a new shard with zero downtime?

⚑ Staff signal: Mention that the choice of partition key is the most consequential decision — it determines whether load is even or skewed forever. Composite keys (tenant_id + entity_id) or pre-sharding (create 100 logical shards on 10 nodes, then move logical shards when scaling) are common production patterns.

Q3 What sharding strategies exist beyond consistent hashing, and when do you choose each? ▾

Range sharding, hash sharding, directory sharding, and geo sharding each optimize for different access patterns. Hash sharding (consistent or modular) optimizes for even distribution. Range sharding optimizes for range scans. Directory sharding offers flexibility at the cost of an extra lookup hop.

Deep Dive

Range sharding: Shard 1 holds users 1–1M, Shard 2 holds 1M–2M, etc. Great for range queries (fetch all orders from Jan–Feb). Problem: hotspots if recent data is always on the last shard (time-series monotonic IDs). Fix: UUID or Snowflake IDs to distribute inserts.

Hash sharding: Deterministic, even distribution, zero extra hops. Terrible for range queries since adjacent keys scatter across shards. Use when: random access, no range queries (cache, session store).

Directory sharding: A lookup table maps key→shard. Fully flexible — you can rebalance arbitrarily. Cost: the directory is a single point of failure and a latency hop. Mitigate with a distributed, cached directory (ZooKeeper, etcd) or a sharding proxy.

Consistent hashing: Best of both worlds for elastic distributed systems. Not ideal when you need range queries across shard boundaries.

Compound sharding: Shard by (tenant_id % N) for multi-tenant SaaS so all a tenant's data is colocated — simplifies queries, enables tenant isolation, but risks uneven load if one tenant is large.

How does Google Spanner handle sharding across regions?

What is the "scatter-gather" problem in sharded databases?

⚑ Staff signal: In practice, resharding is the most dangerous operation in a database system. A well-designed system is pre-sharded: create 100–1000 logical shards at launch even on 3 physical nodes; migrating logical shards is far safer than live resharding. Pinterest and Discord both use this pattern.

Q4 How would you design a URL shortener that handles 100K writes/s and 1M reads/s? ▾

The core challenge is generating globally unique short IDs at high write throughput, then serving read-heavy redirect traffic with sub-millisecond latency. The solution combines pre-generated ID pools, distributed caching, and globally replicated storage.

Deep Dive

ID generation (100K writes/s): Don't use UUID (too long). Use base62 encoding of a 64-bit integer. Sources: (a) Twitter Snowflake — 41-bit timestamp + 10-bit machine ID + 12-bit sequence = ~4096 IDs/ms/machine, or (b) pre-generated ID pool: a background job fills a Redis LPUSH queue with IDs; writers pop IDs (LPOP) — decoupled, fast, no coordination.

Storage: A simple KV store (short_id → long_url) with consistent hashing across shards. 100K writes/s ÷ 10 shards = 10K writes/s/shard. Use Cassandra or a sharded Redis with persistence. For durability: write to primary + 2 replicas before acking.

Read path (1M reads/s): 99%+ of reads are cache hits. Multi-level cache: (1) CDN edge caches 301/302 redirects — most requests never hit origin, (2) In-process LRU cache on each app server for the hottest 10K URLs, (3) Redis cache layer, (4) Database as fallback. 302 (temporary redirect) is better than 301 for analytics since browsers don't cache 302.

Sharding key: Hash(short_id) — uniform distribution, no range queries needed. But use consistent hashing so adding servers doesn't remap all keys.

How do you handle custom aliases ("mycompany.io/promo") without collision?

How do you track click analytics without blocking the redirect?

How do you expire URLs after 30 days at scale?

⚑ Staff signal: The redirect HTTP status code choice matters. 301 = permanent (browsers cache, no analytics hits), 302 = temporary (no browser caching, analytics works). Most commercial URL shorteners use 302. Also mention: for analytics at 1M events/s, write to a Kafka topic, not directly to a database.

02 / 07 · Distribution

CAP Theorem

In a distributed system, during a network partition you must choose between Consistency (every read sees the latest write) and Availability (every request receives a response). Partition tolerance is mandatory.

Distributed SystemsPACELCCassandraetcdConsensus

Interactive Visualizer

CP

Consistent + Partition Tolerant

AP

Available + Partition Tolerant

CA

Single node only — not distributed

Click CP or AP to explore. CA is impossible in a distributed system.

⚑ Staff signal: Say "P is unavoidable, so the real choice is C vs A during a partition." PACELC extends CAP: even without partition, systems choose between latency (L) and consistency (C). Cassandra: W+R>N gives strong consistency even when AP. Spanner: TrueTime commits enable global external consistency.

P always required

Distributed systems

Partitions happen — plan for them

CP → partition

Returns error

etcd, HBase, Zookeeper

AP → partition

Returns stale data

Cassandra, CouchDB, DNS

Interview Questions & Deep Dives

Q1 Explain CAP theorem. Why is CA not a real option in distributed systems? ▾

CAP states: a distributed system can guarantee at most 2 of — Consistency (every read returns the latest write), Availability (every request gets a non-error response), and Partition Tolerance (system operates during network partitions). Since network partitions are unavoidable (cables break, switches fail, GC pauses cause timeouts), P is always required. The real choice is C vs A during a partition.

Deep Dive

Why CA is impossible: Claiming CA means "consistent and available even during partition" — but by definition a partition means nodes can't communicate. A consistent system that can't communicate must return an error (unavailable). A CA system requires a network with zero partitions — only possible on a single machine.

CP behavior during partition: etcd elects a new leader via Raft and refuses reads/writes from nodes that can't communicate with a quorum. This guarantees linearizability but means partial unavailability.

AP behavior during partition: Cassandra accepts writes on any replica. When the partition heals, last-write-wins (LWW) based on timestamp reconciles conflicts. The client may read stale data during the partition.

PACELC (the real model): Even without partition, there's a trade-off: lower latency or higher consistency. Spanner prioritizes consistency (2PC + TrueTime), adding ~10ms latency. DynamoDB eventual reads save latency vs consistent reads that always hit the leader.

How does Zookeeper maintain CP guarantees during a leader election?

What does "eventual consistency" actually mean in practice?

How does Cassandra's quorum (W+R>N) give strong consistency despite being AP?

⚑ Staff signal: Mention PACELC — the real systems trade-off is latency vs consistency, not just C vs A. In interviews, classify specific databases: Zookeeper=CP, Cassandra=AP, CockroachDB=CP, DynamoDB=AP by default/CP with consistent reads. Showing you know the nuance of tunable consistency (Cassandra ONE vs QUORUM vs ALL) demonstrates production depth.

Q2 Design a distributed leader election system. What consistency guarantees do you need? ▾

Leader election requires strong consistency — split-brain (two nodes believing they're leader) is catastrophic. You need a CP system. The standard approach is a consensus algorithm (Raft or Paxos) that requires a quorum (majority) of nodes to agree before electing a leader.

Deep Dive

Raft election basics: Nodes start as Followers. If no heartbeat from leader within election timeout (150–300ms), a Follower becomes Candidate and requests votes. A node votes for at most one candidate per term. A candidate that receives majority becomes Leader and begins sending heartbeats.

Fencing tokens: Every leader gets a monotonically increasing token (etcd: lease + revision). Clients include the token in every request. Storage layer rejects requests with stale tokens — prevents a zombie leader from making writes after losing its lease.

ZooKeeper approach: Clients create ephemeral sequential znodes (lock/node_0000001). The node that creates the lowest-numbered znode is the leader. Nodes watch the znode just below theirs — this avoids herd effect (all nodes waking on each deletion).

etcd approach: Use a lease (TTL-bound key). Leader heartbeats to refresh the lease. If leader crashes, lease expires and a new election starts. etcd guarantees linearizable reads via Raft log.

What is the split-brain problem and how do fencing tokens prevent it?

How does Redis Sentinel differ from a proper consensus-based election?

⚑ Staff signal: Fencing tokens are the key insight most candidates miss. Even with Raft, a leader can be temporarily paused (GC stop-the-world) and keep believing it's the leader while a new one is elected. Fencing tokens at the storage layer provide the hard guarantee. Leslie Lamport's paper "Time, Clocks, and the Ordering of Events" is foundational here.

Q3 Explain the difference between strong consistency, eventual consistency, and causal consistency. ▾

Strong/linearizable consistency means reads always return the latest write — as if the system is a single machine. Eventual consistency means all replicas will converge to the same value if no new writes occur, but reads may see stale data. Causal consistency is in between: if operation A causally precedes B, everyone sees A before B.

Deep Dive

Linearizability (strong): Every operation appears to take effect instantaneously at some point between its invocation and response. Cost: requires coordination (2PC, Raft commit) → higher latency. Used for: financial transactions, inventory deductions, leader election. etcd, CockroachDB, Spanner.

Sequential consistency: Operations appear in order across all processes, but not necessarily in real-time order. Less strict than linearizability. Java volatile gives sequential consistency within a single machine.

Causal consistency: Operations causally related are seen in order by all nodes. Concurrent operations may be seen in different orders. Vector clocks or hybrid logical clocks (HLC) track causality. Used in: comment threading ("replies appear after the comment they reply to"), shopping carts. MongoDB session consistency is causal.

Eventual consistency: Weakest guarantee. Guarantees convergence but not order or timing. DNS updates propagate eventually — you may query a stale IP for minutes. Great for: view counts, likes, recommendations. Cassandra ONE, DynamoDB default.

How does Google's Spanner achieve external consistency across data centers?

What are vector clocks and how do they track causality?

⚑ Staff signal: Map consistency models to real products. Knowing that "eventual" means "no guarantee on when" — and that in practice, Cassandra replication lag is typically <100ms but can be unbounded — shows you understand operational reality, not just theory.

03 / 07 · Distribution

Replication Lag

Async replication writes to primary first and streams changes to replicas via WAL. Reads from replicas may return stale data. Understanding lag, its causes, and mitigation patterns is critical for read-heavy architectures.

PostgreSQLWALRead ReplicasRead-Your-WritesSemi-sync

Interactive Visualizer

Primary LSN = 0

No writes yet

WAL stream (async)

Replica LSN = 0

No replicated writes yet

Write to primary, then quickly read from replica to observe stale reads.

⚑ Fix read-your-writes: route reads to primary for 1s post-write, OR return write LSN to client and have replica wait until it reaches that LSN. Monitor: pg_stat_replication write_lag / flush_lag / replay_lag.

Async lag (typical)

10–200ms

Can grow to minutes under pressure

Sync replication

0ms replica lag

Primary waits for ≥1 replica ack

Monitor via

pg_stat_replication

write_lag · flush_lag · replay_lag

Interview Questions & Deep Dives

Q1 What causes replication lag and how do you detect and mitigate it in production? ▾

Replication lag accumulates when the replica cannot apply WAL records as fast as the primary generates them. Causes: heavy write workloads on primary, replica running analytics queries that lock rows, network saturation between primary and replica, or large transactions that hold locks during WAL replay.

Deep Dive

Lag sources:
1. Write amplification: Primary writes 100K rows/s; replica replays them serially. Mitigate with max_worker_processes and parallel WAL apply (Postgres 14+ parallel logical replication).
2. Long-running queries on replica: A 30-minute analytics query on the replica can cause lag because WAL apply waits for conflicting queries. Set max_standby_streaming_delay = 0 to cancel blocking queries.
3. Network: WAL segments are up to 16MB. High-latency links between AZs cause lag spikes. Monitor pg_stat_replication.sent_lsn - replay_lsn.

Detection: SELECT * FROM pg_stat_replication shows write_lag, flush_lag, replay_lag for each standby. Alert if replay_lag > threshold. Application-level: embed write timestamp in response, replicas report how far behind they are.

Mitigation patterns:
1. Read-your-writes: After write, read from primary for 30s. Or attach write LSN to session and replica checks pg_last_wal_replay_lsn() >= session_lsn.
2. Semi-sync replication: Primary waits for ≥1 replica to flush WAL before acknowledging write (MySQL semi-sync, Postgres synchronous_standby_names).
3. Route by staleness tolerance: User-facing reads → primary. Reporting/analytics → replica.

What is the difference between logical and physical replication?

How does MySQL Group Replication differ from standard async replication?

⚑ Staff signal: Physical replication copies WAL bytes — the replica must be the same Postgres major version and same architecture. Logical replication decodes WAL into row-level changes — can replicate to a different version, filter specific tables, or fan out to multiple subscribers. Important for zero-downtime major version upgrades.

Q2 Design a social media feed system (like Twitter/X) — how do you handle read fanout? ▾

The core problem is fanout: when a user with 10M followers posts, do you push to 10M timelines immediately (write fanout), or do you pull and merge on read (read fanout)? The answer is a hybrid — push for most users, pull for mega-celebrities.

Deep Dive

Write fanout (push model): On post, asynchronously write to every follower's timeline cache (Redis sorted set, score = timestamp). Feed read is O(1) — just ZRANGE. Problem: Kylie Jenner posts → 300M Redis writes. Not feasible.

Read fanout (pull model): On feed read, query followed accounts, fetch their recent posts, merge-sort. Fast for celebrities, terrible for users following 10K accounts — expensive N-way merge at read time.

Hybrid approach (Twitter's actual model):
1. Users with <1M followers → write fanout (push on post)
2. Celebrities (>1M followers) → read fanout (pull their timeline when user opens feed)
3. On feed read: load pre-built timeline from cache + pull from celebrities + merge

Storage: Redis sorted set per user: timeline:{user_id} with score=epoch, value=tweet_id. Keep last 800 tweets (beyond that, cold storage). Tweet content stored separately in a tweet service — timeline only stores IDs.

Replication for feed reads: Timeline reads don't need strong consistency — serve from read replicas. Write to primary when building timelines. ~50ms stale lag is acceptable for a social feed.

How does Instagram handle "Stories" which expire after 24h at scale?

How would you design notifications (real-time vs batch)?

⚑ Staff signal: The celebrity threshold for switching from push to pull is usually around 1M followers. The key insight is that this is a runtime decision per post — the same infrastructure handles both. Also mention: fan-out-on-write requires the poster's write latency is decoupled from follower count — use a Kafka queue and async workers.

04 / 07 · Storage & Indexing

B-Tree & Indexing

B-Trees are the dominant index structure in relational databases. Their self-balancing, high fan-out, and linked leaf nodes make them ideal for both point lookups and range scans with minimal disk I/O.

PostgreSQLMySQLIndex DesignQuery PlanningLSM-Tree

Interactive Visualizer

Press Search to traverse the tree

⚑ Staff signal: B-Tree leaves are a doubly-linked list — enabling range scans without returning to root. Height ≈ log_b(N), b = branching factor (100–1000). A 1TB table with 1KB rows needs only 3–4 disk reads. Hash index: O(1) equality-only, no ranges, no ORDER BY.

Point query

O(log N)

Root to leaf, 3-4 disk reads

Range query

O(log N + K)

K = matching rows via leaf scan

Write overhead

O(log N) + rebalance

Each index = extra write cost

Interview Questions & Deep Dives

Q1 Explain composite indexes and the leftmost prefix rule. When does an index NOT get used? ▾

A composite index on (a, b, c) sorts rows first by a, then by b within each a group, then by c. Queries can use the index for prefixes: (a), (a, b), (a, b, c) — but not (b) or (c) alone, since those columns are not sorted at the top level.

Deep Dive

Leftmost prefix rule: Index on (user_id, created_at, status) helps: WHERE user_id=5, WHERE user_id=5 AND created_at>X, WHERE user_id=5 AND created_at=X AND status='active'. Does NOT help: WHERE created_at>X (no user_id prefix), WHERE status='active' (not leftmost).

When indexes are skipped:
1. Function on indexed column: WHERE LOWER(email)='x' — create expression index instead: CREATE INDEX ON users(LOWER(email))
2. Leading wildcard: WHERE name LIKE '%smith' — can't use B-tree, use full-text search
3. Implicit cast: WHERE user_id='123' when user_id is INTEGER — cast changes the type, function applied
4. Low selectivity: Boolean column index on a table where 90% of rows have status='active' — planner chooses sequential scan
5. OR conditions: WHERE a=1 OR b=2 — each index can serve one arm; use UNION ALL instead

Covering index: Index includes all columns needed by the query — no heap fetch required. CREATE INDEX ON orders(user_id) INCLUDE (total, created_at). Query SELECT total,created_at FROM orders WHERE user_id=5 is satisfied entirely from the index page.

What is an index-only scan vs index scan vs bitmap index scan?

How do you use EXPLAIN ANALYZE to diagnose a slow query?

What is the N+1 problem and how do you fix it?

⚑ Staff signal: Know EXPLAIN ANALYZE output: Seq Scan = no index, Index Scan = index lookup + heap fetch, Index Only Scan = covered (fast), Bitmap Index Scan = used for low-selectivity or OR conditions. The planner's cost model uses pg_statistics — run ANALYZE after bulk loads to update statistics.

Q2 Compare B-Tree indexes vs LSM-Trees. When would you choose an LSM-Tree database? ▾

B-Trees optimize for reads (low read amplification: 3-4 page reads). LSM-Trees (Log-Structured Merge) optimize for writes (sequential WAL append, batch flush) at the cost of read amplification (may check multiple SSTables). LSM wins for write-heavy workloads, time-series, and high-cardinality key-value inserts.

Deep Dive

LSM-Tree write path: Write → MemTable (in-memory sorted tree) → WAL (durability). When MemTable fills, flush to SSTable (immutable sorted file on disk). Background compaction merges and garbage-collects SSTables.

LSM-Tree read path: Check MemTable → L0 SSTables → L1 → L2 (each level has larger, compacted SSTables). Bloom filter per SSTable: probabilistically skip SSTables that don't contain the key. Without bloom filters, worst-case reads check every level.

Write amplification: B-Tree: random writes = many page updates, write amplification ~10x. LSM: sequential writes, compaction causes write amplification ~30x for L1 (10x for each merge level). But LSM writes are sequential — faster on HDDs and more NVMe friendly.

B-Tree wins for: Mixed OLTP (reads and writes), complex queries, range scans, relational workloads. PostgreSQL, MySQL InnoDB.

LSM wins for: Write-heavy (IoT, time-series, audit logs), high write throughput, SSD-optimized workloads. RocksDB, Cassandra, LevelDB, InfluxDB.

What is compaction and how does it affect read/write performance in RocksDB?

How does a Bloom filter work and what is its false positive rate?

⚑ Staff signal: In interviews, tie LSM to specific products. RocksDB is the embedded storage engine used by CockroachDB, TiKV, MyRocks (MySQL LSM variant), Yugabyte, and countless others. The key insight: LSM converts random writes to sequential — this is why SSDs don't fully eliminate LSM's advantage; sequential I/O is still 3-5x faster than random on NVMe.

Q3 Design a database schema for a ride-sharing app (Uber/Lyft). How do you index for geospatial queries? ▾

The core challenge is "find all available drivers within 5km of a rider" — efficiently. Traditional B-Trees on (lat, lng) don't help since 2D spatial queries can't use a 1D prefix rule. Solutions: space-filling curves (Geohash, S2), R-Trees, or PostGIS with GIST indexes.

Deep Dive

Geohash approach: Encode (lat, lng) into a string (e.g., "9q8yy"). Nearby locations share a common prefix. Store driver locations with a Geohash index. Query: search all geohash prefixes covering the 5km radius. Fast B-tree range scan on a string column. Caveats: grid cells at Geohash boundaries are adjacent but don't share a prefix — must check 8 surrounding cells.

PostGIS + GIST index: Store location as GEOGRAPHY(POINT). Create CREATE INDEX ON drivers USING GIST(location). Query: SELECT * FROM drivers WHERE ST_DWithin(location, 'POINT(-122 37)', 5000). GIST index supports circle queries natively. Best accuracy, but requires PostGIS extension.

Redis GEO commands: GEOADD drivers:available lng lat driver_id, GEORADIUS drivers:available -122 37 5 km. Internally uses Geohash stored in a sorted set. O(N+log(N)) with N nearby results. Fast for real-time driver tracking — update location on every GPS ping.

Driver location update design: Drivers send GPS every 5s. Write to Redis (real-time tracking) + async write to Postgres (historical). Matching service reads from Redis. Billing/analytics read from Postgres replicas.

How would you shard the driver location data across regions?

How does Uber H3 (hexagonal grid) differ from Geohash?

⚑ Staff signal: Uber built H3 — a hexagonal hierarchical spatial index — specifically because Geohash rectangles have distortion at the poles and inconsistent cell sizes. Hexagons minimize distortion. For interviews, the key signal is knowing that Redis GEO is good for real-time tracking but you need a durable store (Postgres + PostGIS) for history and analytics.

05 / 07 · Storage & Indexing

MVCC & Transaction Isolation

Multi-Version Concurrency Control keeps multiple versions of each row so readers and writers don't block each other. Each transaction sees a consistent snapshot. Dead tuples are reclaimed by VACUUM.

PostgreSQLxmin/xmaxIsolation LevelsVACUUMTransactions

Interactive Visualizer

Step 1/6

Row 42 heap versions:

⚑ Core insight: Readers never block writers, writers never block readers. xmin/xmax are hidden system columns on every Postgres heap tuple. VACUUM must run regularly to reclaim dead tuples and prevent table bloat and transaction ID wraparound (32-bit, wraps at ~2B).

Read-write blocking

None

MVCC gives each txn its snapshot

Hidden columns

xmin · xmax · ctid

Every Postgres heap tuple

VACUUM prevents

Bloat + xid wraparound

autovacuum_freeze_max_age=200M

Interview Questions & Deep Dives

Q1 Explain the four transaction isolation levels and what anomalies each permits. ▾

SQL defines four isolation levels: Read Uncommitted (allows dirty reads), Read Committed (prevents dirty reads), Repeatable Read (prevents non-repeatable reads), and Serializable (prevents all anomalies). PostgreSQL doesn't implement Read Uncommitted — its minimum is Read Committed.

Deep Dive — Isolation Levels

Dirty Read: Txn A reads uncommitted data from Txn B. If B rolls back, A read data that never existed. Prevented at Read Committed+.

Non-repeatable Read: Txn A reads row twice; B commits update in between; A gets different values. Prevented at Repeatable Read+.

Phantom Read: Txn A runs a range query twice; B inserts a new row that matches; A gets different set. Prevented at Serializable. (Postgres Repeatable Read prevents phantoms too — an implementation detail above the SQL standard.)

Write Skew (MVCC-specific): Two transactions both read an overlapping set and make decisions based on it. Classic example: two doctors both see "1 on-call doctor" and both decide to go off-call — now 0 on-call. Prevented only by Serializable (SSI in Postgres).

Postgres defaults: Read Committed (default). Each statement sees a new snapshot. Safe from dirty reads, but prone to non-repeatable reads. For financial transactions: use Repeatable Read or Serializable. Postgres uses SSI (Serializable Snapshot Isolation) — optimistic, detects conflicts at commit time with minimal locking.

How does Postgres SSI detect write skew without locks?

When would you explicitly use SERIALIZABLE in a Postgres application?

⚑ Staff signal: Write skew is the isolation anomaly that surprises most engineers. It requires Serializable to prevent. Give the on-call doctor example. Also mention: Postgres Repeatable Read is technically Snapshot Isolation — it prevents phantoms (which the SQL standard doesn't require at that level), giving it better safety than the standard defines.

Q2 What is VACUUM in PostgreSQL and why is it critical? What happens if it doesn't run? ▾

VACUUM reclaims disk space from dead tuples (rows invalidated by MVCC) and updates visibility maps for index-only scans. Without VACUUM, tables grow unboundedly (bloat) and transaction IDs eventually wrap around — a catastrophic event that makes all tuples invisible and requires a full database freeze.

Deep Dive

Dead tuple lifecycle: UPDATE creates a new row version. The old version's xmax is set to the updating transaction's xid. It becomes a dead tuple once that xid is committed and no active snapshot needs it. VACUUM marks dead tuples as free space, reclaimed for future inserts.

Table bloat: Without VACUUM, dead tuples accumulate. A table that is 100GB logically may be 400GB on disk after many updates. Queries read more pages than necessary — performance degrades. Fix: VACUUM FULL rewrites the table (takes exclusive lock, slow) or pg_squeeze / pg_repack (online).

Transaction ID wraparound: Postgres xid is 32-bit — wraps at ~2.1 billion. When a table's oldest xmin approaches the current xid by 200M (autovacuum_freeze_max_age), Postgres must freeze those tuples (set xmin to FrozenTransactionId=2 — always visible). If it fails to do this, Postgres refuses new connections with: "database is not accepting connections: database age exceeds autovacuum_freeze_max_age." Recovery: enter single-user mode, run VACUUM FREEZE.

autovacuum tuning: autovacuum_vacuum_scale_factor=0.2 (trigger at 20% dead tuples). For high-write tables: lower to 0.01 or use cost-based throttling. Monitor: SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC.

What is the difference between VACUUM and VACUUM FULL?

How do you detect and prevent table bloat in production?

⚑ Staff signal: XID wraparound is a production incident many engineers have never seen but every Postgres DBA fears. The tell is the age() function: SELECT age(datfrozenxid) FROM pg_database. If this exceeds 1.5B, you're at risk. Monitor and alert on it. Also: long-running transactions hold snapshots and prevent VACUUM from cleaning dead tuples — they're a major source of bloat.

Q3 Design a bank transfer system that prevents double-spends and race conditions. ▾

A bank transfer requires atomicity (both debit and credit happen or neither does), isolation (concurrent transfers don't see each other's in-progress state), and durability. The key patterns are: pessimistic locking with SELECT FOR UPDATE, or optimistic locking with version columns + conditional update.

Deep Dive

Pessimistic locking approach:

BEGIN;
SELECT balance FROM accounts WHERE id=1 FOR UPDATE;
SELECT balance FROM accounts WHERE id=2 FOR UPDATE;
-- check balance >= amount
UPDATE accounts SET balance = balance - amount WHERE id=1;
UPDATE accounts SET balance = balance + amount WHERE id=2;
INSERT INTO transfers (from, to, amount) VALUES (1, 2, 100);
COMMIT;

Critical: Always lock accounts in a consistent order (lower ID first) to prevent deadlocks. FOR UPDATE acquires row-level lock; concurrent transfers to the same accounts wait, not race.

Optimistic locking approach (for low contention):
Add a version column. Read version + balance. Update with WHERE id=1 AND version=read_version. If 0 rows affected → conflict → retry. No locks held between read and write — better throughput when conflicts are rare.

Idempotency key: Client includes idempotency_key (UUID). Store (idempotency_key, result) with a UNIQUE constraint. If the same key is submitted twice (retry after network failure), return the cached result instead of applying a duplicate transfer.

Event sourcing for audit: Never update balances in-place. Append a ledger entry for every transaction. Balance = SUM of all ledger entries. Immutable, auditable, trivially reversible.

How do you handle distributed transactions across multiple databases (Saga vs 2PC)?

How would you implement an idempotency layer for payment APIs?

⚑ Staff signal: Mention the Saga pattern for distributed transactions. 2PC (two-phase commit) works but is blocking and fragile. Saga: choreography (events trigger compensating transactions) or orchestration (a saga orchestrator drives the workflow). For payments, Stripe uses idempotency keys + event sourcing — every state change is an immutable ledger record. This is the gold standard.

06 / 07 · Storage & Indexing

Deadlocks & Locking

A deadlock occurs when two transactions wait on each other in a cycle. PostgreSQL detects cycles in the wait-for graph and rolls back the cheapest victim. Prevention is far better than detection.

SELECT FOR UPDATESKIP LOCKEDWait-For GraphLock OrderingAdvisory Locks

Interactive Visualizer

Step 1/4

⚑ Prevention: Always lock rows in the same global order — sort IDs ascending. T1 and T2 both lock id=5 then id=7, never reversed. No cycle possible. Set lock_timeout='2s'. For queues: SELECT FOR UPDATE SKIP LOCKED.

Detection

Cycle in wait-for graph

Checked every deadlock_timeout (1s)

Victim selection

Lowest rollback cost

Fewest locks / least work done

Prevention

Global lock ordering

Always lock lower ID first

Interview Questions & Deep Dives

Q1 What is SELECT FOR UPDATE SKIP LOCKED and how do you build a job queue with it? ▾

SELECT FOR UPDATE SKIP LOCKED acquires a row lock and skips any rows that are already locked by another transaction — instead of blocking. This makes it ideal for a distributed job queue where multiple workers claim work concurrently without deadlocks or queueing behind each other.

Deep Dive

Job queue schema:

CREATE TABLE jobs (id BIGSERIAL PRIMARY KEY, status TEXT DEFAULT 'pending', payload JSONB, attempts INT DEFAULT 0, run_at TIMESTAMPTZ DEFAULT NOW(), created_at TIMESTAMPTZ DEFAULT NOW());
CREATE INDEX ON jobs(status, run_at) WHERE status IN ('pending','failed');

Worker claim logic:

BEGIN;
SELECT id, payload FROM jobs
  WHERE status = 'pending' AND run_at <= NOW()
  ORDER BY run_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED;
UPDATE jobs SET status='running', attempts=attempts+1 WHERE id=<claimed_id>;
COMMIT;

The SKIP LOCKED ensures each worker atomically claims a unique job. Without SKIP LOCKED, all workers would queue on the first row's lock.

At-least-once delivery: If a worker crashes mid-job, the row stays as 'running'. A reaper job periodically: UPDATE jobs SET status='pending', run_at=NOW()+interval'5m' WHERE status='running' AND updated_at < NOW()-interval'10m' — this re-queues stuck jobs.

Versus Kafka/SQS: Postgres-as-queue works excellently up to ~1000 jobs/s. Beyond that, use a dedicated message broker. Advantages: transactional (enqueue inside business logic txn, no dual-write), no extra infrastructure, easy visibility into queue state.

How do you prevent a job from being picked up by multiple workers (exactly-once semantics)?

When should you migrate from a Postgres job queue to Kafka?

⚑ Staff signal: Many engineers don't know about SKIP LOCKED and propose complex coordination mechanisms. Knowing this built-in is a strong signal. Also mention advisory locks: pg_try_advisory_xact_lock(job_id) — session-level locks you can use for distributed mutual exclusion without dedicated rows.

Q2 How does distributed locking work with Redis? What are its failure modes? ▾

Redis distributed locking uses SET with NX (set if not exists) and PX (expiry milliseconds) as an atomic operation. The lock value is a random token unique to the holder — release uses a Lua CAS script to ensure only the holder can release. Failure modes: clock skew, GC pauses outliving TTL, network partitions, and Redis failover.

Deep Dive

Acquire: SET lock:resource <token> NX PX 30000 — atomic, returns OK if acquired or nil if held.

Release (Lua CAS):

if redis.call("get",KEYS[1]) == ARGV[1] then
  return redis.call("del",KEYS[1])
else return 0 end

The token check + delete is atomic. Without this, two scenarios break: (a) lock expires, B acquires, A releases B's lock; (b) A finishes, releases, B acquires, A crashes and re-tries release.

Failure modes:
1. GC pause: Process A holds lock, GC pauses for 35s, TTL=30s expires, B acquires lock, GC resumes — A and B both believe they hold the lock. Fix: fencing tokens (a monotonic counter included in every request, storage rejects older tokens).
2. Redis failover: A acquires lock on primary. Primary crashes before replicating to replica. Replica becomes primary — lock is gone. B acquires the same lock. Fix: Redlock algorithm (acquire on 3+ Redis masters, need majority).
3. Clock skew: Redlock relies on TTL being honored consistently across machines. NTP skew can cause early expiry on one node.

When Redis locks are fine: For "advisory" locks where occasional failures are acceptable (preventing duplicate sends, rate limiting). Not for critical mutual exclusion (use Postgres FOR UPDATE or ZooKeeper/etcd ephemeral nodes).

What is the Redlock algorithm and why is it controversial?

When would you use etcd over Redis for distributed locking?

⚑ Staff signal: Martin Kleppmann's analysis of Redlock is a famous distributed systems debate — he argues Redlock is fundamentally unsafe without fencing tokens even with 5 Redis masters. Antirez (Redis author) disagrees. Knowing this debate and the fencing token argument shows deep understanding. Conclusion: for safety-critical locks, use etcd or ZooKeeper with proper fencing.

07 / 07 · Caching

Cache Failure Modes

Three critical failure modes in caching systems: Cache Stampede (hot key expires, N requests flood DB), Cache Avalanche (mass expiry), and Cache Penetration (non-existent keys). Each has distinct causes and fixes.

RedisStampedeAvalanchePenetrationBloom FilterTTL Jitter

Interactive Visualizer

💥 WITHOUT Mutex

✓ WITH Mutex

Choose a scenario to simulate

⚑ Three failure modes: (1) Stampede — mutex or probabilistic XFetch. (2) Avalanche — TTL jitter = base + random(0, 10%). (3) Penetration — cache null values with short TTL, or Bloom filter at edge to reject unknown keys without hitting cache or DB.

Without mutex

N DB queries

1 per concurrent cache miss

With mutex

1 DB query

Rest queue and get cached value

XFetch early recompute

P ∝ β·log(rtime)

Probabilistic pre-expiry refresh

Interview Questions & Deep Dives

Q1 Design a caching layer for a product catalog (100M items, 1M reads/s, updates every few minutes). ▾

A multi-level cache strategy: CDN edge for static product pages (HTML), distributed Redis cluster for product JSON, and a read-through L1 in-process LRU for the hottest items. TTL-based expiry with jitter for freshness. Cache-aside pattern for application control.

Deep Dive

Read path:
L1 (in-process) → L2 (Redis cluster) → L3 (DB replica) — miss at each level populates the level above it (read-through or cache-aside).

Cache key design: product:{product_id}:v{version} or product:{product_id} with TTL. If using version in key: cache is auto-busted on publish — old keys expire naturally. Allows instant invalidation by incrementing version in Redis (atomic, O(1)) without scanning all product keys.

Write invalidation strategies:
1. TTL expiry: Simple, eventual. Cache may serve 5min stale data. Acceptable for many products.
2. Cache-aside invalidation: On product update → DELETE cache key → next read populates fresh. Risk: stampede on hot products (use mutex or stale-while-revalidate).
3. Write-through: Update DB + cache in same transaction. Ensures no stale reads. Doubles write latency.
4. CDC + async invalidation: Debezium captures DB changes → Kafka → cache invalidation workers. Decoupled, scales independently. Sub-second invalidation.

TTL jitter: All 100M products added at launch — without jitter, all expire simultaneously. TTL = base_ttl + random(0, 0.1 * base_ttl). Spreads expiry over time.

Hot key mitigation: Top 100 products by traffic get a longer TTL and proactive refresh (background job refreshes before expiry). Or local L1 with 1s TTL on each app server absorbs spikes.

How do you handle cache invalidation when a product has many variants?

How would you implement stale-while-revalidate in a distributed cache?

⚑ Staff signal: The hardest part of caching is invalidation consistency. Mention CDC (Change Data Capture) as the gold standard for event-driven invalidation — Debezium tails Postgres WAL and publishes change events to Kafka. Cache workers consume and invalidate. This decouples the write path from cache management and provides a reliable event log for all consumers.

Q2 What cache eviction strategies exist and how do you choose among them for different workloads? ▾

LRU (Least Recently Used) is the default for most workloads. LFU (Least Frequently Used) is better when access frequency matters more than recency. ARC (Adaptive Replacement Cache) adapts between LRU and LFU dynamically. Redis offers 8 eviction policies covering variants of each.

Deep Dive

LRU: Evict the item not accessed for the longest time. Works well for temporal locality (recent data is more likely to be accessed). Weakness: a large sequential scan pollutes the cache with items that won't be accessed again (cache pollution). Redis implements approximate LRU: sample 5 random keys, evict the oldest — saves memory overhead of a true LRU linked list.

LFU: Evict the item accessed the fewest times overall. Better for stable hot sets (a product that was hot last month but not last week would be protected by LFU). Redis allkeys-lfu uses a logarithmic frequency counter (decay over time). Best for: CDN edge caches, recommendation results, stable popular content.

ARC (used in ZFS, some CDNs): Maintains two LRU lists — recently used (T1) and frequently used (T2) — plus ghost entries to adapt the balance. Outperforms both LRU and LFU on mixed workloads without tuning.

Redis eviction policies: noeviction (returns error on OOM), allkeys-lru, volatile-lru (only TTL keys), allkeys-lfu, volatile-lfu, allkeys-random, volatile-ttl (evict soonest-expiring).

Recommendation: General cache → allkeys-lru. Stable hot content → allkeys-lfu. Mixed → use ARC at the CDN level, LRU at Redis level.

How does a Bloom filter prevent cache penetration?

What is a "write-behind" cache and when is it appropriate?

⚑ Staff signal: Cache penetration (requests for non-existent keys) can be fixed with a Bloom filter: on startup, load all valid keys into the filter. Requests for unknown keys are rejected before hitting cache or DB — with a small false positive rate (a valid key is occasionally rejected, handled by fallback). False positive rate = function of bit array size and number of hash functions.

Q3 Design a rate limiter — implement Token Bucket and Sliding Window, compare their trade-offs. ▾

Token Bucket allows bursting up to a maximum capacity, refilling at a steady rate. Sliding Window counts requests in a rolling time window with no burst allowance. Token Bucket is more user-friendly (allows legitimate bursts), Sliding Window is more strict for abuse prevention.

Deep Dive

Token Bucket in Redis:
Lua script for atomicity:

local tokens = tonumber(redis.call("get", key) or capacity)
local now = tonumber(ARGV[1])
local last = tonumber(redis.call("get", key..":ts") or now)
tokens = math.min(capacity, tokens + (now-last) * rate)
if tokens >= 1 then
  redis.call("set", key, tokens-1)
  redis.call("set", key..":ts", now)
  return 1  -- allowed
else return 0 end

Sliding Window Counter in Redis:
Sorted set with timestamp as score:

MULTI
ZREMRANGEBYSCORE key 0 (now-window_ms)
ZCARD key
ZADD key now request_id
EXPIRE key window_seconds
EXEC

Count in the sorted set = requests in the last window_ms. If count < limit → allow.

Trade-offs:
Token Bucket: allows bursts (100 req/s limit, user sends 100 at t=0, then waits 1s) — feels natural. Good for APIs. Memory O(1) per user.
Sliding Window: no burst. Strictly N requests per window. Slightly higher memory (O(N) per user). Better for abuse detection.
Fixed Window: simple but boundary attack — 100 req at 11:59:59pm + 100 req at 12:00:00am = 200 requests in 2 seconds, both technically within limits.

Distributed rate limiting: Each app server has local counter + sync to Redis every 100ms. Trade-off: occasional over-counting (up to 100ms window * N servers) for lower Redis latency.

How do you implement rate limiting without Redis — in a stateless Lambda function?

How would you rate limit at different granularities (per-IP, per-user, per-endpoint)?

⚑ Staff signal: The rate limit response headers matter. Return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset on every response — not just 429s. This is a DX courtesy; the server-side enforcement is the real gate. Also: Cloudflare Workers has a built-in rate limiter backed by Durable Objects that handles distributed counting without a separate Redis.

08 / 14 · Messaging

Kafka & Message Queues

Kafka is a distributed, append-only commit log. Producers write to partitions. Consumers in groups each own a partition subset. Offsets enable replay and exactly-once semantics. The backbone of modern event-driven architectures.

KafkaPartitionsConsumer GroupsAt-Least-OnceExactly-OnceLog Compaction

Interactive Visualizer — Kafka Partition Flow

Produce messages and observe partition distribution

⚑ Staff signal: Partition count determines max parallelism — you can never have more active consumers than partitions. Choose partition count at topic creation (changing it redistributes keys). Key-based partitioning ensures ordering per key; null key → round-robin.

Max parallelism

= Partition count

Extra consumers sit idle

Retention default

7 days / 1GB

Configurable; log compaction for KV

Throughput

1M+ msgs/s

Per broker on commodity hardware

Interview Questions & Deep Dives

Q1 Explain Kafka's delivery guarantees — at-most-once, at-least-once, and exactly-once. ▾

At-most-once: fire and forget — messages may be lost. At-least-once: producers retry on failure — messages may be duplicated, consumers must be idempotent. Exactly-once: Kafka's transactional API (enable.idempotence=true + transactions) provides end-to-end exactly-once within a Kafka topology.

Deep Dive

At-most-once (producer): acks=0 — producer doesn't wait for broker ack. Highest throughput, data loss on broker failure. Used for: metrics, logs where loss is acceptable.

At-least-once (producer): acks=all + retries=MAX_INT + enable.idempotence=false. Broker confirms all in-sync replicas wrote. If ack is lost in transit, producer retries → duplicate. Consumer must be idempotent (check if already processed). Used for: most event pipelines.

Exactly-once (producer + consumer): enable.idempotence=true gives idempotent producer (sequence number per partition, broker deduplicates). Transactional API (initTransactions / beginTransaction / commitTransaction) wraps consume+produce in an atomic unit — offsets committed only if the entire batch commits. Used for: financial calculations, inventory updates.

Consumer offset commits: auto.commit.enable=true commits every 5s — at-least-once (may reprocess on crash). Manual commit after processing: call commitSync() after handling each batch. Enable.idempotence on consumer side: use a DB unique constraint or Redis SET NX on message_id before processing.

How does Kafka's transactional API interact with consumer group rebalancing?

What is the "zombie fencing" problem in Kafka transactions?

⚑ Staff signal: Exactly-once processing end-to-end (Kafka → app logic → database) requires idempotent consumers even with Kafka's transactional API, because the Kafka transaction only covers the Kafka topology. If your consumer writes to Postgres, you need an idempotency key there too. True end-to-end EOS requires coordination across both systems.

Q2 How do you design a Kafka-based event system for an e-commerce order pipeline? ▾

Model each state transition as an event on a topic. Services are loosely coupled producers and consumers. Use the Outbox pattern to guarantee atomicity between DB write and Kafka publish. Consumer groups allow each downstream service to process independently at its own pace.

Deep Dive

Topic design: One topic per business entity/event type: orders.placed, orders.payment-processed, orders.fulfilled, orders.cancelled. Partition by order_id — ensures all events for an order arrive in order to one consumer.

Outbox pattern (dual-write problem): The problem: writing to DB and Kafka in a single transaction is impossible. If you write DB then Kafka crashes → inconsistency. Solution: write to an outbox table in the same DB transaction. A CDC process (Debezium) tails the WAL and publishes outbox records to Kafka. Atomicity guaranteed by the DB transaction; Kafka publish is a separate async step.

Schema registry: Use Confluent Schema Registry with Avro/Protobuf. Producers register schemas; consumers fail-safe against schema evolution. Critical for a multi-team event bus — prevents schema drift breaking consumers silently.

Consumer group isolation: payment-service, inventory-service, notification-service each have their own consumer group. Each gets every event independently. One service being slow doesn't affect others — it just falls behind (monitor consumer lag).

Dead letter queue (DLQ): If a consumer fails processing after N retries, route to an orders.dlq topic. Monitor DLQ; don't silently drop. Replay from DLQ after fixing the bug.

How do you handle event ordering when consumers process in parallel?

When would you use Kafka Streams vs Flink for stream processing?

⚑ Staff signal: The Outbox pattern is the production-grade solution to the dual-write problem — most candidates propose "write DB then Kafka" which has a race condition. Outbox + CDC (Debezium) is how Stripe, Airbnb, and Uber solve this. It also gives you a built-in audit log of all state changes.

Q3 Compare Kafka vs RabbitMQ vs SQS — when do you choose each? ▾

Kafka: high-throughput event streaming, replay, fan-out to many consumer groups, event sourcing. RabbitMQ: flexible routing (exchanges, routing keys), traditional job queues, lower throughput needs, complex routing. SQS: managed, serverless, simple queuing, AWS-native, no message replay.

Deep Dive

Kafka: Log-based — messages are retained and replayable. Consumers track their own offset. Fan-out to N consumer groups with no extra cost. Throughput: 1M+ msgs/s. Use for: event sourcing, analytics pipelines, CDC, microservice event bus, audit logging. Operational complexity: requires ZooKeeper/KRaft, partition management, schema registry.

RabbitMQ: Message broker — delete-on-ack model. Rich routing: fanout exchange (broadcast), direct (routing key), topic (pattern matching), headers. Messages are consumed once and deleted. Priority queues, TTL, dead-letter exchanges. Use for: task queues, RPC over messaging, complex routing, low-volume high-flexibility needs.

SQS: Managed AWS service. FIFO queues or standard (at-least-once, best-effort ordering). Visibility timeout prevents double-processing without locking. Dead-letter queues built-in. Max 14-day retention. No replay. Use for: serverless architectures, AWS-native systems, simple decoupling, Lambda triggers. SQS + SNS = pub-sub fan-out.

Decision framework: Need replay/audit? → Kafka. Complex routing rules? → RabbitMQ. AWS-managed, no ops burden? → SQS/SNS. High throughput streaming? → Kafka. Simple job queue? → RabbitMQ or SQS.

How does Kafka's log compaction work and when would you enable it?

What is the difference between a queue and a topic in messaging systems?

⚑ Staff signal: Log compaction keeps only the latest value per key — turns a Kafka topic into a key-value store backed by an immutable log. Useful for: configuration changes (keep latest config per service), materialized views, changelogs. Flink and Kafka Streams use compacted topics for state stores that survive consumer restarts.

09 / 14 · Networking

Load Balancing

Load balancers distribute traffic across backend servers for scalability and fault tolerance. L4 (TCP/UDP) vs L7 (HTTP) operate at different OSI layers with different capabilities. Algorithm choice affects distribution quality and session stickiness.

L4 / L7Round RobinConsistent HashingLeast ConnectionsHealth ChecksSticky Sessions

Interactive Visualizer — Load Balancing Algorithms

Send requests to observe load distribution

⚑ Staff signal: L7 load balancers can route by path (/api → service A, /static → CDN), inspect headers, terminate TLS, do content-based routing, and sticky sessions via cookies. L4 is faster (no HTTP parsing) — use it for non-HTTP protocols (gRPC over TCP, database proxies like HAProxy for Postgres).

L4 (Transport)

TCP/UDP routing

Faster, no content inspection

L7 (Application)

HTTP/gRPC routing

Path, header, cookie routing

Health check interval

5–30 seconds

TCP ping or HTTP /health endpoint

Interview Questions & Deep Dives

Q1 Explain the difference between L4 and L7 load balancing. When do you use each? ▾

L4 operates at the transport layer — it sees source/destination IP and port, but not the content. It forwards TCP/UDP packets without parsing HTTP. L7 operates at the application layer — it reads the full HTTP request, enabling routing by URL path, hostname, headers, and cookies. L7 adds latency but enables sophisticated routing.

Deep Dive

L4 load balancers (e.g., AWS NLB, HAProxy TCP mode): Establish one TCP connection to backend per client connection (full proxy) or use DSR (Direct Server Return) where response bypasses the LB. Low latency (~microseconds added), can handle millions of connections/s. Used for: database proxies, game servers, any non-HTTP protocol, ultra-low latency scenarios.

L7 load balancers (e.g., AWS ALB, Nginx, Envoy): Terminate the client TLS/TCP connection, read the full HTTP request, then make a routing decision and open a new connection to the backend. Can do: path-based routing (/api → backend, /static → CDN), host-based routing (api.example.com → API cluster), header injection (X-Forwarded-For), WebSocket upgrades, HTTP/2 multiplexing, sticky sessions via cookies. Added latency: 1–5ms for TLS + HTTP parsing.

Sticky sessions: L7 LB sets a cookie on first response. Subsequent requests from the same client include the cookie → same backend. Required for stateful apps that store session in memory. Better: eliminate stickiness by storing session in Redis.

AWS architecture: NLB (L4) in front of ALB (L7) for maximum flexibility. NLB gets a static IP (required for whitelisting), passes through to ALB which does HTTP routing.

How does a load balancer handle health checks for a database that's slow but alive?

What is BGP anycast and how does Cloudflare use it?

⚑ Staff signal: The "thundering herd" problem — if a backend goes down and comes back up, all health checks pass simultaneously and the LB floods it with traffic it can't handle yet. Fix: slow start / gradual ramp-up. AWS ALB does this automatically. Nginx upstream: slow_start=30s. Envoy: slow_start_window.

Q2 Design a globally distributed system with multi-region load balancing and failover. ▾

Global load balancing uses DNS-based routing (Route 53 latency-based, geolocation) combined with Anycast to route users to the nearest healthy region. Active-active multi-region serves traffic from multiple regions simultaneously. Active-passive maintains warm standby with automated failover.

Deep Dive

DNS-based global routing (Route 53): Latency-based routing: returns the IP of the region with lowest latency to the client. Geolocation routing: EU clients → EU region (data residency compliance). Health checks: if us-east-1 health check fails, Route 53 removes it from DNS and routes to eu-west-1. TTL: set low (60s) for fast failover, but DNS caching by ISPs can still cause 5–15min propagation delay.

Anycast: Cloudflare and AWS Global Accelerator advertise the same IP prefix from all PoPs globally. BGP routing automatically sends clients to the nearest PoP. No DNS TTL delay — failover happens at BGP convergence time (~30s). Better for real-time apps where DNS TTL delay is unacceptable.

Active-active vs active-passive: Active-active: both regions serve live traffic. Write conflicts require conflict resolution (last-write-wins, CRDTs, or sticky writes to one region). Active-passive: primary region serves all traffic; passive is warm standby. RTO (Recovery Time Objective) ~minutes. Lower complexity.

Data replication: Stateless services are easy — just add replicas. Databases: async replication with possibility of data loss on failover, or synchronous replication with latency penalty (Spanner, CockroachDB, Aurora Global).

What is RTO vs RPO and how do they influence architecture decisions?

How does AWS Global Accelerator differ from CloudFront?

⚑ Staff signal: Know RTO (Recovery Time Objective — how fast you recover) vs RPO (Recovery Point Objective — how much data you can lose). RTO=0/RPO=0 requires hot-hot active-active with synchronous replication — very expensive. Most systems accept RPO=30s (async replication) and RTO=5min (automation + DNS failover). Tie this to business cost of downtime.

10 / 14 · Networking

CDN & DNS

CDNs serve content from edge PoPs close to users, eliminating origin round-trips. DNS is the internet's phone book — a hierarchical, distributed, eventually-consistent system. Both are fundamental to any low-latency global architecture.

CDNEdge CachingDNS ResolutionTTLAnycastCache-Control

Interactive Visualizer — DNS Resolution Chain

Press Resolve to walk the full DNS resolution chain

⚑ Staff signal: DNS is eventually consistent — lowering TTL before a migration gives faster propagation but increases resolver load. DNS over HTTPS (DoH) encrypts queries. Cloudflare's 1.1.1.1 and Google's 8.8.8.8 are recursive resolvers that cache answers.

Full resolution time

50–300ms

Root → TLD → authoritative

Cached resolution

<5ms

Resolver has TTL-valid answer

CDN cache hit ratio

85–98%

Depends on content + TTL config

Interview Questions & Deep Dives

Q1 Walk me through what happens when a user types "example.com" in a browser. ▾

Browser checks local cache → OS resolver cache → recursive resolver (ISP/8.8.8.8) → root nameserver → TLD nameserver (.com) → authoritative nameserver → IP returned → TCP connection → TLS handshake → HTTP request. This entire flow takes 100–500ms on a cold start.

Deep Dive

1. Local caches: Browser DNS cache (chrome://net-internals/#dns), OS hosts file (/etc/hosts), OS resolver cache (nscd/systemd-resolved). TTL-bounded.

2. Recursive resolver query: OS sends query to configured DNS resolver (DHCP-assigned or 8.8.8.8). Resolver checks its cache. If miss, begins iterative resolution.

3. Root nameserver: 13 logical root server clusters (A–M), distributed globally via anycast. Returns NS records for .com TLD server. Response cached; root servers are rarely queried.

4. TLD nameserver: .com TLD operated by Verisign. Returns NS records for example.com's authoritative nameserver. TTL: 2 days (changes rarely).

5. Authoritative nameserver: Returns A record (IPv4) or AAAA (IPv6) for example.com. This is what you configure in Route 53 / Cloudflare DNS. TTL: 300s (5min) is common. Lowered to 60s before migrations.

6. TCP + TLS: 3-way handshake (1 RTT) + TLS 1.3 handshake (1 RTT with 0-RTT for repeat connections). HTTP/2 multiplexes multiple requests over one connection. QUIC/HTTP3 eliminates TCP handshake overhead entirely.

CDN interception: If Cloudflare is the authoritative NS, it returns a CDN edge IP. Request goes to nearest PoP, served from edge cache. Origin is only contacted on cache miss.

What is DNSSEC and what attack does it prevent?

How does a CDN decide what to cache vs pass to origin?

⚑ Staff signal: DNSSEC prevents DNS spoofing/cache poisoning by signing DNS records with cryptographic signatures. Cloudflare supports DNSSEC automatically. Also mention: negative caching (NXDOMAIN TTL) — if a domain doesn't exist, resolvers cache that too. Important when launching new domains — propagation delay before the world can resolve them.

Q2 How do you design a CDN strategy for a video streaming platform? ▾

Video streaming requires high-throughput, low-latency delivery of large files to millions of concurrent users globally. Multi-CDN strategy with adaptive bitrate (ABR) streaming, segment-level caching, and real-time analytics on quality of experience (QoE).

Deep Dive

ABR streaming (HLS / DASH): Video encoded at multiple quality levels (360p, 720p, 1080p, 4K). Split into 2–6 second segments. Client player downloads a manifest (.m3u8 or .mpd) listing segment URLs. Player monitors download speed and buffer health, switching quality levels seamlessly. Each segment is independently cacheable at CDN edge.

CDN caching for video: Segments (small, fixed-size) → high cache hit ratio. Manifest file (.m3u8) → short TTL (30s) since live streams update it constantly. Signed URLs for auth: CDN validates HMAC signature so you don't need to call origin for auth on every segment request.

Multi-CDN: Use 2–3 CDN providers (Cloudflare, Fastly, Akamai). Route users to fastest CDN based on real-time QoE data. Fallback: if CDN A has elevated error rate → shift traffic to CDN B. Tested by Netflix (OpenConnect) and YouTube.

Origin shield: CDN PoPs closer to edge may not cache long-tail content. Origin shield = a single regional CDN node that consolidates cache-miss requests to origin. Reduces origin load from N PoP misses to 1 shield miss per segment.

Live streaming: Ultra-low latency (<3s): CMAF Low-Latency HLS or WebRTC. Standard live (3–8s): HLS/DASH with short segment duration. Broadcast-quality (8–30s): larger segments for better compression.

How does Netflix's OpenConnect CDN differ from commercial CDNs?

How would you implement DRM (Digital Rights Management) with CDN delivery?

⚑ Staff signal: Netflix runs OpenConnect — their own embedded CDN with ISP partnerships. ISPs host Netflix appliances on their network; Netflix traffic never leaves the ISP network. Result: near-zero CDN costs, maximum throughput. Not feasible for most companies, but understanding the concept shows you think about cost and architecture at scale.

11 / 14 · API & Services

API Design

Great APIs are contracts. REST, gRPC, and GraphQL each optimize for different trade-offs. Versioning, idempotency, rate limiting, pagination, and error handling are the difference between an API that teams love and one they dread.

RESTgRPCGraphQLIdempotencyPaginationRate Limiting

Interactive Visualizer — REST vs gRPC vs GraphQL

REST

GET /users/42

→ 200 OK { id, name, email,

avatar, bio, followers,

...20 other fields }

Over-fetch: client needs

only name + avatar

gRPC

rpc GetUser(UserRequest)

returns (UserResponse);

Binary Protobuf encoding

Streaming: server, client,

bidirectional

✓ 5-10x smaller payload

GraphQL

query {

user(id: 42) {

name avatar

}

✓ Exactly what you need

⚑ Staff signal: REST for public APIs (universal tooling), gRPC for internal microservice communication (performance + type safety), GraphQL for frontend-driven flexible queries (BFF pattern). Never use GraphQL as a public API without depth limiting and query cost analysis — N+1 and deeply nested queries can DoS your DB.

REST payload

JSON ~5–50KB

Human-readable, universal

gRPC payload

Protobuf ~1–10KB

5–10x smaller, binary, typed

GraphQL

Request exactly fields needed

Needs N+1 protection (DataLoader)

Interview Questions & Deep Dives

Q1 Design a robust pagination strategy for a large dataset API. Compare cursor vs offset pagination. ▾

Offset pagination (LIMIT/OFFSET) is simple but slow on deep pages — database scans and discards offset rows. Cursor-based pagination uses an opaque pointer to the last seen item, enabling O(log N) seeks via index. Cursor is the production standard for any table with frequent inserts/deletes.

Deep Dive

Offset pagination problem: SELECT * FROM posts ORDER BY created_at LIMIT 20 OFFSET 10000 forces the DB to read and discard 10,000 rows. Each page gets slower. Also: if a row is inserted/deleted between page 1 and page 2 requests, you get duplicates or skipped items.

Cursor-based (keyset) pagination:

-- First page
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20;
-- Returns: last item has created_at='2024-01-15 10:30:00'
-- Next page cursor = base64('2024-01-15 10:30:00')
SELECT * FROM posts
  WHERE created_at < '2024-01-15 10:30:00'
  ORDER BY created_at DESC LIMIT 20;

Index seek directly to cursor position — O(log N) regardless of page depth. Stable: inserts/deletes don't cause skips or duplicates.

Cursor collision (non-unique sort key): If two items have the same created_at, cursor is ambiguous. Fix: composite cursor (created_at, id) → stable, unique ordering. WHERE (created_at, id) < (cursor_ts, cursor_id).

API design: Return { data: [...], cursor: "opaque_base64", has_more: true }. Cursor is opaque to client — implementation can change without breaking clients. GraphQL Relay spec: edges + pageInfo { endCursor, hasNextPage }.

How do you implement bidirectional pagination (previous page support) with cursors?

How would you paginate across multiple shards?

⚑ Staff signal: Cursor pagination with composite keys is the gold standard. But for search results that need "jump to page 50" (like Google's pagination), offset is necessary — just cache the intermediate offsets and limit max page depth. Twitter's timeline uses cursor pagination; Google Search uses offset with max 100 pages.

Q2 How do you design API versioning and handle breaking vs non-breaking changes? ▾

URL versioning (/v1/, /v2/) is most common — visible, cacheable, easy to route. Header versioning (Accept: application/vnd.api.v2+json) is cleaner but less visible. Breaking changes require a new version. Non-breaking changes (adding fields, new optional parameters) can be deployed without versioning.

Deep Dive

Breaking changes (require new version): Removing fields, renaming fields, changing field types, changing HTTP methods, changing auth scheme, removing endpoints, changing error codes.

Non-breaking changes (safe to deploy): Adding new optional request parameters, adding new response fields (clients should ignore unknown fields), new endpoints, new enum values (if client handles gracefully), looser validation.

Versioning strategies:
1. URL path: /v1/users, /v2/users. Simple, explicit, excellent CDN cacheability. Most popular for public APIs (Twitter, Stripe, Twilio).
2. Header versioning: API-Version: 2024-01-01 (Stripe's date-based approach). Clean URLs; version is a request concern not a resource concern. Harder to test in browser.
3. Query param: /users?version=2. Convenient for quick tests; messy for complex APIs.

Sunset policy: When deprecating v1: (a) add Sunset: Sat, 31 Dec 2024 and Deprecation: Mon, 01 Jan 2024 headers on every v1 response, (b) email registered API key owners, (c) log which clients still use v1, (d) enforce sunset date. Stripe runs versions for years — very generous sunset windows for public APIs.

How does Stripe's date-based versioning work and what are its advantages?

How do you manage versioning in a microservices internal API mesh?

⚑ Staff signal: Stripe's versioning model — each API key is pinned to a version at key creation. Breaking changes are released as new versions. Customers explicitly upgrade. This means Stripe runs 10+ simultaneous API versions in production, with a compatibility shim layer. This is the gold standard for developer-friendly public APIs. For internal microservices, prefer backward-compatible changes and deprecation via Protobuf field deprecation markers.

Q3 Design an idempotent payment API. How do you handle network retries safely? ▾

Idempotency keys are client-generated UUIDs sent with every mutating request. The server stores (idempotency_key → response) in a durable cache. On retry with the same key, return the cached response instead of re-executing. This decouples "did the request arrive?" from "did the operation succeed?"

Deep Dive

Client implementation: Generate UUID before sending. Retry the exact same request (same body + same key) on timeout or 5xx. Never generate a new key for a retry — that creates a new operation.

Server implementation:

BEGIN;
-- Atomic check-and-insert
INSERT INTO idempotency_keys (key, status, response)
  VALUES ('uuid-123', 'processing', NULL)
  ON CONFLICT (key) DO UPDATE
    SET lock_acquired = true
  RETURNING status, response;
-- If status='completed' → return cached response
-- If status='processing' → wait or return 409
-- Execute payment logic...
UPDATE idempotency_keys SET status='completed', response=... WHERE key='uuid-123';
COMMIT;

Concurrent retry handling: Two retries arrive simultaneously. First gets the row lock (SELECT FOR UPDATE on idempotency_keys). Second sees 'processing' and returns 409 or waits. When first completes, second re-reads and returns cached response.

TTL: Keep idempotency keys for 24 hours (Stripe) or 7 days. After expiry, the key is invalid — client must generate a new one. Store in Redis with TTL for fast lookups + periodic sync to DB for durability.

Response reproducibility: The same idempotency key must always return the exact same response — including error responses. If payment failed with "insufficient funds" on the first attempt, retry must return the same error, not re-attempt.

What happens if the server commits the payment but crashes before recording the idempotency key?

How do you implement idempotency for a distributed workflow (Saga)?

⚑ Staff signal: The atomicity problem — you need to commit the idempotency key in the same transaction as the payment. If payment succeeds but key recording fails (crash), the next retry re-executes the payment creating a duplicate charge. Solution: write idempotency key in the same DB transaction as all side effects. Stripe does this by making the idempotency key table part of the payments database.

12 / 14 · API & Services

Microservices & Circuit Breaker

Microservices decompose a monolith into independently deployable services. This creates distributed systems challenges: network partitions, cascading failures, service discovery, and distributed tracing. Circuit breakers prevent one slow service from taking down the entire system.

Circuit BreakerService DiscoveryService MeshBulkheadRetrySaga

Interactive Visualizer — Circuit Breaker State Machine

Circuit CLOSED — system nominal. Trigger failures to open the circuit.

⚑ Staff signal: Circuit breakers only make sense with a fallback — degraded response, cached data, or a graceful error. Without a fallback, the circuit breaker just converts a slow error into a fast error. Hystrix introduced the pattern; Resilience4j, Envoy, and Istio implement it at the infrastructure layer.

CLOSED state

Normal operation

Requests pass through, errors counted

OPEN state

Fail fast

Requests rejected immediately — no calls to downstream

HALF-OPEN state

Probe recovery

1 request let through; success → CLOSE, fail → OPEN

Interview Questions & Deep Dives

Q1 What is a service mesh and when would you introduce one? (Istio, Linkerd, Envoy) ▾

A service mesh is an infrastructure layer that handles service-to-service communication — mTLS, retries, circuit breaking, observability — via sidecar proxies injected alongside every service. It moves cross-cutting concerns out of application code into the platform layer. Introduce it when you have 10+ services and cross-cutting concerns are duplicated across codebases.

Deep Dive

Sidecar proxy pattern: Envoy (or Linkerd's linkerd-proxy) runs as a sidecar container in the same pod. All inbound and outbound traffic is transparently intercepted. Services talk to localhost; the sidecar handles everything else.

What a service mesh provides:
1. mTLS (mutual TLS): Every service-to-service call is encrypted and mutually authenticated. No more trusting "because it's inside the cluster." Istio does this with SPIFFE/SPIRE certificate issuance.
2. Retries + circuit breaking: Configured in YAML, not code. Consistent policy across all services.
3. Observability: Automatic distributed traces (Jaeger/Zipkin integration), golden signals (latency, traffic, errors, saturation) for every service pair without code changes.
4. Traffic management: Canary deployments (route 5% of traffic to v2), A/B testing, fault injection for chaos engineering.

Drawbacks: Sidecar adds ~2ms latency per hop. Operational complexity of Istio is significant. Alternative: Cilium eBPF-based mesh (no sidecar, kernel-level, lower overhead).

When NOT to use: <5 services, single team, small scale. The overhead isn't worth it. Start with a good HTTP client library (Resilience4j, go-kit) for retry/circuit breaking.

How does eBPF-based networking (Cilium) compare to sidecar-based service meshes?

What is the difference between east-west traffic and north-south traffic?

⚑ Staff signal: The hidden cost of microservices is distributed systems complexity: network failures, partial failures, cascading timeouts. The fallacies of distributed computing (network is reliable, latency is zero, bandwidth is infinite, network is secure) become daily operational realities. Microservices should be a conscious trade-off for independent deployability and scalability — not a default architecture choice.

Q2 How do you decompose a monolith into microservices? What patterns guide the decomposition? ▾

Decompose along bounded contexts (Domain-Driven Design). Each service owns its data, has a single responsibility, and communicates via well-defined APIs or events. The Strangler Fig pattern migrates a monolith incrementally — new features go in services while the monolith continues operating.

Deep Dive

Bounded contexts (DDD): Identify domain boundaries where concepts have consistent meaning. "Order" means different things to the payment team (a charge to process) vs the warehouse team (a pick list) vs the customer team (a status to track). These are natural service boundaries.

Strangler Fig pattern: Build new functionality as services. Add a routing layer (API gateway) in front of the monolith. Gradually migrate monolith endpoints to services behind the same gateway. The monolith "strangles" over time — new code goes to services, old code stays until migrated. Zero-risk incremental migration.

Database decomposition: Most painful part. Options:
1. Shared DB (anti-pattern) → two services sharing a DB schema — creates coupling.
2. Database-per-service → each service owns its schema. Joins go away; replaced by API calls or events. Enables polyglot persistence (use the right DB for each service).
3. Shared DB with schema isolation → same Postgres instance but different schemas — a stepping stone toward full separation.

Anti-patterns to avoid: Distributed monolith (services that are tightly coupled and must deploy together). Nano-services (too granular — network overhead dominates). Services that share databases (schema coupling).

How do you handle cross-service transactions (Saga vs 2PC)?

What is the Backend-for-Frontend (BFF) pattern?

⚑ Staff signal: The Saga pattern for distributed transactions — either choreography (services emit events, react to each other's events, emit compensating events on failure) or orchestration (a central saga orchestrator calls each service and issues rollbacks on failure). Choreography is more decoupled but harder to debug. Orchestration is more explicit and observable. Temporal.io is a popular orchestration engine.

13 / 14 · Search & Observe

Search & Full-Text Indexing

Search engines use inverted indexes to map terms → document IDs. Elasticsearch (Lucene) shards documents across nodes, enables near-real-time search, and supports full-text, faceted, and vector search. The challenge is balancing freshness, relevance, and performance.

ElasticsearchInverted IndexTF-IDFBM25Vector SearchSharding

Interactive Visualizer — Inverted Index Construction

Add documents to build the inverted index

⚑ Staff signal: Inverted index lookup is O(1) for a term — just a hash map lookup. The cost is in query expansion (synonyms, stemming), scoring (BM25), and merging postings lists for multi-term queries. Use Elasticsearch for full-text; don't build it on top of a relational DB LIKE query.

Inverted index lookup

O(1) per term

Hash map: term → postings list

Indexing lag (ES)

~1 second NRT

Near-real-time, configurable refresh

Reindex cost

Full corpus rescan

Use index aliases for zero-downtime reindex

Interview Questions & Deep Dives

Q1 Design a full-text search system for a product catalog with 100M items. ▾

Use Elasticsearch as the search layer with Postgres as the system of record. CDC (Debezium) or dual-write keeps ES in sync. Design the ES mapping carefully — analyzers, field types, and the number of shards at index creation are permanent decisions that affect performance forever.

Deep Dive

Architecture: Postgres (source of truth) → CDC/Debezium → Kafka → Elasticsearch indexing workers → ES cluster → Search API → clients.

ES cluster sizing (100M docs): ~1KB per doc average = 100GB source data. ES index size ~3–5x source (inverted index overhead, stored fields, doc values) = 300–500GB. With 3 replicas: ~1TB. Shard size recommendation: 20–50GB per shard → 20–50 primary shards. Too many small shards → overhead per shard (heap, segment merges). Too few large shards → uneven distribution, slow reindex.

Mapping design:
- title: text with standard analyzer + keyword for exact sort/facet
- description: text with english analyzer (stemming: "running" → "run")
- price: float (range queries, sorting)
- category: keyword (facets, filtering)
- embedding: dense_vector(dims=1536) for semantic search

Hybrid search (keyword + semantic): BM25 for exact keyword relevance + vector similarity (cosine) for semantic meaning. Combine with Reciprocal Rank Fusion (RRF) or weighted combination. ES 8.x has native hybrid search with kNN + BM25.

Zero-downtime reindex: Create new index with updated mapping → reindex documents → atomic alias swap (POST _aliases: remove old, add new) → delete old index. Never reindex in place.

How does Elasticsearch handle a node failure during a search?

What is the difference between a query context and a filter context in ES?

⚑ Staff signal: The query vs filter context distinction is critical for performance. Filters are cached (bitsets, extremely fast), don't affect relevance scores, and should be used for all structured data (status=active, price_range, category). Queries compute BM25 scores and are not cached. Wrong: putting a filter in a query context wastes computation. Right: bool query with must (scored) + filter (cached).

Q2 How does vector search work? When would you use it over BM25 keyword search? ▾

Vector search converts text (or images) into high-dimensional embeddings using ML models. Similarity is measured by cosine or dot product distance. It finds semantically similar content even without exact keyword matches. Use for: semantic search, recommendations, duplicate detection, multimodal search. BM25 wins for: exact keyword matching, high-precision structured lookups.

Deep Dive

Embedding generation: Pass text through a model (OpenAI text-embedding-3-small, sentence-transformers, Cohere) → 768–3072 dimension float vector. Each document gets an embedding at index time. Query also gets an embedding at query time.

Approximate Nearest Neighbor (ANN) algorithms: Exact kNN (brute force cosine similarity over all vectors) is O(N×D) — too slow for 100M docs. ANN algorithms trade recall for speed:
- HNSW (Hierarchical Navigable Small World): Graph-based, O(log N) query, excellent recall. Used by: pgvector, Pinecone, Weaviate, ES 8.x. Memory-intensive: 1536-dim float32 = 6KB/vector → 600GB for 100M vectors.
- FAISS (Facebook AI Similarity Search): IVF (inverted file index) + PQ (product quantization) compresses vectors 4–32x. Lower memory, slight recall loss. Good for massive scale.

When BM25 wins: "GET invoice INV-2024-001" — exact match, structured lookup. Part numbers, SKUs, specific names — users know exactly what they want.

When vector wins: "comfortable shoes for standing all day" — semantic intent, no exact keywords. Recommendations ("similar to this product"). Multimodal ("search by image").

How do you keep vector embeddings fresh as content updates?

What is RAG (Retrieval Augmented Generation) and how does vector search enable it?

⚑ Staff signal: RAG — LLMs have a fixed knowledge cutoff. RAG gives LLMs access to current, proprietary data: embed user query → retrieve top-K relevant chunks via vector search → inject chunks into LLM prompt as context → LLM answers with grounded knowledge. This is how Notion AI, Cursor, and most AI assistants work. The retrieval quality determines the answer quality — garbage in, garbage out.

14 / 14 · Search & Observe

Observability

Observability is the ability to understand a system's internal state from external outputs. The three pillars: Metrics (what is happening), Logs (what happened), Traces (why it happened). SLIs, SLOs, and error budgets define reliability contracts between engineering and the business.

MetricsTracesLogsSLI / SLOError BudgetOpenTelemetry

Interactive Visualizer — SLO Error Budget Burn

SLO: 99.9% availability. Error budget: 43.8 min/month.

⚑ Staff signal: Error budgets make reliability a data-driven conversation. When budget is ample → ship features. When budget is exhausted → freeze deployments, focus on reliability. This removes the engineering vs PM tension — both sides agreed to the budget upfront.

99.9% SLO monthly budget

43.8 minutes

8.7h/year downtime allowance

99.99% SLO monthly budget

4.4 minutes

52.6 min/year downtime allowance

MTTD → MTTR

Detect fast, fix fast

Alerting: symptom-based, not cause-based

Interview Questions & Deep Dives

Q1 Explain SLIs, SLOs, and error budgets. How do you set meaningful SLOs? ▾

SLI (Service Level Indicator): a quantitative measure of service quality (e.g., % requests returning 2xx in <200ms). SLO (Service Level Objective): the target value for an SLI (e.g., 99.9% over 30 days). Error budget: 1 - SLO = acceptable failure rate. SLA (Service Level Agreement): a legal commitment — typically 10% stricter than internal SLO.

Deep Dive

Good SLIs: Measure what users care about — not internal metrics. Bad: CPU < 80%. Good: request latency p99 < 500ms at the load balancer level. Best SLI categories (Google's four golden signals): Latency (how slow), Traffic (how much load), Errors (how many failures), Saturation (how full — queue depth, DB connections).

Setting SLOs:
1. Start with historical data — what's your p99 latency actually been? Set SLO slightly below your current best performance.
2. Ask: what does the user experience? For payments, 99.99% availability. For recommendations, 99.5% is fine (degrading to no recommendations is acceptable).
3. Consider dependency SLOs. Your SLO cannot exceed your dependencies' SLOs. If your DB is 99.9%, your service SLO must be ≤ 99.9% (in practice, lower due to your own failures).

Error budget usage: 99.9% SLO = 43.8 min budget/month. Spend budget on: incidents, planned maintenance, risky feature deploys. When budget is low (<20% remaining): pause risky changes, focus reliability work. When budget is ample: move fast, ship features. This creates a data-driven conversation between reliability and velocity.

Multi-window burn rate alerts: Google SRE recommendation: alert when burn rate is high over both short (1h) and long (6h) windows simultaneously. Avoids false positives from brief spikes. A 14x burn rate over 1h means you'll exhaust the monthly budget in 2 days.

How do you handle an SLO that keeps getting violated — when do you raise vs lower it?

What's the difference between availability SLO and latency SLO measurement?

⚑ Staff signal: Alerting on SLI burn rate (not raw error rate) dramatically reduces alert fatigue. A 1% error rate on 10 req/min (1 error) is very different from 1% on 10M req/min (100K errors). Burn rate alerts: "you will exhaust your monthly budget in Xh at the current rate" is actionable. Raw error rate alerts fire on every hiccup. Google's SRE Workbook chapter on alerting is worth reading verbatim.

Q2 How does distributed tracing work? Design a tracing system from scratch. ▾

Distributed tracing tracks a request as it flows through multiple services by propagating a trace context (trace ID + span ID) in headers. Each service creates a span (start time, end time, service name, attributes, parent span ID). Spans form a tree. A trace collector aggregates spans; a UI (Jaeger, Zipkin, Honeycomb) renders the timeline.

Deep Dive

Trace context propagation (W3C traceparent header):
traceparent: 00-{trace_id}-{span_id}-01
trace_id: 128-bit, unique per request (e.g., user clicks "checkout"). span_id: 64-bit, unique per operation within the trace. Each service: extracts parent span from incoming headers, creates a child span, injects its own span ID before calling downstream services.

Span attributes: http.method, http.status_code, db.statement, peer.service, error=true, user.id. Rich attributes enable powerful queries: "all traces where db.statement contained a slow query AND error=true."

Sampling: At 10K req/s, tracing 100% creates enormous data volume. Strategies:
1. Head sampling: decide at trace entry point (1% of requests). Simple, but misses rare slow requests.
2. Tail sampling: collect all spans, make sampling decision after trace completes (keep 100% of errors + slow traces, 1% of fast successful ones). Requires a trace buffer (Tempo's tail sampler, Honeycomb's Refinery).

OpenTelemetry standard: Vendor-neutral: instrument once, export to any backend (Jaeger, Zipkin, Honeycomb, Datadog, Grafana Tempo). SDK available in Go, Java, Python, JS, Rust. Auto-instrumentation: zero-code spans for HTTP, gRPC, DB drivers via agent injection.

How do you correlate logs with traces? (trace_id in log fields)

What is the OpenTelemetry Collector and why is it valuable?

⚑ Staff signal: The OTel Collector is a critical piece — it receives spans/metrics/logs from services, buffers, batches, and routes to multiple backends. Decouples instrumentation from destination: same code can send to Jaeger in dev and Honeycomb in prod. Also enables enrichment (add k8s metadata), sampling, and filtering centrally. Running without a Collector means vendor lock-in from day one.

Q3 Design a metrics pipeline that handles 1M time-series at 10s intervals. (Prometheus-style) ▾

1M series at 10s = 100K samples/s. Prometheus scrape model: server pulls metrics from each service's /metrics endpoint. For this scale, use federated Prometheus or switch to a remote write pipeline — Prometheus → Thanos/VictoriaMetrics for long-term, scalable storage. Pre-aggregate with recording rules to reduce query-time computation.

Deep Dive

Prometheus internals: Time-series DB (TSDB) on local disk. 2-hour in-memory block, flushed to immutable on-disk block. Blocks compacted over time. Retention: 15 days default. WAL for durability. Query engine: PromQL (functional, vector-oriented).

Scale challenges: Single Prometheus: max ~5M active series, ~1M samples/s. Beyond that: Cortex, Thanos, or VictoriaMetrics for horizontal scaling.

Thanos architecture:
1. Sidecar runs next to each Prometheus, uploads blocks to object storage (S3) every 2h.
2. Store Gateway serves queries against S3 blocks.
3. Querier fan-outs to multiple Prometheus + Store Gateways, deduplicates replicated series.
4. Compactor merges and downsamples old blocks (5m resolution after 2 weeks).
Result: unlimited retention, global query view across Prometheus instances.

Cardinality explosion: Each unique label combination = one time-series. http_requests_total{user_id="12345"} — if user_id has 10M values, this single metric creates 10M series. This is a "high cardinality" metric and will OOM Prometheus. Fix: use trace attributes for high-cardinality data; metrics for aggregate dimensions only.

What is the difference between push-based (StatsD) and pull-based (Prometheus) metrics collection?

How do you alert on metric anomalies vs static thresholds?

⚑ Staff signal: Cardinality is the most common Prometheus footgun. The rule: never put unbounded values in metric labels (user_id, request_id, IP address, URL path with IDs). These belong in traces/logs. Metrics labels should have bounded cardinality: status_code (finite set), service_name, region, method. Violating this causes OOM crashes on Prometheus — I've seen it take down monitoring during incidents.

15 / 19 · Transactions

Distributed Transactions

When a business operation spans multiple services or databases, coordinating atomicity is the hardest problem in distributed systems. Two-Phase Commit gives strong guarantees but blocks. Saga gives flexibility at the cost of eventual consistency.

Saga Pattern2PCOutbox PatternCompensating TransactionsIdempotency

Interactive Visualizer — Saga Choreography

Step 0/6

Simulating a travel booking: Flight → Hotel → Payment → Confirm

⚑ Staff signal: 2PC requires a coordinator — if it crashes after sending PREPARE but before COMMIT, all participants hold locks indefinitely. Saga avoids this: each step is a local transaction + event. Failure triggers published compensation events. No coordinator, no global lock.

2PC lock window

Coordinator lifetime

Can deadlock if coordinator crashes

Saga consistency

Eventual (BASE)

ACI without D across services

Outbox guarantee

At-least-once delivery

CDC tail + idempotent consumers

Interview Questions & Deep Dives

Q1 Compare 2PC and Saga. When would you use each in production? ▾

Two-Phase Commit gives atomic, synchronous commit across multiple databases by having a coordinator drive PREPARE then COMMIT/ROLLBACK phases. Saga breaks a distributed transaction into a sequence of local transactions, each publishing events. On failure, compensating transactions undo previous steps. 2PC = strong atomicity, blocking. Saga = available, eventually consistent.

Deep Dive — 2PC

Phase 1 (PREPARE): Coordinator sends PREPARE to all participants. Each acquires locks and writes a redo log, then responds YES/NO.

Phase 2 (COMMIT or ROLLBACK): If all YES → COMMIT. Any NO → ROLLBACK. Participants apply and release locks.

Failure modes:
1. Coordinator crashes after PREPARE, before COMMIT: Participants hold locks indefinitely — must block waiting for coordinator to recover. This is the "in-doubt transaction" problem.
2. Participant crashes after ack, before commit: Coordinator waits for timeout, then rolls back. Participant must re-sync on recovery.

When to use 2PC: Same-organization, low-latency networks (e.g., two Postgres shards in the same AZ). Never across microservices owned by different teams — coordinator coupling is too tight.

Saga advantages: Each service uses only local transactions. No coordinator. Service failures trigger compensation events. Pairs perfectly with event-driven architecture. Used by Stripe, Uber, Netflix.

Saga disadvantages: No isolation between saga steps — concurrent sagas may interleave. Intermediate states are visible. Must design idempotent compensating transactions.

What is a "lost update" anomaly in saga and how do you prevent it?

How does the Outbox Pattern guarantee at-least-once event delivery?

When would you use orchestration-based saga vs choreography-based saga?

⚑ Staff signal: The key insight is that Saga trades isolation for availability. Intermediate states are visible (e.g., "flight booked, payment pending"). To mitigate: use semantic locks (mark resources as PENDING), countermeasures (check state before compensation), and pivot transactions. Chris Richardson's "Microservices Patterns" is the canonical reference.

Q2 Explain the Outbox Pattern. Why is a direct dual-write (DB + publish event) unsafe? ▾

If you write to a database and publish a Kafka event in the same code path (dual-write), either can succeed without the other — creating inconsistency. The Outbox Pattern atomically persists the event to the same DB transaction as the business data, then a separate process reliably publishes it.

Deep Dive

The dual-write failure scenarios:
1. DB write succeeds → process crashes → Kafka publish never happens → downstream services miss the event
2. Kafka publish succeeds → DB write fails (timeout, constraint) → event fired for a transaction that was rolled back

Outbox Pattern:
1. Within the same DB transaction: INSERT into business table + INSERT into outbox table (event_type, payload, status='pending')
2. A separate publisher process (CDC relay or polling) reads pending outbox rows and publishes to Kafka
3. On successful Kafka ack: mark outbox row as published (or delete it)

CDC approach (preferred): Use Debezium to tail the outbox table's WAL. No polling — microsecond latency after commit. Debezium publishes one Kafka message per row change. The outbox table never needs explicit status updates — deletion is the signal.

Idempotency: Outbox gives at-least-once delivery (publisher may crash between Kafka ack and DB update). Consumers must be idempotent: check a unique event_id before processing. Store processed event_ids in a dedup table with TTL.

How does Debezium ensure exactly-once delivery end-to-end?

What is the "inbox pattern" and how does it complement the outbox?

⚑ Staff signal: Exactly-once delivery is a myth at the messaging layer — Kafka Transactions + idempotent producers give exactly-once within Kafka, but end-to-end exactly-once requires idempotent consumers. Design all event consumers around at-least-once delivery + idempotency keys. This is the correct production stance.

Q3 Design a payment processing system that handles partial failures across services. ▾

A payment flow involves: validate → reserve inventory → charge card → fulfill order → send receipt. Each step can fail independently. Design uses a Saga orchestrator (explicit state machine) rather than choreography for financial flows, since the payment state machine is complex and auditability is critical.

Deep Dive

Why orchestration over choreography for payments: Choreography (services react to events) becomes a distributed state machine that's hard to visualize, debug, and audit. An explicit orchestrator (a service or workflow engine) owns the saga state, retries, timeouts, and compensations — a single source of truth for payment status.

Payment saga steps:
1. ReserveInventory → compensation: ReleaseInventory
2. AuthorizeCard (pre-auth) → compensation: VoidAuthorization
3. DeductInventory → compensation: RestockInventory
4. CaptureCharge → compensation: RefundCharge (expensive!)
5. FulfillOrder → compensation: CancelFulfillment

Idempotency keys: Every step receives an idempotency_key = payment_id + step_name. If a step is retried (network timeout after success), the downstream service returns the cached result instead of double-charging.

Timeout handling: Each step has a deadline. If no response in 30s → orchestrator marks step as UNKNOWN → retries with same idempotency key → if 3 retries fail → compensation saga begins from the last confirmed step.

State persistence: Saga state stored in a JSONB column (saga_id, current_step, status, compensation_log). Updated atomically with each step. On orchestrator restart, resume from last known state.

How does Stripe's idempotency key mechanism work internally?

What is the "pivot transaction" technique in Saga design?

⚑ Staff signal: Pivot transaction — the last "forward" transaction in a saga that, once committed, makes the saga irreversible (capture charge is typically the pivot). Steps before the pivot can be compensated easily. Steps after the pivot are "retriable" transactions that must eventually succeed. This pattern structures the compensation design cleanly.

Q4 What is event sourcing and how does it differ from traditional CRUD storage? ▾

Event sourcing stores the full sequence of state-changing events as the source of truth — not the current state. Current state is derived by replaying events. CRUD stores only the current snapshot. Event sourcing gives a complete audit log, time-travel queries, and natural event publishing — at the cost of query complexity and eventual consistency of read models.

Deep Dive

CRUD model: accounts table has a balance column. UPDATE accounts SET balance=balance-100 WHERE id=5. Previous state is gone. No history without a separate audit log.

Event sourcing model: events table: {account_id, type:'DEBIT', amount:100, timestamp, version}. Current balance = SUM over event stream. Full history retained. Append-only — no updates or deletes.

Projections (Read Models): Replaying every event for every read is slow. Event sourcing pairs with CQRS: a separate read model (denormalized Postgres table, Redis, Elasticsearch) is updated by an event processor. Read model may lag by milliseconds.

Snapshots: For long-lived aggregates, store a periodic snapshot of current state + the event version at snapshot time. On load: load snapshot, replay only events after snapshot version. Prevents O(N) replay.

When to use: Financial ledgers (required audit trail), e-commerce orders (need to reconstruct order state at any point), collaborative tools (Figma, Google Docs — operations as events enable conflict-free merge), compliance-heavy domains.

Downsides: Schema evolution is hard (event format v1 must be migratable to v2), query patterns are limited (cannot directly query "all accounts with balance > 1000" without a projection), eventual consistency of read models needs careful design.

How do you handle schema evolution in an event store without breaking old event replays?

What is CQRS and when should it be separated from event sourcing?

⚑ Staff signal: Event sourcing ≠ CQRS. They're often used together but are independent patterns. CQRS = separate write model (commands) from read model (queries). You can do CQRS without event sourcing (separate read replica). Event sourcing without CQRS means your event log is both write and read model — feasible for simple cases but doesn't scale for complex queries.

16 / 19 · Real-time

Real-time Systems

Delivering low-latency updates to clients requires choosing between WebSockets (full duplex), Server-Sent Events (server→client only), and Long Polling (HTTP compatible). Each has radically different scalability characteristics.

WebSocketSSELong PollingChat SystemsFan-outPresence

Protocol Comparison Visualizer

Select a protocol to visualize

⚑ Staff signal: WebSocket connections are stateful — they pin to a specific server. A 1M-user chat needs connection state (user→server mapping) in a shared store (Redis). When server A receives a message for user B connected to server B, it must fan-out via pub/sub (Redis Pub/Sub, Kafka) to server B.

WebSocket latency

< 5ms

After initial handshake, full duplex

SSE reconnect

Auto via Last-Event-ID

Browser reconnects, resumes stream

WS connections/server

~10K–100K

File descriptor limit. Node.js good fit

Interview Questions & Deep Dives

Q1 Design a scalable chat system like Slack/WhatsApp (1B messages/day, millions of concurrent connections). ▾

A chat system needs: persistent connection management, message routing across servers, message storage with efficient retrieval, presence tracking, and push notifications for offline users. The core challenge is that WebSocket connections are stateful — users must be routed to their connection server for message delivery.

Deep Dive

Connection layer: Stateless HTTP + WebSocket gateway servers. Each server holds 10K–50K connections in memory. A connection service writes to Redis: user:{id}:server = server_id with a 30s TTL (heartbeat refreshes). On disconnect, key expires.

Message flow (1-to-1):
1. User A (on Server 1) sends message to User B
2. Server 1 looks up Redis: B is on Server 3
3. Server 1 publishes to Redis channel server:3
4. Server 3 receives pub/sub event, pushes to B's WebSocket
5. Message written to Cassandra asynchronously (partition key: conversation_id, sort key: timestamp)

Group messages (fan-out): A group has 1000 members. One message → 1000 deliveries. At 100K messages/s in a large group: use a fan-out service. Option A: write once to DB, online members pull via long-poll (WhatsApp model). Option B: fan-out to all online members' servers via pub/sub (Slack model). Option A scales better; Option B is lower latency.

Message storage: Cassandra: partition by conversation_id, cluster by created_at DESC. Allows efficient "latest 50 messages" query. Retention: keep 90 days in hot storage, archive to S3 Parquet for compliance.

Offline delivery: If user is offline (no entry in Redis), write message to inbox table. On reconnect, drain inbox. Also trigger push notification (APNS/FCM) via a notification service.

How do you implement message ordering guarantees in a distributed chat?

How does Slack handle message threading and reactions at scale?

How do you implement end-to-end encryption in a group chat?

⚑ Staff signal: Message ordering in distributed systems is hard. Per-conversation monotonic IDs (via a distributed sequence per conversation ID) or vector clocks. In practice, Slack uses Snowflake-like IDs scoped per channel — slightly out-of-order delivery is tolerated on the client and displayed in order after a small buffer window. Total ordering would require a single writer per conversation — a scalability bottleneck.

Q2 How do you implement presence (online/offline/typing indicators) at scale? ▾

Presence is a high-frequency, low-durability state. Users send heartbeats every 10–30s. The system stores last-seen timestamp in Redis with TTL. A pub/sub fanout broadcasts presence changes to subscribed clients. Typing indicators are ephemeral — never persisted, TTL-based.

Deep Dive

Online status:
SETEX presence:{user_id} 60 "online" — client sends heartbeat every 30s. TTL=60s means 2 missed heartbeats = offline. Query: GET presence:{user_id}. For bulk friend status: MGET presence:u1 presence:u2 ...

Status fan-out: When user A comes online, their status change must be pushed to all of A's contacts who are currently online. Naive: query all friends (potentially 5000), look up their servers, send presence update to each. Optimized: clients subscribe to presence channel for friends they care about; server pushes deltas only.

Typing indicators: "User is typing" debounced every 2s. Server receives → broadcasts to conversation participants → TTL 5s (if no new typing event in 5s, display disappears). Never persisted — entirely ephemeral Redis pub/sub.

Scale concern: 100M users × 30s heartbeat = ~3.3M heartbeat writes/s to Redis. This is fine for a Redis cluster (sharded by user_id). The expensive part is fan-out: publishing presence changes to all connected friends. Batch updates: aggregate presence changes over 1s windows and bulk-push diffs.

How does Discord handle presence for millions of users in large servers?

How do you reduce presence fan-out cost for users with tens of thousands of followers?

⚑ Staff signal: Discord's architecture blog describes "lazy fan-out" for presence in large servers — rather than pushing presence to all 500K server members, presence is fetched on-demand when a user opens a channel (pull on view). This trades freshness for scalability. The threshold is usually around 75K members (Discord's "large server" threshold).

Q3 Design a live collaboration tool like Google Docs. How do you handle concurrent edits? ▾

Concurrent editing requires a conflict resolution algorithm: Operational Transformation (OT) or CRDTs (Conflict-free Replicated Data Types). Google Docs uses OT. Figma uses CRDTs. Both allow multiple users to edit simultaneously and converge to the same document state without locking.

Deep Dive

Naive approach (fails): Last write wins — if A inserts "hello" at position 5 and B deletes character at position 5 concurrently, naive merge loses data or corrupts the document.

Operational Transformation (OT): Each edit is an operation (INSERT char at pos, DELETE char at pos). Operations are transformed against concurrent operations before being applied. If A inserts at pos 5 and B inserts at pos 3, A's operation is transformed: insert at pos 6 (accounting for B's insert). Requires a server to sequence and transform operations — can't be fully P2P.

CRDTs: Mathematical data structures designed to merge without conflicts. Each character gets a globally unique ID and a position derived from its neighbors (not integer offset). Insertions and deletions commute — apply in any order, same result. No server coordination needed, P2P friendly. Used in: Figma, Automerge, Yjs framework (used by VS Code, Jupyter).

Architecture for Google Docs-scale:
1. All clients connect via WebSocket to a document session server
2. Operations buffered locally, sent to server
3. Server applies OT, assigns a global sequence number, broadcasts to all participants
4. Clients apply received operations, transform against pending local ops
5. Document state checkpointed every N operations to S3; operations stored in append-only log

Presence in docs: Cursor positions broadcast via WebSocket every 100ms. Not persisted — ephemeral.

What are the CAP tradeoffs in a CRDT-based collaborative editor?

How does Figma scale a single document to thousands of concurrent viewers?

⚑ Staff signal: CRDTs are the future of real-time collaboration — they enable P2P sync (local-first apps) without a server as coordinator, making offline editing + merge seamless. The Yjs CRDT library is open-source and powers collaborative features in VS Code Live Share, Jupyter, and many modern editors. Knowing the difference between OT (server-centric) and CRDT (P2P-friendly) shows architectural depth.

17 / 19 · Storage

Object Storage & File Systems

Object storage (S3-compatible) provides infinitely scalable blob storage via flat key-value semantics. Unlike block storage or file systems, there are no directories, no random-write capability, and consistency is eventually-strong (S3 now provides strong consistency).

S3Multipart UploadPresigned URLsDesign DropboxChunkingDeduplication

Multipart Upload Visualizer

Press Upload to start a multipart upload

⚑ Staff signal: Multipart upload: split file into ≥5MB parts, upload each independently (parallelizable), S3 assembles on CompleteMultipartUpload. Failed parts can be retried without re-uploading the whole file. Parts are durably stored until CompleteMultipartUpload or AbortMultipartUpload (clean up with lifecycle rules).

Min part size

5MB

Except the last part (any size)

Max parts

10,000

Max object = 5TB

S3 consistency (2020+)

Strong

Read-after-write for all ops

Interview Questions & Deep Dives

Q1 Design Dropbox / Google Drive. Walk through upload, sync, and conflict resolution. ▾

The core components: a chunk-based upload pipeline (files split into 4MB chunks, deduplicated by hash), a metadata service (file tree, version history, sync state per device), a notification service (SSE/WebSocket to push change events to connected clients), and S3-compatible storage for chunk data.

Deep Dive

Upload flow:
1. Client splits file into 4MB chunks, computes SHA256 of each chunk
2. Client sends chunk hashes to metadata API: "which of these do I need to upload?"
3. API checks chunk store — returns list of hashes not yet uploaded (deduplication)
4. Client uploads only missing chunks to S3 via presigned URLs
5. Client calls CompleteUpload(file_id, [chunk_hashes]) — metadata DB records file as version N

Deduplication benefit: If 100 users upload the same file, only 1 copy is stored. Even within a file, duplicate chunks (e.g., large files with repeated sections) are stored once. Dropbox reported 90%+ deduplication rate on early servers.

Sync: Each device has a sync cursor (last-seen change_id). On reconnect, client queries: "give me all changes since change_id=X" → receive delta of modified files → pull only changed chunks.

Conflict resolution: If two devices edit the same file while offline, both generate version N+1. On sync, server detects version conflict. Strategy: create a "conflicted copy" (Dropbox's approach — no data loss) or merge (requires CRDT for structured data). For binary files (photos, PDFs), always create a conflicted copy.

Presigned URLs: Metadata server issues S3 presigned URLs with 15-minute expiry for each chunk upload. Client uploads directly to S3 — metadata server is never in the data path. Reduces infrastructure cost dramatically (no bandwidth billing on app servers).

How do you implement efficient delta sync for large files (like video editing projects)?

How does Dropbox's Magic Pocket (proprietary storage) differ from S3?

⚑ Staff signal: The presigned URL pattern (client uploads directly to S3) is critical for scaling. The metadata server should never be in the file data path — it would become the bottleneck. The server only generates a short-lived signed URL (AWS SigV4 or GCS signed URL) and the client uploads directly. This also reduces egress costs on the app server tier.

Q2 Design a video streaming platform (like YouTube). How do you handle transcoding and adaptive bitrate? ▾

Video upload triggers an async transcoding pipeline: raw video → multiple resolutions (1080p, 720p, 480p, 360p) + codecs (H.264, VP9, AV1). Each resolution is split into 2-10 second segments. A manifest file (HLS .m3u8 or DASH .mpd) lists available quality levels. The player picks segments dynamically based on bandwidth (ABR — Adaptive Bitrate).

Deep Dive

Upload pipeline:
1. Creator uploads to a resumable upload endpoint (multipart to S3)
2. S3 event triggers a transcoding job (published to SQS/Kafka)
3. Transcoding workers (GPU or AWS MediaConvert) produce: 4 resolution variants × 3 codecs × N segments
4. Segments stored in S3, manifest files generated per quality level
5. Video metadata (duration, thumbnails, chapters) stored in PostgreSQL

Adaptive Bitrate (ABR): HLS playlist has a master manifest listing all quality levels. Player buffers 30s of video. If download speed drops → switch to lower quality at next segment boundary. If speed improves → switch up. Seamless to user (no rebuffering on quality switch).

CDN strategy: Segments are static files → perfect for CDN caching. Long-tail videos (watched rarely) are cached at origin. Popular videos (top 0.1%) are cached at all 200 CDN PoPs globally. YouTube serves 1B+ hours/day — without CDN edge caching, origin bandwidth would be economically infeasible.

DRM (Digital Rights Management): Encrypted segments keyed with content-specific AES keys. Key server (KMS) issues decryption keys only to authenticated players. Widevine (Google) for Chrome/Android, FairPlay (Apple) for Safari/iOS, PlayReady (Microsoft) for Windows.

Thumbnail generation: Extract frames at N-second intervals during transcoding. Generate animated previews (hover thumbnails). Stored as WebP/AVIF for compression. Served from CDN.

How would you design the recommendation system that picks the next video?

How do you store and serve live video (low-latency streaming)?

⚑ Staff signal: The cost split for video platforms: ~70% CDN bandwidth, ~15% storage, ~10% transcoding, ~5% compute. CDN egress optimization (choosing AV1 codec: 50% smaller than H.264 at same quality) has huge economic impact at YouTube scale. AV1 adoption was primarily a cost decision, not a quality decision.

Q3 How does S3's internal architecture achieve 11 nines of durability? ▾

S3 durability (99.999999999%) comes from erasure coding across multiple Availability Zones, continuous data integrity verification, and automatic repair of corrupted or lost chunks. Each object is split into chunks and encoded such that data can be reconstructed from a subset — similar to RAID but across AZs and even regions.

Deep Dive

Erasure coding: An object is split into k data chunks + m parity chunks. Any k out of k+m chunks can reconstruct the full object. S3 uses a variant where chunks are stored across multiple AZs — loss of an entire AZ still allows reconstruction. More storage-efficient than 3× replication (typical overhead: 1.4× vs 3×) while providing stronger durability.

Intra-AZ redundancy: Within each AZ, chunks are stored on multiple physical drives across multiple racks. A single disk failure or rack power failure doesn't lose any chunk.

Continuous scrubbing: S3 continuously reads and checksums stored data. Silent bit-rot (bit flips due to cosmic rays, aging drives) is detected within seconds and repaired from redundant chunks before the corruption spreads.

S3 vs EBS vs EFS:
S3 = object store (key-value, eventually-strong, infinite scale, high latency ~50ms)
EBS = block store (attached to one EC2 instance, POSIX-like, low latency ~1ms)
EFS = distributed file system (multi-AZ, NFS-compatible, can attach to many EC2, ~5ms)
Instance Store = ephemeral NVMe on the EC2 host (fastest, no durability — data lost on stop)

Strong consistency (added Dec 2020): S3 now provides read-after-write consistency for all operations. Previously, new objects were eventually consistent — there was a window where a GET after a PUT could return 404. This is no longer the case.

What is S3's consistency model for list operations?

How would you design an S3-compatible object store from scratch?

⚑ Staff signal: 11 nines means losing 1 object per 100 billion stored objects per year. For context: at 1B objects stored, expected loss = 0.01 objects/year. The durability guarantee is about the storage layer — it does NOT protect against application bugs that delete objects. Use S3 Versioning + MFA Delete + Object Lock for protection against accidental/malicious deletion.

18 / 19 · Probabilistic

Bloom Filters & Probabilistic Data Structures

A Bloom filter is a space-efficient probabilistic set membership structure. It can definitively say "NOT in set" (zero false negatives) but may say "IN set" when wrong (false positives). Used in Cassandra, Postgres, Chrome's Safe Browsing, and countless caches.

Bloom FilterHyperLogLogCount-Min SketchCache PenetrationCassandra

Interactive Bloom Filter

Add words to the filter, then test membership

⚑ Staff signal: False positive rate p ≈ (1 − e^(−kn/m))^k where k=hash functions, n=items, m=bit array size. Optimal k = (m/n)·ln(2). At p=1%, need ~10 bits/element. Bloom filters NEVER have false negatives — "no" is always correct. "Yes" may be wrong. Deletion not supported (use Counting Bloom Filter).

False negatives

Impossible

If item added, always found

Space efficiency

~10 bits/item

At 1% FPR vs hash set ~100× larger

Lookup complexity

O(k)

k = constant number of hash functions

Interview Questions & Deep Dives

Q1 Where are Bloom filters used in production database systems? ▾

Bloom filters are used in LSM-tree databases (Cassandra, RocksDB, BigTable) to avoid disk reads for keys that don't exist. Each SSTable has a Bloom filter — a "probably here" check before reading disk. Also used in: Chrome Safe Browsing (check if URL is malicious before full lookup), Postgres amcheck, CDN origin bypass, rate limiting (cache penetration prevention).

Deep Dive

LSM-Tree + Bloom Filter: When reading key K, you must check: MemTable → L0 SSTables → L1 → L2 ... Each level may have N SSTables, each needing a disk seek. Bloom filter per SSTable: if filter says NO, skip the SSTable entirely — O(1) instead of O(log N) disk reads. Cassandra reports 90% of reads served without disk I/O for keys in cache thanks to Bloom filters.

Cassandra Bloom filter tuning: bloom_filter_fp_chance = 0.1 (10% FPR). Lower FPR = more memory. bloom_filter_fp_chance = 0.01 (1% FPR) uses ~10× more memory per SSTable. Tune based on read/write ratio — read-heavy workloads benefit from lower FPR.

Chrome Safe Browsing: Downloading 2M+ malicious URLs on every browser would be infeasible. Instead: a Bloom filter of ~300MB covers all known bad URLs with <1% FPR. On URL check: if filter says "possibly malicious" → real-time check against Google's API. If "definitely not" → skip check. 99% of URL checks handled locally.

Cache penetration prevention: Load all valid user IDs into a Bloom filter at startup. Requests for non-existent user IDs (common in scraping/DDoS attacks) are rejected by the filter before hitting cache or DB. Zero false negatives = valid users always pass through.

What is a Counting Bloom Filter and when do you need it?

How do you update a Bloom filter when items are deleted from the set?

⚑ Staff signal: Standard Bloom filters don't support deletion (you can't unset a bit — it might be shared with other items). Counting Bloom Filter replaces each bit with a counter: increment on add, decrement on delete. Supports deletion at 4× memory cost. Cuckoo Filter is a modern alternative that supports deletion with better cache performance than Counting Bloom.

Q2 Explain HyperLogLog. How does Redis use it for cardinality estimation? ▾

HyperLogLog (HLL) estimates the count of distinct elements in a stream using ~12KB of memory regardless of stream size, with ~0.81% standard error. It works by observing the maximum number of leading zeros in the hash of each element — more leading zeros = likely more distinct elements. Used in: Redis PFADD/PFCOUNT, Spark, databases for COUNT DISTINCT estimation.

Deep Dive

Intuition: Hash each element to a uniformly random binary string. The probability of seeing k leading zeros is (1/2)^k. If the maximum leading zeros observed is k, the estimate of distinct elements is ~2^k. HLL uses many hash sub-streams and harmonic mean to reduce variance.

Redis HyperLogLog: PFADD daily_visitors user_id, PFCOUNT daily_visitors. Returns approximate distinct count. Memory: 12KB per HLL regardless of cardinality. Compare to keeping a set: 1M distinct user IDs × 8 bytes = 8MB. HLL is 99.8% smaller with <1% error.

Merge: HLLs can be merged: PFMERGE total daily:mon daily:tue daily:wed — union cardinality estimated without re-scanning source data. Useful for "unique users over 7 days" type queries.

Use cases: Daily active users, unique IPs per endpoint (for rate limiting alerting), unique search queries per day, unique items viewed per user session.

Other Probabilistic Data Structures

Count-Min Sketch: Estimates frequency of items in a stream. 2D array of counters, k hash functions. Query: min count across all hash functions. Used for: top-K frequent items, approximate frequency counting in Kafka streams, detecting trending topics. Overestimates never, underestimates never — gives upper bound on count.

MinHash / LSH: Estimate Jaccard similarity between sets without comparing full sets. Used in: near-duplicate detection, plagiarism detection, recommendation systems (find users with similar listening history).

How would you use Count-Min Sketch to implement a real-time trending topics feature?

⚑ Staff signal: The trade-off with all probabilistic structures is: exactness vs. space/time. For exact COUNT DISTINCT of 1B events, you need O(N) space. For ~1% error, you need O(log log N). In practice, analytics workloads (DAU, unique page views) are fine with <1% error — HLL is the right tool. For billing or security decisions requiring exact counts, use exact algorithms.

19 / 19 · Data Selection

Database Selection Guide

Choosing the right database is one of the most consequential system design decisions. SQL, document, wide-column, key-value, graph, time-series, and search engines each optimize for different access patterns. A wrong choice at schema design time is expensive to undo.

PostgreSQLDynamoDBCassandraRedisNeo4jInfluxDBElasticsearch

Interactive Decision Tree

Answer the questions to get a DB recommendation

SQL sweet spot

ACID + joins

Complex queries, relational data

NoSQL sweet spot

Scale + flexibility

Known access patterns, high write

Graph DB sweet spot

N-hop traversals

Fraud detection, recommendations

Interview Questions & Deep Dives

Q1 When would you choose DynamoDB over PostgreSQL? Walk through the single-table design pattern. ▾

Choose DynamoDB when: access patterns are known and fixed at design time, you need infinite horizontal scale without sharding complexity, write throughput > 10K/s on a single table, or schema flexibility is needed for heterogeneous items. Single-table design stores all entities in one table — access patterns drive the schema, not relationships.

Deep Dive — Single-Table Design

Why single-table: DynamoDB charges per-request and per-stored-byte. Multiple tables means multiple round-trips for related data. Single-table lets you collocate related items under the same partition key, enabling efficient queries that would require JOINs in SQL.

Pattern — Generic keys: Use overloaded PK/SK (partition_key, sort_key) with type-prefixed values:
PK=USER#123, SK=PROFILE → user profile
PK=USER#123, SK=ORDER#456 → order by user
PK=ORDER#456, SK=ORDER#456 → order details
PK=ORDER#456, SK=ITEM#789 → order line item
One query: PK=USER#123, SK begins_with ORDER# → all orders for user.

GSI (Global Secondary Index): Reverse lookups. If you need "all orders by status", create a GSI with PK=status, SK=created_at. Query GSI instead of scan. Up to 20 GSIs per table.

Access pattern first: Start by listing every query your app needs. Design PK/SK/GSI combinations to serve each query in a single request. If an access pattern requires a Scan — rethink the schema.

When to avoid DynamoDB: Ad-hoc queries unknown at design time, complex reporting and analytics, heavy JOINs, strong ACID transactions across many items, team unfamiliar with NoSQL modeling. In those cases: PostgreSQL + read replicas scales to 100K+ reads/s before you need DynamoDB.

How do DynamoDB transactions (TransactWrite) compare to PostgreSQL transactions?

What is DynamoDB's consistency model and how do you get strong consistency?

⚑ Staff signal: DynamoDB's access-pattern-first design is fundamentally different from SQL schema design. In SQL, you normalize for correctness and rely on the query planner. In DynamoDB, you denormalize for access patterns. This means changing access patterns after launch is expensive (requires table scan + migrate). Know this trade-off deeply — it comes up in every DynamoDB system design question.

Q2 When would you use a time-series database vs PostgreSQL for metrics storage? ▾

Time-series databases (InfluxDB, TimescaleDB, Prometheus) optimize for append-only, time-ordered data with aggressive compression (delta encoding, run-length encoding), auto-retention (TTL on old data), and time-bucketing aggregations (rollups). PostgreSQL with TimescaleDB extension can bridge both worlds — familiar SQL with time-series optimizations.

Deep Dive

Why TSDB beats Postgres for metrics:
1. Compression: Time-series data has high temporal correlation — values close in time are similar. Delta encoding (store difference instead of full value) achieves 5-20× compression vs raw floats. InfluxDB's Gorilla compression (Facebook's TSDB paper) compresses doubles to ~1.4 bytes average.
2. Ingestion throughput: Prometheus scrapes millions of metrics/s. A relational DB with row-per-point can't keep up with index updates. TSDBs use append-only segment files, batched writes, in-memory buffers.
3. Retention policies: Keep raw 1s data for 7 days → downsample to 1min for 90 days → downsample to 1hr for 2 years. TSDBs automate this (continuous queries, retention policies). In Postgres you'd write custom jobs.
4. Time-window queries: SELECT mean(cpu_usage) FROM metrics WHERE time > now()-1h GROUP BY time(5m) — TSDBs optimize these natively.

Prometheus + Thanos for scale: Prometheus is single-node (2B samples/shard practical limit). For multi-region scale: Thanos adds a sidecar that uploads Prometheus blocks to S3 + a query layer that merges results from multiple Prometheus instances. Free unlimited historical range.

When PostgreSQL + TimescaleDB is enough: Under 1M data points/day, need SQL JOINs with business data (e.g., metrics + user accounts), team already expert in PostgreSQL. TimescaleDB's hypertables give automatic time-partitioning, continuous aggregation, and compression within Postgres.

How does Prometheus handle high-cardinality labels?

Compare ClickHouse vs InfluxDB for large-scale analytics.

⚑ Staff signal: High-cardinality is the Achilles heel of Prometheus. Labeling metrics with user_id or request_id creates a unique time series per user/request — millions of series, each needing memory and disk. The rule: labels should have bounded cardinality (status codes, endpoints, regions — not user IDs). Use a column-oriented analytics DB (ClickHouse, BigQuery) for per-user metrics.

Q3 When would you choose a graph database? Design a fraud detection system using graph traversal. ▾

Graph databases (Neo4j, Amazon Neptune, TigerGraph) excel at multi-hop traversals — "find all accounts connected within 3 hops to a known fraudster." In a relational DB, this requires N recursive joins — O(N^k) for k hops. Graph DBs store adjacency lists natively — traversal is O(edges traversed), independent of total graph size.

Deep Dive

The SQL problem for graphs: "Find all friends of friends of user 123 who have purchased in the last 7 days." In SQL: 3 recursive JOINs on a users table. At 1M users with avg 200 friends each: join explodes exponentially. Postgres can handle 2–3 hops with CTEs but performance degrades rapidly beyond 3 hops.

Graph DB fraud detection:
Nodes: Account, Device, IP, Phone, Email, Card
Edges: USED_DEVICE, USED_IP, LINKED_PHONE, LINKED_EMAIL, CHARGED_CARD

Query: "Is this new account connected to any known fraudulent account within 2 hops via shared device or IP?"

MATCH (newAccount)-[:USED_DEVICE|USED_IP*1..2]-(fraudAccount {is_fraud:true})
RETURN COUNT(fraudAccount) > 0

This traversal visits only relevant subgraphs — not the full graph. At 100M accounts: milliseconds.

Real-world graph fraud signals:
1. Shared device between 50+ accounts → device farming
2. IP shared between new accounts created in burst → account factory
3. Phone number linked to known fraudulent accounts → phone resale
4. Ring structure: A → B → C → D → A (money laundering cycle)

Graph DB vs Postgres for graphs: Postgres with recursive CTEs handles simple graphs (social networks up to ~5M nodes, 3-hop queries). Beyond that, or for complex pattern matching, use a purpose-built graph DB.

How does LinkedIn use graph queries for degree-of-connection ("2nd degree connection")?

What is PageRank and how would you compute it on a distributed graph?

⚑ Staff signal: Most companies don't need a graph database — they need better SQL. Postgres recursive CTEs and ltree (hierarchical data) handle the majority of graph use cases. Invest in a graph DB only when: queries require 4+ hops at production scale, graph pattern matching is a core feature, or real-time recommendation/fraud traversal is business-critical. Neo4j's overhead (JVM, complex clustering) isn't worth it for simple hierarchy queries.

Q4 Design a leaderboard system that updates in real-time and serves top-N queries at scale. ▾

Real-time leaderboards need: O(log N) score updates, O(log N) rank queries, and efficient top-K retrieval. Redis Sorted Sets (ZSET) are purpose-built for this: ZADD is O(log N), ZRANK is O(log N), ZRANGE/ZREVRANGE is O(log N + K). For 100M players, a single Redis shard handles the entire leaderboard with millisecond latency.

Deep Dive

Redis Sorted Set operations:
ZADD leaderboard:global 9500 user:123 — O(log N)
ZRANK leaderboard:global user:123 — O(log N), returns 0-indexed rank
ZREVRANGE leaderboard:global 0 99 — O(log N + 100), top 100
ZINCRBY leaderboard:global 50 user:123 — O(log N), atomic increment

Ranking window complexity: "Rank among my friends" — maintain a separate ZSET per user of their social graph + scores. Update on each score change: ZINCRBY friend_leaderboard:{user_id} delta friend_id for all friends. Fan-out on write. Or: on read, ZUNIONSTORE a temp set from all friend score ZSETs — fan-out on read, but single lookup at display time.

Handling 100M players: Redis ZSET performance is excellent up to ~10M members. Beyond that: shard by user_id bucket (ZSET per shard), maintain a global top-1000 in a separate ZSET (reconciled periodically). For the "exact global rank" of any user: probabilistic rank estimation + exact rank only for top 10K.

Persistence: Redis is in-memory. Persist to Postgres asynchronously: Redis acts as the write buffer and read cache. Background job drains Redis scores to Postgres every 60s. On Redis failure: reload from Postgres snapshot + replay score events from Kafka.

Windowed leaderboards (daily/weekly): Use EXPIREAT on the ZSET key — the key for "today's leaderboard" expires at midnight UTC. No manual cleanup needed. Or: key = leaderboard:{date} rotated daily.

How do you handle ties in ranking (same score, different players)?

How would you design a leaderboard for 10B players where top-N must be exact?

⚑ Staff signal: Redis Sorted Set is the canonical answer for leaderboards — interviewers expect it. The sophisticated follow-up is the "friends leaderboard" problem: naive fan-out on write requires updating N ZSETs per score change (N = friend count). For power users with 100K friends, this is 100K Redis writes per score change. Solution: hybrid approach — fan-out on write for users with <1000 friends, fan-out on read (union at display time) for power users.