Interactive Reference

Database
Visualizers

Animated, interactive diagrams for the hardest database concepts. Click, step through, and build intuition — not just definitions.

Consistent Hashing Ring MVCC Snapshots Cache Stampede CAP Explorer B-Tree Search Deadlock Cycle Replication Lag Isolation Anomalies Sharding & Hotspots WAL & Crash Recovery Distributed Transactions
01 — Consistent Hashing

The Ring

Keys are hashed onto a ring. Each node owns keys clockwise until the next node. Adding/removing a node only moves ~1/N of the keys — not a full reshuffle.

⚑ Staff signal: explain why consistent hashing minimizes data movement on rebalance. Mention virtual nodes (vnodes) for even distribution.
Node
Key
New Node
Removed Node
Click "Hash Key" to see placement
Data moved on add/remove
~1/N keys
vs 100% with modular hashing
Virtual nodes (vnodes)
Each node → 150 vnodes
Ensures even distribution
Used by
Cassandra, DynamoDB
Also CDN cache routing
Failure mode
Hot partition
One vnode range gets disproportionate traffic. Fix: split hot range, add vnodes, or use client-side routing
Alternative
Rendezvous (HRW) hashing
Rank all nodes per key — highest wins. No ring needed. Used in GitHub load balancer, Ceph CRUSH
Interview one-liner
"Only ~K/N keys move"
On rebalance, consistent hashing moves O(K/N) keys vs O(K) with mod-N. This is the core insight.
02 — MVCC

Multi-Version Concurrency

PostgreSQL never overwrites data. Each transaction sees a consistent snapshot of the DB from when it started. Readers never block writers, writers never block readers.

⚑ This is why Postgres can have high concurrency. Explain xmin/xmax hidden columns and that old versions are reclaimed by VACUUM.
Txn A (Reader)
Txn B (Writer)
Row Version
Invisible to Txn A
Step through MVCC lifecycle
xmin
Creator txn ID
Row is visible if xmin committed before your snapshot
xmax
Deleter txn ID
0 means row is alive. Non-zero = deleted/updated
VACUUM
Reaps dead rows
Without it, tables grow unbounded (bloat)
Failure mode
Table bloat from long txns
Long-running transactions pin old row versions. VACUUM can't reclaim them → table grows unbounded. Monitor pg_stat_user_tables.n_dead_tup
Tuning lever
autovacuum_vacuum_scale_factor
Default 0.2 = vacuum when 20% dead. For high-write tables set to 0.01–0.05. Also tune autovacuum_naptime, cost_delay
Interview one-liner
"Readers never block writers"
This is MVCC's superpower. Each txn sees a snapshot. The cost: dead row cleanup (VACUUM) and transaction ID wraparound prevention (FREEZE)
03 — Cache Stampede

Without vs With Mutex

When a hot key expires, thousands of requests simultaneously miss the cache and pile onto the database. Watch the two strategies side-by-side.

⚑ Name all three failure modes: Stampede (hot key expires), Avalanche (many keys expire together), Penetration (non-existent keys). Each has a different fix.
Client Request
Cache Hit
DB Hit
Mutex Wait
Choose a scenario to simulate
Fix 1: Mutex
Lock + wait
One request populates, others queue
Fix 2: XFetch
Probabilistic early refresh
Refresh before TTL with probability ∝ recompute_time
Fix 3: Stale-while-revalidate
Serve stale, refresh async
No user ever waits. Data may be briefly old
Avalanche
Mass TTL expiry
Many keys expire at same second (e.g., deployed at same time). Fix: TTL jitter — add random(0, spread) to each key's TTL
Penetration
Non-existent key lookups
Attacker queries IDs that don't exist → always miss cache, always hit DB. Fix: Bloom filter or cache negative results with short TTL
Hot key sharding
Shard popular keys across N slots
key = original_key + ":" + hash(request_id) % N. Spreads single hot key across N cache entries. Used at Twitter/Meta scale
04 — CAP Theorem

Partition Tolerance is Not Optional

In any distributed system, network partitions happen. The real tradeoff during a partition is: do you stay consistent (and reject reads) or stay available (and serve stale data)?

⚑ CA systems only exist on a single machine. In interviews, say "since P is mandatory, the choice is C vs A during a partition."
Click a region to explore
CP Examples
etcd, HBase, Zookeeper
Returns error during partition. Leader election.
AP Examples
Cassandra, CouchDB, DNS
Returns stale data during partition. Eventually consistent.
Tunable
DynamoDB, Cassandra
W=QUORUM, R=QUORUM gives strong consistency
PACELC extension
If Partition → A or C; Else → Latency or Consistency
CAP only describes partition behavior. PACELC adds the normal-operation tradeoff. DynamoDB: PA/EL. Postgres: PC/EC
Consistency spectrum
Linearizable → Causal → Eventual
Linearizable: single-copy illusion. Causal: respects happens-before. Eventual: converges "sometime." Most apps need causal, not linearizable
Interview one-liner
"P is not a choice — it's a fact"
Networks partition. The question is always: during a partition, do you serve errors (CP) or stale data (AP)? CA only exists on a single node
05 — B-Tree Index

How Indexes Find Rows

A B-Tree is a balanced tree where every leaf is at the same depth. Searches always take O(log N) steps regardless of data distribution. This is the default index in every major database.

⚑ B-Tree supports equality, range, and sort. Hash index is faster for equality-only but can't do range queries. Know when each applies.
Press Search to traverse the tree
Time complexity
O(log N)
Height ≈ log_b(N), b = branching factor (typically 100–1000)
Leaf nodes
Linked list
Range scans are sequential reads along leaf level
Write cost
O(log N) + rebalance
Why many indexes slow down writes significantly
Composite index rule
Left-prefix matching only
Index on (a, b, c): WHERE a=1 AND b=2 ✓. WHERE b=2 ✗. WHERE a=1 AND c=3 → only uses 'a' prefix. Order matters.
Covering index
Index-only scan — no heap access
If all SELECTed columns are in the index, DB never reads the table. Postgres: INCLUDE clause. Huge perf win for hot queries
Clustered vs secondary
Clustered = data order = index order
InnoDB: primary key IS the clustered index. Secondary indexes point to PK, not heap. Postgres: all indexes are secondary (use CLUSTER to reorder)
06 — Deadlock

Wait-For Cycle Detection

A deadlock occurs when transactions form a circular wait. The database detects the cycle in the wait-for graph and kills one transaction (the victim) to break it.

⚑ Prevention: always acquire locks in the same global order (sort IDs ascending). Detection: DB cycle detection + victim rollback. Always set lock_timeout.
Step through lock acquisition
Detection method
Cycle in wait-for graph
DB checks periodically (e.g. every 1s in PostgreSQL)
Victim selection
Lowest cost to rollback
Fewest locks held, least work done, youngest txn
Prevention
Ordered lock acquisition
Always lock row_id=5 before row_id=7. Never reverse.
Lock modes
Shared (S) vs Exclusive (X)
S+S compatible. S+X conflict. X+X conflict. SELECT FOR SHARE = S lock. SELECT FOR UPDATE = X lock. UPDATE/DELETE = X lock
Gap locks (InnoDB)
Lock ranges between index keys
Prevents phantom reads under REPEATABLE READ. Can cause unexpected deadlocks on INSERT into adjacent ranges. Postgres uses predicate locks instead
Practical tips
NOWAIT / SKIP LOCKED / lock_timeout
NOWAIT: fail immediately if locked. SKIP LOCKED: skip locked rows (great for job queues). Always set lock_timeout to avoid indefinite waits
07 — Replication Lag

Primary → Replica Delay

Writes go to the primary and asynchronously stream to replicas via WAL (Write-Ahead Log). During that lag window, reads from replicas may see stale data.

⚑ Solution for read-your-writes: route reads to primary for a short window after write, OR pass a replication token (LSN) and have replica wait to catch up.
WAL Stream
Committed Write
Stale Read
Consistent Read
Write to primary, then read from replica to see lag
Async replication lag
Typically 10–200ms
Can grow to minutes under write pressure
Sync replication
Zero lag, higher write latency
Primary waits for ≥1 replica ack before committing
Monitor via
pg_stat_replication
write_lag, flush_lag, replay_lag columns
Sync modes
Async → Semi-sync → Sync
Async: fastest writes, may lose data on failover. Semi-sync: ≥1 replica acks before commit. Sync: all replicas ack — highest durability, worst latency
Read-your-writes pattern
Route to primary after write
After INSERT, read from primary for N seconds. Or pass LSN token — replica waits until replay reaches that position before serving read
Failover timeline
Detect (5–30s) → Elect → Promote → Redirect
Total: 10–60s typically. During failover: writes fail, reads may serve stale. pg_stat_replication.replay_lag is your key metric
08 — Isolation Levels

Transaction Anomalies

Different isolation levels protect against different anomalies. MVCC alone does not guarantee serializability. Step through each anomaly to see exactly when and why data goes wrong.

⚑ Staff signal: "Snapshot isolation prevents dirty/non-repeatable/phantom reads but still allows write skew. For correctness-critical paths, you need SELECT FOR UPDATE or SERIALIZABLE."
Txn 1
Txn 2
Anomaly
Lock / Block
Select an anomaly to visualize
Read Uncommitted
Allows dirty reads
Almost never used. Sees uncommitted writes from other txns. No real database defaults to this
Read Committed
Postgres default
Each statement sees latest committed data. Same SELECT can return different results within one txn. Prevents dirty reads only
Repeatable Read
Snapshot at txn start
Postgres: uses MVCC snapshot. MySQL/InnoDB: uses gap locks. Both prevent non-repeatable reads. Postgres still allows write skew
Serializable
Full correctness guarantee
Postgres SSI: optimistic, detects conflicts at commit. InnoDB: pessimistic, uses gap+next-key locks. Expect serialization failures — must retry in app code
Write skew example
Double-booking / overselling
Two txns each check "seats available > 0", both see true, both INSERT. Result: oversold. Fix: SELECT FOR UPDATE or SERIALIZABLE
Interview one-liner
"MVCC ≠ serializable"
Snapshot isolation is powerful but not a silver bullet. For invariants like "at most N", you need explicit locking or serializable isolation
09 — Sharding

Partition Key & Hotspots

Sharding distributes data across nodes by a partition key. The wrong key creates hotspots — one shard handles disproportionate load while others idle. Choose keys with high cardinality and uniform distribution.

⚑ Staff signal: "Hash sharding gives uniform distribution but loses range queries. Range sharding preserves ordering but creates hotspots on sequential inserts. The partition key choice is the most consequential schema decision."
Shard (normal)
Hot Shard
Request
Scatter-Gather
Choose a partition strategy
Hash partitioning
hash(key) % N
Uniform distribution. Cannot do range scans across shards efficiently. Used by DynamoDB, Cassandra (with vnodes)
Range partitioning
key_range → shard
Supports efficient range queries within one shard. Sequential keys (timestamps, auto-increment IDs) create write hotspots on last shard
Scatter-gather cost
Latency = max(all shards)
Query without partition key hits ALL shards. P99 dominated by slowest shard. Cross-shard joins are extremely expensive
Rebalancing
Split → migrate → cutover
Online resharding requires dual-write period or change-data-capture. Vitess, Citus, CockroachDB handle this automatically. DIY is risky
Whale tenant problem
One customer = 40% of traffic
Dedicated shard for whale tenants. Or sub-shard within tenant by secondary key. Monitor per-shard QPS to detect early
Interview one-liner
"Shard by access pattern, not by data"
Pick the key that appears in every query's WHERE clause. If queries need multiple partition keys, you need secondary indexes or denormalization
10 — Write-Ahead Log

WAL & Crash Recovery

The WAL ensures durability: every change is written to an append-only log BEFORE modifying data pages. On crash, the database replays the WAL from the last checkpoint to recover committed transactions.

⚑ Staff signal: "Commit means WAL is fsync'd to disk. Data pages are written lazily by the background writer. Checkpoints bound recovery time but cause I/O spikes. Spread checkpoints to avoid latency cliffs."
WAL Record
Data Page (clean)
Data Page (dirty)
Checkpoint
Crash Point
Write data to see WAL in action
WAL-before-data rule
Log first, then modify pages
WAL record must be fsync'd before transaction is acknowledged as committed. This is what makes COMMIT durable
Group commit
Batch multiple txns into one fsync
Instead of one fsync per commit, batch N commits into one disk write. Dramatically improves throughput. commit_delay in Postgres
Checkpoint
Flush all dirty pages to disk
Creates a known-good recovery point. Old WAL segments before checkpoint can be recycled. checkpoint_completion_target spreads I/O
Recovery time
Proportional to WAL since last checkpoint
More frequent checkpoints = faster recovery but more I/O during normal operation. Typical: checkpoint every 5–15 min
Redo vs Undo
Postgres: redo-only (MVCC handles undo)
InnoDB uses undo logs for rollback + redo logs for recovery. Postgres skips undo because dead tuples are kept in-place (VACUUM cleans later)
Interview one-liner
"COMMIT = WAL fsync, not page write"
The actual data pages are written asynchronously by the background writer. This is why random writes become sequential WAL appends — huge perf win
11 — Distributed Transactions

Cross-Shard Coordination

When a transaction spans multiple shards, single-node deadlock detection and ACID guarantees break down. You need a coordination protocol — each with real tradeoffs in latency, availability, and failure modes.

⚑ Staff signal: "Avoid distributed transactions when possible — redesign the data model so hot paths stay within one shard. When you can't avoid them, choose 2PC for strong consistency or Sagas for availability. Never pretend the problem doesn't exist."
Coordinator
Shard (OK)
Committed
Aborted / Failed
Compensating Action
Choose a scenario to simulate
Two-Phase Commit (2PC)
Prepare → Commit
Coordinator asks all shards to PREPARE (lock resources). If all vote YES, send COMMIT. If any votes NO, send ABORT. Guarantees atomicity across shards
2PC failure mode
Coordinator crash = blocking
If coordinator crashes after PREPARE but before COMMIT, all participants hold locks indefinitely. This is the fundamental weakness of 2PC
Saga pattern
Forward actions + compensations
Each step has an undo. If step 3 fails, run compensate_2, compensate_1 in reverse. No global lock — higher availability, eventual consistency
Cross-shard deadlock
No global wait-for graph
Each shard only sees local waits. Shard A: T1→T2. Shard B: T2→T1. Neither detects the cycle. Fix: global lock ordering, timeouts, or a distributed deadlock detector
Best practice
Avoid cross-shard txns entirely
Model data so transactions stay within one shard. Denormalize. Use eventual consistency with idempotent events between shards. Reserve 2PC for the rare cases that truly need it
Interview one-liner
"2PC for correctness, Sagas for availability"
2PC blocks on failure but guarantees atomicity. Sagas never block but require compensating logic and tolerate intermediate inconsistency. Know when each applies
Global lock ordering
Sort shard IDs, lock in order
Same principle as single-node deadlock prevention, applied globally. Always acquire shard_1 before shard_2. Prevents cycles in the distributed wait-for graph
Timeout-based detection
lock_timeout = 3–5s across shards
If a cross-shard lock isn't acquired within timeout, abort and retry. Simple, pragmatic. Used by CockroachDB, Vitess, Citus. Better than waiting forever
Real-world usage
Google Spanner: Paxos + 2PC
Spanner uses Paxos groups per shard for fault tolerance + 2PC across groups for atomicity. TrueTime provides global ordering. Most teams use Sagas instead