Database Concepts — Interactive Visualizers

01 — Consistent Hashing

The Ring

Keys are hashed onto a ring. Each node owns keys clockwise until the next node. Adding/removing a node only moves ~1/N of the keys — not a full reshuffle.

⚑ Staff signal: explain why consistent hashing minimizes data movement on rebalance. Mention virtual nodes (vnodes) for even distribution.

Node

Key

New Node

Removed Node

Click "Hash Key" to see placement

Data moved on add/remove

~1/N keys

vs 100% with modular hashing

Virtual nodes (vnodes)

Each node → 150 vnodes

Ensures even distribution

Used by

Cassandra, DynamoDB

Also CDN cache routing

Failure mode

Hot partition

One vnode range gets disproportionate traffic. Fix: split hot range, add vnodes, or use client-side routing

Alternative

Rendezvous (HRW) hashing

Rank all nodes per key — highest wins. No ring needed. Used in GitHub load balancer, Ceph CRUSH

Interview one-liner

"Only ~K/N keys move"

On rebalance, consistent hashing moves O(K/N) keys vs O(K) with mod-N. This is the core insight.

02 — MVCC

Multi-Version Concurrency

PostgreSQL never overwrites data. Each transaction sees a consistent snapshot of the DB from when it started. Readers never block writers, writers never block readers.

⚑ This is why Postgres can have high concurrency. Explain xmin/xmax hidden columns and that old versions are reclaimed by VACUUM.

Txn A (Reader)

Txn B (Writer)

Row Version

Invisible to Txn A

Step through MVCC lifecycle

xmin

Creator txn ID

Row is visible if xmin committed before your snapshot

xmax

Deleter txn ID

0 means row is alive. Non-zero = deleted/updated

VACUUM

Reaps dead rows

Without it, tables grow unbounded (bloat)

Failure mode

Table bloat from long txns

Long-running transactions pin old row versions. VACUUM can't reclaim them → table grows unbounded. Monitor pg_stat_user_tables.n_dead_tup

Tuning lever

autovacuum_vacuum_scale_factor

Default 0.2 = vacuum when 20% dead. For high-write tables set to 0.01–0.05. Also tune autovacuum_naptime, cost_delay

Interview one-liner

"Readers never block writers"

This is MVCC's superpower. Each txn sees a snapshot. The cost: dead row cleanup (VACUUM) and transaction ID wraparound prevention (FREEZE)

03 — Cache Stampede

Without vs With Mutex

When a hot key expires, thousands of requests simultaneously miss the cache and pile onto the database. Watch the two strategies side-by-side.

⚑ Name all three failure modes: Stampede (hot key expires), Avalanche (many keys expire together), Penetration (non-existent keys). Each has a different fix.

Client Request

Cache Hit

DB Hit

Mutex Wait

Choose a scenario to simulate

Fix 1: Mutex

Lock + wait

One request populates, others queue

Fix 2: XFetch

Probabilistic early refresh

Refresh before TTL with probability ∝ recompute_time

Fix 3: Stale-while-revalidate

Serve stale, refresh async

No user ever waits. Data may be briefly old

Avalanche

Mass TTL expiry

Many keys expire at same second (e.g., deployed at same time). Fix: TTL jitter — add random(0, spread) to each key's TTL

Penetration

Non-existent key lookups

Attacker queries IDs that don't exist → always miss cache, always hit DB. Fix: Bloom filter or cache negative results with short TTL

Hot key sharding

Shard popular keys across N slots

key = original_key + ":" + hash(request_id) % N. Spreads single hot key across N cache entries. Used at Twitter/Meta scale

04 — CAP Theorem

Partition Tolerance is Not Optional

In any distributed system, network partitions happen. The real tradeoff during a partition is: do you stay consistent (and reject reads) or stay available (and serve stale data)?

⚑ CA systems only exist on a single machine. In interviews, say "since P is mandatory, the choice is C vs A during a partition."

Click a region to explore

CP Examples

etcd, HBase, Zookeeper

Returns error during partition. Leader election.

AP Examples

Cassandra, CouchDB, DNS

Returns stale data during partition. Eventually consistent.

Tunable

DynamoDB, Cassandra

W=QUORUM, R=QUORUM gives strong consistency

PACELC extension

If Partition → A or C; Else → Latency or Consistency

CAP only describes partition behavior. PACELC adds the normal-operation tradeoff. DynamoDB: PA/EL. Postgres: PC/EC

Consistency spectrum

Linearizable → Causal → Eventual

Linearizable: single-copy illusion. Causal: respects happens-before. Eventual: converges "sometime." Most apps need causal, not linearizable

Interview one-liner

"P is not a choice — it's a fact"

Networks partition. The question is always: during a partition, do you serve errors (CP) or stale data (AP)? CA only exists on a single node

05 — B-Tree Index

How Indexes Find Rows

A B-Tree is a balanced tree where every leaf is at the same depth. Searches always take O(log N) steps regardless of data distribution. This is the default index in every major database.

⚑ B-Tree supports equality, range, and sort. Hash index is faster for equality-only but can't do range queries. Know when each applies.

Press Search to traverse the tree

Time complexity

O(log N)

Height ≈ log_b(N), b = branching factor (typically 100–1000)

Leaf nodes

Linked list

Range scans are sequential reads along leaf level

Write cost

O(log N) + rebalance

Why many indexes slow down writes significantly

Composite index rule

Left-prefix matching only

Index on (a, b, c): WHERE a=1 AND b=2 ✓. WHERE b=2 ✗. WHERE a=1 AND c=3 → only uses 'a' prefix. Order matters.

Covering index

Index-only scan — no heap access

If all SELECTed columns are in the index, DB never reads the table. Postgres: INCLUDE clause. Huge perf win for hot queries

Clustered vs secondary

Clustered = data order = index order

InnoDB: primary key IS the clustered index. Secondary indexes point to PK, not heap. Postgres: all indexes are secondary (use CLUSTER to reorder)

06 — Deadlock

Wait-For Cycle Detection

A deadlock occurs when transactions form a circular wait. The database detects the cycle in the wait-for graph and kills one transaction (the victim) to break it.

⚑ Prevention: always acquire locks in the same global order (sort IDs ascending). Detection: DB cycle detection + victim rollback. Always set lock_timeout.

Step through lock acquisition

Detection method

Cycle in wait-for graph

DB checks periodically (e.g. every 1s in PostgreSQL)

Victim selection

Lowest cost to rollback

Fewest locks held, least work done, youngest txn

Prevention

Ordered lock acquisition

Always lock row_id=5 before row_id=7. Never reverse.

Lock modes

Shared (S) vs Exclusive (X)

S+S compatible. S+X conflict. X+X conflict. SELECT FOR SHARE = S lock. SELECT FOR UPDATE = X lock. UPDATE/DELETE = X lock

Gap locks (InnoDB)

Lock ranges between index keys

Prevents phantom reads under REPEATABLE READ. Can cause unexpected deadlocks on INSERT into adjacent ranges. Postgres uses predicate locks instead

Practical tips

NOWAIT / SKIP LOCKED / lock_timeout

NOWAIT: fail immediately if locked. SKIP LOCKED: skip locked rows (great for job queues). Always set lock_timeout to avoid indefinite waits

07 — Replication Lag

Primary → Replica Delay

Writes go to the primary and asynchronously stream to replicas via WAL (Write-Ahead Log). During that lag window, reads from replicas may see stale data.

⚑ Solution for read-your-writes: route reads to primary for a short window after write, OR pass a replication token (LSN) and have replica wait to catch up.

WAL Stream

Committed Write

Stale Read

Consistent Read

Write to primary, then read from replica to see lag

Async replication lag

Typically 10–200ms

Can grow to minutes under write pressure

Sync replication

Zero lag, higher write latency

Primary waits for ≥1 replica ack before committing

Monitor via

pg_stat_replication

write_lag, flush_lag, replay_lag columns

Sync modes

Async → Semi-sync → Sync

Async: fastest writes, may lose data on failover. Semi-sync: ≥1 replica acks before commit. Sync: all replicas ack — highest durability, worst latency

Read-your-writes pattern

Route to primary after write

After INSERT, read from primary for N seconds. Or pass LSN token — replica waits until replay reaches that position before serving read

Failover timeline

Detect (5–30s) → Elect → Promote → Redirect

Total: 10–60s typically. During failover: writes fail, reads may serve stale. pg_stat_replication.replay_lag is your key metric

08 — Isolation Levels

Transaction Anomalies

Different isolation levels protect against different anomalies. MVCC alone does not guarantee serializability. Step through each anomaly to see exactly when and why data goes wrong.

⚑ Staff signal: "Snapshot isolation prevents dirty/non-repeatable/phantom reads but still allows write skew. For correctness-critical paths, you need SELECT FOR UPDATE or SERIALIZABLE."

Txn 1

Txn 2

Anomaly

Lock / Block

Select an anomaly to visualize

Read Uncommitted

Allows dirty reads

Almost never used. Sees uncommitted writes from other txns. No real database defaults to this

Read Committed

Postgres default

Each statement sees latest committed data. Same SELECT can return different results within one txn. Prevents dirty reads only

Repeatable Read

Snapshot at txn start

Postgres: uses MVCC snapshot. MySQL/InnoDB: uses gap locks. Both prevent non-repeatable reads. Postgres still allows write skew

Serializable

Full correctness guarantee

Postgres SSI: optimistic, detects conflicts at commit. InnoDB: pessimistic, uses gap+next-key locks. Expect serialization failures — must retry in app code

Write skew example

Double-booking / overselling

Two txns each check "seats available > 0", both see true, both INSERT. Result: oversold. Fix: SELECT FOR UPDATE or SERIALIZABLE

Interview one-liner

"MVCC ≠ serializable"

Snapshot isolation is powerful but not a silver bullet. For invariants like "at most N", you need explicit locking or serializable isolation

09 — Sharding

Partition Key & Hotspots

Sharding distributes data across nodes by a partition key. The wrong key creates hotspots — one shard handles disproportionate load while others idle. Choose keys with high cardinality and uniform distribution.

⚑ Staff signal: "Hash sharding gives uniform distribution but loses range queries. Range sharding preserves ordering but creates hotspots on sequential inserts. The partition key choice is the most consequential schema decision."

Shard (normal)

Hot Shard

Request

Scatter-Gather

Choose a partition strategy

Hash partitioning

hash(key) % N

Uniform distribution. Cannot do range scans across shards efficiently. Used by DynamoDB, Cassandra (with vnodes)

Range partitioning

key_range → shard

Supports efficient range queries within one shard. Sequential keys (timestamps, auto-increment IDs) create write hotspots on last shard

Scatter-gather cost

Latency = max(all shards)

Query without partition key hits ALL shards. P99 dominated by slowest shard. Cross-shard joins are extremely expensive

Rebalancing

Split → migrate → cutover

Online resharding requires dual-write period or change-data-capture. Vitess, Citus, CockroachDB handle this automatically. DIY is risky

Whale tenant problem

One customer = 40% of traffic

Dedicated shard for whale tenants. Or sub-shard within tenant by secondary key. Monitor per-shard QPS to detect early

Interview one-liner

"Shard by access pattern, not by data"

Pick the key that appears in every query's WHERE clause. If queries need multiple partition keys, you need secondary indexes or denormalization

10 — Write-Ahead Log

WAL & Crash Recovery

The WAL ensures durability: every change is written to an append-only log BEFORE modifying data pages. On crash, the database replays the WAL from the last checkpoint to recover committed transactions.

⚑ Staff signal: "Commit means WAL is fsync'd to disk. Data pages are written lazily by the background writer. Checkpoints bound recovery time but cause I/O spikes. Spread checkpoints to avoid latency cliffs."

WAL Record

Data Page (clean)

Data Page (dirty)

Checkpoint

Crash Point

Write data to see WAL in action

WAL-before-data rule

Log first, then modify pages

WAL record must be fsync'd before transaction is acknowledged as committed. This is what makes COMMIT durable

Group commit

Batch multiple txns into one fsync

Instead of one fsync per commit, batch N commits into one disk write. Dramatically improves throughput. commit_delay in Postgres

Checkpoint

Flush all dirty pages to disk

Creates a known-good recovery point. Old WAL segments before checkpoint can be recycled. checkpoint_completion_target spreads I/O

Recovery time

Proportional to WAL since last checkpoint

More frequent checkpoints = faster recovery but more I/O during normal operation. Typical: checkpoint every 5–15 min

Redo vs Undo

Postgres: redo-only (MVCC handles undo)

InnoDB uses undo logs for rollback + redo logs for recovery. Postgres skips undo because dead tuples are kept in-place (VACUUM cleans later)

Interview one-liner

"COMMIT = WAL fsync, not page write"

The actual data pages are written asynchronously by the background writer. This is why random writes become sequential WAL appends — huge perf win

11 — Distributed Transactions

Cross-Shard Coordination

When a transaction spans multiple shards, single-node deadlock detection and ACID guarantees break down. You need a coordination protocol — each with real tradeoffs in latency, availability, and failure modes.

⚑ Staff signal: "Avoid distributed transactions when possible — redesign the data model so hot paths stay within one shard. When you can't avoid them, choose 2PC for strong consistency or Sagas for availability. Never pretend the problem doesn't exist."

Coordinator

Shard (OK)

Committed

Aborted / Failed

Compensating Action

Choose a scenario to simulate

Two-Phase Commit (2PC)

Prepare → Commit

Coordinator asks all shards to PREPARE (lock resources). If all vote YES, send COMMIT. If any votes NO, send ABORT. Guarantees atomicity across shards

2PC failure mode

Coordinator crash = blocking

If coordinator crashes after PREPARE but before COMMIT, all participants hold locks indefinitely. This is the fundamental weakness of 2PC

Saga pattern

Forward actions + compensations

Each step has an undo. If step 3 fails, run compensate_2, compensate_1 in reverse. No global lock — higher availability, eventual consistency

Cross-shard deadlock

No global wait-for graph

Each shard only sees local waits. Shard A: T1→T2. Shard B: T2→T1. Neither detects the cycle. Fix: global lock ordering, timeouts, or a distributed deadlock detector

Best practice

Avoid cross-shard txns entirely

Model data so transactions stay within one shard. Denormalize. Use eventual consistency with idempotent events between shards. Reserve 2PC for the rare cases that truly need it

Interview one-liner

"2PC for correctness, Sagas for availability"

2PC blocks on failure but guarantees atomicity. Sagas never block but require compensating logic and tolerate intermediate inconsistency. Know when each applies

Global lock ordering

Sort shard IDs, lock in order

Same principle as single-node deadlock prevention, applied globally. Always acquire shard_1 before shard_2. Prevents cycles in the distributed wait-for graph

Timeout-based detection

lock_timeout = 3–5s across shards

If a cross-shard lock isn't acquired within timeout, abort and retry. Simple, pragmatic. Used by CockroachDB, Vitess, Citus. Better than waiting forever

Real-world usage

Google Spanner: Paxos + 2PC

Spanner uses Paxos groups per shard for fault tolerance + 2PC across groups for atomicity. TrueTime provides global ordering. Most teams use Sagas instead

DatabaseVisualizers

The Ring

Multi-Version Concurrency

Without vs With Mutex

Partition Tolerance is Not Optional

How Indexes Find Rows

Wait-For Cycle Detection

Primary → Replica Delay

Transaction Anomalies

Partition Key & Hotspots

WAL & Crash Recovery

Cross-Shard Coordination

Database
Visualizers