Everything from storage engines to cache stampedes — the depth that separates senior from staff.
Picking the right database is the first signal in any system design. Each type has a distinct data model, consistency profile, and performance envelope.
Most systems use 2–3 database types together. A common pattern: PostgreSQL as the system of record, Redis for caching + sessions, Elasticsearch for search. Design the sync strategy and failure modes between them.
Network partitions are unavoidable in distributed systems. So the real choice is C vs A during a partition: do you return stale data (AP) or refuse to answer (CP)? CA systems only exist on a single machine.
| Database | CAP Choice | During Partition... | Examples |
|---|---|---|---|
| CP | Consistent | Returns error rather than stale data | HBase, Zookeeper, etcd, MongoDB (strong) |
| AP | Available | Returns potentially stale data | Cassandra, CouchDB, DynamoDB (eventual) |
| Tunable | Configurable | Depends on read/write quorum settings | Cassandra (W+R>N), DynamoDB |
PostgreSQL defaults to Read Committed and implements isolation via MVCC (Multi-Version Concurrency Control) — readers never block writers. Serializable uses SSI (Serializable Snapshot Isolation) with minimal performance overhead vs traditional locking.
| Index Type | Structure | Best For | Tradeoff |
|---|---|---|---|
| B-Tree | Balanced tree, O(log n) | Equality + range queries, sorts. Default index type. | Write overhead per index |
| Hash | Hash table, O(1) | Exact equality only. Faster than B-tree for = | No range queries, no sorts |
| GiST / GIN | Generalized tree | Full-text search, arrays, geometric, JSONB | Slow builds, larger size |
| Composite | B-tree on N columns | Queries filtering on multiple columns together | Column order matters critically |
| Partial | B-tree with WHERE clause | Index a subset (e.g. WHERE status='active') | Smaller, faster, but limited use |
| Covering | Includes extra columns | Index-only scans — no heap access needed | Larger index size |
| Clustered | Data stored with index | Range scans on primary key (InnoDB default) | Only one per table |
-- Composite index: column order = selectivity order (most selective first)
CREATE INDEX idx_orders ON orders(user_id, status, created_at);
-- Supports: WHERE user_id=? | WHERE user_id=? AND status=?
-- Ignores: WHERE status=? (leftmost prefix rule)
-- Covering index: avoids heap fetch entirely
CREATE INDEX idx_covering ON orders(user_id) INCLUDE (total, created_at);
-- Partial index: index only what you query
CREATE INDEX idx_pending ON orders(created_at) WHERE status = 'pending';
-- Expression index: on computed values
CREATE INDEX idx_email_lower ON users(LOWER(email));
-- EXPLAIN ANALYZE — always verify your index is used
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 123;
Sharding splits data across multiple nodes when a single machine can't handle the volume. It's a last resort — exhaust vertical scaling, read replicas, and caching first.
| Strategy | How | Pros | Problems |
|---|---|---|---|
| Range-based | user_id 1–1M → Shard 1, 1M–2M → Shard 2 | Simple, range queries stay local | Hotspots — new users pile onto latest shard |
| Hash-based | shard = hash(user_id) % N | Even distribution, no hotspots | Range queries hit all shards. Resharding = full data move |
| Consistent Hashing | Ring topology, virtual nodes | Minimal data movement on add/remove nodes | Complex implementation, uneven without vnodes |
| Directory-based | Lookup table: key → shard mapping | Full control, easy resharding | Lookup table = single point of failure + bottleneck |
| Geographic | EU users → EU shard, US → US shard | Data residency, low latency | Cross-region queries, uneven growth |
-- Problem: celebrity/viral user gets millions of requests → single shard overloaded
-- Solution 1: Shard key salting — spread one user across sub-shards
shard_key = user_id + "_" + random(1, 100) // write to 100 sub-shards
// Read: query all 100, merge results (scatter-gather)
-- Solution 2: Dedicated shard for known hot keys
IF user_id IN hot_users_list:
route_to_dedicated_hot_shard(user_id)
-- Solution 3: Cache in front of shard
-- Absorb read spikes with Redis before hitting the shard
-- Solution 4: Time-based compound key for time-series
BAD: shard_key = device_id -- all writes go to same shard
GOOD: shard_key = (device_id, month) -- distributes over time
Sharding makes JOINs, multi-row transactions, and aggregations across shards extremely expensive. You must denormalize, use scatter-gather with application-level merging, or redesign queries to stay within a single shard. This is why sharding is always a last resort.
| Mode | How | Durability | Latency | Failover |
|---|---|---|---|---|
| Synchronous | Primary waits for ≥1 replica to ack before committing | Strong | High | No data loss |
| Asynchronous | Primary commits immediately, replica lags | Eventual | Low | Possible data loss (lag) |
| Semi-sync | Wait for ≥1 replica, rest are async | Balanced | Medium | At most 1 copy lost |
| Quorum | Write accepted when W nodes ack, read from R nodes | W+R > N = strong | Configurable | Tunable trade-off |
// Problem: User posts a tweet, reads feed, doesn't see their own tweet
// Cause: read served from replica lagging 200ms behind primary
// Solution 1: Route reads to primary for 1 second after write
function read(userId, lastWriteTime):
if now() - lastWriteTime < 1000ms:
return primary.read() // strong consistency window
return replica.read() // eventual after window
// Solution 2: Replication tokens — read from replica that has seen the write
write_token = primary.write(data) // returns LSN/position
replica.read(min_lsn=write_token) // waits for replica to catch up
// Solution 3: Sticky sessions — same user always hits same replica
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Cache-Aside (Lazy Loading) | App checks cache → miss → load from DB → write to cache | Only caches what's requested. Resilient — cache failure doesn't break reads | Cache miss = extra DB hit. Stale data after write |
| Write-Through | App writes to cache + DB atomically on every write | Cache always fresh, no stale reads | Write latency doubles. Cache fills with unread data |
| Write-Behind (Write-Back) | Write to cache immediately, async flush to DB | Fastest writes. DB can absorb bursts | Data loss if cache crashes before flush |
| Read-Through | Cache sits in front, loads from DB automatically on miss | Simple app code | First request always slow. Cache vendor must support |
| Refresh-Ahead | Pre-emptively refresh cache before TTL expires | No cold start on expiry | Loads data that may not be re-requested |
allkeys-lru — evict any key LRU (general cache)volatile-lru — evict only keys with TTL setallkeys-lfu — evict any key LFU (hot data bias)volatile-ttl — evict shortest TTL firstnoeviction — return error when full (dangerous)// Hard problem: "There are only two hard things in CS..."
// 1. TTL-based (simplest) — accept brief staleness
cache.set(key, value, ttl=300) // expires in 5 minutes regardless
// 2. Event-driven invalidation
on_db_write(user_id):
cache.delete(f"user:{user_id}") // invalidate on write
// Next read will repopulate from DB
// 3. Write-through with version tag
version = db.increment("user_version:{id}")
cache.set(f"user:{id}:{version}", data, ttl=3600)
// 4. CDC (Change Data Capture) — Debezium reads WAL
// DB WAL → Kafka → cache invalidation consumer
// Most reliable for complex invalidation logic
// 5. Two-level cache (L1 local + L2 Redis)
L1 = process_memory_lru(max=1000, ttl=10s) // ultra-fast, small
L2 = redis_cluster(ttl=300s) // shared across instances
-- Pessimistic: lock the row immediately
BEGIN;
SELECT balance FROM accounts WHERE id = 42 FOR UPDATE; -- row locked
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT; -- lock released
-- Optimistic: version-based CAS
SELECT balance, version FROM accounts WHERE id = 42;
-- ... do processing ...
UPDATE accounts
SET balance = balance - 100, version = version + 1
WHERE id = 42 AND version = 7; -- fails if version changed
-- If 0 rows affected → conflict, retry
Transaction A holds lock on row 1, waits for row 2. Transaction B holds lock on row 2, waits for row 1. Both wait forever. DBs detect cycles in the wait-for graph and kill one transaction (victim selection by lowest cost).
-- Prevention rule: always acquire locks in the same global order
-- BAD: T1 locks user_A then account_B; T2 locks account_B then user_A → deadlock
-- GOOD: always lock by ascending ID
ids = sorted([user_id, account_id])
FOR id IN ids:
SELECT * FROM resources WHERE id = id FOR UPDATE
-- Set lock timeout to avoid indefinite waits
SET lock_timeout = '2s';
-- Use SKIP LOCKED for queue-like workloads (no blocking)
SELECT * FROM jobs WHERE status = 'pending'
LIMIT 1 FOR UPDATE SKIP LOCKED;
Multi-Version Concurrency Control stores multiple versions of each row. Readers see a consistent snapshot of the DB at their transaction start time — they never block writers, writers never block readers.
-- Each row has hidden system columns:
-- xmin: transaction ID that inserted this version
-- xmax: transaction ID that deleted/updated (0 = live)
-- Snapshot isolation: your transaction sees rows where
-- xmin <= your_txn_id AND (xmax = 0 OR xmax > your_txn_id)
-- Consequence: old versions accumulate → VACUUM reclaims them
-- VACUUM FREEZE prevents transaction ID wraparound (32-bit overflow)
-- autovacuum_freeze_max_age = 200M transactions (monitor this!)
-- Problem: global counter row gets thousands of updates/sec
-- BAD: single row counter — serialized writes, massive contention
UPDATE counters SET count = count + 1 WHERE key = 'page_views';
-- GOOD: Sharded counters — N rows, randomly pick one to increment
shard = random(0, 99)
UPDATE counters SET count = count + 1
WHERE key = 'page_views' AND shard = shard;
-- Read: SELECT SUM(count) FROM counters WHERE key = 'page_views'
-- BETTER: Buffer increments in Redis INCR, flush to DB in batches
redis.incr("page_views") // O(1), no DB contention
// Background job flushes Redis→DB every 10s
// For cross-service mutual exclusion — e.g. cron job should run once
// Simple single-node lock
acquired = redis.set(
"lock:job_name",
"owner_uuid",
NX=True, // only set if Not eXists
PX=30000 // auto-expire in 30s (prevents deadlock on crash)
)
if acquired:
try:
do_work()
finally:
// Release ONLY if we still own it (Lua script for atomicity)
redis.eval("""
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
end
""", ["lock:job_name"], [owner_uuid])
// Redlock: acquire on N/2+1 independent Redis nodes for safety
// Clock skew and GC pauses can still cause split-brain — use with care