System Design · Staff Engineer Series

Database
Design

Everything from storage engines to cache stampedes — the depth that separates senior from staff.

01

Database Types

Picking the right database is the first signal in any system design. Each type has a distinct data model, consistency profile, and performance envelope.

Relational
SQL
PostgreSQL · MySQL · CockroachDB
Tables, foreign keys, ACID transactions. Schema-enforced. Best query flexibility via SQL joins.
Use: financial systems, user profiles, inventory, anything needing strong consistency + ad-hoc queries
Document
Document
MongoDB · Firestore · CouchDB
JSON-like documents, flexible schema. Good for hierarchical or polymorphic data.
Use: catalogs, CMS, user-generated content, schemas that change frequently
Key-Value
Key-Value
Redis · DynamoDB · Memcached
Hash map at scale. O(1) reads/writes. No query language beyond key lookup.
Use: sessions, caching, rate limiting, leaderboards, feature flags
Wide-Column
Wide-Column
Cassandra · HBase · Bigtable
Row key + dynamic columns. Tunable consistency. Designed for massive write throughput and horizontal scale.
Use: time-series events, IoT, messaging history, audit logs at 10M+ writes/sec
Graph
Graph
Neo4j · Amazon Neptune · Dgraph
Nodes + edges as first-class citizens. Native graph traversal. Terrible for non-graph queries.
Use: social networks, fraud detection, recommendation engines, knowledge graphs
Time-Series
Time-Series
InfluxDB · TimescaleDB · Prometheus
Optimized for append-only time-stamped data. Efficient compression, downsampling, retention policies.
Use: metrics, telemetry, stock prices, sensor data, observability pipelines
Search
Search Engine
Elasticsearch · OpenSearch · Typesense
Inverted indexes for full-text search. Relevance scoring, faceting, fuzzy matching. Not a primary store.
Use: product search, log analytics, autocomplete — always sync from primary DB
NewSQL
NewSQL
CockroachDB · Spanner · YugabyteDB
SQL + ACID + horizontal scale. Distributed transactions via consensus. Higher latency than single-node SQL.
Use: global apps needing SQL semantics + multi-region writes without manual sharding
⚑ Staff Signal

Most systems use 2–3 database types together. A common pattern: PostgreSQL as the system of record, Redis for caching + sessions, Elasticsearch for search. Design the sync strategy and failure modes between them.

02

CAP Theorem, ACID vs BASE

C
Consistency
Every read returns the most recent write or an error. All nodes see the same data simultaneously.
A
Availability
Every request receives a response (not necessarily the latest data). System always stays operational.
P
Partition Tolerance
System continues to operate even when network partitions split nodes. Always required in practice.
Critical Nuance

Network partitions are unavoidable in distributed systems. So the real choice is C vs A during a partition: do you return stale data (AP) or refuse to answer (CP)? CA systems only exist on a single machine.

DatabaseCAP ChoiceDuring Partition...Examples
CPConsistentReturns error rather than stale dataHBase, Zookeeper, etcd, MongoDB (strong)
APAvailableReturns potentially stale dataCassandra, CouchDB, DynamoDB (eventual)
TunableConfigurableDepends on read/write quorum settingsCassandra (W+R>N), DynamoDB

ACID vs BASE

ACID (Relational DBs)

  • Atomic — all or nothing. Transaction fully commits or fully rolls back
  • Consistent — DB moves from one valid state to another. Constraints enforced
  • Isolated — concurrent transactions don't interfere. See isolation levels
  • Durable — committed data survives crashes (WAL / fsync)

BASE (NoSQL / Distributed)

  • Basically Available — system stays responsive even during failures
  • Soft state — state may change over time even without new input
  • Eventually consistent — all replicas will converge given no new updates
  • Trade correctness for availability and partition tolerance

Transaction Isolation Levels

Isolation Level
Dirty Read
Non-Repeatable
Phantom Read
Performance
Read Uncommitted
Possible
Possible
Possible
Fastest
Read Committed
Safe
Possible
Possible
Fast
Repeatable Read
Safe
Safe
Possible
Moderate
Serializable
Safe
Safe
Safe
Slowest
PostgreSQL Default

PostgreSQL defaults to Read Committed and implements isolation via MVCC (Multi-Version Concurrency Control) — readers never block writers. Serializable uses SSI (Serializable Snapshot Isolation) with minimal performance overhead vs traditional locking.

03

Indexing & Query Optimization

Index TypeStructureBest ForTradeoff
B-TreeBalanced tree, O(log n)Equality + range queries, sorts. Default index type.Write overhead per index
HashHash table, O(1)Exact equality only. Faster than B-tree for =No range queries, no sorts
GiST / GINGeneralized treeFull-text search, arrays, geometric, JSONBSlow builds, larger size
CompositeB-tree on N columnsQueries filtering on multiple columns togetherColumn order matters critically
PartialB-tree with WHERE clauseIndex a subset (e.g. WHERE status='active')Smaller, faster, but limited use
CoveringIncludes extra columnsIndex-only scans — no heap access neededLarger index size
ClusteredData stored with indexRange scans on primary key (InnoDB default)Only one per table
-- Composite index: column order = selectivity order (most selective first)
CREATE INDEX idx_orders ON orders(user_id, status, created_at);
-- Supports: WHERE user_id=? | WHERE user_id=? AND status=?
-- Ignores:  WHERE status=? (leftmost prefix rule)

-- Covering index: avoids heap fetch entirely
CREATE INDEX idx_covering ON orders(user_id) INCLUDE (total, created_at);

-- Partial index: index only what you query
CREATE INDEX idx_pending ON orders(created_at) WHERE status = 'pending';

-- Expression index: on computed values
CREATE INDEX idx_email_lower ON users(LOWER(email));

-- EXPLAIN ANALYZE — always verify your index is used
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 123;

Index Kills Performance When...

  • High write tables — each index = extra write
  • Low cardinality columns (boolean, status with 3 values)
  • Leading wildcard LIKE '%foo' — index skipped
  • Function on indexed column: WHERE LOWER(email)=... without expression index
  • Implicit type cast: WHERE int_col = '123' (string)

N+1 & Slow Query Patterns

  • N+1: fetching 1 user then N queries for their orders — use JOINs or batch loading
  • SELECT * — fetches unused columns, breaks covering indexes
  • Missing index on FK — parent-child join scans entire child table
  • OFFSET pagination — scans all preceding rows
  • Unbounded queries — always enforce LIMIT
04

Sharding — Strategies & Problems

Sharding splits data across multiple nodes when a single machine can't handle the volume. It's a last resort — exhaust vertical scaling, read replicas, and caching first.

StrategyHowProsProblems
Range-baseduser_id 1–1M → Shard 1, 1M–2M → Shard 2Simple, range queries stay localHotspots — new users pile onto latest shard
Hash-basedshard = hash(user_id) % NEven distribution, no hotspotsRange queries hit all shards. Resharding = full data move
Consistent HashingRing topology, virtual nodesMinimal data movement on add/remove nodesComplex implementation, uneven without vnodes
Directory-basedLookup table: key → shard mappingFull control, easy reshardingLookup table = single point of failure + bottleneck
GeographicEU users → EU shard, US → US shardData residency, low latencyCross-region queries, uneven growth

Consistent Hashing — Virtual Nodes

Node A
0–63
+ vnodes: 192–255
+ vnodes: 96–127
Node B
64–127
+ vnodes: 0–31
+ vnodes: 160–191
Node C
128–191
+ vnodes: 32–63
+ vnodes: 224–255
Node D (new)
Takes ~25% from
each existing node.
Only moved data
migrates.

Hotspot Problem & Solutions

-- Problem: celebrity/viral user gets millions of requests → single shard overloaded

-- Solution 1: Shard key salting — spread one user across sub-shards
shard_key = user_id + "_" + random(1, 100)    // write to 100 sub-shards
// Read: query all 100, merge results (scatter-gather)

-- Solution 2: Dedicated shard for known hot keys
IF user_id IN hot_users_list:
    route_to_dedicated_hot_shard(user_id)

-- Solution 3: Cache in front of shard
-- Absorb read spikes with Redis before hitting the shard

-- Solution 4: Time-based compound key for time-series
BAD:  shard_key = device_id          -- all writes go to same shard
GOOD: shard_key = (device_id, month) -- distributes over time

Cross-Shard Operations — The Hard Part

The Sharding Tax

Sharding makes JOINs, multi-row transactions, and aggregations across shards extremely expensive. You must denormalize, use scatter-gather with application-level merging, or redesign queries to stay within a single shard. This is why sharding is always a last resort.

Problems Sharding Creates

  • Cross-shard JOINs — must be done in application layer
  • Distributed transactions — 2PC with coordinator overhead
  • Rebalancing — expensive data movement when adding shards
  • Global uniqueness — can't rely on auto-increment IDs
  • Schema migrations — must run on all shards

Solutions

  • Snowflake IDs / ULIDs for globally unique IDs
  • Denormalize frequently joined data into same shard
  • Saga pattern instead of distributed transactions
  • Consistent hashing to minimize rebalancing
  • Blue-green migrations per shard
05

Replication & Consistency

ModeHowDurabilityLatencyFailover
SynchronousPrimary waits for ≥1 replica to ack before committingStrongHighNo data loss
AsynchronousPrimary commits immediately, replica lagsEventualLowPossible data loss (lag)
Semi-syncWait for ≥1 replica, rest are asyncBalancedMediumAt most 1 copy lost
QuorumWrite accepted when W nodes ack, read from R nodesW+R > N = strongConfigurableTunable trade-off

Replication Topologies

Primary–Replica
One primary handles all writes. Replicas serve reads. Simple failover by promoting a replica. Replication lag can cause stale reads.

Use: Most OLTP workloads.
Multi-Primary
Multiple nodes accept writes. Conflicts resolved via LWW (Last-Write-Wins) or CRDTs. Geographic distribution.

Use: Multi-region writes, offline-first apps. Conflicts are hard.
Consensus (Raft/Paxos)
Leader elected, followers must ack majority. Strongly consistent. CockroachDB, etcd, Spanner.

Use: Coordination services, strongly consistent distributed DBs.
// Problem: User posts a tweet, reads feed, doesn't see their own tweet
// Cause: read served from replica lagging 200ms behind primary

// Solution 1: Route reads to primary for 1 second after write
function read(userId, lastWriteTime):
  if now() - lastWriteTime < 1000ms:
    return primary.read()   // strong consistency window
  return replica.read()     // eventual after window

// Solution 2: Replication tokens — read from replica that has seen the write
write_token = primary.write(data)           // returns LSN/position
replica.read(min_lsn=write_token)           // waits for replica to catch up

// Solution 3: Sticky sessions — same user always hits same replica
06

Caching — Patterns & Pitfalls

Cache Placement Strategies

StrategyHowProsCons
Cache-Aside
(Lazy Loading)
App checks cache → miss → load from DB → write to cacheOnly caches what's requested. Resilient — cache failure doesn't break readsCache miss = extra DB hit. Stale data after write
Write-ThroughApp writes to cache + DB atomically on every writeCache always fresh, no stale readsWrite latency doubles. Cache fills with unread data
Write-Behind
(Write-Back)
Write to cache immediately, async flush to DBFastest writes. DB can absorb burstsData loss if cache crashes before flush
Read-ThroughCache sits in front, loads from DB automatically on missSimple app codeFirst request always slow. Cache vendor must support
Refresh-AheadPre-emptively refresh cache before TTL expiresNo cold start on expiryLoads data that may not be re-requested

Cache Eviction Policies

Eviction Algorithms

  • LRU — evict least recently used. Default Redis. Good for temporal locality
  • LFU — evict least frequently used. Better for skewed access patterns
  • ARC — adaptive, balances recency + frequency. PostgreSQL buffer pool
  • FIFO — simple, unfair to popular items. Rarely optimal
  • TTL-only — time-based expiry regardless of access patterns

Redis Eviction Modes

  • allkeys-lru — evict any key LRU (general cache)
  • volatile-lru — evict only keys with TTL set
  • allkeys-lfu — evict any key LFU (hot data bias)
  • volatile-ttl — evict shortest TTL first
  • noeviction — return error when full (dangerous)

The Three Cache Failure Modes

Cache Stampede
Problem: Popular key expires. 10,000 concurrent requests miss cache simultaneously, all hit DB.

Solutions:
• Lock/mutex: only one request populates cache, others wait
• Probabilistic early recomputation (XFetch)
• Background refresh before TTL expires
• Stale-while-revalidate: serve stale while refreshing async
Cache Avalanche
Problem: Many keys expire at the same time (e.g. after a restart). Massive DB load spike.

Solutions:
• Jitter on TTL: TTL = base + random(0, 10%)
• Stagger cache warmup on restart
• Circuit breaker to protect DB
• Persistent cache (Redis RDB/AOF)
Cache Penetration
Problem: Requests for non-existent keys (bad IDs) always miss cache, always hit DB.

Solutions:
• Cache null results with short TTL (e.g., 30s)
• Bloom filter at edge: reject unknown IDs before cache
• Input validation / rate limit by IP
• Negative caching with sentinel values
// Hard problem: "There are only two hard things in CS..."

// 1. TTL-based (simplest) — accept brief staleness
cache.set(key, value, ttl=300)   // expires in 5 minutes regardless

// 2. Event-driven invalidation
on_db_write(user_id):
  cache.delete(f"user:{user_id}")    // invalidate on write
  // Next read will repopulate from DB

// 3. Write-through with version tag
version = db.increment("user_version:{id}")
cache.set(f"user:{id}:{version}", data, ttl=3600)

// 4. CDC (Change Data Capture) — Debezium reads WAL
// DB WAL → Kafka → cache invalidation consumer
// Most reliable for complex invalidation logic

// 5. Two-level cache (L1 local + L2 Redis)
L1 = process_memory_lru(max=1000, ttl=10s)   // ultra-fast, small
L2 = redis_cluster(ttl=300s)                // shared across instances
07

Contention, Locks & Transactions

Locking Strategies

Pessimistic Locking

  • Lock the row before reading it (SELECT FOR UPDATE)
  • Blocks other transactions until released
  • Safe for high-conflict scenarios
  • Risk: deadlocks if locks acquired in different order
  • Use: bank transfers, inventory decrement, seat booking

Optimistic Locking

  • Read row with version number, write only if version unchanged
  • No locks held during processing — high concurrency
  • Retry on version conflict (CAS — compare-and-swap)
  • Risk: high retry rate under heavy contention
  • Use: low-conflict reads, user profile updates, config changes
-- Pessimistic: lock the row immediately
BEGIN;
SELECT balance FROM accounts WHERE id = 42 FOR UPDATE;  -- row locked
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;  -- lock released

-- Optimistic: version-based CAS
SELECT balance, version FROM accounts WHERE id = 42;
-- ... do processing ...
UPDATE accounts
  SET balance = balance - 100, version = version + 1
  WHERE id = 42 AND version = 7;  -- fails if version changed
-- If 0 rows affected → conflict, retry

Deadlocks — Detection & Prevention

Deadlock

Transaction A holds lock on row 1, waits for row 2. Transaction B holds lock on row 2, waits for row 1. Both wait forever. DBs detect cycles in the wait-for graph and kill one transaction (victim selection by lowest cost).

-- Prevention rule: always acquire locks in the same global order
-- BAD: T1 locks user_A then account_B; T2 locks account_B then user_A → deadlock

-- GOOD: always lock by ascending ID
ids = sorted([user_id, account_id])
FOR id IN ids:
  SELECT * FROM resources WHERE id = id FOR UPDATE

-- Set lock timeout to avoid indefinite waits
SET lock_timeout = '2s';

-- Use SKIP LOCKED for queue-like workloads (no blocking)
SELECT * FROM jobs WHERE status = 'pending'
LIMIT 1 FOR UPDATE SKIP LOCKED;

MVCC — How Postgres Avoids Read-Write Contention

Multi-Version Concurrency Control stores multiple versions of each row. Readers see a consistent snapshot of the DB at their transaction start time — they never block writers, writers never block readers.

-- Each row has hidden system columns:
--   xmin: transaction ID that inserted this version
--   xmax: transaction ID that deleted/updated (0 = live)

-- Snapshot isolation: your transaction sees rows where
--   xmin <= your_txn_id AND (xmax = 0 OR xmax > your_txn_id)

-- Consequence: old versions accumulate → VACUUM reclaims them
-- VACUUM FREEZE prevents transaction ID wraparound (32-bit overflow)
-- autovacuum_freeze_max_age = 200M transactions (monitor this!)

Write Amplification & Hot Row Contention

-- Problem: global counter row gets thousands of updates/sec
-- BAD: single row counter — serialized writes, massive contention
UPDATE counters SET count = count + 1 WHERE key = 'page_views';

-- GOOD: Sharded counters — N rows, randomly pick one to increment
shard = random(0, 99)
UPDATE counters SET count = count + 1
  WHERE key = 'page_views' AND shard = shard;
-- Read: SELECT SUM(count) FROM counters WHERE key = 'page_views'

-- BETTER: Buffer increments in Redis INCR, flush to DB in batches
redis.incr("page_views")   // O(1), no DB contention
// Background job flushes Redis→DB every 10s

Distributed Locking

// For cross-service mutual exclusion — e.g. cron job should run once

// Simple single-node lock
acquired = redis.set(
  "lock:job_name",
  "owner_uuid",
  NX=True,          // only set if Not eXists
  PX=30000          // auto-expire in 30s (prevents deadlock on crash)
)

if acquired:
  try:
    do_work()
  finally:
    // Release ONLY if we still own it (Lua script for atomicity)
    redis.eval("""
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('del', KEYS[1])
      end
    """, ["lock:job_name"], [owner_uuid])

// Redlock: acquire on N/2+1 independent Redis nodes for safety
// Clock skew and GC pauses can still cause split-brain — use with care
08

Staff Engineer Playbook

Ask Before Designing
Clarify read/write ratio, peak QPS, consistency requirements, and acceptable latency. A 99:1 read-write ratio calls for heavy caching. 50:50 needs different indexing strategy entirely.
Scale in Order
Vertical scale → Query optimization → Caching → Read replicas → Sharding. Most interviewers want to hear you exhaust simpler options before jumping to sharding.
Name the Failure Mode
For every design decision, name what breaks and how you detect + recover. Replica lag → stale reads. Cache miss storm → circuit breaker. Shard failure → replica promotion.
Polyglot Persistence
Real systems use multiple DB types. Propose how they sync: CDC/Kafka for DB→search sync, Redis as write-through cache, object store for BLOBs. Design the data flow, not just the storage.
Operations Matter
Mention schema migrations (expand-contract pattern), VACUUM for Postgres, index bloat monitoring, slow query logs, and connection pool sizing. Staff engineers think about Day 2.
Mention Observability
Cache hit rate, replication lag, p99 query latency, deadlock frequency, connection pool saturation, IOPS. If you can't measure it, you can't operate it.
◈ Quick Reference — Staff Cheat Sheet
CAP during partitionChoose CP (error) or AP (stale). CA = single machine only
Postgres default isolationRead Committed. MVCC: readers never block writers
Shard key ruleHigh cardinality + even distribution + avoids cross-shard ops
Cache stampede fixMutex lock OR probabilistic early refresh (XFetch)
Cache avalanche fixJitter TTL: base_ttl + random(0, 10%)
Cache penetration fixCache null + short TTL, or Bloom filter at edge
Deadlock preventionAlways acquire locks in same global order (sorted IDs)
Optimistic vs pessimisticOptimistic = low conflict. Pessimistic = high conflict (finance)
Hot row counter fixShard counter into N rows OR buffer in Redis INCR
Distributed lock TTLAlways set expiry. Release with Lua CAS to avoid stealing
Replication lag issueRead-your-writes: route to primary for N seconds after write
Composite index orderMost selective first. Supports leftmost prefix queries only
N+1 query fixEager load with JOIN, or batch fetch (SELECT IN)
Schema migration patternExpand (add new) → deploy → contract (remove old)
Global unique IDsSnowflake ID (Twitter) or ULID — sortable + distributed
Cassandra quorumW + R > N for strong consistency. W=QUORUM, R=QUORUM