Database Design — Staff Engineer Reference

01

Database Types

Picking the right database is the first signal in any system design. Each type has a distinct data model, consistency profile, and performance envelope.

Relational

SQL

PostgreSQL · MySQL · CockroachDB

Tables, foreign keys, ACID transactions. Schema-enforced. Best query flexibility via SQL joins.

Use: financial systems, user profiles, inventory, anything needing strong consistency + ad-hoc queries

Document

MongoDB · Firestore · CouchDB

JSON-like documents, flexible schema. Good for hierarchical or polymorphic data.

Use: catalogs, CMS, user-generated content, schemas that change frequently

Key-Value

Redis · DynamoDB · Memcached

Hash map at scale. O(1) reads/writes. No query language beyond key lookup.

Use: sessions, caching, rate limiting, leaderboards, feature flags

Wide-Column

Cassandra · HBase · Bigtable

Row key + dynamic columns. Tunable consistency. Designed for massive write throughput and horizontal scale.

Use: time-series events, IoT, messaging history, audit logs at 10M+ writes/sec

Graph

Neo4j · Amazon Neptune · Dgraph

Nodes + edges as first-class citizens. Native graph traversal. Terrible for non-graph queries.

Use: social networks, fraud detection, recommendation engines, knowledge graphs

Time-Series

InfluxDB · TimescaleDB · Prometheus

Optimized for append-only time-stamped data. Efficient compression, downsampling, retention policies.

Use: metrics, telemetry, stock prices, sensor data, observability pipelines

Search

Search Engine

Elasticsearch · OpenSearch · Typesense

Inverted indexes for full-text search. Relevance scoring, faceting, fuzzy matching. Not a primary store.

Use: product search, log analytics, autocomplete — always sync from primary DB

NewSQL

CockroachDB · Spanner · YugabyteDB

SQL + ACID + horizontal scale. Distributed transactions via consensus. Higher latency than single-node SQL.

Use: global apps needing SQL semantics + multi-region writes without manual sharding

⚑ Staff Signal

Most systems use 2–3 database types together. A common pattern: PostgreSQL as the system of record, Redis for caching + sessions, Elasticsearch for search. Design the sync strategy and failure modes between them.

02

CAP Theorem, ACID vs BASE

C

Consistency

Every read returns the most recent write or an error. All nodes see the same data simultaneously.

A

Availability

Every request receives a response (not necessarily the latest data). System always stays operational.

P

Partition Tolerance

System continues to operate even when network partitions split nodes. Always required in practice.

Critical Nuance

Network partitions are unavoidable in distributed systems. So the real choice is C vs A during a partition: do you return stale data (AP) or refuse to answer (CP)? CA systems only exist on a single machine.

Database	CAP Choice	During Partition...	Examples
CP	Consistent	Returns error rather than stale data	HBase, Zookeeper, etcd, MongoDB (strong)
AP	Available	Returns potentially stale data	Cassandra, CouchDB, DynamoDB (eventual)
Tunable	Configurable	Depends on read/write quorum settings	Cassandra (W+R>N), DynamoDB

ACID vs BASE

ACID (Relational DBs)

Atomic — all or nothing. Transaction fully commits or fully rolls back
Consistent — DB moves from one valid state to another. Constraints enforced
Isolated — concurrent transactions don't interfere. See isolation levels
Durable — committed data survives crashes (WAL / fsync)

BASE (NoSQL / Distributed)

Basically Available — system stays responsive even during failures
Soft state — state may change over time even without new input
Eventually consistent — all replicas will converge given no new updates
Trade correctness for availability and partition tolerance

Transaction Isolation Levels

Read Uncommitted

Possible

Fastest

Read Committed

Safe

Possible

Fast

Repeatable Read

Safe

Possible

Moderate

Serializable

Safe

Slowest

PostgreSQL Default

PostgreSQL defaults to Read Committed and implements isolation via MVCC (Multi-Version Concurrency Control) — readers never block writers. Serializable uses SSI (Serializable Snapshot Isolation) with minimal performance overhead vs traditional locking.

03

Indexing & Query Optimization

Index Type	Structure	Best For	Tradeoff
B-Tree	Balanced tree, O(log n)	Equality + range queries, sorts. Default index type.	Write overhead per index
Hash	Hash table, O(1)	Exact equality only. Faster than B-tree for =	No range queries, no sorts
GiST / GIN	Generalized tree	Full-text search, arrays, geometric, JSONB	Slow builds, larger size
Composite	B-tree on N columns	Queries filtering on multiple columns together	Column order matters critically
Partial	B-tree with WHERE clause	Index a subset (e.g. WHERE status='active')	Smaller, faster, but limited use
Covering	Includes extra columns	Index-only scans — no heap access needed	Larger index size
Clustered	Data stored with index	Range scans on primary key (InnoDB default)	Only one per table

-- Composite index: column order = selectivity order (most selective first)
CREATE INDEX idx_orders ON orders(user_id, status, created_at);
-- Supports: WHERE user_id=? | WHERE user_id=? AND status=?
-- Ignores:  WHERE status=? (leftmost prefix rule)

-- Covering index: avoids heap fetch entirely
CREATE INDEX idx_covering ON orders(user_id) INCLUDE (total, created_at);

-- Partial index: index only what you query
CREATE INDEX idx_pending ON orders(created_at) WHERE status = 'pending';

-- Expression index: on computed values
CREATE INDEX idx_email_lower ON users(LOWER(email));

-- EXPLAIN ANALYZE — always verify your index is used
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 123;

Index Kills Performance When...

High write tables — each index = extra write
Low cardinality columns (boolean, status with 3 values)
Leading wildcard LIKE '%foo' — index skipped
Function on indexed column: WHERE LOWER(email)=... without expression index
Implicit type cast: WHERE int_col = '123' (string)

N+1 & Slow Query Patterns

N+1: fetching 1 user then N queries for their orders — use JOINs or batch loading
SELECT * — fetches unused columns, breaks covering indexes
Missing index on FK — parent-child join scans entire child table
OFFSET pagination — scans all preceding rows
Unbounded queries — always enforce LIMIT

04

Sharding — Strategies & Problems

Sharding splits data across multiple nodes when a single machine can't handle the volume. It's a last resort — exhaust vertical scaling, read replicas, and caching first.

Strategy	How	Pros	Problems
Range-based	user_id 1–1M → Shard 1, 1M–2M → Shard 2	Simple, range queries stay local	Hotspots — new users pile onto latest shard
Hash-based	shard = hash(user_id) % N	Even distribution, no hotspots	Range queries hit all shards. Resharding = full data move
Consistent Hashing	Ring topology, virtual nodes	Minimal data movement on add/remove nodes	Complex implementation, uneven without vnodes
Directory-based	Lookup table: key → shard mapping	Full control, easy resharding	Lookup table = single point of failure + bottleneck
Geographic	EU users → EU shard, US → US shard	Data residency, low latency	Cross-region queries, uneven growth

Consistent Hashing — Virtual Nodes

Node A

0–63

+ vnodes: 192–255
+ vnodes: 96–127

Node B

64–127

+ vnodes: 0–31
+ vnodes: 160–191

Node C

128–191

+ vnodes: 32–63
+ vnodes: 224–255

Node D (new)

—

Takes ~25% from
each existing node.
Only moved data
migrates.

Hotspot Problem & Solutions

-- Problem: celebrity/viral user gets millions of requests → single shard overloaded

-- Solution 1: Shard key salting — spread one user across sub-shards
shard_key = user_id + "_" + random(1, 100)    // write to 100 sub-shards
// Read: query all 100, merge results (scatter-gather)

-- Solution 2: Dedicated shard for known hot keys
IF user_id IN hot_users_list:
    route_to_dedicated_hot_shard(user_id)

-- Solution 3: Cache in front of shard
-- Absorb read spikes with Redis before hitting the shard

-- Solution 4: Time-based compound key for time-series
BAD:  shard_key = device_id          -- all writes go to same shard
GOOD: shard_key = (device_id, month) -- distributes over time

Cross-Shard Operations — The Hard Part

The Sharding Tax

Sharding makes JOINs, multi-row transactions, and aggregations across shards extremely expensive. You must denormalize, use scatter-gather with application-level merging, or redesign queries to stay within a single shard. This is why sharding is always a last resort.

Problems Sharding Creates

Cross-shard JOINs — must be done in application layer
Distributed transactions — 2PC with coordinator overhead
Rebalancing — expensive data movement when adding shards
Global uniqueness — can't rely on auto-increment IDs
Schema migrations — must run on all shards

Solutions

Snowflake IDs / ULIDs for globally unique IDs
Denormalize frequently joined data into same shard
Saga pattern instead of distributed transactions
Consistent hashing to minimize rebalancing
Blue-green migrations per shard

05

Replication & Consistency

Mode	How	Durability	Latency	Failover
Synchronous	Primary waits for ≥1 replica to ack before committing	Strong	High	No data loss
Asynchronous	Primary commits immediately, replica lags	Eventual	Low	Possible data loss (lag)
Semi-sync	Wait for ≥1 replica, rest are async	Balanced	Medium	At most 1 copy lost
Quorum	Write accepted when W nodes ack, read from R nodes	W+R > N = strong	Configurable	Tunable trade-off

Replication Topologies

Primary–Replica

One primary handles all writes. Replicas serve reads. Simple failover by promoting a replica. Replication lag can cause stale reads.

Use: Most OLTP workloads.

Multi-Primary

Multiple nodes accept writes. Conflicts resolved via LWW (Last-Write-Wins) or CRDTs. Geographic distribution.

Use: Multi-region writes, offline-first apps. Conflicts are hard.

Consensus (Raft/Paxos)

Leader elected, followers must ack majority. Strongly consistent. CockroachDB, etcd, Spanner.

Use: Coordination services, strongly consistent distributed DBs.

// Problem: User posts a tweet, reads feed, doesn't see their own tweet
// Cause: read served from replica lagging 200ms behind primary

// Solution 1: Route reads to primary for 1 second after write
function read(userId, lastWriteTime):
  if now() - lastWriteTime < 1000ms:
    return primary.read()   // strong consistency window
  return replica.read()     // eventual after window

// Solution 2: Replication tokens — read from replica that has seen the write
write_token = primary.write(data)           // returns LSN/position
replica.read(min_lsn=write_token)           // waits for replica to catch up

// Solution 3: Sticky sessions — same user always hits same replica

06

Caching — Patterns & Pitfalls

Cache Placement Strategies

Strategy	How	Pros	Cons
Cache-Aside (Lazy Loading)	App checks cache → miss → load from DB → write to cache	Only caches what's requested. Resilient — cache failure doesn't break reads	Cache miss = extra DB hit. Stale data after write
Write-Through	App writes to cache + DB atomically on every write	Cache always fresh, no stale reads	Write latency doubles. Cache fills with unread data
Write-Behind (Write-Back)	Write to cache immediately, async flush to DB	Fastest writes. DB can absorb bursts	Data loss if cache crashes before flush
Read-Through	Cache sits in front, loads from DB automatically on miss	Simple app code	First request always slow. Cache vendor must support
Refresh-Ahead	Pre-emptively refresh cache before TTL expires	No cold start on expiry	Loads data that may not be re-requested

Cache Eviction Policies

Eviction Algorithms

LRU — evict least recently used. Default Redis. Good for temporal locality
LFU — evict least frequently used. Better for skewed access patterns
ARC — adaptive, balances recency + frequency. PostgreSQL buffer pool
FIFO — simple, unfair to popular items. Rarely optimal
TTL-only — time-based expiry regardless of access patterns

Redis Eviction Modes

allkeys-lru — evict any key LRU (general cache)
volatile-lru — evict only keys with TTL set
allkeys-lfu — evict any key LFU (hot data bias)
volatile-ttl — evict shortest TTL first
noeviction — return error when full (dangerous)

The Three Cache Failure Modes

Cache Stampede

Problem: Popular key expires. 10,000 concurrent requests miss cache simultaneously, all hit DB.

Solutions:
• Lock/mutex: only one request populates cache, others wait
• Probabilistic early recomputation (XFetch)
• Background refresh before TTL expires
• Stale-while-revalidate: serve stale while refreshing async

Cache Avalanche

Problem: Many keys expire at the same time (e.g. after a restart). Massive DB load spike.

Solutions:
• Jitter on TTL: TTL = base + random(0, 10%)
• Stagger cache warmup on restart
• Circuit breaker to protect DB
• Persistent cache (Redis RDB/AOF)

Cache Penetration

Problem: Requests for non-existent keys (bad IDs) always miss cache, always hit DB.

Solutions:
• Cache null results with short TTL (e.g., 30s)
• Bloom filter at edge: reject unknown IDs before cache
• Input validation / rate limit by IP
• Negative caching with sentinel values

// Hard problem: "There are only two hard things in CS..."

// 1. TTL-based (simplest) — accept brief staleness
cache.set(key, value, ttl=300)   // expires in 5 minutes regardless

// 2. Event-driven invalidation
on_db_write(user_id):
  cache.delete(f"user:{user_id}")    // invalidate on write
  // Next read will repopulate from DB

// 3. Write-through with version tag
version = db.increment("user_version:{id}")
cache.set(f"user:{id}:{version}", data, ttl=3600)

// 4. CDC (Change Data Capture) — Debezium reads WAL
// DB WAL → Kafka → cache invalidation consumer
// Most reliable for complex invalidation logic

// 5. Two-level cache (L1 local + L2 Redis)
L1 = process_memory_lru(max=1000, ttl=10s)   // ultra-fast, small
L2 = redis_cluster(ttl=300s)                // shared across instances

07

Contention, Locks & Transactions

Locking Strategies

Pessimistic Locking

Lock the row before reading it (SELECT FOR UPDATE)
Blocks other transactions until released
Safe for high-conflict scenarios
Risk: deadlocks if locks acquired in different order
Use: bank transfers, inventory decrement, seat booking

Optimistic Locking

Read row with version number, write only if version unchanged
No locks held during processing — high concurrency
Retry on version conflict (CAS — compare-and-swap)
Risk: high retry rate under heavy contention
Use: low-conflict reads, user profile updates, config changes

-- Pessimistic: lock the row immediately
BEGIN;
SELECT balance FROM accounts WHERE id = 42 FOR UPDATE;  -- row locked
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;  -- lock released

-- Optimistic: version-based CAS
SELECT balance, version FROM accounts WHERE id = 42;
-- ... do processing ...
UPDATE accounts
  SET balance = balance - 100, version = version + 1
  WHERE id = 42 AND version = 7;  -- fails if version changed
-- If 0 rows affected → conflict, retry

Deadlocks — Detection & Prevention

Deadlock

Transaction A holds lock on row 1, waits for row 2. Transaction B holds lock on row 2, waits for row 1. Both wait forever. DBs detect cycles in the wait-for graph and kill one transaction (victim selection by lowest cost).

-- Prevention rule: always acquire locks in the same global order
-- BAD: T1 locks user_A then account_B; T2 locks account_B then user_A → deadlock

-- GOOD: always lock by ascending ID
ids = sorted([user_id, account_id])
FOR id IN ids:
  SELECT * FROM resources WHERE id = id FOR UPDATE

-- Set lock timeout to avoid indefinite waits
SET lock_timeout = '2s';

-- Use SKIP LOCKED for queue-like workloads (no blocking)
SELECT * FROM jobs WHERE status = 'pending'
LIMIT 1 FOR UPDATE SKIP LOCKED;

MVCC — How Postgres Avoids Read-Write Contention

Multi-Version Concurrency Control stores multiple versions of each row. Readers see a consistent snapshot of the DB at their transaction start time — they never block writers, writers never block readers.

-- Each row has hidden system columns:
--   xmin: transaction ID that inserted this version
--   xmax: transaction ID that deleted/updated (0 = live)

-- Snapshot isolation: your transaction sees rows where
--   xmin <= your_txn_id AND (xmax = 0 OR xmax > your_txn_id)

-- Consequence: old versions accumulate → VACUUM reclaims them
-- VACUUM FREEZE prevents transaction ID wraparound (32-bit overflow)
-- autovacuum_freeze_max_age = 200M transactions (monitor this!)

Write Amplification & Hot Row Contention

-- Problem: global counter row gets thousands of updates/sec
-- BAD: single row counter — serialized writes, massive contention
UPDATE counters SET count = count + 1 WHERE key = 'page_views';

-- GOOD: Sharded counters — N rows, randomly pick one to increment
shard = random(0, 99)
UPDATE counters SET count = count + 1
  WHERE key = 'page_views' AND shard = shard;
-- Read: SELECT SUM(count) FROM counters WHERE key = 'page_views'

-- BETTER: Buffer increments in Redis INCR, flush to DB in batches
redis.incr("page_views")   // O(1), no DB contention
// Background job flushes Redis→DB every 10s

Distributed Locking

// For cross-service mutual exclusion — e.g. cron job should run once

// Simple single-node lock
acquired = redis.set(
  "lock:job_name",
  "owner_uuid",
  NX=True,          // only set if Not eXists
  PX=30000          // auto-expire in 30s (prevents deadlock on crash)
)

if acquired:
  try:
    do_work()
  finally:
    // Release ONLY if we still own it (Lua script for atomicity)
    redis.eval("""
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('del', KEYS[1])
      end
    """, ["lock:job_name"], [owner_uuid])

// Redlock: acquire on N/2+1 independent Redis nodes for safety
// Clock skew and GC pauses can still cause split-brain — use with care

08

Staff Engineer Playbook

Ask Before Designing

Clarify read/write ratio, peak QPS, consistency requirements, and acceptable latency. A 99:1 read-write ratio calls for heavy caching. 50:50 needs different indexing strategy entirely.

Scale in Order

Vertical scale → Query optimization → Caching → Read replicas → Sharding. Most interviewers want to hear you exhaust simpler options before jumping to sharding.

Name the Failure Mode

For every design decision, name what breaks and how you detect + recover. Replica lag → stale reads. Cache miss storm → circuit breaker. Shard failure → replica promotion.

Polyglot Persistence

Real systems use multiple DB types. Propose how they sync: CDC/Kafka for DB→search sync, Redis as write-through cache, object store for BLOBs. Design the data flow, not just the storage.

Operations Matter

Mention schema migrations (expand-contract pattern), VACUUM for Postgres, index bloat monitoring, slow query logs, and connection pool sizing. Staff engineers think about Day 2.

Mention Observability

Cache hit rate, replication lag, p99 query latency, deadlock frequency, connection pool saturation, IOPS. If you can't measure it, you can't operate it.

◈ Quick Reference — Staff Cheat Sheet

CAP during partitionChoose CP (error) or AP (stale). CA = single machine only

Postgres default isolationRead Committed. MVCC: readers never block writers

Shard key ruleHigh cardinality + even distribution + avoids cross-shard ops

Cache stampede fixMutex lock OR probabilistic early refresh (XFetch)

Cache avalanche fixJitter TTL: base_ttl + random(0, 10%)

Cache penetration fixCache null + short TTL, or Bloom filter at edge

Deadlock preventionAlways acquire locks in same global order (sorted IDs)

Optimistic vs pessimisticOptimistic = low conflict. Pessimistic = high conflict (finance)

Hot row counter fixShard counter into N rows OR buffer in Redis INCR

Distributed lock TTLAlways set expiry. Release with Lua CAS to avoid stealing

Replication lag issueRead-your-writes: route to primary for N seconds after write

Composite index orderMost selective first. Supports leftmost prefix queries only

N+1 query fixEager load with JOIN, or batch fetch (SELECT IN)

Schema migration patternExpand (add new) → deploy → contract (remove old)

Global unique IDsSnowflake ID (Twitter) or ULID — sortable + distributed

Cassandra quorumW + R > N for strong consistency. W=QUORUM, R=QUORUM

DatabaseDesign

Database Types

CAP Theorem, ACID vs BASE

ACID vs BASE

ACID (Relational DBs)

BASE (NoSQL / Distributed)

Transaction Isolation Levels

Indexing & Query Optimization

Index Kills Performance When...

N+1 & Slow Query Patterns

Sharding — Strategies & Problems

Consistent Hashing — Virtual Nodes

Hotspot Problem & Solutions

Cross-Shard Operations — The Hard Part

Problems Sharding Creates

Solutions

Replication & Consistency

Replication Topologies

Caching — Patterns & Pitfalls

Cache Placement Strategies

Cache Eviction Policies

Eviction Algorithms

Redis Eviction Modes

The Three Cache Failure Modes

Contention, Locks & Transactions

Locking Strategies

Pessimistic Locking

Optimistic Locking

Deadlocks — Detection & Prevention

MVCC — How Postgres Avoids Read-Write Contention

Write Amplification & Hot Row Contention

Distributed Locking

Staff Engineer Playbook

Database
Design