Advanced
Database Replication
Data replication - single-leader, multi-leader, and consensus algorithms
Database Replication
Replication is the process of copying data to multiple nodes to improve availability, fault tolerance, and read performance.
Why Replication?
┌─────────────────────────────────────────────────────────┐
│ Purpose of replication: │
│ │
│ 1. High availability: Auto-failover when nodes fail │
│ 2. Read scaling: Distribute read load across replicas │
│ 3. Geographic distribution: Place data close to users │
│ 4. Disaster recovery: Prevent data loss │
└─────────────────────────────────────────────────────────┘Single-Leader Replication
The most common replication mode where one primary node handles writes and multiple followers sync data.
Architecture
┌───────────────┐
│ Primary │
Writes ─────────▶│ (Leader) │
└───────┬───────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
Reads◀─│ Replica1 │ │ Replica2 │ │ Replica3 │
│(Follower)│ │(Follower)│ │(Follower)│
└──────────┘ └──────────┘ └──────────┘Synchronous vs Asynchronous Replication
Synchronous Replication:
┌─────────────────────────────────────────────────────────┐
│ Primary Replica │
│ │ │ │
│ │──Write──▶│ │
│ │ │ Apply │
│ │◀───ACK───│ │
│ │ │ │
│ Commit │ │
│ │◀─Client──│ │
│ │
│ Pros: No data loss, strong consistency │
│ Cons: Higher latency, replica failure affects writes │
└─────────────────────────────────────────────────────────┘
Asynchronous Replication:
┌─────────────────────────────────────────────────────────┐
│ Primary Replica │
│ │ │ │
│ Commit │ │
│ │◀─Client──│ │
│ │ │ │
│ │──Write──▶│ │
│ │ │ Apply (background) │
│ │
│ Pros: Low latency, high availability │
│ Cons: Primary failure may lose data │
└─────────────────────────────────────────────────────────┘PostgreSQL Streaming Replication
-- Primary node config (postgresql.conf)
wal_level = replica
max_wal_senders = 10
synchronous_commit = on -- synchronous replication
synchronous_standby_names = 'replica1'
-- Replica node config
primary_conninfo = 'host=primary port=5432 user=replicator'
primary_slot_name = 'replica1_slot'
-- Check replication status
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;MySQL Primary-Replica Replication
-- Primary node config (my.cnf)
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
-- Replica node config
server-id = 2
relay_log = relay-bin
-- Configure replication
CHANGE MASTER TO
MASTER_HOST='primary',
MASTER_USER='replicator',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=0;
START SLAVE;
SHOW SLAVE STATUS\GMulti-Leader Replication
Multiple nodes can accept writes, suitable for multi-datacenter deployments.
Architecture
Datacenter A Datacenter B
┌─────────────┐ ┌─────────────┐
│ Leader A │◀─────────▶│ Leader B │
│ (Write) │ Conflict │ (Write) │
└──────┬──────┘ Resolve └──────┬──────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Followers │ │ Followers │
└─────────────┘ └─────────────┘Conflict Handling
Write conflict scenario:
┌─────────────────────────────────────────────────────────┐
│ Leader A: UPDATE users SET name = 'Alice' WHERE id = 1; │
│ Leader B: UPDATE users SET name = 'Bob' WHERE id = 1; │
│ Happens simultaneously, how to resolve? │
└─────────────────────────────────────────────────────────┘
Conflict resolution strategies:
1. Last Write Wins (LWW)
- Based on timestamp, newer overwrites older
- Simple but may lose data
2. Merge Values
- Combine conflicting values: name = 'Alice,Bob'
- Requires application-level handling
3. Custom Logic
- Decide based on business rules
- e.g., take higher version number
4. Keep All Versions
- Let user or application decide
- Like CouchDB conflict documentsCRDTs (Conflict-free Replicated Data Types)
Conflict-free replicated data types:
G-Counter (Grow-only counter):
┌─────────────────────────────────────────────────────────┐
│ Node A: {A: 5, B: 0} │
│ Node B: {A: 0, B: 3} │
│ Merge: {A: 5, B: 3} → Total = 8 │
│ Same result regardless of merge order │
└─────────────────────────────────────────────────────────┘
LWW-Register (Last-Write-Wins Register):
┌─────────────────────────────────────────────────────────┐
│ Each write has timestamp │
│ {value: 'Alice', timestamp: 1000} │
│ {value: 'Bob', timestamp: 1001} │
│ Merge takes larger timestamp: 'Bob' │
└─────────────────────────────────────────────────────────┘Leaderless Replication
All nodes can accept reads and writes, using quorum for consistency.
Architecture
Client
│
┌─────┼─────┐
│ │ │
▼ ▼ ▼
┌─────┐┌─────┐┌─────┐
│Node1││Node2││Node3│
│ R/W ││ R/W ││ R/W │
└─────┘└─────┘└─────┘
Write: Send to all nodes, wait for W acknowledgments
Read: Query all nodes, wait for R responsesQuorum Mechanism
Quorum formula: R + W > N
N = Total number of replicas
W = Number of nodes that must acknowledge writes
R = Number of nodes that must respond to reads
Example (N=3):
┌─────────────────────────────────────────────────────────┐
│ W=2, R=2 → Strong consistency (2+2>3) │
│ W=3, R=1 → Strong consistency, slower writes │
│ W=1, R=3 → Strong consistency, slower reads │
│ W=1, R=1 → Eventual consistency (1+1≤3) │
└─────────────────────────────────────────────────────────┘Cassandra Implementation
-- Create Keyspace with replication factor 3
CREATE KEYSPACE myapp WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
-- Specify consistency level when writing
INSERT INTO users (id, name) VALUES (1, 'John')
USING CONSISTENCY QUORUM;
-- QUORUM = (RF/2) + 1 = 2
-- Specify consistency level when reading
SELECT * FROM users WHERE id = 1
CONSISTENCY QUORUM;Failover
Handling mechanism when primary node fails.
Automatic Failover
Failure detection and switchover:
┌─────────────────────────────────────────────────────────┐
│ 1. Heartbeat detection: Periodically check primary │
│ 2. Failure confirmation: Confirm after multiple misses │
│ 3. Election: Select new primary node │
│ 4. Switchover: Update config, redirect traffic │
│ 5. Recovery: Old primary becomes follower when back │
└─────────────────────────────────────────────────────────┘
Timeline:
T0: Primary running normally
Primary ◀──── Heartbeat ────▶ Follower
T1: Primary fails
Primary (❌) ◀──── Heartbeat (timeout) ────▶ Follower
T2: Failure detection
Threshold exceeded, failure confirmed
T3: Elect new Primary
Follower1 elected as new Primary
T4: Traffic switchover
Client ────▶ New Primary (formerly Follower1)Split Brain Problem
Split Brain:
┌─────────────────────────────────────────────────────────┐
│ Network partition causes both nodes to think they are │
│ the primary │
│ │
│ ┌──────────┐ ╳╳╳╳╳╳ ┌──────────┐ │
│ │ Primary │ Network │ Primary? │ │
│ │ (old) │ split │ (newly │ │
│ └──────────┘ │ elected) │ │
│ │ └──────────┘ │
│ Writes │ │
│ (conflicting!) Writes │
│ (conflicting!) │
└─────────────────────────────────────────────────────────┘
Solutions:
1. Fencing: Ensure old primary cannot write (STONITH)
2. Quorum: Minority partition auto-demotes
3. Lease: Primary holds time-limited leaseConsensus Algorithms
Algorithms for reaching agreement in distributed systems.
Raft Algorithm
Raft state machine:
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ Election timeout ┌──────────┐ │
│ │ Follower │───────────────────▶│ Candidate│ │
│ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ Discovers higher term │ Wins majority │
│ │ ▼ │
│ │ ┌──────────┐ │
│ └─────────────────────────│ Leader │ │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Leader election:
1. Follower times out without heartbeat
2. Becomes Candidate, increments term
3. Requests votes from other nodes
4. Wins majority to become Leader
5. Leader sends periodic heartbeats
Log replication:
1. Client sends write request to Leader
2. Leader appends to local log
3. Sends in parallel to all Followers
4. Commits after majority confirm
5. Responds to ClientPaxos Algorithm
Paxos roles:
┌─────────────────────────────────────────────────────────┐
│ Proposer: Initiates proposals │
│ Acceptor: Votes on decisions │
│ Learner: Learns the final decision │
└─────────────────────────────────────────────────────────┘
Two-phase commit:
Phase 1 (Prepare):
Proposer → Acceptors: "Prepare proposal N"
Acceptors → Proposer: "Promise not to accept < N"
Phase 2 (Accept):
Proposer → Acceptors: "Accept proposal N, value V"
Acceptors → Proposer: "Accepted"
Majority accepts → Proposal passesBest Practices
-
Choose appropriate replication mode:
- Single-leader: Most scenarios
- Multi-leader: Multi-datacenter
- Leaderless: High availability requirements
-
Configure appropriate sync level:
- Critical data: Synchronous replication
- General data: Asynchronous replication
-
Monitor replication lag:
-- PostgreSQL SELECT NOW() - pg_last_xact_replay_timestamp() AS replication_lag; -
Test failover:
- Regularly practice failover
- Verify data integrity
-
Handle consistency issues:
- Critical reads go to primary
- Use read-your-writes guarantee