Consensus Protocol in Distributed Systems

1. What is a Distributed System?

A distributed system is a collection of independent computers (nodes) that work together as a single system. Examples include databases like Google Spanner or blockchain networks like Bitcoin and Ethereum.

Since these nodes communicate over potentially unreliable networks, keeping them synchronized is challenging.

2. What is Consensus?

Consensus means agreement. In distributed systems, it refers to getting all nodes to agree on a single value or state, even if some nodes fail or send incorrect information.

Goal: All non-faulty nodes should eventually agree on the same result.

Example: Multiple database servers must agree on the order of transactions or the identity of the leader node.

3. What is a Consensus Protocol?

A consensus protocol is an algorithm that ensures nodes agree on a common value (such as a transaction order, leader, or block) even in the presence of failures.

Protocol	Main Use	Key Feature
Paxos	Databases, distributed coordination	High fault tolerance
Raft	Cluster management (e.g., etcd, Kubernetes)	Simpler to understand and implement than Paxos
PBFT (Practical Byzantine Fault Tolerance)	Blockchain systems	Tolerates malicious (Byzantine) nodes
Proof of Work / Proof of Stake	Blockchains	Achieve consensus over open, untrusted networks

4. Why is Consensus Needed?

Let’s make it intuitive with an example:

Server A receives a request to add $10 to an account.
Server B receives a request to subtract $10.

If they update separately, their balances may diverge. Consensus ensures all servers agree on the same operation order, maintaining consistency.

Consensus ensures:

All servers agree which operation happens first.
System remains consistent and reliable even with failures.

Consensus provides:

Consistency — All nodes agree on data.
Fault tolerance — System works even if some nodes fail.
Coordination — Nodes agree on who is leader and what happens next.

5. Real-world Example — Raft in Kubernetes / etcd

System: Kubernetes uses etcd (a distributed key-value store) for storing cluster state.
Consensus Protocol: Raft

How it works:

etcd runs on multiple nodes (e.g., 3 or 5).
One node is elected as the leader.
Configuration changes go to the leader.
The leader replicates changes to followers.
Once a majority acknowledges, the change is committed.
If a node fails, the system remains consistent and operational.

Quick Recap

Concept	Meaning
Consensus	Agreement among distributed nodes on a single value or state
Why Needed	Ensures data consistency and fault tolerance
Common Protocols	Paxos, Raft, PBFT, PoW/PoS
Example System	etcd (used by Kubernetes) uses Raft for consistent cluster state

6. Raft Leader Failure Handling

Let’s see what happens when the leader fails in Raft — this is key to its reliability.

Step 1: Leader Fails

Imagine 5 nodes: A, B, C, D, E. B is the leader. If B crashes or loses network connectivity, followers stop receiving heartbeats from B.

Step 2: Detecting the Failure

Each follower expects regular heartbeats from the leader. If they stop hearing from it for a timeout period (e.g., 150–300 ms), one follower (say C) assumes the leader is dead and becomes a candidate.

Increments its term number (a logical clock).
Requests votes from other nodes.
If it receives a majority of votes, it becomes the new leader.

Step 3: Resuming Operations

Once C is elected leader:

It sends heartbeats to all followers to announce leadership.
Checks old logs from B.
Keeps only those entries that were confirmed by a majority.
Discards uncommitted entries to prevent inconsistency.

Step 4: Fault Tolerance

Even if 1 or 2 nodes fail, the cluster continues to operate normally. Raft only needs a majority to function — in a 5-node cluster, 3 working nodes are enough.

Raft Leader Failure and Recovery Diagram

Leader Failure Handling Summary

Stage	Description
Failure Detection	Followers stop receiving heartbeats from the leader
Leader Election	A follower becomes a candidate and requests votes
New Leader Elected	Majority votes decide the new leader
Log Synchronization	Uncommitted logs are removed; committed entries retained
System Recovery	Cluster resumes normal operation