System Design Tutorial: 3 Must-Know Distributed Systems Concepts

In this post, we will discuss three system design concepts that can be used to solve design problems relating to distributed systems. As these concepts can be applied to all kinds of distributed systems, they are very useful during System Design Tutorial.

Here is the list of concepts we will be discussing:

Explore Free Engineering Handwritten Notes!

Looking for comprehensive study materials on Python, Data Structures and Algorithms (DSA), Object-Oriented Programming (OOPs), Java, Software Testing, and more?

We earn a commission if you make a purchase, at no additional cost to you.
  1. Heartbeat
  2. Split Brain
  3. Merkle Tree

1. Heartbeat

Background

In a distributed environment, data (or work) is distributed among servers. Such a setup requires servers to know what other servers are part of the system in order to route requests efficiently. Furthermore, servers should be able to tell if other servers are up and running. In a decentralized environment, whenever a request arrives at a server, the server should be able to decide which server is responsible for entertaining that request. In this way, timely detection of server failure is crucial, enabling the system to take corrective action and move data (or work) to another healthy server and stop the environment from deteriorating further.

Definition

In a distributed environment, each server periodically sends a heartbeat message to a central monitoring server or other servers in the system to show that it is still alive and functioning.

Solution

Heartbeating is one of the mechanisms for detecting failures in a distributed system. If there is a central server, all servers periodically send a heartbeat message to it. If there is no central server, all servers randomly choose a set of servers and send them a heartbeat message every few seconds. This way, if no heartbeat message is received from a server for a while, the system can suspect that the server might have crashed. If there is no heartbeat within a configured timeout period, the system can conclude that the server is not alive anymore and stop sending requests to it and start working on its replacement.

Example

Google File System (GFS) and HDFS use Heartbeating to communicate with each other servers in the system to give instructions and collect state.

2. Split Brain

Background

In a distributed environment with a central (or leader) server, if the central server dies, the system must quickly find a substitute; otherwise, the system can quickly deteriorate.

One of the problems is that we cannot truly know if the leader has stopped for good or has experienced an intermittent failure like a stop-the-world GC pause or a temporary network disruption. Nevertheless, the cluster has to move on and pick a new leader. If the original leader had an intermittent failure, we now find ourselves with a so-called zombie leader. A zombie leader can be defined as a leader node that had been deemed dead by the system and has since come back online. Another node has taken its place, but the zombie leader might not know that yet. The system now has two active leaders that could be issuing conflicting commands. How can a system detect such a scenario so that all nodes in the system can ignore requests from the old leader and the old leader itself can detect that it is no longer the leader?

Definition

The common scenario in which a distributed system has two or more active leaders is called split-brain.

Split-brain is solved through the use of Generation Clock, which is simply a monotonically increasing number to indicate a server’s generation.

Solution

Every time a new leader is elected, the generation number gets incremented. This means if the old leader had a generation number of ‘1’, the new one will have ‘2’. This generation number is included in every request that is sent from the leader to other nodes. This way, nodes can now easily differentiate the real leader by simply trusting the leader with the highest number. The generation number should be persisted on disk, so that it remains available after a server reboot. One way is to store it with every entry in the Write-ahead Log.

Examples

Kafka: To handle Split-brain (where we could have multiple active controller brokers), Kafka uses ‘Epoch number,’ which is simply a monotonically increasing number to indicate a server’s generation.

HDFS: ZooKeeper is used to ensure that only one NameNode is active at any time. An epoch number is maintained as part of every transaction ID to reflect the NameNode generation.

3. Merkle Trees

Background

Distributed systems maintain multiple copies of data on different servers (called replicas) for fault tolerance and higher availability. To keep the data in sync among all replica servers, the system needs an efficient mechanism to compare data between two replicas. In a distributed environment, how can we quickly compare two copies of a range of data residing on two different replicas and figure out exactly which parts are different?

Definition

A replica can contain a lot of data. Naively splitting up the entire range to calculate checksums for comparison is not very feasible; there is simply too much data to be transferred. Instead, we can use Merkle trees to compare replicas of a range.

Solution

A Merkle tree is a binary tree of hashes, where each internal node is the hash of its two children, and each leaf node is a hash of a portion of the original data.

Comparing Merkle trees is conceptually simple:

  1. Compare the root hashes of both trees.
  2. If they are equal, stop.
  3. Recurse on the left and right children.

Ultimately, this means that replicas know exactly which parts of the range are different, but the amount of data exchanged is minimized. The principal advantage of a Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. Hence, Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads.

The disadvantage of using Merkle trees is that many key ranges can change when a node joins or leaves, at which point the trees need to be recalculated.

Examples

For anti-entropy and to resolve conflicts in the background, Amazon’s Dynamo uses Merkle trees.

Conclusion

➡ Practice these system design concepts to distinguish yourself from others!

Leave a Reply