This is the 6th post in a series on System Design.
The emergence of distributed, NoSQL databases has increased the importance of eventual consistency.
It was not a problem when systems had a single source of truth for all data in the database. The replicas simply did not exist.
To provide sufficient processing and storage capacity, many modern systems scale out their databases across multiple nodes. To eliminate single points of failure, you also need to replicate the contents of each node to ensure the data for each node is highly available.
Your database has suddenly become a distributed system. When the database nodes and networks are fast and reliable, your users don’t realize they are interacting with a distributed system. User requests are processed with low response times, and replicas are updated seemingly instantly. It is rare for reads to be inconsistent.
A distributed system must be able to handle a variety of failure modes. A database must deal with all the issues associated with highly variable network latencies, communication failures, and machine failures. Your database replicas may remain inconsistent for longer periods of time as a result of these failures.
An eventually consistent system’s inconsistency window is the period of time it takes for an update to propagate to all replicas. The leader coordinates the updating of other replicas in a leader-based system. The update is coordinated by any replica in a leaderless system. As soon as all replicas have the same value, the inconsistency window ends.
Several factors affects the duration of the inconsistency window:
- The number of replicas — The more replicas you have, the more replica updates must be coordinated.
- Operational Environment — Instantaneous operational glitches, such as temporary network failures and packet loss, can extend the inconsistency window.
- Distance between replicas — It is possible to achieve submillisecond communications latencies if all replicas are on the same local area network subnet. Your round-trip time will be the minimum value of the inconsistency window if one of your replicas is across a continent or around the world.
Read Your Own Writes
The Read Your Own Write (RYOW) property of a system ensures that when a client persistently changes data, the updated value is guaranteed to be returned by subsequent reads.
The inconsistency window allows a client in an eventually consistent system to:
- Issue an update to a database object key.
- If you read the same database object key again, you will see the old value since it accesses a replica that has not yet persisted.
In order to avoid this situation, a system must provide RYOW consistency. In this way, any updates made by an individual user will be visible in subsequent readings.
Implementing read-your-writes consistency is straightforward with leader-follower replication. You simply ensure that the subsequent read is handled by the leader replica when RYOW is required. Data object values are guaranteed to be up-to-date here.
Most eventually consistent databases offer configuration options and API parameters that allow you to tailor the database’s behavior. A use case can therefore trade off the performance of individual reads and writes based on the level of eventual replica consistency it can tolerate.
An application can specify the number of replicas it needs to access in order to complete a database request with tunable consistency.
N: Total number of replicas.
W: Number of replicas to update before confirming the update to the client.
R: Number of replicas to read before returning a value
Case when W=3, the request coordinator will wait untill all three replicas are updated before returning success to the client.
Case when W=1, the request coordinator will confirm the update locally and return success to the client. The other two replicas will be updated asynchronously.
If W=3, all replicas will be consistent after the write is complete. It is also known as immediate consistency. The client can issue reads with a value of R=1 and receive the latest value.
With W=1, you have an inconsistency window as only one replica will be guaranteed to have the latest value. The result of a read with R=1 may or may not be the latest value.
Eric Brewer’s famous CAP theorem elegantly summarizes the options for replica consistency and availability in distributed databases. This describes the options a database system has when there is a network partition, such as when the network drops or delays messages.
It is possible for a system to be both consistent and available if the network is functioning properly. In the event of a network partition, a system can either be consistent (CP) or available (AP).
A network partition means some nodes in the database are not accessible to others, dividing the database into two groups. If an update occurs and the replicas for the updated data object reside on both sides of the partition, then the database can either:
- CP can’t be ensured, so return an error.
- Update the replicas that are visible (AP). Until the partition heals and the database can make all replicas consistent, there will be replica inconsistency. Clients may see different values for the same data object until the inconsistency is resolved.
Quorum Reads and Writes
A quorum is simply a majority, which equals (N/2)+1. It is possible to balance the performance of reads and writes by configuring both W and R values as a quorum, while still providing access to the latest updated value.
A quorum with N=3 replicas requires a write to succeed at 2 replicas, and a read to access 2 replicas. Read requests will always see the latest version of the database object by always reading and writing from the majority of replicas.
Any replica can handle writes in a leaderless system. As a result, two clients can update the same database key independently on different replicas at the same time. How should the updates be applied when this occurs? What should be the final value of all replicas?
Last Writer Wins — Timestamps can be used to determine final, definitive values. A timestamp is generated for each update request, and the database ensures that the most recent timestamp becomes the final version when concurrent updates occur. From a database perspective, this is simple and fast. Unfortunately, this approach has a flaw. What was the order of the updates? A machine’s clock drifts. Therefore, comparing timestamps is meaningless if one node’s clock is ahead of another.
Version Vectors — Identifying and resolving conflicts is necessary to handle concurrent updates without losing data. So version numbers are assigned to each unique database object.
Book: Designing Data-Intensive Applications
Book: Foundations of Scalable Systems