This is the 7th post in a series on System Design.
Eventual consistent databases allow data sets to be partitioned and replicated across multiple machines, as discussed in our previous post. Scalability is achieved at the expense of maintaining strong data consistency across replicas and allowing conflicting writes. These trade-offs have two consequences. The first problem is that different clients may see either the old or the latest value for a data object until all replicas converge on the latest value. A second concern is the application’s responsibility to ensure that data is not lost when multiple clients update an object concurrently.
Strong consistent data systems are another class of distributed databases.
Strong consistent systems, also known as NewSQL, aim to ensure that data objects are updated consistently across all clients. Furthermore, they provide the well-known benefits of atomicity, consistency, isolation, and durability (ACID) database transactions.
A single-node relational database is already known for its transactional and data consistency features, eliminating many of the complications associated with eventually consistent systems.
Consensus algorithms are used to ensure transactional and replica consistency. The algorithms enable nodes in a distributed system to agree on the value of a shared state.
- To ensure transactional consistency, all participants must agree to commit or abort changes within the transaction.
- For replica consistency, all replicas must agree on the same order of updates.
Atomicity - Database changes must be executed as a single operation. In other words, all updates must succeed, or all must fail (roll back). Consistency - The database will remain consistent after transactions have been processed. Isolation - During a transaction, any data modified by the transaction is invisible to other concurrent transactions. Concurrent transactions are isolated from each other, and the results of a transaction are not accessible to other concurrent transactions until the transaction has completed. This is accomplished by acquiring locks on the data objects that a transaction accesses and releasing them at the end of the transaction. Durability - When a transaction commits, the changes are permanent and can be recovered in case of a system failure.
Serializability and linearizability are two of the most restrictive consistency models defined by the database and distributed systems communities. It is possible to achieve the strongest data consistency by combining these two models.
ACID refers to this as transactional consistency, the “C” in ACID. Transactions involve the reading and writing of multiple data objects at the same time. As a result of serializability, executing a set of concurrent transactions over multiple items is equivalent to performing them sequentially.
This refers to “reads” and “writes” to single data objects. All clients should always see the most recent value of a data object. After a write to a data object succeeds, all subsequent reads must return the value of that write until the object is modified again. Linearizable consistency is concerned with replica consistency, essentially the “C” in the CAP theorem.
The transactional semantics ensure that all operations succeed or fail.
START TRANSACTION;SELECT @income:= MAX(income) FROM employees; INSERT INTO employees(emp_id, emp_name, emp_age, city, income) VALUES (100, 'Mr X', 45, 'Delhi', 99999); INSERT INTO orders(order_id, prod_name, order_num, order_date, emp_id) VALUES (600, 'Laptop', 34324, '2022-07-20', 100);COMMIT;
Committing a transaction in a single node database is relatively straightforward. In a transaction log file, the database engine ensures that transaction modifications and states are persisted on the disk. On the restart, the transaction log can be used to restore a consistent database state if the database engine fails.
The process becomes more complicated if “employees” and the “orders” tables reside in different databases or partitions in a distributed database. To ensure that both nodes agree on the outcome, you need an algorithm.
A two-phase commit (2PC) is a classic distributed transaction consensus algorithm. It is widely implemented in established relational databases.
A coordinator or leader drives the protocol. As part of a multipartition transactional update in a distributed SQL database, the coordinator can be one of the partitions being updated. A coordinator is selected when a database client initiates a transaction. Clients receive a globally unique transaction identifier (“tid”) from the coordinator. A “tid” identifies a data structure maintained by the coordinator called the transaction context. Transaction context records the database partitions, or participants, that participate in a transaction and their communication status. In order to maintain the transaction’s state durably, the coordinator persists the context.
The client then executes the database operations defined by the transaction, passing the “tid” to each participant. Participants acquire locks on mutated objects and perform operations locally. The “tid” is also durably associated with the updates in a local transaction log. Only after the transaction commits will these database updates be completed.
After all operations in the transaction have been completed successfully, the client attempts to commit the transaction. As a result, the 2PC algorithm is applied to the coordinator, which drives two rounds of voting:
All participants are notified by the coordinator to prepare for the commit transaction. Successful preparation ensures that a participant will be able to commit to the transaction and make it durable. After this, it cannot unilaterally abort the transaction. If a participant cannot prepare, i.e., cannot guarantee to commit the transaction, it must abort. By returning a message, each participant informs the coordinator of their decision to commit or abort.
As soon as all participants have responded to the “prepare” phase, the coordinator examines the results. The coordinator sends a commit message to each participant, and as soon as all participants commit, the whole transaction commits. The coordinator sends an abort message to each participant if one of them decides to abort the transaction or does not reply within a specified time period.
2PC Failure Modes
There are two main failure modes for 2PC. Failures of participants and coordinators fall into these categories. A failure can occur when a system crashes or is partitioned from the rest of the application. From the perspective of 2PC, the crashes and partitions are indistinguishable.
When a participant crashes before the “prepare” phase is completed, the coordinator aborts the transaction. It is a straightforward failure scenario. Participants can also reply to the “prepare” message and then fail. When the participant restarts, it must communicate with the coordinator to determine the outcome of the transaction. Using its transaction log, the coordinator can look up the outcomes and inform the recovered participant. After that, the participant completes the transaction locally. As long as the correct transaction outcome is reached, participant failure does not threaten consistency.
In the event that the coordinator fails to send the “prepare” message, participants are faced with a dilemma. If a participant has voted to commit, they must block until the coordinator informs them of the outcome of the transaction. When the coordinator crashes before or during sending out the commit messages, participants cannot proceed, as the coordinator has failed and will not be able to send the transaction outcome until it has recovered.
There is no simple solution to this problem. In the absence of information about how other participants voted, a participant cannot autonomously decide to commit. The semantics of a transaction would be violated if one participant voted to roll back, and others to commit. The only practical solution is for participants to wait until the coordinator recovers and examines the transaction log. Using the log, the coordinator is able to resolve all incomplete transactions. It will inform the participants to commit if it has logged a commit entry for an incomplete transaction. In any other case, the transaction will be rolled back.
Using the transaction coordinator recovery and the transaction log, incomplete transactions can be finalized and consistency is maintained in the system. Participants must block while the coordinator recovers, which is a downside.
Additionally, participants must hold locks on the data objects that have been mutated by the transaction during this time. To ensure transaction isolation, locks are necessary. Other concurrent transactions will be blocked if they try to access these locked data items. As a result, response times are increased and requests may time out. Based on the characteristics of the system design, this can cause cascading failures, circuit breakers to open, and other generally undesirable outcomes in heavily loaded systems or during request spikes.
Due to its incapacity to tolerate coordinator failure, 2PC has a weakness. As with all single point of failure problems, it is possible to replicate the coordinator and transaction state across participants. A participant can be promoted to the coordinator if the coordinator fails. This path leads to a solution that requires a distributed consensus algorithm
Distributed Consensus Algorithms
Consensus, or agreement, on every replica value, is required to implement replica consistency so that all clients see the same replica values. An object’s replicas must be updated in the same order at every replica. Distributed consensus algorithms are required to make this possible.
Fault-tolerant consensus approaches use algorithms known as atomic broadcast, total order broadcast, or replicated state machines. A set of values, or states, is delivered to multiple nodes exactly once, and in the same order. 2PC is also a consensus algorithm.
There are a number of well-known consensus algorithms. Raft, for example, uses a leader-based atomic broadcast algorithm. To ensure a consistent order of updates, a single leader receives clients’ requests, establishes their order, and broadcasts it atomically to followers.
Leslie Lamport’s Paxos, probably the most well-known consensus algorithm, is leaderless.
Consensus algorithms must be fault tolerant in the face of both leader and follower failures. If a leader fails, a single new leader must be elected and all followers must agree on the new leader. Algorithms use different approaches for selecting new leaders, but at their core, they all require:
- Detection of failed leaders.
- Followers nominate themselves as leaders.
- Electing a new leader through voting, possibly in multiple rounds.
- A recovery protocol to ensure all replicas attain a consistent state after a new leader is elected
Book: Designing Data-Intensive Applications
Book: Foundations of Scalable Systems