THE BYZANTINE FAULT — REPELLING AN ATTACK

- Falk Borgmann

On the defense of distributed systems

In our last blog entry about the Byzantine fault, we demonstrated how a one-third minority of compromised nodes can prevent consensus and destroy trust in the entire cluster. This gives rise to the question of how a system must be constructed to be Byzantine fault-tolerant.

Byzantine fault tolerance for distributed systems requires two thirds of the nodes participating in a cluster to be working correctly at any time. Then, the system cannot be attacked. The percentage of nodes compromised by attackers must therefore remain below the one-third threshold. To confirm this assumption, we are going to simulate an attack scenario in this post as well. But this time, on a four-node cluster. We will use agreement on a meeting time as the objective of the transaction here as well.

Deep Shore

Figure 1: An intact four-node cluster

Node A triggers the process by sending a message with the suggestion of “4:30 pm” simultaneously to nodes B, C, and D. The nodes B, C, and D in turn confirm to one another and the sender A that they have received the information “4:30 pm.” The information is consistent across all nodes, the transaction is validated, consensus has been reached. The meeting will thus take place at 4:30 pm.

Deep Shore

Figure 2: Attack on a four-node cluster

In this illustration, node A has been compromised and is now under the control of the attacker. Three different times are sent to the peers through this hacked node to generate confusion about when the actual meeting is to take place. The loyal instances immediately detect that they are receiving contradictory information from their peers — yet they are unable to easily determine which one or ones are the source of the problem. Without this knowledge, repairing and securing the system is no more possible than it is to reach a valid agreement for the meeting time.

Deep Shore

Figure 3: Validation in a four-node cluster

The challenge for each one of the loyal nodes is thus to a) determine which of the nodes is trustworthy, and b) achieve a quorum of over two thirds versus the malicious nodes. In the case of a four-node system, that means at least three intact nodes must unite against the one compromised node. This is achieved using a simple counter proposal. If, for instance, node C detects that a received message cannot be validated, it sends its own proposal with the content “4:30 pm.” to all peers for validation. The intact nodes B and D confirm this value both to the sender as well as amongst themselves. Only the compromised node A will send a value that deviates, “2:00 pm.”, back to the cluster or, alternatively, tries to disguise its compromised state towards C with a correct answer which is initially successful. Thus, C receives a confirmation from all nodes and assumes that “4:30 pm.” will hold. From the perspective of D and B, two out of three nodes have confirmed “4:30 pm.” Since there is no reason to doubt the achieved quorum of three, D and B will accept the proposal as a consensus – the transaction “4:30 pm.” is valid. A is powerless against three nodes. For D and B, this procedure has also revealed A as the originator of the inconsistent information. Except for C, the attacker is now known to everyone. By way of a validation message, D and B could trigger the unmasking of making the diverging answers of A also known to C. Since all loyal nodes form the system quorum, A can subsequently be excluded from the system.

This simplified model does not, of course, take into account all eventualities that a real transaction can involve. For instance, individual messages within a cluster are not always sent at the exact same time. Instead, the process in a distributed system is always sequential. This means that, in the real world, the recognition of which node is infected can only take place after the validation process has completed. The protection mechanism thus only fully applies to subsequent transactions.

To make the model more complete, the specific process of blocking an infected node would need to be considered. After the malicious node has been identified, a mechanism is used in which the trusted nodes distribute a proposal to remove node A, which they have identified as not trustworthy, from the system. The mechanism could be “applied for” just like a validation message in the cluster. Since the three properly functioning nodes were in agreement and together can create a quorum, a valid consensus has been established that this action is correct. The attack is thus repelled, and the system is trustworthy with respect to Byzantine fault tolerance.

Share