Current location - Education and Training Encyclopedia - Graduation thesis - Raft algorithm
Raft algorithm
Raft algorithm is an algorithm to solve the knowledge problem in distributed system. Raft is simplified and restricted on the basis of multiple Paxos. Different from the incomprehensible Paxos, the primary purpose of Raft design is understandability, and it is an easy-to-understand and easy-to-implement distributed consistency protocol.

Raft decomposes the consistency algorithm into several key modules, such as leader election, log replication and security. This paper will briefly analyze raft algorithm based on Raft paper.

Raft is a strong leader model, and everything is led by the leader. For example, logs can only be copied from the leader to other servers. So the election of leaders is a very important part.

Firstly, three service states of raft algorithm are introduced:

There can only be one leader in a cluster at any time.

Raft used heartbeat mechanism to achieve the leadership election. When the service is started, it is in the follower role. It should be noted that the timeout of each heartbeat serving the leader is random (150-300ms).

As shown in the above figure, there are three points in the cluster, A, B and C, and the timeout time is 150ms, 200ms and 300ms respectively. The initial term numbers are all 0, and they are all follower roles. The heartbeat timeout time of node A and leader is the shortest, so they first change from follower status to candidate, increasing the number of terms. First, they vote for themselves and send the voting information to other nodes in the cluster. When nodes B and C receive A's voting request, they accept A's voting request without voting for other nodes at this stage, and the time limit is 1. At this time, node A accepted the vote of more than half of the nodes in the cluster and became the leader with the term of 1.

Complaint is the simplest election process, and there are many concepts to explain, such as why overtime hours are different? What is the term number? What are the rules for voting comparison?

1. Term number

Every leader has his own term of office during the election period, which is monotonically increasing all over the world. Each node stores the term number of the current leader. When in the candidate stage, the current term number will be increased by 1 when voting is initiated.

In addition, when a node receives a request with a higher term than its own, it will update its term number to a higher term number. If the current role is a leader, it will change from a leader role to a follower role. When receiving a request with fewer items than itself, the node will directly reject the request.

2. Voting comparison rules

A. First-come, first-served service: a node can only vote once in a term. If both nodes A and B request node C to vote, if node C votes for A first, it will reject node B's voting request.

B log integrity: if the log of voting information received by a node is smaller than its own, it will reject the voting request.

C. Half strategy: When a node receives votes from more than half of the nodes in the cluster, it becomes the leader of the term and sends the leader heartbeat to other nodes.

D. While waiting for the vote, the candidate may receive AppendEntries RPC from another server node claiming to be the leader. If the number of terms of the leader (included in RPC) is not less than the current number of terms of the candidate, the candidate will recognize the legal status of the leader and return to the status of follower.

3. Random timeout

As mentioned above, each point is different from the leader's heartbeat timeout. The advantage of doing this is to avoid the situation of dividing votes and conduct the leadership election quickly. If the timeout of each node is the same, it will be easy to divide the votes. If each node does not get more than half of the votes, the next round of elections will be held, and the election time will be very long. Using the random timeout mechanism, under normal circumstances, only one node initiates a voting request in a time period.

The following figure is the flow chart of service role change in the whole cluster.

After the leader is elected, it provides services for the client, appends the received instructions to the log as new log entries, and then initiates AppendEntries RPC to other servers in parallel, so that they can copy the log entries. When the log entry is copied safely (more than half of the nodes have been copied), the leader will apply the log entry to his state machine (state machine execution instruction) and then return the execution result to the client. If followers crash or run slowly, or the network loses packets, the leader will keep trying AppendEntries RPC (even if he has replied to the client) until all followers finally store all the logs.

The above figure shows the format of the log, and a log entry contains three parts.

The leader copies the log to other nodes through AppendEntries RPC.

AppendEntries RPC:

Receiver implementation:

Appeal is the acceptance process of AppendeEntries RPC parameters. It is very simple to introduce $ term and leaderId, but the role of prevLogIndex and prevLogTerm is to check the consistency of logs. If the follower cannot find an entry with the same index position and term number in his log, he will reject the new log entry. Consistency check is like an inductive step: first, the empty log state must meet the log matching attribute, and then consistency check ensures the log matching attribute when the log is expanded. Therefore, whenever AppendEntries RPC returns successfully, the leader knows that the follower's log must be the same as his own (from the first log entry to the latest entry).

During normal operation, the logs of leaders and followers are consistent, so the consistency check of AppendEntries RPC will never fail. However, the collapse of the leader will leave the log in an inconsistent state (the old leader may not have completely copied all the entries in his log). The following situations:

In the Raft algorithm, the leader solves the inconsistency problem by forcing followers to copy their logs. This means that the journal entries in followers that conflict with the leader will be overwritten by the leader's journal entries.

The leader maintains nextIndex for each follower, indicating the index of the next log entry that the leader will send to the follower. When a new leader is selected, the leader initializes all the values of nextIndex to the index of his last log entry plus 1. If the follower's log is inconsistent with the leader's log, the consistency check in AppendEntries RPC will fail next time. After being rejected by followers, leaer will reduce the nextIndex value and retry AppendEntries RPC. Eventually, nextIndex will make the logs of leaders and followers consistent at some point. At this point, AppendEntries RPC will succeed, delete all the log entries in the follower that conflict with the leader, and then append the log entries in the leader (if any). Once AppendEntries RPC is successful, the follower's diary will be consistent with the leader's diary and will be consistent for the rest of the semester.

This machine briefly introduces raft's leader election and log copying. Of course, raft has other features that are not introduced in this article. It is recommended to read raft's paper and have a complete understanding of raft.

My previous article on zab protocol analyzed zookeeper's ZAB protocol, and here I compare the similarities and differences between them.

Finally/blob/master/raft-zh _ cn.md.