One of the advantages of distributed system over stand-alone system is better fault tolerance.
How is this done? An easy way is to back up. The working mode of a system is: accept the command of the client, the system processes it and returns the processing result to the client. It can be seen that the data in the system may change due to the command.
One way to realize backup is replication state machine (RSM), which has very important attribute certainty:
That is to say, if we can apply commands to the state machine in sequence, it can produce the same state and the same output.
So how do you implement a state machine? As shown in the following figure (from raft protocol):
In the above figure, each RSM has a replicated log in which commands from the client are stored. The commands in replicate log in each RSM are in the same order, and the state machine processes the commands in replicate log in sequence and returns the processing results to the client. Because state machines are deterministic, the output and state of each state machine are the same.
There is a module in the above picture-the consensus module is just not mentioned. This module is used to ensure the consistency of logs on each server!
So raft is a consistency protocol and an algorithm to ensure the consistency of copies on the server.
2.raft agreement principle
The following will record the important points I think when reading the paper.
The nature of raft agreement
How to ensure election security
In raft, as long as a candidate gets a majority of votes, he can become a leader. Followers vote under two conditions:
For the election results:
The important point here is: how to ensure that a candidate is determined in a limited time and the votes will not be divided up all the time? Raft uses a more elegant implementation to randomize the election timeout. This makes the timeout of each server different. When a new election begins, some servers are not voters. Or some qualified candidates did not participate in the next round. This practice enabled a single leader to be elected quickly.
2.2 How to ensure that the logs match?
When the leader executes appendEntry rpcs, the message will carry two pieces of information, preLogIndex and preLogTerm. When the follower receives the message, it will first judge whether the index and $ term of its latest log entry are the same as those in RPC, and if so, it will append it.
This ensures that the newly added log must be the same as its previous log. Following this principle from the first log, it is natural to make such an inference.
2.3 How to ensure the integrity of leaders
This is fully proved in the Raft Covenant. This proof is relatively short. I added some notes when I read it in reverse order.
Suppose the leader is the first person who does not include the submission point in the leader (T
2.4 There is an agreement in the RAFT agreement that you can't submit the journal entry of the last issue as the submission point. Why is this?
This problem is mainly due to the confusion in the submission entries of the former $ TERM part of the raft protocol, and it is misunderstood that this protocol is used to ensure that the logs that have been copied to most servers in the previous terms but have not been submitted will not be overwritten in the new terms.
In fact, the process of Figure 8 in this paper is correct.
Index=2 in (c) has not been submitted, and copying in (d) is correct. What the paper wants to explain is that if in (c), the leader submits the previous item, it will still be overwritten in (d), that is to say, it will be overwritten by the item of commit, which is wrong! Therefore, it is agreed that "entries from previous $ TERM cannot be submitted"
2.5 changes in cluster members
For example, if the configuration of the cluster changes, several new servers will be added and several servers will be suspended. This will affect the election.
Raft's solution is a two-stage method, which introduces an over-configuration called * * * identical state. Specific practices and charts:
Consider the above process:
2.6 What should I do if the log is too long or the log playback time is too long?
At this point, you need to compress the log.
Raft's method is a copy-on-write snapshot (in linux, writing or copying can be done through fork)
Copy-on-write is mainly related to performance. If there is too much data in the state machine, snapshots will consume a lot of time, which may lead to a great decrease in system availability.