-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RATIS-1273. Fix split brain by leader lease #383
base: master
Are you sure you want to change the base?
Conversation
@szetszwo Could you help review this proposal ? |
Sure, will review this. |
The design looks good. Will look at the code changes. |
Question: when the leader has lost the leader lease, should it step down? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation is a little bit tricky. See the comment inlined.
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
Outdated
Show resolved
Hide resolved
Sure, the plan sounds great. |
I agree with @GlenGeng that we may not need LEADER_LEASE_TIMEOUT_RATIO_KEY. If we take rpc-send-time right before sending out appendEntries, it seems pretty safe to use min-rpc-timeout as the leader-lease-timeout. If split brain happens, it has to take at least (min-rpc-timeout + leader election time) to elect a new leader. Then, the old leader lease must be expired by that time. @runzhiwang , what do you think? |
@szetszwo I agree. There are some failed ut related to this PR, let me fix them. |
@runzhiwang , thanks a lot. BTW, we should add confs to enable/disable PreVote and LeaderLease. Some applications may not require these features. This is suggested by @bshashikant . |
@szetszwo @bharatviswa504 Thanks, got it. I will add config in next PR. |
@szetszwo @bharatviswa504 Sorry, I have a question, I understand LeaderLease maybe not needed in some applications. But |
@runzhiwang , in general, I agree that PreVote should help for all the applications. However, PreVote needs an additional phase before the real election. It could potentially slows down some applications. Individual applications like Ozone may want to benchmark it. If it is not configurable, it is impossible to benchmark. |
@szetszwo Thanks, got it. With leader lease, some ut become flaky, I need some time to fix them. |
@runzhiwang , No problem. Please take you time. Thanks a lot for working hard on this. |
For now, me and @runzhiwang is developing SCM HA. In SCM HA, SCM will cache SCM HA does not invoke I suggest to implement the leader with lease solution in What do you think @szetszwo @runzhiwang ? |
@szetszwo Hi, with leader lease, the CI becomes flaky, there are 2 reasons:
I think we have following options:
What do you think ? |
@runzhiwang , thanks again for working on this.
With the leader lease feature, the leader probably should send heartbeat separately since followers may take a long time to process log entires. (The followers do not count the log processing time when counting heartbeat timeout. However, it is impossible for the leader to do the same discount.) |
Let's disable leader lease as default. When the feature becomes stable, we can change the default to enable. |
@szetszwo Hi, I find it's almost impossible to enable leader lease in CI, because sometimes it cost 300ms from leader send heart to follower receive heartbeat. So CI will become very unstable, unless we increase rpc.timeout.min. Besides, what do you think of @GlenGeng 's suggestion: leader step down when lease become invalid ? In SCM HA, we do not check isLeaderReady, we depends on StateMachine#notifyLeaderChanged to change leadership. |
Yes, we may increase rpc.timeout.min if necessary.
Let's also make it configurable? It seems that both ways have its own benefit. |
@szetszwo Thanks, I agree. |
this pr depends on: #398 |
hi, i am not sure are you still working on this jira? as a good raft implementation, i think leader lease is very important for ratis, we should continue and complete the work. if Needed, it is my pleasure to continue this work! @szetszwo |
It seems that @runzhiwang is no longer working on this. (Please correct me if I am wrong.) @JacksonYao287 , please feel free to take over this. Thanks a lot. |
@szetszwo sorry for the delay. @JacksonYao287 please feel free to take over this, thanks. |
What changes were proposed in this pull request?
What's the problem ?
For example, there are 3 servers: s1, s2, s3, and s1 is leader. When split-brain happens, s2 was elected as new leader, but s1 still think it's leader, when client read from s1, if s2 has processed write request, client will read old data from s1.
How to fix ?
As the raft paper described, assign the leader with a lease, the leader would use the normal heartbeat mechanism to maintain a lease. Once the leader’s heartbeats were acknowledged by a majority of the cluster, it would extends its lease to
start+ election timeout
, since the followers shouldn’t time out before then, so we can make sure there will no new leader was elected(need pre-vote feature and need to consider transferLeadership feature) , so before start + election timeout, there will not split-brain happens..
[TODO] Why need pre-vote feature ?
As the image shows, s1 is leader, but s1 can not connect with s2, even though s1 extend its lease to start+ election timeout when s1 receive acknowledgement from s3, but before start+ election timeout, s1 isolated from all servers, and s2 maybe timeout and start election and change to leader immediately with vote from s3, so both s1 and s2 think itself as leader before start+ election timeout. But if with pre-vote feature, when s2 request vote, s3 check s1's leadership is still valid, s3 will reject vote to s2, only one leader exists.
[TODO] How to address transferLeadership ?
For example, s1 is leader and extend its lease to
start+ election timeout
when s1 receive acknowledgement from s2 and s3. But beforestart+ election timeout
, admin maybe call transferLeadership(s2), after s1 send StartLeaderElectionRequest to s2, s1 isolated from all servers, then s2 start election and change to leader immediately with vote from s3, so both s1 and s2 think itself as leader before start+ election timeout.So s1 should step down as a follower when s1 send StartLeaderElectionRequest to s2.
@szetszwo Could you help review this proposal ?
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-1273
How was this patch tested?
TODO