-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IF: Document correct process for switching to a backup producer #93
Comments
We should clarify how BP fail-over for block proposals can be separate than the process for block finalization. The pause/resume endpoints in the producer plugin can still remain relevant for a quick process of switching over block production to a backup node. With the HotStuff transition, since produced blocks (and the BP signature included with them) would no longer convey an attestation with regards to the finality algorithm, there is less of a risk to the network if the BP signs two conflicting blocks. So BPs could still use an off-chain method leveraging pause/resume to quickly switch block production over to a backup node. However, with the HotStuff changes, we can and should also reduce latency in a producer schedule change. So it would not be burdensome to handle that on-chain as well. In the case of a BP switching over block finalization from one of their nodes to another, we want to strongly encourage the BPs to use separate finalizer keys for each machine and to handle the switchover using an on-chain action that changes their active finalizer key. Due to the speed of IF, this process should be fast (seconds). |
Not sure what exactly the plan is here, but keep in mind in some BP operations the person managing the infrastructure is different than the one who can execute an on-chain transaction. Ability to fail-over nodes without touching the chain is highly desirable feature to keep. |
Great feedback! Would using linkauths and custom permissions to have a dedicated key to sign transactions to switch the finalizer key on-chain (and is only able to do that on-chain action) be an acceptable compromise? The problem with not doing it on-chain is that it isn't technically safe unless another consensus algorithm is used to ensure safety among the replicas of each BP (which would add additional latency as well). We are trying to be more rigorous with consensus safety going forward with the Instant Finality switchover. The high time-to-finality of the current algorithm has reluctantly pushed us to accept the current "unsafe" approach that producers use since the probability of enough BPs messing up at the same time to cause an actual finality violation is low. But with Instant Finality, this latency penalty shouldn't exist. So I am hoping to encourage BPs to adopt best practices. Obviously, BPs are ultimately the ones who get to decide how they manage their keys and operations. In the future, perhaps economic disincentives (e.g. automatically slashing bonds for double signing which the new consensus algorithm enables) if adopted by the BPs could change the playing field enough to convince each BP it is best for their net economic outcome to adopt those best practices (e.g. balancing risk of missed income due to loss of availability versus risk of lost money due to slashing). But it would be ideal if we could remove as many of the obstacles BPs currently face with adopting the best practice so that BPs can comfortably adopt them for post-IF operation as soon as possible. |
Other reading material relevant to this discussion is provided in this old issue: eosnetworkfoundation/mandel#265 Note that Solution 1 described in that issue is essentially the current path we are considering for the system contract accompanying Leap 5.0 and the launch of Instant Finality (issue for that work is tracked in AntelopeIO/reference-contracts#24). That old issue also describes a "Solution 2" which provides an alternative "off-chain" mechanism to prevent BP backup nodes from double-signing that still remains safe (the high-level idea is still good, but the details of the design need updates to reflect the specific constraints imposed by the on-chain consensus algorithm of HotStuff that was selected for Instant Finality). It requires using a completely different consensus algorithm within an internal network compromised of just the BP nodes, so it is not under consideration for the Leap 5.0 timeframe. The additional development complexity does however provide some benefits that may be of interest to the BPs:
I personally believe that with the very low time-to-finality provided by IF, the latencies involved with the on-chain method are already low enough to not cause any significant risk of the BP nodes being unavailable to contribute to block finalization for any noticeable amount of time. The other risks, limitations, or costs with the on-chain solution also appear to me to be negligible given the safety win it enables for the entire EOS network compared to risks currently imposed on the network due to the typical (theoretically unsafe) methods used to handle failover between BP machines now. However, if there is still significant concern with the long-term use of an on-chain BP failover method, perhaps that should influence prioritization of the development of "Solution 2" to be delivered some time in the future as way to eventually improve upon the limitations of the recommended on-chain BP failover method but without giving up safety. |
Labelled as pending discussion after relevant decisions are made as part of AntelopeIO/reference-contracts#24. |
Documentation should also capture how BPs should be connected to other BPs (including potentially standby nodes) and how they should configure their vote-threads to ensure that all the BP nodes that may participate in consensus send/receive vote messages so that finality can still advance. |
See: Between these three documents, we cover the intended documentation of this issue. Improvements and reorganization can be covered in future issues. |
Depends on AntelopeIO/reference-contracts#24 for decision on how finalizers register their keys.
Need to document a new process for switching to backup.
The text was updated successfully, but these errors were encountered: