Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IF: Document correct process for switching to a backup producer #93

Closed
Tracked by #39
ericpassmore opened this issue Aug 18, 2023 · 8 comments
Closed
Tracked by #39
Labels
discussion documentation Improvements or additions to documentation

Comments

@ericpassmore
Copy link
Contributor

ericpassmore commented Aug 18, 2023

Depends on AntelopeIO/reference-contracts#24 for decision on how finalizers register their keys.

Need to document a new process for switching to backup.

@arhag
Copy link
Member

arhag commented Aug 18, 2023

We should clarify how BP fail-over for block proposals can be separate than the process for block finalization.

The pause/resume endpoints in the producer plugin can still remain relevant for a quick process of switching over block production to a backup node. With the HotStuff transition, since produced blocks (and the BP signature included with them) would no longer convey an attestation with regards to the finality algorithm, there is less of a risk to the network if the BP signs two conflicting blocks. So BPs could still use an off-chain method leveraging pause/resume to quickly switch block production over to a backup node. However, with the HotStuff changes, we can and should also reduce latency in a producer schedule change. So it would not be burdensome to handle that on-chain as well.

In the case of a BP switching over block finalization from one of their nodes to another, we want to strongly encourage the BPs to use separate finalizer keys for each machine and to handle the switchover using an on-chain action that changes their active finalizer key. Due to the speed of IF, this process should be fast (seconds).

@matthewdarwin
Copy link

Not sure what exactly the plan is here, but keep in mind in some BP operations the person managing the infrastructure is different than the one who can execute an on-chain transaction. Ability to fail-over nodes without touching the chain is highly desirable feature to keep.

@arhag
Copy link
Member

arhag commented Aug 21, 2023

@matthewdarwin:

Great feedback! Would using linkauths and custom permissions to have a dedicated key to sign transactions to switch the finalizer key on-chain (and is only able to do that on-chain action) be an acceptable compromise?

The problem with not doing it on-chain is that it isn't technically safe unless another consensus algorithm is used to ensure safety among the replicas of each BP (which would add additional latency as well). We are trying to be more rigorous with consensus safety going forward with the Instant Finality switchover. The high time-to-finality of the current algorithm has reluctantly pushed us to accept the current "unsafe" approach that producers use since the probability of enough BPs messing up at the same time to cause an actual finality violation is low. But with Instant Finality, this latency penalty shouldn't exist. So I am hoping to encourage BPs to adopt best practices.

Obviously, BPs are ultimately the ones who get to decide how they manage their keys and operations. In the future, perhaps economic disincentives (e.g. automatically slashing bonds for double signing which the new consensus algorithm enables) if adopted by the BPs could change the playing field enough to convince each BP it is best for their net economic outcome to adopt those best practices (e.g. balancing risk of missed income due to loss of availability versus risk of lost money due to slashing). But it would be ideal if we could remove as many of the obstacles BPs currently face with adopting the best practice so that BPs can comfortably adopt them for post-IF operation as soon as possible.

@matthewdarwin
Copy link

Could we discuss at the next node operator round table @bhazzard @heifner

@arhag
Copy link
Member

arhag commented Aug 25, 2023

Other reading material relevant to this discussion is provided in this old issue: eosnetworkfoundation/mandel#265

Note that Solution 1 described in that issue is essentially the current path we are considering for the system contract accompanying Leap 5.0 and the launch of Instant Finality (issue for that work is tracked in AntelopeIO/reference-contracts#24).

That old issue also describes a "Solution 2" which provides an alternative "off-chain" mechanism to prevent BP backup nodes from double-signing that still remains safe (the high-level idea is still good, but the details of the design need updates to reflect the specific constraints imposed by the on-chain consensus algorithm of HotStuff that was selected for Instant Finality). It requires using a completely different consensus algorithm within an internal network compromised of just the BP nodes, so it is not under consideration for the Leap 5.0 timeframe. The additional development complexity does however provide some benefits that may be of interest to the BPs:

  • it is safe way of preventing double signing without forcing the BPs to be dependent on the overall liveness of the blockchain itself;
  • it does not require expending on-chain resources (CPU/NET) to change the active key/machine;
  • and, it should have slightly lower latencies for switching the active key/machine than the on-chain method.

I personally believe that with the very low time-to-finality provided by IF, the latencies involved with the on-chain method are already low enough to not cause any significant risk of the BP nodes being unavailable to contribute to block finalization for any noticeable amount of time. The other risks, limitations, or costs with the on-chain solution also appear to me to be negligible given the safety win it enables for the entire EOS network compared to risks currently imposed on the network due to the typical (theoretically unsafe) methods used to handle failover between BP machines now.

However, if there is still significant concern with the long-term use of an on-chain BP failover method, perhaps that should influence prioritization of the development of "Solution 2" to be delivered some time in the future as way to eventually improve upon the limitations of the recommended on-chain BP failover method but without giving up safety.

@bhazzard
Copy link

bhazzard commented Sep 7, 2023

Labelled as pending discussion after relevant decisions are made as part of AntelopeIO/reference-contracts#24.

@arhag
Copy link
Member

arhag commented Apr 15, 2024

Documentation should also capture how BPs should be connected to other BPs (including potentially standby nodes) and how they should configure their vote-threads to ensure that all the BP nodes that may participate in consensus send/receive vote messages so that finality can still advance.

@arhag arhag transferred this issue from AntelopeIO/leap Apr 29, 2024
@arhag
Copy link
Member

arhag commented Aug 21, 2024

@arhag arhag closed this as completed Aug 21, 2024
@github-project-automation github-project-automation bot moved this from Blocked to Done in Team Backlog Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion documentation Improvements or additions to documentation
Projects
Archived in project
Development

No branches or pull requests

5 participants