Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3966 add_epoch_root and sync_l1 on Membership #3984

Merged
merged 1 commit into from
Jan 3, 2025
Merged

3966 add_epoch_root and sync_l1 on Membership #3984

merged 1 commit into from
Jan 3, 2025

Conversation

pls148
Copy link
Contributor

@pls148 pls148 commented Dec 19, 2024

Closes #3966

This PR:

Adds add_epoch_root and sync_l1 to Membership trait, calling them from within quorum_vote.

This PR does not:

Key places to review:

.block_number();

// Skip if this is not the expected block.
if task_state.epoch_height != 0 && (decided_block_number + 3) % task_state.epoch_height == 0 {
Copy link
Contributor

@sveitser sveitser Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently what would happen if a node misses the block when the update is triggered? Would it never have add_epoch_root called for that epoch?

I'm wondering if we the Membership trait should have a method like is_epoch_initialized(&self, epoch) -> bool so that if the node is not online for the beginning of the epoch we can still initialize the membership for the epoch. But then we also have to dig out the correct header from history which I think might be cumbersome.

So maybe this should be handled with catchup in the confirmation layer? What do you think @imabdulbasit ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing a catchup is better, but I am wondering how we can do that now with Hotshot managing the locks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our existing catchup it's hotshot that triggers the catchup (for example by calling validate_and_apply_header) and the confirmation layer that performs the catchup. Maybe it should be the same here: hotshot needs to call something to trigger catchup for the membership and the sequencer performs the catchup.

If we were to do it analoguous for membership we would have to make all the reading functions async and &mut self which seems terrible. I think we can think of something better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example hotshot could call add_epoch_root (probably with another name) for every view then the sequencer could decide if it needs to do catchup for this epoch (and in most cases do (almost) nothing and return None as write_callback). This way we ensure the catchup is actually triggered.

I think our overall design is a bit sub-optimal because it's an invariant that the membership is static for an epoch but that invariant isn't really reflected in the code. I think eventually we should enforce this in hotshot and the confirmation layer should somehow provide a constructor for a "EpochMembership" type that takes an epoch as input that is called by hotshot where needed. @ss-es mentioned this would be difficult to do in hotshot but I'm not sure pushing the complexity elsewhere is actually better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but then hotshot has to decide when to call it? I think it would calling it for every view would be too much maybe. It makes sense to do catch up when the membership get functions() do not return any data?

Copy link
Contributor

@imabdulbasit imabdulbasit Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we do catchup at the node startup and pass the Committee after catchup so hotshot does not need to do anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think catchup at startup might not work if a node misses some views for some other reason, for example due to a temporary network outage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but then hotshot has to decide when to call it? I think it would calling it for every view would be too much maybe.

If it doesn't do anything except for once per epoch then what's the problem with calling it every view?

Copy link
Contributor

@tbro tbro Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the Membership trait should have a method like is_epoch_initialized(&self, epoch) -> bool so that if the node is not online for the beginning of the epoch we can still initialize the membership for the epoch. But then we also have to dig out the correct header from history which I think might be cumbersome.

Why do we need this? When node comes online it should go through same routine as all nodes, which includes calling add_epoch_root for the block being proposed/validated. Hotshot should ensure mechanisms are in place to concert the late joining node. I don't see the need for an extra mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be reading it wrongly but I think the current implementation in this PR only calls the function if decided_block_number + 3) % task_state.epoch_height == 0 so if it joins at a later point and the decided block has moved on it won't call it. Maybe I'm misunderstanding.

epoch_from_block_number(decided_block_number, task_state.epoch_height) + 1,
);

let membership_reader = task_state.membership.read().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly personal preference, but I think it is more idiomatic to prefer introducing a scoped block over explicit drops. Something like:

let write_callback = {
    let membership_reader = task_state.membership.read().await;
    membership_reader.add_epoch_root(next_epoch_number, proposal.block_header.clone())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sveitser sveitser mentioned this pull request Dec 20, 2024
1 task
@pls148 pls148 force-pushed the ps/3966B branch 3 times, most recently from 2cb5789 to 13a9ce4 Compare January 3, 2025 16:23
Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, though I'm not necessarily sure in any of them

Comment on lines 147 to 149
fn uses_sync_l1() -> bool {
false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually needed with the current signature for sync_l1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I feel like we wouldn't want to do this even if we could

I think we'd (eventually) always try to take this lock in deployment, so this would only help us in tests. but I don't think we want our tests to have different locking behavior

I might be misunderstanding though

crates/task-impls/src/helpers.rs Show resolved Hide resolved
Comment on lines 201 to 205
if TYPES::Membership::uses_sync_l1() {
let write_callback = {
let membership_reader = membership.read().await;
membership_reader.sync_l1().await
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than checking uses_sync_l1() we can just check whether add_epoch_root actually did something, and if so immediately apply the callback and the sync

@pls148 pls148 force-pushed the ps/3966B branch 3 times, most recently from 7b49557 to e6ab29f Compare January 3, 2025 19:53
Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me, though I think it's worth merging main in and using is_epoch_root

.block_number();

// Skip if this is not the expected block.
if epoch_height != 0 && (decided_block_number + 3) % epoch_height == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3948 just added hotshot_types::utils::is_epoch_root which I think we should use here too (I'm not 100% sure if we're going with the 3rd last block, but we should definitely be using the same thing as we do in the drb)

Comment on lines +204 to +215
let write_callback = {
let membership_reader = membership.read().await;
membership_reader.sync_l1().await
};

if let Some(write_callback) = write_callback {
let mut membership_writer = membership.write().await;
write_callback(&mut *membership_writer);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want this in a tokio::spawn, so we don't block on the sync_l1 call but I'm not 100% sure what the right approach is here (e.g. I don't know if doing that adds deadlock risks). probably also fine to leave as-is and deal with it if problems come up

@pls148 pls148 merged commit e8cd837 into main Jan 3, 2025
17 checks passed
@pls148 pls148 deleted the ps/3966B branch January 3, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Membership trait changes to support PoS
7 participants