Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Failure: nodeos_startup_catchup_if_lr_test #69

Closed
heifner opened this issue Apr 25, 2024 · 4 comments · Fixed by #82
Closed

Test Failure: nodeos_startup_catchup_if_lr_test #69

heifner opened this issue Apr 25, 2024 · 4 comments · Fixed by #82
Assignees
Labels
👍 lgtm OCI Work exclusive to OCI team

Comments

@heifner
Copy link
Member

heifner commented Apr 25, 2024

https://github.com/AntelopeIO/spring/actions/runs/8831635726/job/24247940383

ERROR: Node14 has unlinkable blocks

@heifner
Copy link
Member Author

heifner commented Apr 25, 2024

Did eventually come out of it, but had 219 unlinkable block exceptions. The test expects less than 15.

@heifner
Copy link
Member Author

heifner commented Apr 25, 2024

This looks suspicious:

debug 2024-04-25T11:44:33.707 net-3     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] posting block 161 to dispatcher strand
debug 2024-04-25T11:44:33.707 net-0     net_plugin.cpp:3840           operator()           ] posting block 161 to app thread
debug 2024-04-25T11:44:33.709 net-2     net_plugin.cpp:3788           handle_message       ] ["localhost:9877 - b41fa1a" - 15 127.0.0.1:42822] posting block 163 to dispatcher strand
debug 2024-04-25T11:44:33.709 net-1     net_plugin.cpp:3840           operator()           ] posting block 163 to app thread

Looks like net-3 thread didn't get any CPU for about 3ms. These log statements that should happen right after each other. There is a mutex lock between them, which I assume is what is causing this large delay.

debug 2024-04-25T11:44:33.707 net-3     net_plugin.cpp:2532           sync_recv_block      ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] calling sync_wait, block 162
debug 2024-04-25T11:44:33.710 net-3     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] posting block 162 to dispatcher strand

During those 3ms, the net_plugin has started syncing from a different peer and received 14 blocks. We continue to post from two different streams of blocks causing a huge number of unlinkable (out of order) blocks.

One possible solution is to keep an explicit ordered queue of incoming blocks while in LIB catchup. We could insert into this queue and pull from it as we process the blocks. If the next block is not in the queue we could wait for it to arrive. Currently there is an implicit queue for the app thread we post into. This explicit ordered queue would only be applicable during LIB catchup. After LIB catchup the current scheme of processing blocks as they come in would need to be used.

Seems low priority as the node does recover and continue syncing, the only real harm is a large number of unlinkable block log messages and a bit of extra processing determining that the block does not link which does not take much time.

@heifner
Copy link
Member Author

heifner commented Apr 26, 2024

Same failure: https://github.com/AntelopeIO/spring/actions/runs/8847642165/job/24296229867

debug 2024-04-26T11:53:34.168 net-0     net_plugin.cpp:3130           process_next_block_m ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] received block 149, id bf2b2b53dc02a90e..., latency: 64168ms, head 1
debug 2024-04-26T11:53:34.168 net-0     net_plugin.cpp:2476           sync_recv_block      ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] got block 149:bf2b2b53dc02a90e.. latency 0ms
debug 2024-04-26T11:53:34.168 net-0     net_plugin.cpp:2532           sync_recv_block      ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] calling sync_wait, block 149
debug 2024-04-26T11:53:34.171 net-0     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] posting block 149 to dispatcher strand
debug 2024-04-26T11:53:34.171 net-3     net_plugin.cpp:3840           operator()           ] posting block 149 to app thread

3ms between receiving and posting on the main, meanwhile many other blocks have been posted to the main thread.

@arhag
Copy link
Member

arhag commented Apr 26, 2024

For this issue, we just want to make the test more reliable and less sensitive to the number of unlinkable blocks that occurred during the test.

Either we remove the check that the number of unlinkable blocks is less than the threshold, or we just increase the threshold to account for the fact that such unlinkable blocks during lib catchup are more likely with the changes we made to the implementation for how we process blocks.

A separate issue can track the low-priority optimization to reduce the likelihood of these unlinkable blocks occurring during lib catchup in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👍 lgtm OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants