Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexer Causes Unclean Shutdown Race Condition #1387

Closed
crypto-services opened this issue Dec 12, 2024 · 8 comments
Closed

Indexer Causes Unclean Shutdown Race Condition #1387

crypto-services opened this issue Dec 12, 2024 · 8 comments
Assignees

Comments

@crypto-services
Copy link

crypto-services commented Dec 12, 2024

System information

Bor client version: 0.5.1

Heimdall client version: 1.0.10

OS & Version: Ubuntu 22.04/24.04

Environment: Polygon Mainnet

Type of node: All

Overview of the problem

When attempting to shutdown the Bor service (SIGINT) the process appears unable to exit if the indexer was running at the time. This results in the SIGINT timing out triggering SIGKILL which either dumps the recent state or worse corrupts the db.

Reproduction Steps

Happens often when stopping the Bor service with SIGINT.

Logs / Traces / Output / Error Messages

Gracefully shutting down agent...
INFO [12-12|03:06:12.072] HTTP server stopped                      endpoint=127.0.0.1:8545
INFO [12-12|03:06:12.072] IPC endpoint closed                      url=/home/polygon/.bor/data/bor.ipc
INFO [12-12|03:06:12.072] Stats daemon stopped
INFO [12-12|03:06:12.073] Ethereum protocol stopped
INFO [12-12|03:06:12.073] Transaction pool stopped
INFO [12-12|03:06:12.074] Waiting background transaction indexer to exit
INFO [12-12|03:06:22.080] Looking for peers                        peercount=1 tried=135 static=5
INFO [12-12|03:06:32.085] Looking for peers                        peercount=1 tried=168 static=5
...

The node is not able to reach step Writing cached state to disk and is eventually killed. This does not happen when the indexer isn't running at the moment of shutdown.

@crypto-services
Copy link
Author

This has just occurred again while updating one of our sentries. We've updated 6 of our nodes in the past couple of days and it happened on 50% of them all with the same behaviour. We have 10 minutes timeout set for our service and we're deploying/updating with Ansible so there is no change to the process.

All print Waiting background transaction indexer to exit then Looking for peers for 10 minutes followed by Killing process 3706391 (bor) with signal SIGKILL. Writing cached state to disk never occurs which causes the node to loose unwritten blocks as per the image underneath.

Screenshot 2024-12-13 at 09 02 27

@marcello33
Copy link
Contributor

Hey @crypto-services have you tried with a more recent version of bor?

@crypto-services
Copy link
Author

Hey @crypto-services have you tried with a more recent version of bor?

The issue still appears to be present on v1.5.3.

@manav2401
Copy link
Contributor

Hi @crypto-services, apologies for the delay. I tried to reproduce this issue locally and on one of our internal nodes and couldn't. But it seems like there's a blocking call in the indexing service which doesn't honour the interrupt and hence stalls until the process is forcefully killed leading to unclean shutdown. In order to further debug, it would be great if you can help me with the following things.

  1. Stack trace of bor when it's stuck. In order to get this, start your bor, let it run for a while and try to stop it. Once you see the log Waiting background transaction indexer to exit suggesting that it's stuck somewhere, try to kill bor process with the following command: pkill -QUIT bor (replace name of process if it's different for you). This will dump a big stack trace and will help us in narrowing down the function on which the indexing service is stuck. Please share all the logs here if possible or on a gist.
  2. You can also capture traces via the pprof subcommand. When bor is stuck, please run the following command: bor debug pprof --seconds 5 to capture a trace for 5s. Please share the file to debug further.

Thanks!

@crypto-services
Copy link
Author

Ok sure thing I'll notify the team and get them to capture it when possible.

@manav2401
Copy link
Contributor

Hey @crypto-services, did you prune ancient data by any chance on the node showing this issue?

@crypto-services
Copy link
Author

@manav2401 Yeah they've all been either pruned (or resynced at some point with a pruned snapshot). Additionally they've all had bor snapshot prune-block ran against them at various times to clear the ancient data.

@manav2401
Copy link
Contributor

Hi @crypto-services, this issue has been fixed in #1411. It will be available in a beta branch soon. For now, to avoid this issue, please set the txlookuplimit to a very low number as it was trying to access the data which was already pruned. Closing this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@marcello33 @manav2401 @crypto-services and others