Indexer Causes Unclean Shutdown Race Condition #1387

crypto-services · 2024-12-12T04:46:59Z

System information

Bor client version: 0.5.1

Heimdall client version: 1.0.10

OS & Version: Ubuntu 22.04/24.04

Environment: Polygon Mainnet

Type of node: All

Overview of the problem

When attempting to shutdown the Bor service (SIGINT) the process appears unable to exit if the indexer was running at the time. This results in the SIGINT timing out triggering SIGKILL which either dumps the recent state or worse corrupts the db.

Reproduction Steps

Happens often when stopping the Bor service with SIGINT.

Logs / Traces / Output / Error Messages

Gracefully shutting down agent...
INFO [12-12|03:06:12.072] HTTP server stopped                      endpoint=127.0.0.1:8545
INFO [12-12|03:06:12.072] IPC endpoint closed                      url=/home/polygon/.bor/data/bor.ipc
INFO [12-12|03:06:12.072] Stats daemon stopped
INFO [12-12|03:06:12.073] Ethereum protocol stopped
INFO [12-12|03:06:12.073] Transaction pool stopped
INFO [12-12|03:06:12.074] Waiting background transaction indexer to exit
INFO [12-12|03:06:22.080] Looking for peers                        peercount=1 tried=135 static=5
INFO [12-12|03:06:32.085] Looking for peers                        peercount=1 tried=168 static=5
...

The node is not able to reach step Writing cached state to disk and is eventually killed. This does not happen when the indexer isn't running at the moment of shutdown.

The text was updated successfully, but these errors were encountered:

crypto-services · 2024-12-13T01:13:47Z

This has just occurred again while updating one of our sentries. We've updated 6 of our nodes in the past couple of days and it happened on 50% of them all with the same behaviour. We have 10 minutes timeout set for our service and we're deploying/updating with Ansible so there is no change to the process.

All print Waiting background transaction indexer to exit then Looking for peers for 10 minutes followed by Killing process 3706391 (bor) with signal SIGKILL. Writing cached state to disk never occurs which causes the node to loose unwritten blocks as per the image underneath.

marcello33 · 2024-12-19T06:30:09Z

Hey @crypto-services have you tried with a more recent version of bor?

crypto-services · 2024-12-23T01:33:47Z

Hey @crypto-services have you tried with a more recent version of bor?

The issue still appears to be present on v1.5.3.

manav2401 · 2025-01-17T04:58:10Z

Hi @crypto-services, apologies for the delay. I tried to reproduce this issue locally and on one of our internal nodes and couldn't. But it seems like there's a blocking call in the indexing service which doesn't honour the interrupt and hence stalls until the process is forcefully killed leading to unclean shutdown. In order to further debug, it would be great if you can help me with the following things.

Stack trace of bor when it's stuck. In order to get this, start your bor, let it run for a while and try to stop it. Once you see the log Waiting background transaction indexer to exit suggesting that it's stuck somewhere, try to kill bor process with the following command: pkill -QUIT bor (replace name of process if it's different for you). This will dump a big stack trace and will help us in narrowing down the function on which the indexing service is stuck. Please share all the logs here if possible or on a gist.
You can also capture traces via the pprof subcommand. When bor is stuck, please run the following command: bor debug pprof --seconds 5 to capture a trace for 5s. Please share the file to debug further.

Thanks!

crypto-services · 2025-01-21T11:15:10Z

Ok sure thing I'll notify the team and get them to capture it when possible.

manav2401 · 2025-01-23T05:32:35Z

Hey @crypto-services, did you prune ancient data by any chance on the node showing this issue?

crypto-services · 2025-01-23T05:49:22Z

@manav2401 Yeah they've all been either pruned (or resynced at some point with a pruned snapshot). Additionally they've all had bor snapshot prune-block ran against them at various times to clear the ancient data.

manav2401 · 2025-01-24T10:43:34Z

Hi @crypto-services, this issue has been fixed in #1411. It will be available in a beta branch soon. For now, to avoid this issue, please set the txlookuplimit to a very low number as it was trying to access the data which was already pruned. Closing this issue for now.

ssandeep assigned manav2401 Jan 3, 2025

manav2401 mentioned this issue Jan 23, 2025

core/rawdb: handle ancient pruner offset in indexing #1411

Merged

18 tasks

manav2401 closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexer Causes Unclean Shutdown Race Condition #1387

Indexer Causes Unclean Shutdown Race Condition #1387

crypto-services commented Dec 12, 2024 •

edited

Loading

crypto-services commented Dec 13, 2024

marcello33 commented Dec 19, 2024

crypto-services commented Dec 23, 2024

manav2401 commented Jan 17, 2025

crypto-services commented Jan 21, 2025

manav2401 commented Jan 23, 2025

crypto-services commented Jan 23, 2025

manav2401 commented Jan 24, 2025

Indexer Causes Unclean Shutdown Race Condition #1387

Indexer Causes Unclean Shutdown Race Condition #1387

Comments

crypto-services commented Dec 12, 2024 • edited Loading

System information

Overview of the problem

Reproduction Steps

Logs / Traces / Output / Error Messages

crypto-services commented Dec 13, 2024

marcello33 commented Dec 19, 2024

crypto-services commented Dec 23, 2024

manav2401 commented Jan 17, 2025

crypto-services commented Jan 21, 2025

manav2401 commented Jan 23, 2025

crypto-services commented Jan 23, 2025

manav2401 commented Jan 24, 2025

crypto-services commented Dec 12, 2024 •

edited

Loading