-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spring node suddenly freezes and corrupts its state on restart #982
Comments
This should be something more like |
let me try setting that value. |
We have tried different settings and our nodes just won't run. If we are able to successfully restore from a snapshot, the node will run for a few minutes and still crash with |
We have been running stable EOS nodes for about 4 years. Suddenly, we are completely unable to run a node starting from a snapshot. Steps to reproduce:
We are running in a 256GB server, but the process doesn't use more than 4GB. In this snapshot you can see how it quickly catches up, continues syncing, and freezes. |
Are you able to start a node from scratch using a snapshot? if so, we would love to see which configuration you are using |
Maybe try setting |
we already tried with |
Please make sure you do both. |
Just confirmed we are currently testing with |
10/30/24 2:20:34 ➜ bin git:(52215f90c7) ✗ zstd -d snapshot-2024-10-30-16-eos-v8-0402124605.bin.zst
snapshot-2024-10-30-16-eos-v8-0402124605.bin.zst: 46486102417 bytes
10/30/24 2:22:40 ➜ bin git:(52215f90c7) ✗ ./nodeos --version
info 2024-10-30T19:22:40.911 nodeos main.cpp:158 main ] nodeos started
v1.0.2
10/30/24 2:23:09 ➜ bin git:(52215f90c7) ✗ rm -rf main-d
10/30/24 2:25:50 ➜ bin git:(52215f90c7) ✗ ./nodeos --data-dir main-d --config-dir main-d --chain-state-db-size-mb 180000 --plugin eosio::chain_api_plugin --plugin eosio::db_size_api_plugin --p2p-peer-address eos.seed.eosnation.io:9876 --p2p-peer-address peer1.eosphere.io:9876 --p2p-peer-address peer2.eosphere.io:9876 --p2p-peer-address p2p.genereos.io:9876 --snapshot snapshot-2024-10-30-16-eos-v8-0402124605.bin
...
info 2024-10-30T20:13:41.600 nodeos controller.cpp:1872 startup ] Snapshot loaded, lib: 402124605
...
info 2024-10-30T20:14:18.200 nodeos controller.cpp:3641 log_applied ] Received block 8625ffa71c93416f... #402125000 @ 2024-10-30T16:03:19.000 signed by eosnationftw [trxs: 7, lib: 402124998, net: 1336, cpu: 12415 us, elapsed: 28617 us, applying time: 29212 us, latency: 15059200 ms]
...
info 2024-10-30T20:22:41.109 nodeos controller.cpp:3641 log_applied ] Received block 4b731dbac2e51af3... #402155884 @ 2024-10-30T20:22:41.500 signed by aus1genereos [trxs: 9, lib: 402155882, net: 1336, cpu: 11855 us, elapsed: 2298 us, applying time: 3053 us, latency: -390 ms] Note |
@petemarinello are you working with @the-smooth-operator or did you hijack this issue? Can you provide your complete log? |
we are working together on this issue. |
Update from our side after removing |
2 nodes have been running for 12 hours now. They are doing slightly better, I guess due to using One interesting thing, at some point both processes stopped producing logs, however they are able to answer RPC calls from time to time, and we can see how their height is increasing. They are using 100GB+ of memory and almost no CPU or disk. In this screenshot you can see how the nodes answer from time to time internal RPC calls that we use to determine height. Current configuration:
|
Thanks for the log. Very helpful. You hit an issue I thought was introduced (and fixed) after 1.0.2. I see now that it is in 1.0.2. This is fixed in 1.0.3. Until 1.0.3 is released, you could build |
That's great, we really need to get our nodes up. We are going to run some replicas with the code on I have 2 questions: what did you spot in the logs that made you relate it to a bug? Do you know why other operators are able to run nodes without hitting this issue? |
|
Assuming this is fixed in v1.0.3. Feel free to re-open if you discover otherwise. |
Note:start |
Since Thursday 24th of October we are experiencing this issue were nodes suddenly freezes: stop emitting logs and answering RPC calls.
We start from V8 snapshots, the node runs for a 2 to 3 hours max and freezes.
We were compiling our own binary out of v1.0.2 tag, and we also tried using the DEB package. We are experiencing the same issue.
The node runs with 192GB of memory, and several CPU cores which it doesn't use, not even close. Just a few GB of memory and 1 to 2 CPU cores.
These are our config settings:
In the screenshot you can see how logs stop:
These are the last log lines we get:
Any help will be appreciated as we can't run any node reliable as of now
The text was updated successfully, but these errors were encountered: