spring node suddenly freezes and corrupts its state on restart #982

the-smooth-operator · 2024-10-29T16:49:53Z

Since Thursday 24th of October we are experiencing this issue were nodes suddenly freezes: stop emitting logs and answering RPC calls.

We start from V8 snapshots, the node runs for a 2 to 3 hours max and freezes.
We were compiling our own binary out of v1.0.2 tag, and we also tried using the DEB package. We are experiencing the same issue.
The node runs with 192GB of memory, and several CPU cores which it doesn't use, not even close. Just a few GB of memory and 1 to 2 CPU cores.

These are our config settings:

    # Public peers that are relatively consistent

    # Per EOS documentation Keep your p2p-peer-address list short
    # Ideally, limit it to the 3 or 4 closest nodes geographically
    # You should have at least 2 or 3 nodes that extend all the way to the genesis block
    # Refernce: https://eosnetwork.com/blog/leap-3-1-upgrade-guide/

    p2p-peer-address = eos.seed.eosnation.io:9876
    p2p-peer-address = peer1.eosphere.io:9876
    p2p-peer-address = peer2.eosphere.io:9876
    p2p-peer-address = p2p.genereos.io:9876

    # Plugins
    plugin = eosio::chain_plugin
    plugin = eosio::net_plugin
    plugin = eosio::http_plugin
    plugin = eosio::chain_api_plugin
    plugin = eosio::net_api_plugin

    # Stability Configs
    http-validate-host = false
    chain-state-db-size-mb = 81920

    # Default settings are way too low
    net-threads = 6
    http-threads = 6
    chain-threads = 6

In the screenshot you can see how logs stop:

These are the last log lines we get:

    net_plugin.cpp:3761           handle_message       ] [\"peer1.eosphere.io:9876 - fee7272\" - 3 198.50.156.24:9876] received packed_transaction 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\ndebug 2024-10-29T16:41:43.116 nodeos    producer_plugin.cpp:2428      log_trx_results      ] [TRX_TRACE] Speculative execution is ACCEPTING tx: 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe, auth: miner2.sx, cpu: 437\ndebug 2024-10-29T16:41:43.116 nodeos    producer_plugin.cpp:2430      log_trx_results      ] [TRX_TRACE] Speculative execution is ACCEPTING tx: {\"expiration\":\"2024-10-29T16:43:03\",\"ref_block_num\":25098,\"ref_block_prefix\":2376131749,\"max_net_usage_words\":0,\"max_cpu_usage_ms\":0,\"delay_sec\":0,\"context_free_actions\":[],\"actions\":[{\"account\":\"push.sx\",\"name\":\"mine\",\"authorization\":[{\"actor\":\"miner2.sx\",\"permission\":\"active\"}],\"data\":{\"executor\":\"miner2.sx\",\"nonce\":1377},\"hex_data\":{\"size\":16,\"hex\":\"0000e81888aba6936105000000000000\"}}]}\ndebug 2024-10-29T16:41:43.117 nodeos    producer_plugin.cpp:2432      log_trx_results      ] [TRX_TRACE] Speculative execution is ACCEPTING tx: {\"id\":\"6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\",\"block_num\":401957469,\"block_time\":\"2024-10-29T16:41:43.500\",\"producer_block_id\":null,\"receipt\":{\"status\":\"executed\",\"cpu_usage_us\":437,\"net_usage_words\":14},\"elapsed\":437,\"net_usage\":112,\"scheduled\":false,\"action_traces\":[{\"action_ordinal\":1,\"creator_action_ordinal\":0,\"closest_unnotified_ancestor_action_ordinal\":0,\"receipt\":{\"receiver\":\"push.sx\",\"act_digest\":\"00264089d6e05c468b211135c0318bead8abd1286ef92bb4028439f689803af2\",\"global_sequence\":\"362066506033\",\"recv_sequence\":501569411,\"auth_sequence\":[[\"miner2.sx\",198816888]],\"code_sequence\":315,\"abi_sequence\":62},\"receiver\":\"push.sx\",\"act\":{\"account\":\"push.sx\",\"name\":\"mine\",\"authorization\":[{\"actor\":\"miner2.sx\",\"permission\":\"active\"}],\"data\":{\"executor\":\"miner2.sx\",\"nonce\":1377},\"hex_data\":{\"size\":16,\"hex\":\"0000e81888aba6936105000000000000\"}},\"context_free\":false,\"elapsed\":35,\"console\":\"\",\"trx_id\":\"6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\",\"block_num\":401957469,\"block_time\":\"2024-10-29T16:41:43.500\",\"producer_block_id\":null,\"account_ram_deltas\":[],\"except\":null,\"error_code\":null,\"return_value_hex_data\":\"\"},{\"action_ordinal\":2,\"creator_action_ordinal\":1,\"closest_unnotified_ancestor_action_ordinal\":1,\"receipt\":{\"receiver\":\"fastram.sx\",\"act_digest\":\"00264089d6e05c468b211135c0318bead8abd1286ef92bb4028439f689803af2\",\"global_sequence\":\"362066506034\",\"recv_sequence\":69976621,\"auth_sequence\":[[\"miner2.sx\",198816889]],\"code_sequence\":315,\"abi_sequence\":62},\"receiver\":\"fastram.sx\",\"act\":{\"account\":\"push.sx\",\"name\":\"mine\",\"authorization\":[{\"actor\":\"miner2.sx\",\"permission\":\"active\"}],\"data\":{\"executor\":\"miner2.sx\",\"nonce\":1377},\"hex_data\":{\"size\":16,\"hex\":\"0000e81888aba6936105000000000000\"}},\"context_free\":false,\"elapsed\":315,\"console\":\"\",\"trx_id\":\"6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\",\"block_num\":401957469,\"block_time\":\"2024-10-29T16:41:43.500\",\"producer_block_id\":null,\"account_ram_deltas\":[],\"except\":null,\"error_code\":null,\"return_value_hex_data\":\"\"}],\"account_ram_delta\":null,\"except\":null,\"error_code\":null}\ndebug 2024-10-29T16:41:43.117 nodeos    net_plugin.cpp:3773           operator()           ] chain accepted transaction, bcast 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\ndebug 2024-10-29T16:41:43.117 net-1     net_plugin.cpp:4094           operator()           ] signaled ACK, trx-id = 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\ndebug 2024-10-29T16:41:43.117 net-1     net_plugin.cpp:2783           operator()           ] sending trx: 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe, to connection - 1\ndebug 2024-10-29T16:41:43.117 net-1     net_plugin.cpp:2783           operator()           ] sending trx: 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe, to connection - 2\ndebug 2024-10-29T16:41:43.117 net-2     net_plugin.cpp:1707           ope"; err=<nil>

Any help will be appreciated as we can't run any node reliable as of now

The text was updated successfully, but these errors were encountered:

heifner · 2024-10-29T16:57:32Z

chain-state-db-size-mb = 81920

This should be something more like chain-state-db-size-mb = 180000. I would have expected that to given you an error.

the-smooth-operator · 2024-10-29T17:42:42Z

let me try setting that value.

nmiceli-simtlix · 2024-10-30T15:17:01Z

We have tried different settings and our nodes just won't run. If we are able to successfully restore from a snapshot, the node will run for a few minutes and still crash with terminate called after throwing an instance of 'Serialization::FatalSerializationException'
Can anyone share a config.ini file that we can work with?
Thanks!

the-smooth-operator · 2024-10-30T17:09:07Z

We have been running stable EOS nodes for about 4 years. Suddenly, we are completely unable to run a node starting from a snapshot.
Honestly we are desperate and out of ideas.

Steps to reproduce:

Download a V8 snapshot from https://snapshots.eosnation.io/
Run with default config (fails to start due to std::exception::what: Unknown option 'reversible-blocks-db-size-mb' inside the config file config.ini
Bump chain-state-db-size-mb to 65536 or any multiple. The node processes the snapshot, starts syncing and freezes in about an hour. Throws 'Serialization::FatalSerializationException'

We are running in a 256GB server, but the process doesn't use more than 4GB.

In this snapshot you can see how it quickly catches up, continues syncing, and freezes.

the-smooth-operator · 2024-10-30T17:09:35Z

Are you able to start a node from scratch using a snapshot? if so, we would love to see which configuration you are using

greg7mdp · 2024-10-30T17:58:54Z

Honestly we are desperate and out of ideas.

Maybe try setting database-map-mode = mapped_private. It could help if you are using a slow disk. Also set chain-state-db-size-mb = 180000 as suggested above.

the-smooth-operator · 2024-10-30T18:04:04Z

we already tried with chain-state-db-size-mb = 180000
We are now going to test database-map-mode = mapped_private

greg7mdp · 2024-10-30T18:06:01Z

we already tried with chain-state-db-size-mb = 180000 We are now going to test database-map-mode = mapped_private

Please make sure you do both.

petemarinello · 2024-10-30T19:01:38Z

Please make sure you do both.

Just confirmed we are currently testing with chain-state-db-size-mb = 180000 and database-map-mode = mapped_private.

heifner · 2024-10-30T20:23:11Z

10/30/24 2:20:34 ➜  bin git:(52215f90c7) ✗ zstd -d snapshot-2024-10-30-16-eos-v8-0402124605.bin.zst
snapshot-2024-10-30-16-eos-v8-0402124605.bin.zst: 46486102417 bytes
10/30/24 2:22:40 ➜  bin git:(52215f90c7) ✗ ./nodeos --version
info  2024-10-30T19:22:40.911 nodeos    main.cpp:158                  main                 ] nodeos started
v1.0.2
10/30/24 2:23:09 ➜  bin git:(52215f90c7) ✗ rm -rf main-d
10/30/24 2:25:50 ➜  bin git:(52215f90c7) ✗ ./nodeos --data-dir main-d --config-dir main-d --chain-state-db-size-mb 180000 --plugin eosio::chain_api_plugin --plugin eosio::db_size_api_plugin --p2p-peer-address eos.seed.eosnation.io:9876 --p2p-peer-address peer1.eosphere.io:9876 --p2p-peer-address peer2.eosphere.io:9876 --p2p-peer-address p2p.genereos.io:9876 --snapshot snapshot-2024-10-30-16-eos-v8-0402124605.bin
...
info  2024-10-30T20:13:41.600 nodeos    controller.cpp:1872           startup              ] Snapshot loaded, lib: 402124605
...
info  2024-10-30T20:14:18.200 nodeos    controller.cpp:3641           log_applied          ] Received block 8625ffa71c93416f... #402125000 @ 2024-10-30T16:03:19.000 signed by eosnationftw [trxs: 7, lib: 402124998, net: 1336, cpu: 12415 us, elapsed: 28617 us, applying time: 29212 us, latency: 15059200 ms]
...
info  2024-10-30T20:22:41.109 nodeos    controller.cpp:3641           log_applied          ] Received block 4b731dbac2e51af3... #402155884 @ 2024-10-30T20:22:41.500 signed by aus1genereos [trxs: 9, lib: 402155882, net: 1336, cpu: 11855 us, elapsed: 2298 us, applying time: 3053 us, latency: -390 ms]

Note peer2.eosphere.io:9876 did keep giving me errors. I would recommend removing that peer.

petemarinello · 2024-10-30T20:23:38Z

The situation seems to be improving, as we are currently observing over 100 GB of memory usage. This is expected since mapped_private suggests that data is being memory-mapped, if I understand correctly. However, after several hours, the same issue remains: the node continues to halt block processing. Additionally, it is worth noting that there have been no restarts of the service.

heifner · 2024-10-30T20:25:37Z

node continues to halt block processing

@petemarinello are you working with @the-smooth-operator or did you hijack this issue?

Can you provide your complete log?

petemarinello · 2024-10-30T20:27:05Z

@petemarinello are you working with @the-smooth-operator or did you hijack this issue?

we are working together on this issue.

petemarinello · 2024-10-30T22:34:30Z

Update from our side after removing peer2.eosphere.io:9876 from the peer list, things seem to be working as expected, however I feel like we need more time to monitor before fully calling the issue resolved

the-smooth-operator · 2024-10-31T09:00:03Z

2 nodes have been running for 12 hours now. They are doing slightly better, I guess due to using database-map-mode = mapped_private. However they still throw the same Serialization errors, and not functioning correctly.

One interesting thing, at some point both processes stopped producing logs, however they are able to answer RPC calls from time to time, and we can see how their height is increasing.
I'm attaching the log output of the lasts minutes until the process froze.
eos-fn-log.txt

They are using 100GB+ of memory and almost no CPU or disk.

In this screenshot you can see how the nodes answer from time to time internal RPC calls that we use to determine height.
Eve tho they are not logging anything

Current configuration:

    p2p-peer-address = eos.seed.eosnation.io:9876
    p2p-peer-address = peer1.eosphere.io:9876
    p2p-peer-address = p2p.genereos.io:9876


    # Plugins
    plugin = eosio::chain_plugin
    plugin = eosio::net_plugin
    plugin = eosio::http_plugin
    plugin = eosio::chain_api_plugin
    plugin = eosio::net_api_plugin
    plugin = eosio::db_size_api_plugin

    # Stability Configs
    http-validate-host = false
    chain-state-db-size-mb = 180000
    eos-vm-oc-cache-size-mb = 4096
    database-map-mode = mapped_private


    # Default settings are way too low
    net-threads = 6
    http-threads = 6
    chain-threads = 4

heifner · 2024-10-31T12:08:17Z

Thanks for the log. Very helpful. You hit an issue I thought was introduced (and fixed) after 1.0.2. I see now that it is in 1.0.2. This is fixed in 1.0.3. Until 1.0.3 is released, you could build release/1.0 branch.

the-smooth-operator · 2024-10-31T13:05:04Z

That's great, we really need to get our nodes up. We are going to run some replicas with the code on release/1.0 branch.
My colleagues will report later on this thread.

I have 2 questions: what did you spot in the logs that made you relate it to a bug? Do you know why other operators are able to run nodes without hitting this issue?

heifner · 2024-10-31T13:09:38Z

The sync_timeout. You running with debug net_plugin_impl provided this very useful information. I remembered we fixed an issue with sync time out. See https://github.com/AntelopeIO/spring/pull/897/files

debug 2024-10-30T20:00:34.813 net-5     net_plugin.cpp:2268           sync_timeout         ] ["eosn-eos-seed170:9876 - be6402b" - 1 209.249.216.152:9876] sync timed out
terminate called after throwing an instance of 'Serialization::FatalSerializationException'

You will only hit this issue if you have a sync time out. This requires a slow peer for you to sync from.

heifner · 2024-10-31T16:33:04Z

https://github.com/AntelopeIO/spring/releases/tag/v1.0.3

heifner · 2024-10-31T17:03:14Z

Assuming this is fixed in v1.0.3. Feel free to re-open if you discover otherwise.

ericpassmore · 2024-11-04T15:52:08Z

Note:start
category: System Stability
component: Internal
summary: Node freeze closed by other fixes.
Note:end

enf-ci-bot added the triage label Oct 29, 2024

enf-ci-bot added this to Team Backlog Oct 29, 2024

enf-ci-bot moved this to Todo in Team Backlog Oct 29, 2024

heifner closed this as completed Oct 31, 2024

github-project-automation bot moved this from Todo to Done in Team Backlog Oct 31, 2024

heifner removed the triage label Oct 31, 2024

the-smooth-operator mentioned this issue Nov 4, 2024

nodeos hangs on "terminate called after throwing an instance of 'Serialization::FatalSerializationException'" #866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spring node suddenly freezes and corrupts its state on restart #982

spring node suddenly freezes and corrupts its state on restart #982

the-smooth-operator commented Oct 29, 2024

heifner commented Oct 29, 2024

the-smooth-operator commented Oct 29, 2024

nmiceli-simtlix commented Oct 30, 2024

the-smooth-operator commented Oct 30, 2024

the-smooth-operator commented Oct 30, 2024

greg7mdp commented Oct 30, 2024

the-smooth-operator commented Oct 30, 2024

greg7mdp commented Oct 30, 2024

petemarinello commented Oct 30, 2024

heifner commented Oct 30, 2024

petemarinello commented Oct 30, 2024

heifner commented Oct 30, 2024

petemarinello commented Oct 30, 2024 •

edited

Loading

petemarinello commented Oct 30, 2024

the-smooth-operator commented Oct 31, 2024

heifner commented Oct 31, 2024

the-smooth-operator commented Oct 31, 2024

heifner commented Oct 31, 2024

heifner commented Oct 31, 2024

heifner commented Oct 31, 2024

ericpassmore commented Nov 4, 2024

spring node suddenly freezes and corrupts its state on restart #982

spring node suddenly freezes and corrupts its state on restart #982

Comments

the-smooth-operator commented Oct 29, 2024

heifner commented Oct 29, 2024

the-smooth-operator commented Oct 29, 2024

nmiceli-simtlix commented Oct 30, 2024

the-smooth-operator commented Oct 30, 2024

the-smooth-operator commented Oct 30, 2024

greg7mdp commented Oct 30, 2024

the-smooth-operator commented Oct 30, 2024

greg7mdp commented Oct 30, 2024

petemarinello commented Oct 30, 2024

heifner commented Oct 30, 2024

petemarinello commented Oct 30, 2024

heifner commented Oct 30, 2024

petemarinello commented Oct 30, 2024 • edited Loading

petemarinello commented Oct 30, 2024

the-smooth-operator commented Oct 31, 2024

heifner commented Oct 31, 2024

the-smooth-operator commented Oct 31, 2024

heifner commented Oct 31, 2024

heifner commented Oct 31, 2024

heifner commented Oct 31, 2024

ericpassmore commented Nov 4, 2024

petemarinello commented Oct 30, 2024 •

edited

Loading