Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spring node suddenly freezes and corrupts its state on restart #982

Closed
the-smooth-operator opened this issue Oct 29, 2024 · 21 comments
Closed

Comments

@the-smooth-operator
Copy link

Since Thursday 24th of October we are experiencing this issue were nodes suddenly freezes: stop emitting logs and answering RPC calls.

We start from V8 snapshots, the node runs for a 2 to 3 hours max and freezes.
We were compiling our own binary out of v1.0.2 tag, and we also tried using the DEB package. We are experiencing the same issue.
The node runs with 192GB of memory, and several CPU cores which it doesn't use, not even close. Just a few GB of memory and 1 to 2 CPU cores.

These are our config settings:

    # Public peers that are relatively consistent

    # Per EOS documentation Keep your p2p-peer-address list short
    # Ideally, limit it to the 3 or 4 closest nodes geographically
    # You should have at least 2 or 3 nodes that extend all the way to the genesis block
    # Refernce: https://eosnetwork.com/blog/leap-3-1-upgrade-guide/

    p2p-peer-address = eos.seed.eosnation.io:9876
    p2p-peer-address = peer1.eosphere.io:9876
    p2p-peer-address = peer2.eosphere.io:9876
    p2p-peer-address = p2p.genereos.io:9876

    # Plugins
    plugin = eosio::chain_plugin
    plugin = eosio::net_plugin
    plugin = eosio::http_plugin
    plugin = eosio::chain_api_plugin
    plugin = eosio::net_api_plugin

    # Stability Configs
    http-validate-host = false
    chain-state-db-size-mb = 81920

    # Default settings are way too low
    net-threads = 6
    http-threads = 6
    chain-threads = 6

In the screenshot you can see how logs stop:
imagen

These are the last log lines we get:

    net_plugin.cpp:3761           handle_message       ] [\"peer1.eosphere.io:9876 - fee7272\" - 3 198.50.156.24:9876] received packed_transaction 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\ndebug 2024-10-29T16:41:43.116 nodeos    producer_plugin.cpp:2428      log_trx_results      ] [TRX_TRACE] Speculative execution is ACCEPTING tx: 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe, auth: miner2.sx, cpu: 437\ndebug 2024-10-29T16:41:43.116 nodeos    producer_plugin.cpp:2430      log_trx_results      ] [TRX_TRACE] Speculative execution is ACCEPTING tx: {\"expiration\":\"2024-10-29T16:43:03\",\"ref_block_num\":25098,\"ref_block_prefix\":2376131749,\"max_net_usage_words\":0,\"max_cpu_usage_ms\":0,\"delay_sec\":0,\"context_free_actions\":[],\"actions\":[{\"account\":\"push.sx\",\"name\":\"mine\",\"authorization\":[{\"actor\":\"miner2.sx\",\"permission\":\"active\"}],\"data\":{\"executor\":\"miner2.sx\",\"nonce\":1377},\"hex_data\":{\"size\":16,\"hex\":\"0000e81888aba6936105000000000000\"}}]}\ndebug 2024-10-29T16:41:43.117 nodeos    producer_plugin.cpp:2432      log_trx_results      ] [TRX_TRACE] Speculative execution is ACCEPTING tx: {\"id\":\"6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\",\"block_num\":401957469,\"block_time\":\"2024-10-29T16:41:43.500\",\"producer_block_id\":null,\"receipt\":{\"status\":\"executed\",\"cpu_usage_us\":437,\"net_usage_words\":14},\"elapsed\":437,\"net_usage\":112,\"scheduled\":false,\"action_traces\":[{\"action_ordinal\":1,\"creator_action_ordinal\":0,\"closest_unnotified_ancestor_action_ordinal\":0,\"receipt\":{\"receiver\":\"push.sx\",\"act_digest\":\"00264089d6e05c468b211135c0318bead8abd1286ef92bb4028439f689803af2\",\"global_sequence\":\"362066506033\",\"recv_sequence\":501569411,\"auth_sequence\":[[\"miner2.sx\",198816888]],\"code_sequence\":315,\"abi_sequence\":62},\"receiver\":\"push.sx\",\"act\":{\"account\":\"push.sx\",\"name\":\"mine\",\"authorization\":[{\"actor\":\"miner2.sx\",\"permission\":\"active\"}],\"data\":{\"executor\":\"miner2.sx\",\"nonce\":1377},\"hex_data\":{\"size\":16,\"hex\":\"0000e81888aba6936105000000000000\"}},\"context_free\":false,\"elapsed\":35,\"console\":\"\",\"trx_id\":\"6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\",\"block_num\":401957469,\"block_time\":\"2024-10-29T16:41:43.500\",\"producer_block_id\":null,\"account_ram_deltas\":[],\"except\":null,\"error_code\":null,\"return_value_hex_data\":\"\"},{\"action_ordinal\":2,\"creator_action_ordinal\":1,\"closest_unnotified_ancestor_action_ordinal\":1,\"receipt\":{\"receiver\":\"fastram.sx\",\"act_digest\":\"00264089d6e05c468b211135c0318bead8abd1286ef92bb4028439f689803af2\",\"global_sequence\":\"362066506034\",\"recv_sequence\":69976621,\"auth_sequence\":[[\"miner2.sx\",198816889]],\"code_sequence\":315,\"abi_sequence\":62},\"receiver\":\"fastram.sx\",\"act\":{\"account\":\"push.sx\",\"name\":\"mine\",\"authorization\":[{\"actor\":\"miner2.sx\",\"permission\":\"active\"}],\"data\":{\"executor\":\"miner2.sx\",\"nonce\":1377},\"hex_data\":{\"size\":16,\"hex\":\"0000e81888aba6936105000000000000\"}},\"context_free\":false,\"elapsed\":315,\"console\":\"\",\"trx_id\":\"6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\",\"block_num\":401957469,\"block_time\":\"2024-10-29T16:41:43.500\",\"producer_block_id\":null,\"account_ram_deltas\":[],\"except\":null,\"error_code\":null,\"return_value_hex_data\":\"\"}],\"account_ram_delta\":null,\"except\":null,\"error_code\":null}\ndebug 2024-10-29T16:41:43.117 nodeos    net_plugin.cpp:3773           operator()           ] chain accepted transaction, bcast 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\ndebug 2024-10-29T16:41:43.117 net-1     net_plugin.cpp:4094           operator()           ] signaled ACK, trx-id = 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe\ndebug 2024-10-29T16:41:43.117 net-1     net_plugin.cpp:2783           operator()           ] sending trx: 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe, to connection - 1\ndebug 2024-10-29T16:41:43.117 net-1     net_plugin.cpp:2783           operator()           ] sending trx: 6f18e4f4e243469612dfa42467f9c692f2d1c5eca029f433f44b14debb3e5bfe, to connection - 2\ndebug 2024-10-29T16:41:43.117 net-2     net_plugin.cpp:1707           ope"; err=<nil>

Any help will be appreciated as we can't run any node reliable as of now

@heifner
Copy link
Member

heifner commented Oct 29, 2024

chain-state-db-size-mb = 81920

This should be something more like chain-state-db-size-mb = 180000. I would have expected that to given you an error.

@the-smooth-operator
Copy link
Author

let me try setting that value.

@nmiceli-simtlix
Copy link

We have tried different settings and our nodes just won't run. If we are able to successfully restore from a snapshot, the node will run for a few minutes and still crash with terminate called after throwing an instance of 'Serialization::FatalSerializationException'
Can anyone share a config.ini file that we can work with?
Thanks!

@the-smooth-operator
Copy link
Author

We have been running stable EOS nodes for about 4 years. Suddenly, we are completely unable to run a node starting from a snapshot.
Honestly we are desperate and out of ideas.

Steps to reproduce:

  • Download a V8 snapshot from https://snapshots.eosnation.io/
  • Run with default config (fails to start due to std::exception::what: Unknown option 'reversible-blocks-db-size-mb' inside the config file config.ini
  • Bump chain-state-db-size-mb to 65536 or any multiple. The node processes the snapshot, starts syncing and freezes in about an hour. Throws 'Serialization::FatalSerializationException'

We are running in a 256GB server, but the process doesn't use more than 4GB.

In this snapshot you can see how it quickly catches up, continues syncing, and freezes.
imagen

@the-smooth-operator
Copy link
Author

Are you able to start a node from scratch using a snapshot? if so, we would love to see which configuration you are using

@greg7mdp
Copy link
Contributor

Honestly we are desperate and out of ideas.

Maybe try setting database-map-mode = mapped_private. It could help if you are using a slow disk. Also set chain-state-db-size-mb = 180000 as suggested above.

@the-smooth-operator
Copy link
Author

we already tried with chain-state-db-size-mb = 180000
We are now going to test database-map-mode = mapped_private

@greg7mdp
Copy link
Contributor

we already tried with chain-state-db-size-mb = 180000 We are now going to test database-map-mode = mapped_private

Please make sure you do both.

@petemarinello
Copy link

Please make sure you do both.

Just confirmed we are currently testing with chain-state-db-size-mb = 180000 and database-map-mode = mapped_private.

@heifner
Copy link
Member

heifner commented Oct 30, 2024

10/30/24 2:20:34 ➜  bin git:(52215f90c7) ✗ zstd -d snapshot-2024-10-30-16-eos-v8-0402124605.bin.zst
snapshot-2024-10-30-16-eos-v8-0402124605.bin.zst: 46486102417 bytes
10/30/24 2:22:40 ➜  bin git:(52215f90c7) ✗ ./nodeos --version
info  2024-10-30T19:22:40.911 nodeos    main.cpp:158                  main                 ] nodeos started
v1.0.2
10/30/24 2:23:09 ➜  bin git:(52215f90c7) ✗ rm -rf main-d
10/30/24 2:25:50 ➜  bin git:(52215f90c7) ✗ ./nodeos --data-dir main-d --config-dir main-d --chain-state-db-size-mb 180000 --plugin eosio::chain_api_plugin --plugin eosio::db_size_api_plugin --p2p-peer-address eos.seed.eosnation.io:9876 --p2p-peer-address peer1.eosphere.io:9876 --p2p-peer-address peer2.eosphere.io:9876 --p2p-peer-address p2p.genereos.io:9876 --snapshot snapshot-2024-10-30-16-eos-v8-0402124605.bin
...
info  2024-10-30T20:13:41.600 nodeos    controller.cpp:1872           startup              ] Snapshot loaded, lib: 402124605
...
info  2024-10-30T20:14:18.200 nodeos    controller.cpp:3641           log_applied          ] Received block 8625ffa71c93416f... #402125000 @ 2024-10-30T16:03:19.000 signed by eosnationftw [trxs: 7, lib: 402124998, net: 1336, cpu: 12415 us, elapsed: 28617 us, applying time: 29212 us, latency: 15059200 ms]
...
info  2024-10-30T20:22:41.109 nodeos    controller.cpp:3641           log_applied          ] Received block 4b731dbac2e51af3... #402155884 @ 2024-10-30T20:22:41.500 signed by aus1genereos [trxs: 9, lib: 402155882, net: 1336, cpu: 11855 us, elapsed: 2298 us, applying time: 3053 us, latency: -390 ms]

Note peer2.eosphere.io:9876 did keep giving me errors. I would recommend removing that peer.

@petemarinello
Copy link

The situation seems to be improving, as we are currently observing over 100 GB of memory usage. This is expected since mapped_private suggests that data is being memory-mapped, if I understand correctly. However, after several hours, the same issue remains: the node continues to halt block processing. Additionally, it is worth noting that there have been no restarts of the service.

Screenshot 2024-10-30 at 4 12 58 PM

@heifner
Copy link
Member

heifner commented Oct 30, 2024

node continues to halt block processing

@petemarinello are you working with @the-smooth-operator or did you hijack this issue?

Can you provide your complete log?

@petemarinello
Copy link

petemarinello commented Oct 30, 2024

@petemarinello are you working with @the-smooth-operator or did you hijack this issue?

we are working together on this issue.

@petemarinello
Copy link

Update from our side after removing peer2.eosphere.io:9876 from the peer list, things seem to be working as expected, however I feel like we need more time to monitor before fully calling the issue resolved

@the-smooth-operator
Copy link
Author

2 nodes have been running for 12 hours now. They are doing slightly better, I guess due to using database-map-mode = mapped_private. However they still throw the same Serialization errors, and not functioning correctly.

One interesting thing, at some point both processes stopped producing logs, however they are able to answer RPC calls from time to time, and we can see how their height is increasing.
I'm attaching the log output of the lasts minutes until the process froze.
eos-fn-log.txt

They are using 100GB+ of memory and almost no CPU or disk.

In this screenshot you can see how the nodes answer from time to time internal RPC calls that we use to determine height.
Eve tho they are not logging anything
imagen

Current configuration:

    p2p-peer-address = eos.seed.eosnation.io:9876
    p2p-peer-address = peer1.eosphere.io:9876
    p2p-peer-address = p2p.genereos.io:9876


    # Plugins
    plugin = eosio::chain_plugin
    plugin = eosio::net_plugin
    plugin = eosio::http_plugin
    plugin = eosio::chain_api_plugin
    plugin = eosio::net_api_plugin
    plugin = eosio::db_size_api_plugin

    # Stability Configs
    http-validate-host = false
    chain-state-db-size-mb = 180000
    eos-vm-oc-cache-size-mb = 4096
    database-map-mode = mapped_private


    # Default settings are way too low
    net-threads = 6
    http-threads = 6
    chain-threads = 4

@heifner
Copy link
Member

heifner commented Oct 31, 2024

Thanks for the log. Very helpful. You hit an issue I thought was introduced (and fixed) after 1.0.2. I see now that it is in 1.0.2. This is fixed in 1.0.3. Until 1.0.3 is released, you could build release/1.0 branch.

@the-smooth-operator
Copy link
Author

That's great, we really need to get our nodes up. We are going to run some replicas with the code on release/1.0 branch.
My colleagues will report later on this thread.

I have 2 questions: what did you spot in the logs that made you relate it to a bug? Do you know why other operators are able to run nodes without hitting this issue?

@heifner
Copy link
Member

heifner commented Oct 31, 2024

  1. The sync_timeout. You running with debug net_plugin_impl provided this very useful information. I remembered we fixed an issue with sync time out. See https://github.com/AntelopeIO/spring/pull/897/files
debug 2024-10-30T20:00:34.813 net-5     net_plugin.cpp:2268           sync_timeout         ] ["eosn-eos-seed170:9876 - be6402b" - 1 209.249.216.152:9876] sync timed out
terminate called after throwing an instance of 'Serialization::FatalSerializationException'
  1. You will only hit this issue if you have a sync time out. This requires a slow peer for you to sync from.

@heifner
Copy link
Member

heifner commented Oct 31, 2024

@heifner
Copy link
Member

heifner commented Oct 31, 2024

Assuming this is fixed in v1.0.3. Feel free to re-open if you discover otherwise.

@ericpassmore
Copy link
Contributor

Note:start
category: System Stability
component: Internal
summary: Node freeze closed by other fixes.
Note:end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

7 participants