Fix interrupt transaction race condition #1107

heifner · 2025-01-16T21:32:37Z

The support for interrupting transaction on new best head, #1047, has a race condition such that an interrupt could interrupt a speculative transaction instead of a transaction in apply block. This would cause the transaction to fail and not be re-tried on the node.

This PR refactors the platform_timer to keep track if it was trigger by an explicit interrupt or by the timer timing out. This allows for transaction_context::checktime to directly tell if a transaction was interrupted and raise an appropriate interrupt exception. If a transaction is interrupted it can then be retried similar to how a transaction is retried if it hits a block boundary.

Resolves #1095

…er node that is not producing for trx execution

…e an enum for timer state instead of a bool expired. This allows differentiation between timer expired and interruption.

…still needed.

spoonincode · 2025-01-16T22:01:17Z

libraries/chain/include/eosio/chain/platform_timer.hpp

@@ -17,7 +15,8 @@ struct platform_timer {

   void start(fc::time_point tp);
   void stop();
-   void expire_now();
+   void interrupt_timer();
+   void _expire_now(); // called by internal timer


called by internal timer

Can be made private then?

Not without some friends and with the multiple impls that becomes rather a pain.

All of the impls are the same platform_timer struct though, not separate classes. And even _state is already private.

Hmm, I'm a bit surprised that worked.

greg7mdp · 2025-01-17T15:38:47Z

libraries/chain/platform_timer_accuracy.cpp

         auto end = std::chrono::high_resolution_clock::now();
         int timer_slop = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count() - interval;
+         timer.stop();


Not seeing why this added line is necessary?

I added it because I originally added an assert that start() only called when timer is stopped. However, we don't actually honor that constraint currently elsewhere. Seems like it is better to honor that constraint normally.

Decided it best to fix the invariant that stop() always called before start(). Surprise-surprise the issue there is with deferred-transactions.

greg7mdp · 2025-01-17T15:40:32Z

libraries/chain/platform_timer_asio_fallback.cpp

      call_expiration_callback();
   }
 }

 void platform_timer::stop() {
-   if(expired)
+   if(_state == state_t::stopped)
      return;

   my->timer->cancel();


Wouldn't it make sense to call my->timer->cancel(); only if (_state == state_t::running)?

This now needs to set to stopped, so the assert on start() is not triggered.

greg7mdp · 2025-01-17T15:51:46Z

libraries/chain/platform_timer_asio_fallback.cpp

+   state_t expected = state_t::running;
+   if (_state.compare_exchange_strong(expected, state_t::timed_out)) {
+      call_expiration_callback();
+   }
+}
+
+void platform_timer::interrupt_timer() {
+   state_t expected = state_t::running;
+   if (_state.compare_exchange_strong(expected, state_t::interrupted)) {


Why don't these call my->timer->cancel()?

Isn't it possible that someone calls interrupt_timer() (which updates the state but doesn't cancel the async_wait), and then calls start(fc::time_point::maximum()), and may be surprised when the previous async_wait expires the timer.

We would need to add a mutex around the timer to do that. I think the intention is that stop() should be called before calling start() again.

ericpassmore · 2025-01-17T16:24:30Z

Note:start
category: System Stability
component: Internal
summary: Fix a race condition where an interrupt could interrupt a speculative transaction.
Note:end

…tion_context to call stop() on undo/squash

heifner added 6 commits January 15, 2025 12:56

GH-1091 Attempt to make interrupt of trx more common

c9d4276

GH-1095 Additional use of bios node to test using a configured produc…

1a2a0c9

…er node that is not producing for trx execution

GH-1095 Refactor platform_timer and transaction_checktime_timer to us…

11f1e9a

…e an enum for timer state instead of a bool expired. This allows differentiation between timer expired and interruption.

GH-1095 Revert removal of eos_vm_oc_compile_interrupt flag, as it is …

8ad83f2

…still needed.

GH-1095 Update to use the new _state like platform_timer_posix

1a50b7b

GH-1095 Allow interrupted transaction to be retried

0f4fbb4

heifner requested review from spoonincode and greg7mdp January 16, 2025 21:32

heifner added the OCI Work exclusive to OCI team label Jan 16, 2025

spoonincode reviewed Jan 16, 2025

View reviewed changes

GH-1095 Make expire_now() private

d072fcd

greg7mdp reviewed Jan 17, 2025

View reviewed changes

GH-1092 Add assert that start() only called after stop(). Fix transac…

848f90e

…tion_context to call stop() on undo/squash

greg7mdp approved these changes Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix interrupt transaction race condition #1107

Fix interrupt transaction race condition #1107

heifner commented Jan 16, 2025

spoonincode Jan 16, 2025

heifner Jan 16, 2025

spoonincode Jan 16, 2025

heifner Jan 16, 2025

greg7mdp Jan 17, 2025

heifner Jan 17, 2025

heifner Jan 17, 2025 •

edited

Loading

greg7mdp Jan 17, 2025

heifner Jan 17, 2025

greg7mdp Jan 17, 2025

heifner Jan 17, 2025

ericpassmore commented Jan 17, 2025

Fix interrupt transaction race condition #1107

Are you sure you want to change the base?

Fix interrupt transaction race condition #1107

Conversation

heifner commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heifner Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericpassmore commented Jan 17, 2025

heifner Jan 17, 2025 •

edited

Loading