Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix interrupt transaction race condition #1107

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

heifner
Copy link
Member

@heifner heifner commented Jan 16, 2025

The support for interrupting transaction on new best head, #1047, has a race condition such that an interrupt could interrupt a speculative transaction instead of a transaction in apply block. This would cause the transaction to fail and not be re-tried on the node.

This PR refactors the platform_timer to keep track if it was trigger by an explicit interrupt or by the timer timing out. This allows for transaction_context::checktime to directly tell if a transaction was interrupted and raise an appropriate interrupt exception. If a transaction is interrupted it can then be retried similar to how a transaction is retried if it hits a block boundary.

Resolves #1095

@heifner heifner added the OCI Work exclusive to OCI team label Jan 16, 2025
@@ -17,7 +15,8 @@ struct platform_timer {

void start(fc::time_point tp);
void stop();
void expire_now();
void interrupt_timer();
void _expire_now(); // called by internal timer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called by internal timer

Can be made private then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without some friends and with the multiple impls that becomes rather a pain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the impls are the same platform_timer struct though, not separate classes. And even _state is already private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm a bit surprised that worked.

auto end = std::chrono::high_resolution_clock::now();
int timer_slop = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count() - interval;
timer.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing why this added line is necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because I originally added an assert that start() only called when timer is stopped. However, we don't actually honor that constraint currently elsewhere. Seems like it is better to honor that constraint normally.

Copy link
Member Author

@heifner heifner Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided it best to fix the invariant that stop() always called before start(). Surprise-surprise the issue there is with deferred-transactions.

call_expiration_callback();
}
}

void platform_timer::stop() {
if(expired)
if(_state == state_t::stopped)
return;

my->timer->cancel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to call my->timer->cancel(); only if (_state == state_t::running)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now needs to set to stopped, so the assert on start() is not triggered.

Comment on lines +78 to +86
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::timed_out)) {
call_expiration_callback();
}
}

void platform_timer::interrupt_timer() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::interrupted)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't these call my->timer->cancel()?

Isn't it possible that someone calls interrupt_timer() (which updates the state but doesn't cancel the async_wait), and then calls start(fc::time_point::maximum()), and may be surprised when the previous async_wait expires the timer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to add a mutex around the timer to do that. I think the intention is that stop() should be called before calling start() again.

@ericpassmore
Copy link
Contributor

Note:start
category: System Stability
component: Internal
summary: Fix a race condition where an interrupt could interrupt a speculative transaction.
Note:end

…tion_context to call stop() on undo/squash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test failure: restart-scenarios-if-test-resync
4 participants