Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix interrupt transaction race condition #1107

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion libraries/chain/controller.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4533,7 +4533,7 @@ struct controller_impl {
// validated.
if (applying_block) {
ilog("Interrupting apply block");
main_thread_timer.expire_now();
main_thread_timer.interrupt_timer();
}
}

Expand Down
16 changes: 12 additions & 4 deletions libraries/chain/include/eosio/chain/platform_timer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@

#include <atomic>

#include <signal.h>

namespace eosio { namespace chain {

struct platform_timer {
Expand All @@ -17,7 +15,8 @@ struct platform_timer {

void start(fc::time_point tp);
void stop();
void expire_now();
void interrupt_timer();
void _expire_now(); // called by internal timer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called by internal timer

Can be made private then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without some friends and with the multiple impls that becomes rather a pain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the impls are the same platform_timer struct though, not separate classes. And even _state is already private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm a bit surprised that worked.


/* Sets a callback for when timer expires. Be aware this could might fire from a signal handling context and/or
on any particular thread. Only a single callback can be registered at once; trying to register more will
Expand All @@ -35,9 +34,18 @@ struct platform_timer {
_expiration_callback_data = user;
}

std::atomic_bool expired = true;
enum class state_t {
running,
timed_out,
interrupted,
stopped
};
state_t timer_state() const { return _state; }

private:
std::atomic<state_t> _state = state_t::stopped;


struct impl;
constexpr static size_t fwd_size = 8;
fc::fwd<impl,fwd_size> my;
Expand Down
3 changes: 2 additions & 1 deletion libraries/chain/include/eosio/chain/transaction_context.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,13 @@ namespace eosio::chain {
void start(fc::time_point tp);
void stop();

platform_timer::state_t timer_state() const { return _timer.timer_state(); }

/* Sets a callback for when timer expires. Be aware this could might fire from a signal handling context and/or
on any particular thread. Only a single callback can be registered at once; trying to register more will
result in an exception. Use nullptr to disable a previously set callback. */
void set_expiration_callback(void(*func)(void*), void* user);

std::atomic_bool& expired;
private:
platform_timer& _timer;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ struct eosvmoc_tier {
if (executing_code_hash.load() == code_id) {
ilog("EOS VM OC tier up interrupting ${id}", ("id", code_id));
eos_vm_oc_compile_interrupt = true;
main_thread_timer.expire_now();
main_thread_timer.interrupt_timer();
}
});
}
Expand Down Expand Up @@ -181,7 +181,7 @@ struct eosvmoc_tier {
try {
get_instantiated_module(code_hash, vm_type, vm_version, context.trx_context)->apply(context);
} catch (const interrupt_exception& e) {
if (allow_oc_interrupt && eos_vm_oc_compile_interrupt) {
if (allow_oc_interrupt && eos_vm_oc_compile_interrupt && main_thread_timer.timer_state() == platform_timer::state_t::interrupted) {
++eos_vm_oc_compile_interrupt_count;
dlog("EOS VM OC compile complete interrupt of ${r} <= ${a}::${act} code ${h}, interrupt #${c}",
("r", context.get_receiver())("a", context.get_action().account)
Expand Down
3 changes: 2 additions & 1 deletion libraries/chain/platform_timer_accuracy.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,10 @@ void compute_and_print_timer_accuracy(platform_timer& timer) {
for(unsigned int i = 0; i < loops; ++i) {
auto start = std::chrono::high_resolution_clock::now();
timer.start(fc::time_point(fc::time_point::now().time_since_epoch() + fc::microseconds(interval)));
while(!timer.expired) {}
while(timer.timer_state() == platform_timer::state_t::running) {}
auto end = std::chrono::high_resolution_clock::now();
int timer_slop = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count() - interval;
timer.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing why this added line is necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because I originally added an assert that start() only called when timer is stopped. However, we don't actually honor that constraint currently elsewhere. Seems like it is better to honor that constraint normally.

Copy link
Member Author

@heifner heifner Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided it best to fix the invariant that stop() always called before start(). Surprise-surprise the issue there is with deferred-transactions.


//since more samples are run for the shorter expirations, weigh the longer expirations accordingly. This
//helps to make a few results more fair. Two such examples: AWS c4&i5 xen instances being rather stable
Expand Down
25 changes: 16 additions & 9 deletions libraries/chain/platform_timer_asio_fallback.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,36 +57,43 @@ platform_timer::~platform_timer() {

void platform_timer::start(fc::time_point tp) {
if(tp == fc::time_point::maximum()) {
expired = false;
_state = state_t::running;
return;
}
fc::microseconds x = tp.time_since_epoch() - fc::time_point::now().time_since_epoch();
if(x.count() <= 0)
expired = true;
_state = state_t::timed_out;
else {
expired = false;
_state = state_t::running;
my->timer->expires_after(std::chrono::microseconds(x.count()));
my->timer->async_wait([this](const boost::system::error_code& ec) {
if(ec)
return;
expire_now();
_expire_now();
});
}
}

void platform_timer::expire_now() {
bool expected = false;
if (expired.compare_exchange_strong(expected, true)) {
void platform_timer::_expire_now() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::timed_out)) {
call_expiration_callback();
}
}

void platform_timer::interrupt_timer() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::interrupted)) {
Comment on lines +78 to +86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't these call my->timer->cancel()?

Isn't it possible that someone calls interrupt_timer() (which updates the state but doesn't cancel the async_wait), and then calls start(fc::time_point::maximum()), and may be surprised when the previous async_wait expires the timer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to add a mutex around the timer to do that. I think the intention is that stop() should be called before calling start() again.

call_expiration_callback();
}
}

void platform_timer::stop() {
if(expired)
if(_state == state_t::stopped)
return;

my->timer->cancel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to call my->timer->cancel(); only if (_state == state_t::running)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now needs to set to stopped, so the assert on start() is not triggered.

expired = true;
_state = state_t::stopped;
}

}}
19 changes: 13 additions & 6 deletions libraries/chain/platform_timer_kqueue.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ platform_timer::platform_timer() {

if(c == 1 && anEvent.filter == EVFILT_TIMER) {
platform_timer* self = (platform_timer*)anEvent.udata;
self->expire_now();
self->_expire_now();
}
else if(c == 1 && anEvent.filter == EVFILT_USER)
return;
Expand Down Expand Up @@ -105,21 +105,28 @@ void platform_timer::start(fc::time_point tp) {
}
}

void platform_timer::expire_now() {
bool expected = false;
if (expired.compare_exchange_strong(expected, true)) {
void platform_timer::_expire_now() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::timed_out)) {
call_expiration_callback();
}
}

void platform_timer::interrupt_timer() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::interrupted)) {
call_expiration_callback();
}
}

void platform_timer::stop() {
if(expired)
if(_state == state_t::stopped)
return;

struct kevent64_s stop_timer_event;
EV_SET64(&stop_timer_event, my->timerid, EVFILT_TIMER, EV_DELETE, 0, 0, 0, 0, 0);
kevent64(kqueue_fd, &stop_timer_event, 1, NULL, 0, KEVENT_FLAG_IMMEDIATE, NULL);
expired = true;
_state = state_t::stopped;
}

}}
27 changes: 17 additions & 10 deletions libraries/chain/platform_timer_posix.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ struct platform_timer::impl {

static void sig_handler(int, siginfo_t* si, void*) {
platform_timer* self = (platform_timer*)si->si_value.sival_ptr;
self->expire_now();
self->_expire_now();
}
};

Expand Down Expand Up @@ -56,35 +56,42 @@ platform_timer::~platform_timer() {

void platform_timer::start(fc::time_point tp) {
if(tp == fc::time_point::maximum()) {
expired = false;
_state = state_t::running;
return;
}
fc::microseconds x = tp.time_since_epoch() - fc::time_point::now().time_since_epoch();
if(x.count() <= 0)
expired = true;
_state = state_t::timed_out;
else {
time_t secs = x.count() / 1000000;
long nsec = (x.count() - (secs*1000000)) * 1000;
struct itimerspec enable = {{0, 0}, {secs, nsec}};
expired = false;
_state = state_t::running;
if(timer_settime(my->timerid, 0, &enable, NULL) != 0)
expired = true;
_state = state_t::timed_out;
}
}

void platform_timer::expire_now() {
bool expected = false;
if (expired.compare_exchange_strong(expected, true)) {
void platform_timer::_expire_now() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::timed_out)) {
call_expiration_callback();
}
}

void platform_timer::interrupt_timer() {
state_t expected = state_t::running;
if (_state.compare_exchange_strong(expected, state_t::interrupted)) {
call_expiration_callback();
}
}

void platform_timer::stop() {
if(expired)
if(_state == state_t::stopped)
return;
struct itimerspec disable = {{0, 0}, {0, 0}};
timer_settime(my->timerid, 0, &disable, NULL);
expired = true;
_state = state_t::stopped;
}

}
8 changes: 4 additions & 4 deletions libraries/chain/transaction_context.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@
namespace eosio::chain {

transaction_checktime_timer::transaction_checktime_timer(platform_timer& timer)
: expired(timer.expired), _timer(timer) {
expired = 0;
: _timer(timer) {
}

void transaction_checktime_timer::start(fc::time_point tp) {
Expand Down Expand Up @@ -489,11 +488,12 @@ namespace eosio::chain {
}

void transaction_context::checktime()const {
if(BOOST_LIKELY(transaction_timer.expired == false))
platform_timer::state_t expired = transaction_timer.timer_state();
if(BOOST_LIKELY(expired == platform_timer::state_t::running))
return;

auto now = fc::time_point::now();
if (explicit_billed_cpu_time && block_deadline > now) {
if (expired == platform_timer::state_t::interrupted) {
EOS_THROW( interrupt_exception, "interrupt signaled, ran ${bt}us, start ${s}",
("bt", now - pseudo_start)("s", start) );
} else if( explicit_billed_cpu_time || deadline_exception_code == deadline_exception::code_value ) {
Expand Down
3 changes: 2 additions & 1 deletion libraries/chain/webassembly/runtimes/eos-vm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ namespace {
guard(transaction_checktime_timer& timer, F&& func)
: _timer(timer), _func(std::forward<F>(func)) {
_timer.set_expiration_callback(&callback, this);
if(_timer.expired) {
platform_timer::state_t expired = _timer.timer_state();
if(expired == platform_timer::state_t::timed_out || expired == platform_timer::state_t::interrupted) {
_func(); // it's harmless if _func is invoked twice
}
}
Expand Down
1 change: 1 addition & 0 deletions plugins/producer_plugin/producer_plugin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ bool exception_is_exhausted(const fc::exception& e) {
return (code == block_cpu_usage_exceeded::code_value) ||
(code == block_net_usage_exceeded::code_value) ||
(code == deadline_exception::code_value) ||
(code == interrupt_exception::code_value) || // allow interrupted trxs to be retried
(code == ro_trx_vm_oc_compile_temporary_failure::code_value);
}
} // namespace
Expand Down
3 changes: 3 additions & 0 deletions tests/nodeos_high_transaction_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,9 @@
nonProdNodes.append(node)
else:
prodNodes.append(node)
if not args.send_duplicates:
# want to test sending to a node that is configured as a producer but not producing
nonProdNodes.append(cluster.biosNode)
nonProdNodeCount = len(nonProdNodes)

# *** delegate bandwidth to accounts ***
Expand Down
3 changes: 2 additions & 1 deletion tests/nodeos_retry_transaction_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@

apiNodeCount = len(apiNodes)

node=apiNodes[0]
node=cluster.biosNode # use biosNode to test configured as a producer but not producing
checkTransIds = []
startTime = time.perf_counter()
Print("Create new accounts via %s" % (cluster.eosioAccount.name))
Expand All @@ -155,6 +155,7 @@
Print("Transfer funds took %s sec" % (nextTime - startTime))
startTime = nextTime

node=apiNodes[0]
Print("Delegate Bandwidth to new accounts")
for account in accounts:
trans=node.delegatebw(account, 200.0000, 200.0000, waitForTransBlock=False, exitOnError=True, reportStatus=False)
Expand Down
Loading