Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDS-related processes crash after system time adjustment #3836

Open
1 task done
xjzer opened this issue Sep 9, 2023 · 4 comments
Open
1 task done

DDS-related processes crash after system time adjustment #3836

xjzer opened this issue Sep 9, 2023 · 4 comments
Labels
bug Issue to report a bug

Comments

@xjzer
Copy link
Contributor

xjzer commented Sep 9, 2023

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

Expected normal operation of the process

Current behavior

1. After the controller is powered up, the DDS-related processes start normally, and then the system time is adjusted, at which point the processes may crash occasionally.
2. The results of gdb parsing the core file show that both crash the program after an exception is thrown by the destructor ~RTPSMessageGroup().
3. After removing try/catch/throw from the destructor ~RTPSMessageGroup(), and after a couple of problems, the gbd backtrace is as follows.

  • delete try catch
RTPSMessageGroup::~RTPSMessageGroup() noexcept(false)
{
    //try
    //{
        send();
    //}
    //catch (...)
    //{
    //    if (!internal_buffer_)
    //   {
    //         participant_->return_send_buffer(std::move(send_buffer_));
    //     }
    //     throw;
    // }

    if (!internal_buffer_)
    {
        participant_->return_send_buffer(std::move(send_buffer_));
    }
}
  • add log
std::chrono::time_point<std::chrono::steady_clock> start_steady_clock = std::chrono::steady_clock::now();
std::chrono::time_point<std::chrono::system_clock> start_system_clock = std::chrono::system_clock::now();
if (!sender_->send(msgToSend, max_blocking_time_is_set_ ? max_blocking_time_point_
                                                        : (std::chrono::steady_clock::now() + std::chrono::hours(24))))
{
    std::chrono::time_point<std::chrono::steady_clock> end_steady_clock = std::chrono::steady_clock::now();
    std::chrono::time_point<std::chrono::system_clock> end_system_clock = std::chrono::system_clock::now();

    std::time_t start_c = std::chrono::system_clock::to_time_t(start_system_clock);
    std::time_t end_c = std::chrono::system_clock::to_time_t(end_system_clock);
    std::cerr << "max_blocking_time_is_set_ = " << max_blocking_time_is_set_ << std::endl;
    std::cerr
        << "max_blocking_time_point_ = "
        << std::chrono::duration_cast<std::chrono::milliseconds>(max_blocking_time_point_.time_since_epoch()).count()
        << std::endl;
    std::cerr << "start steady_clock = "
              << std::chrono::duration_cast<std::chrono::milliseconds>(start_steady_clock.time_since_epoch()).count()
              << ",
        end steady_clock =
        "
        << std::chrono::duration_cast<std::chrono::milliseconds>(end_steady_clock.time_since_epoch()).count()
        << std::endl;
    std::cerr << "start system_clock = " << std::put_time(std::localtime(&start_c), "% F % T") << "."
              << std::setfill('0') << std::setw(3)
              << (std::chrono::duration_cast<std::chrono::milliseconds>(start_system_clock.time_since_epoch()) -
                  std::chrono::duration_cast<std::chrono::seconds>(start_system_clock.time_since_epoch()))
                     .count()
              << std::endl;

    std::cerr << "end system_clock = " << std::put_time(std::localtime(&end_c), "% F % T") << "." << std::setfill('0')
              << std::setw(3)
              << (std::chrono::duration_cast<std::chrono::milliseconds>(end_system_clock.time_since_epoch()) -
                  std::chrono::duration_cast<std::chrono::seconds>(end_system_clock.time_since_epoch()))
              << std::endl;

    std::cerr << "== typeid(*sender_)" << typeid(*sender_).name() << std::endl;
    std::cout << "dynamic_cast<LocatorSelectorSender *> " << dynamic_cast<LocatorSelectorSender *>(sender_)
              << std::endl;
    std::cout << "dynamic_cast<ReaderLocator *> " << dynamic_cast<ReaderLocator *>(sender_) << std::endl;
    std::cout << "dynamic_cast<DirectMessageSender *> " << dynamic_cast<DirectMessageSender *>(sender_) << std::endl;
    std::cout << "dynamic_cast<WriterProxy *> " << dynamic_cast<WriterProxy *>(sender_) << std::endl;
    throw timeout();
}
  • log1
    image

  • log2
    image

  • gdb bt1
    image

  • gdb bt2
    image

  • gdb bt3
    image

Steps to reproduce

  1. After the controller is powered up, the DDS-related processes start normally, and then the system time is adjusted, at which point the processes may crash occasionally.
  2. This problem is episodic and no clear steps to reproduce it have been found.

Fast DDS version/commit

2.5.0

Platform/Architecture

Other. Please specify in Additional context section.

Transport layer

Default configuration, UDPv4 & SHM

Additional context

  1. aarch64
  2. Linux
  3. UDP Transport

XML configuration file

use fastddsgen

Relevant log output

No response

Network traffic capture

No response

@xjzer xjzer added the triage Issue pending classification label Sep 9, 2023
@JLBuenoLopez JLBuenoLopez added need more info Issue that requires more info from contributor and removed triage Issue pending classification labels Sep 11, 2023
@JLBuenoLopez
Copy link
Contributor

Hi @xjzer

Thanks for opening the ticket. I am afraid that Fast DDS v2.5.x is no longer being maintained. Can you please check if it is happening in some of the currently maintained branches? You can find more information in Fast DDS Release support page.

Also, is this crash happening when the system time adjustment is to fast the time forward or to rewind it?

@xjzer
Copy link
Contributor Author

xjzer commented Sep 11, 2023

Hi @xjzer

Thanks for opening the ticket. I am afraid that Fast DDS v2.5.x is no longer being maintained. Can you please check if it is happening in some of the currently maintained branches? You can find more information in Fast DDS Release support page.

Also, is this crash happening when the system time adjustment is to fast the time forward or to rewind it?

Hi @JLBuenoLopez-eProsima

  1. Can you recommend a relatively stable version, as the latest is not necessarily the most stable.
  2. every time the controller starts, there is a time jump, for example from 2020 to 2023, but the process does not necessarily crash. But on the other hand, although the process crashes are occasional, the point in time of each crash must correspond to the point in time when the time is adjusted.In short, when time is adjusted, there is not necessarily a process crash; when the process crashes, there must be a time adjustment.

@JLBuenoLopez JLBuenoLopez added bug Issue to report a bug and removed need more info Issue that requires more info from contributor labels Sep 12, 2023
@JLBuenoLopez
Copy link
Contributor

@xjzer,

  1. Any Fast DDS version is released as a stable version. Nevertheless, patch releases are frequently done to fix detected bugs. Currently, Fast DDS v2.10.x and Fast DDS v2.11.x are the officially supported versions. Both of them are in its second patch release so the can be considered to be really stable.
  2. Thanks for the info. We will need to take a look into the reported issue.

@xjzer
Copy link
Contributor Author

xjzer commented Sep 12, 2023

@JLBuenoLopez-eProsima

Some more clues, based on the backtrace provided earlier:

  1. The problem occurs because sender_->send() in RTPSMessageGroup::send() function returns false.

  2. Then according to the dynamic_case message, there are two possibilities for sender_ when something goes wrong, one is LocatorSelectorSender and the other is WriterProxy.

  3. so far found that the two derived classes send, will call RTPSParticipantImpl::sendSync () function, and this function returns false only one case, is try_lock_until return false, I added a log here to reproduce the problem. The log when the problem occurs is as follows.

  • function call graph
    func_call

  • add log
    try_lock_until_add_fail_log

  • coredump1 log
    core_log_1

  • coredump1 backtrace
    gdb_core_1

  • coredump2 log
    core_log_2

  • coredump2 backtrace
    gdb_core_2

  1. According to the information as above, it can be confirmed that RTPSMessageGroup::send() throws an exception due to the failure of try_lock_until, but try_lock_until is not returned false due to timeout, how should this be troubleshooted or handled, is it triggering the here mentioned "fail spuriously"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue to report a bug
Projects
None yet
Development

No branches or pull requests

2 participants