Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: barrier assertion #1809

Open
syamajala opened this issue Dec 19, 2024 · 13 comments
Open

Realm: barrier assertion #1809

syamajala opened this issue Dec 19, 2024 · 13 comments
Assignees
Labels

Comments

@syamajala
Copy link
Contributor

I'm seeing the following assertion when running S3D on Frontier:

s3d.x: /lustre/orion/cmb103/scratch/seshuy/legion_s3d_tdb/legion/runtime/realm/barrier_impl.cc:1259: static void Realm:
:BarrierTriggerMessage::handle_message(Realm::NodeID, const Realm::BarrierTriggerMessage&, const void*, size_t, Realm::
TimeLimit): Assertion `datalen == (impl->redop->sizeof_lhs * (trigger_args.internal.trigger_gen - trigger_args.internal
.previous_gen))' failed.

It seems to happen when I turn on profiling as my run without profiling succeeded.

@syamajala syamajala added the S3D label Dec 19, 2024
@syamajala
Copy link
Contributor Author

Also I am running on the onepool branch.

@lightsighter
Copy link
Contributor

The critical path profiling requires the use of Realm barrier reductions which are not used when you are not profiling. What are the values of datalen, impl->redop->sizeof_lhs, trigger_args.internal.trigger_gen, trigger_args.internal.previous_gen, and trigger_args.internal.first_gen?

@lightsighter
Copy link
Contributor

If you want to avoid this error for now and you don't mind not having critical path profiling then you can run with -lg:prof_no_critical_paths.

I'll also note that this doesn't have anything to do with onepool and will happen in the master branch as well.

@apryakhin
Copy link
Contributor

Yep that is certainly a bug due to my recent merge of the notification broadcast. That will be fixed at the earliest convenience.

@apryakhin
Copy link
Contributor

apryakhin commented Jan 9, 2025

That will be the patch:

That needs a test to be added which I will do shortly and then it's ready

@apryakhin
Copy link
Contributor

@syamajala The assert you are hitting is only on the onepool branch? I wonder if we can actually test the path with S3D

@lightsighter
Copy link
Contributor

The onepool branch is merged into the master branch now, so you can upstream those changes to whatever feature branch you are testing.

@apryakhin
Copy link
Contributor

PR1613 should already include changes from one pool branch then. @syamajala just to follow up again on testing the fix

@syamajala
Copy link
Contributor Author

Will test the pull request.

@syamajala
Copy link
Contributor Author

Frontier is down today.

@syamajala
Copy link
Contributor Author

Job is in the queue, but it looks like the compute nodes are still down on Frontier.

@syamajala
Copy link
Contributor Author

I tested on sapling and it seems to be working.

@syamajala
Copy link
Contributor Author

oops. meant to click comment not close. Please close after you merge to master.

@syamajala syamajala reopened this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants