Replies: 2 comments
-
Unfortunately that message is a bit on the generic side of things -- pretty much any issue with MPI termination would result in that message, so there's not much to go on to figure out what might have happened. It might be helpful to provide more info, such as the type and version of the MPI system you're using (e.g. OpenMPI 4.1.1), the full command line you ran (including the number of processes passed to mpirun), and any additional detail from the tracer output (e.g. what are the last lines printed from each of the nodes). In particular, there are a couple of different MPI job running systems in Rosetta, and depending on your flags you may have invoked one or the other. Knowing which was used would help narrow things down. Additionally, since it's likely it's an issue in the cleanup code (since you got all of the outputs), have a sense of how the number of extracted structures match up with the number of processes requested may point to issues. (In particular, I seem to recall similar issues when the number of processes exceeded the number of structures to run - the "extra" processes which never got anything to do might not clean up correctly.) |
Beta Was this translation helpful? Give feedback.
-
Hi, Noora. This would be expected with `extract_pdbs` with the MPI build:
`extract_pdbs` is inherently single-threaded, and doesn't use MPI at all.
So if you launch multiple MPI processes, only the first one does anything,
and the others sit idle (and likely don't spin down properly at the end).
You *can* run the `extract_pdbs` MPI build (i.e. aside from that message,
it won't cause any problems), but there's no advantage to it.
…On Tue, Jul 30, 2024 at 11:28 AM Noora Azadvari ***@***.***> wrote:
I’m getting this error at the end of running extract_pdbs on mpi release.
All PDBs are extracted, but the job is reported as Failed, and I just
wanted to report it.
mpirun has exited due to process rank X with PID X on
node X exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
—
Reply to this email directly, view it on GitHub
<#102>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABACZUEIU2V3ROLFLLDF7FLZO6WKPAVCNFSM6AAAAABLWT4YNGVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWHE4TKMZSGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I’m getting this error at the end of running
extract_pdbs
on mpi release. All PDBs are extracted, but the job is reported as Failed, and I just wanted to report it.Beta Was this translation helpful? Give feedback.
All reactions