Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle dynamic extensions of the DVM #854

Merged
merged 1 commit into from
Mar 23, 2021

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Mar 23, 2021

The DVM can be extended in response to add_host and add_hostfile
directives. In such cases, we need to provide the new daemons
with a complete picture of the currently executing jobs so they
can properly map the new one.

Signed-off-by: Ralph Castain [email protected]

The DVM can be extended in response to add_host and add_hostfile
directives. In such cases, we need to provide the new daemons
with a complete picture of the currently executing jobs so they
can properly map the new one.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 merged commit 543f62e into openpmix:master Mar 23, 2021
@rhc54 rhc54 deleted the topic/dvm branch March 23, 2021 14:34
@hppritcha
Copy link
Contributor

@rhc54 we are looking at making use of this expansion feature. Are there any examples in the test suite that generate how to use the add_hosts to grow the DVM?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2023

Currently, it is done via a "spawn" command - i.e., as part of starting another job. I suppose we could either add an option to the PMIx palloc tool or create a PRRTE pctrl tool that would independently support it. We could avoid invoking a scheduler if a hostfile was given.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2023

Note that I'd need to check that this still worked as it has been a couple of years since anyone used it. I'm unaware of any examples that exercise it, nor anything in the test suite that tests it.

@hppritcha
Copy link
Contributor

Thanks. We actually found a "MPI" equivalent tests in the ompi unit tests. We're thinking currently to add an extension to prun for the experiments we're trying to do. we'll look into palloc and pctrl as other options too.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2023

I don't see an "add-host" or "add-hostfile" cmd line option defined in src/util/prte_cmd_line.h, but they should be easy enough to add and then update the prun cmd line. All you'd have to do then is update src/prted/prte_app_parse.c to look for those options and include them in the pmix_app_t for the spawn call.

Like I said, I'll have to check the backend to ensure PRRTE still handles those correctly.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2023

It looks like the backend support is present, though I haven't checked it out. In order to do that, I had to add the cmd line options anyway - see #1769. Check the help text to see if it makes sense to you and meets your needs.

I'll let you know once I've checked the backend to ensure it is working.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 12, 2023

Just an update: the add-host and add-hostfile features are now working on PRRTE master branch as per the help text from #1769. I don't plan to bring that to the v3.0 or v3.1 release branches - let me know if you need it and we can discuss backporting it.

@BhattaraiRajat
Copy link

@rhc54 I am using #1769. I am trying to use add-hostfile to add a new node to an existing DVM. I am getting PMIx_Spawn failed (-25): UNREACHABLE error. I am using it in the following way.

  1. Get allocation from slurm with four nodes salloc -N 4
  2. Create a hostfile with three of these nodes and use the hostfile for the dvm prte --report-uri dvm.uri --hostfile hostfile --daemonize
  3. Create new_hostfile with all four nodes and use with add-localhost in prun prun --dvm-uri file:dvm.uri --add-hostfile new_hostfile --hostfile new_hostfile --map-by ppr:2:node hostname

Could you please let me know if I am using this feature incorrectly or what might be an issue with this?
Also, I have tried creating add_hostfile with only one node which is to be added and ran prun --dvm-uri file:dvm.uri --add-hostfile add_hostfile --hostfile new_hostfile --map-by ppr:2:node hostname which throws the same error.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 13, 2023

Not entirely sure of the problem. Could be that the presence of Slurm is causing confusion. You might try adding --prtemca ras ^slurm --prtemca plm ^slurm to the prte cmd line to see if that helps. I also suspect there might be a problem with having both add-hostfile and hostfile options on the prun cmd line together.

@BhattaraiRajat
Copy link

@rhc54 Thank you. This works now. Apparently, we have to specify the number of slots along with the node name in the hostfile. Otherwise, the added node is assigned -1 slots. I am not sure if it is a bug or expected.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 19, 2023

Otherwise, the added node is assigned -1 slots. I am not sure if it is a bug or expected.

A bit of both. It is intended as an indicator that PRRTE should discover the number of slots based on CPUs on the new node. However, there is a check in there so that it isn't done in managed environments such as Slurm because the scheduler assigns the number of slots - and we cannot override it.

The problem here is that we are faking a dynamic environment inside of what is actually a static one. Slurm assigned the nodes and defined the number of slots for each node. We are then trying to use those nodes as if they are ours to define.

I'll have to ponder this a bit. There might be a way around it, but we have to be careful not to break the normal mode of operation.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 18, 2023

@BhattaraiRajat I believe this should now work correctly, even under a Slurm allocation. See #1851 for the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants