-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parallel I/O for timers #292
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It is important to have AbstractDicts with a deterministic order of entries for options, etc., so that when writing them to the output files with parallel I/O, the writes all happen in exactly the same order, which is necessary for consistency. Failing to do this can cause hangs when writing output because the HDF5 cache(s) are inconsistent on different ranks, which can lead to some ranks calling an MPI function (e.g. MPI.Bcast!()) when others do not.
Not sure whether deleting and re-creating String variables does was actually causing a problem (suspect now that it was not), but anyway, requiring that Strings always have the same length when overwriting seems a bit nicer and may be slightly more efficient.
Convert "global_timer_string" to a fixed-length String (with a hard-coded length) before (over-)writing. This hopefully avoids HDF5 errors. Ensure that `global_timer_string` can only contain ASCII characters. Reset timers during cleanup rather than at the beginning of `run_moment_kinetics()`, which helps to keep the timers in a consistent state, e.g. for tests that do not use `run_moment_kinetics()` but be run after another run that created timers.
Inconsistent formatting on different ranks makes the description string different lengths, which causes HDF5 errors.
On ranks other than the root of each shared-memory block, set `io_moments` and `io_dfns` to `(io_input=io_input)` so that the input (e.g. `parallel_io` setting) is available on all ranks.
johnomotani
force-pushed
the
fix-timer-output
branch
from
November 13, 2024 14:59
52db8e3
to
dba2241
Compare
...rather than in a separate group for each rank. This avoids creating a very large number of variables in the output file when running on many cores, which seems to help prevent parallel HDF5 errors.
Something has made DataInspector work in all the figures when they are opened simultaneously - I guess either the change to making `inspector_label` be defined separately for each plot in the previous commit, or some update to `Makie`.
`mergewith()` always returns a `Dict`, ignoring the types of its arguments. We need `recursive_merge()` to return an `OrderedDict`, so re-implement `recursive_merge()` 'by hand' without using `mergewith()`.
Should help reduce memory usage and so avoid some errors.
johnomotani
force-pushed
the
fix-timer-output
branch
from
November 13, 2024 16:54
dba2241
to
d62bba1
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The timers added in #276 introduced errors in parallel I/O. These errors seem to have come from:
Dict
s. JuliaDict
s do not have a consistent order, so different processes sometimes tried to create/write variables in different orders. Fixed by usingOrderedDict
orSortedDict
. There was also a potential bug with input options which were also stored in aDict
being created in inconsistent orders, which is also fixed by this PR.Also changes the
global_timer_string
output to be a fixed-length string, to avoid repeatedly creating and deleting the variable. Don't know if this was actually part of the problem - suspect not in the end - but the updated version seems a bit nicer.