-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to spack-stack libraries on non-production machines #624
Upgrade to spack-stack libraries on non-production machines #624
Conversation
* corrections to module-setup.sh * update to Orion software stack in a new role-epic location * updated regression/regression_param.sh srun command for Orion --------- Co-authored-by: Natalie Perlin <[email protected]> Co-authored-by: Natalie Perlin <[email protected]>
Merged the develop branch into spack-stack and reran the regression tests:
The following tests failed due to differing penalties on the first iteration onward: The |
The latest develop branch was merged into the spack-stack branch and ctests were run again. This time, I ran them both with the standard spack-stack/1.4.1 packages as well as my custom builds of ncio and crtm (with hpc-stack optimization). The results were as expected for the standard set. For the custom builds, the results were as follows:
The rrfs_3denvar_glbens test failed for scalability (updat with a time of 35.984s/node and contrl with 36.316s/node). Both the global_4denvar and global_3dvar tests experienced maxmem failures, both exceeding the memory thresholds by > 1000GB. This seems problematic. I will look into other libraries to see if I can find which one is causing this memory issue and in the meantime convert this PR to a draft. |
@DavidHuber-NOAA , will this PR add a Hercules build option to NOAA-EMC/GSI? Daryl asked about Hercules yesterday. Are all the libraries needed to build NOAA-EMC/GSI are available in a Hercules version of spack-stack? |
@RussTreadon-NOAA Yes, thank you for the reminder. I will check on the available libraries and add Hercules if all requirements are available (#574). |
Thank you, @DavidHuber-NOAA ! |
The culprit behind the To move forward, I will increase the |
I've compiled this branch on Orion. It passed with GSI RT. However, when I replaced the gsi.x with this newly compiled one and tested on one of my HAFS experiment case, following error showed up "./hafs_gsi.x: error while loading shared libraries: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory" The module file that HAFS currently use on Orion is "/work/noaa/hwrf/save/jcheng/HAFS/modulefiles/hafs.orion.lua". Not sure if there are any inconsistency between libraries. |
@JingCheng-NOAA I would be happy to take a look at the modulefile and see if I can find any issue. Would you mind opening permissions for me with |
Thanks for helping! However the second command came back with an error message "find: missing argument to `-exec'" |
@DavidHuber-NOAA , this PR can be scheduled for merger into |
@JingCheng-NOAA Whoops, that should have been a |
Thanks! It's done. Please feel free to check. |
@JingCheng-NOAA Hmm, that is odd. When I load that module and |
Thanks for checking. Yes after I used the updated gsi_common.lua, the |
Make sure these modules are also loaded at runtime and that subsequent
module loads don't implicitly unload them.
…On Mon, Nov 27, 2023 at 10:49 AM JingCheng-NOAA ***@***.***> wrote:
@JingCheng-NOAA <https://github.com/JingCheng-NOAA> Hmm, that is odd.
When I load that module and echo $LD_LIBRARY_PATH, I do see
/apps/intel-2022.1.2/intel-2022.1.2/mkl/2022.0.2/lib/intel64 in the list,
which is where the libmkl_intel_lp64.so.2 library resides. Can you point
me to the script that runs the GSI and/or the log output?
Thanks for checking. Yes after I used the updated gsi_common.lua, the
libmkl_intel_lp64.so.2 was no longer the issue. However right now the
error message is ./hafs_gsi.x: error while loading shared libraries:
libnetcdff.so.7: cannot open shared object file: No such file or directory
.
You can find the log file here
/work/noaa/hwrf/scrub/jcheng/HAFS_v1p1a_amv/2023091100/13L/hafs_analysis.log
—
Reply to this email directly, view it on GitHub
<#624 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUXJ2OOKLLYXMXH5YDYGSY63AVCNFSM6AAAAAA46YDTISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGA4TONRZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@JingCheng-NOAA Looking through the log, I see that the |
Thank you! I will give it a try! |
@DavidHuber-NOAA @RussTreadon-NOAA |
@aerorahul , I approved this PR. We are waiting for two peer reviews as per GSI code management policy. |
Great. Do we have an estimate? |
Reviews have been requested. No estimate as to when reviews will be complete. It would be good to ping peer reviewers for an update. |
@aerorahul: @JingCheng-NOAA and I have worked through the issues on Orion. He is going to run tests on Jet and then said he will approve if no issues remain. However, if I understand the comments above, it seems Jet had preexisting issues before this PR within the HAFS system. If issues are still present, I would argue for a separate issue in the HAFS repo to handle that as well. I'm not sure if @hu5970 has had a chance to test this yet or not. |
@DavidHuber-NOAA It seems there is an uninitialized variable "toff" in the subroutine read_radar_l2rw_novadqc (read_radar.f90). This variable is getting NaN in our failed HAFS regression tests on Jet. I assume that this might be the reason for the HAFS RT failure on Jet. |
@BijuThomas-NOAA I'm glad you were able to find the issue with Jet. Looking at |
The GSI regression test on JET passed for all the cases except for |
@DavidHuber-NOAA Thanks. Opened a new issue here 661 |
@JingCheng-NOAA Thanks for conducting the GSI regression tests from HAFS end. Also, thanks for @BijuThomas-NOAA for testing from the HAFS workflow side. Meanwhile, agree with @DavidHuber-NOAA, the issue @BijuThomas-NOAA encountered on Jet is probably a separate issue and he has created an issue #661 for it. With that, we will approve this PR from HAFS side. Thanks a lot! |
@ShunLiu-NOAA and @hu5970: if you are OK with this PR, would you please review and approve. Once we get sufficient reviews we can schedule this PR for merger into |
@ShunLiu-NOAA , @hu5970 , @CoryMartin-NOAA , I would like to merge this PR into |
Thank you @DavidHuber-NOAA for your effort in porting the GSI to spack-stack and the prep work for identifying and resolving the older compiler issues. |
Description
The latest version of spack-stack, 1.4.1, is installed on all platforms. This PR points all of the modulefiles on tier-1 machines and S4 to their respective spack-stack installations. This fixes #589, #574, and #563.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
The ctests were run on Orion and Hera. As noted in #589, differences were seen when calling CRTM and NCIO functions and subroutines. These were traced to the different compiler options used for their respective builds between hpc-stack and spack-stack. Recompiling these libraries with spack-stack netCDF 4.9.2 and hpc-stack compiler options resulted in a 1-to-1 comparison.
The GSI was also built on Jet and S4 as well as on Hera with GNU compilers. I will also run build tests on Gaea tomorrow when it comes up from maintenance. Before #571 was merged, regression tests had also be performed on Jet, Gaea, and Cheyenne with some differences noted (possibly due to the different build options).
Checklist
DUE DATE for this PR is 6 weeks from when this PR is marked ready for review.