Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PIO namelist control for CICE #2145

Merged

Conversation

DeniseWorthen
Copy link
Collaborator

@DeniseWorthen DeniseWorthen commented Feb 22, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

Adds namelist configuration of PIO options for CICE and switches from using the netcdf (serial) CICE_IO API to using PIO. This means all netCDF goes through a different API in CICE (ie, under infrastructure, the code in io_pio2 is built and used instead of the code in io_netcdf). Also adds netcdf and PIO error checking.

The available IO types are:

PIO_IOTYPE_PNETCDF=1 Parallel Netcdf (parallel)
PIO_IOTYPE_NETCDF=2 Netcdf3 Classic format (serial)
PIO_IOTYPE_NETCDF4C=3 NetCDF4 (HDF5) compressed format (serial)
PIO_IOTYPE_NETCDF4P=4 NetCDF4 (HDF5) parallel

Independent PIO settings are used for restart and history files. New export variables are added to default_vars to allow all available settings to be used. Settings which are not relevant to a given PIO IOtype are ignored. If non-valid settings are given, appropriate values will be substituted.

The available settings are

(history,restart)_format:

  • Options are cdf1,2,5, hdf5 (netcdf4), pnetcdf1,2,5. For cdf and pnetcdf, the 1,2,5 refer to netcdf3-classic, netcdf3-64-bit-offset and netcdf3-64bit-data.

(history,restart)_iotasks:

  • The subset of compute tasks which should also be used for IO. If not provided, will be given as the 1/4 of the available compute tasks.

(history,restart)_stride:

  • The stride of IO tasks across available compute tasks. If not given, will be set at 4.

(history,restart)_rearranger:

  • For pnetcdf, either box or subset. Box is the default, but subset scales better. For small (< ~100 compute tasks), testing shows little difference between box and rearranger. Subset shows clear advantages at high processor numbers. See CICE IO tests.

(history,restart)_root:

  • The processor used for the stride start

(history,restart)_chunksize, (history,restart)_deflate:

  • For hdf5, the chunking and compression level

Testing

The switch to PIO was not expected to change baselines compared to a baseline generated using the netCDF API. However, because of issues with the IC files, change were found in some, but not all tests using CICE.

All tests using CICE should therefore get new baselines to ensure they are now consistent w/ using the PIO interface.

A test was run to verify that a baseline generated using PIO did reproduce itself. All tests using CICE passed.

Commit Message:

Update to CICE-Consortium/CICE aca8357. Adds implementation of namelist PIO options for CICE

Priority:

  • Critical Bugfix: Reason
  • High: Reason
  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • None.

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Adds New Tests/Baselines.
  • PR Updates/Changes Baselines.
  • No Baseline Changes.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@DeniseWorthen DeniseWorthen marked this pull request as ready for review April 4, 2024 19:55
@DeniseWorthen
Copy link
Collaborator Author

@junwang-noaa This PR will not resolve the issue w/ the ice ICs since that will require some development work on the downscaling function, but it will enable PIO for CICE.

tests/parm/ice_in.IN Outdated Show resolved Hide resolved
* fix CICE for bad merge which missed an update w/in a CESMCOUPLED
ifdef block
junwang-noaa
junwang-noaa previously approved these changes Apr 8, 2024
@FernandoAndrade-NOAA FernandoAndrade-NOAA added Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. labels Apr 8, 2024
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 10, 2024

@FernandoAndrade-NOAA I think if there was a timeout and this state that the job may hang/fail it probably did.
I think this should be brought up to the Hera SA's.

I'll go ahead and let them know, same error with more connection failures on the reruns and eventual timeout:

   Error:      Resource temporarily unavailable (11)
    101 220: --------------------------------------------------------------------------
    102 203: [h2c54:283633] 2 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
    103 203: [h2c54:283633] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
    104 184: [h2c34][[52227,0],184][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(248) failed: Connection reset by peer (104)
    105 193: [h2c34][[52227,0],193][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(180) failed: Connection reset by peer (104)
    106 195: [h2c34][[52227,0],195][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(256) failed: Connection reset by peer (104)

@RatkoVasic-NOAA @ulmononian this is another troubleshooting case related with pdlib side. we will double check with develop and hercules side. will keep you posted.

@FernandoAndrade-NOAA
Copy link
Collaborator

@FernandoAndrade-NOAA I think if there was a timeout and this state that the job may hang/fail it probably did.
I think this should be brought up to the Hera SA's.

I'll go ahead and let them know, same error with more connection failures on the reruns and eventual timeout:

   Error:      Resource temporarily unavailable (11)
    101 220: --------------------------------------------------------------------------
    102 203: [h2c54:283633] 2 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
    103 203: [h2c54:283633] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
    104 184: [h2c34][[52227,0],184][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(248) failed: Connection reset by peer (104)
    105 193: [h2c34][[52227,0],193][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(180) failed: Connection reset by peer (104)
    106 195: [h2c34][[52227,0],195][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(256) failed: Connection reset by peer (104)

@RatkoVasic-NOAA @ulmononian this is another troubleshooting case related with pdlib side. we will double check with develop and hercules side. will keep you posted.

Testing against develop and again with spack-stack 1.6.0 on hercules both resulted in segmentation faults.
develop: /work2/noaa/epic/nandoam/stmp/nandoam/FV3_RT/rt_1379422/cpld_debug_pdlib_p8_gnu/err
develop with spack stack 1.6.0: /work2/noaa/epic/nandoam/stmp/nandoam/FV3_RT/rt_1366408/cpld_debug_pdlib_p8_gnu/err

@jkbk2004
Copy link
Collaborator

@FernandoAndrade-NOAA I think if there was a timeout and this state that the job may hang/fail it probably did.
I think this should be brought up to the Hera SA's.

I'll go ahead and let them know, same error with more connection failures on the reruns and eventual timeout:

   Error:      Resource temporarily unavailable (11)
    101 220: --------------------------------------------------------------------------
    102 203: [h2c54:283633] 2 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
    103 203: [h2c54:283633] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
    104 184: [h2c34][[52227,0],184][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(248) failed: Connection reset by peer (104)
    105 193: [h2c34][[52227,0],193][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(180) failed: Connection reset by peer (104)
    106 195: [h2c34][[52227,0],195][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(256) failed: Connection reset by peer (104)

@RatkoVasic-NOAA @ulmononian this is another troubleshooting case related with pdlib side. we will double check with develop and hercules side. will keep you posted.

Testing against develop and again with spack-stack 1.6.0 on hercules both resulted in segmentation faults. develop: /work2/noaa/epic/nandoam/stmp/nandoam/FV3_RT/rt_1379422/cpld_debug_pdlib_p8_gnu/err develop with spack stack 1.6.0: /work2/noaa/epic/nandoam/stmp/nandoam/FV3_RT/rt_1366408/cpld_debug_pdlib_p8_gnu/err

Crashing points to the line https://github.com/NOAA-EMC/MOM6/blob/ab7bd14d209592d55490e75dbfaa61cb4a62df97/config_src/infra/FMS2/MOM_io_infra.F90#L905

@jkbk2004
Copy link
Collaborator

/scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_1375704/cpld_debug_pdlib_p8_gnu runs ok on hera with develop branch. @FernandoAndrade-NOAA can you rerun the case with pr?

@DeniseWorthen
Copy link
Collaborator Author

I successfully ran the pdlib debug gnu test on Hera on the 4th. See

/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_1718506/cpld_debug_pdlib_p8_gnu

Has something changed on the system since then?

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Apr 10, 2024

Today, I get the same failure reported above, but I'll note it seems to be coming from the WAV PETS (180-257).

221: [h6c35:883934] 6 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
221: [h6c35:883934] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
228: [h6c35:883941] 5 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
228: [h6c35:883941] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
237: [h6c35:883950] 5 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
237: [h6c35:883950] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
203: [h6c35:883916] 3 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
203: [h6c35:883916] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
218: [h6c35:883931] 1 more process has sent help message help-mpi-btl-tcp.txt / client connect fail
218: [h6c35:883931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
220: [h6c35:883933] 3 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
220: [h6c35:883933] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
222: [h6c35:883935] 9 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
222: [h6c35:883935] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
226: [h6c35:883939] 1 more process has sent help message help-mpi-btl-tcp.txt / client connect fail
226: [h6c35:883939] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
238: [h6c35:883951] 7 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
238: [h6c35:883951] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@DeniseWorthen
Copy link
Collaborator Author

I'm testing w/ a higher task count for Waves and it is currently running. I'm getting the sense that R8 is more sensitive to memory use. I saw it in a recent ufs-utils PR that I have been testing.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Apr 10, 2024

The following change allows the cpld_debug_pdlib_gnu test to complete on Hera/R8/GNU

diff --git a/tests/tests/cpld_debug_pdlib_p8 b/tests/tests/cpld_debug_pdlib_p8
index fa7be3c0..45cb9c15 100644
--- a/tests/tests/cpld_debug_pdlib_p8
+++ b/tests/tests/cpld_debug_pdlib_p8
@@ -73,7 +73,7 @@ OCN_tasks=$OCN_tasks_cpl_unstr
 ICE_tasks=$ICE_tasks_cpl_unstr
 WAV_tasks=$WAV_tasks_cpl_unstr
 # bump resources for debug test
-WAV_tasks="$(($WAV_tasks_cpl_unstr + 18))"
+WAV_tasks="$(($WAV_tasks_cpl_unstr + 40))"

The times reported in the log were

PASS -- COMPILE 's2sw_pdlib_debug_gnu' [04:07, 02:28]
PASS -- TEST 'cpld_debug_pdlib_p8_gnu' [18:04, 11:52](1311 MB)

@FernandoAndrade-NOAA
Copy link
Collaborator

/scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_1375704/cpld_debug_pdlib_p8_gnu runs ok on hera with develop branch. @FernandoAndrade-NOAA can you rerun the case with pr?

It's passing for me now, I'm not sure what may have been the cause of this issue. I'll push hera up shortly.

@FernandoAndrade-NOAA
Copy link
Collaborator

Ok, all logs have been pushed up, we can continue with the merge process in CICE.

@DeniseWorthen
Copy link
Collaborator Author

@BrianCurtis-NOAA Would you please create an issue for removing the WCOSS2 setting in default_vars, once we know PIO is correctly working?

@FernandoAndrade-NOAA FernandoAndrade-NOAA merged commit 8a5f711 into ufs-community:develop Apr 11, 2024
2 checks passed
zhanglikate added a commit to zhanglikate/ufs-weather-model that referenced this pull request May 3, 2024
commit f234a3e
Author: Ufuk Turunçoğlu <[email protected]>
Date:   Tue Apr 30 11:35:25 2024 -0600

    Fix for land component model (ufs-community#2191)

    * UFSWM - fix fully coupled land component configuration
      * NOAHMP - get fixed information from surface file

commit 04bbc15
Author: jiandewang <[email protected]>
Date:   Thu Apr 25 14:52:00 2024 -0400

    update MOM6 to its main repo. 20240401 commit (ufs-community#2241)

    * UFSWM -
      * MOM6 - update MOM6 to its main repo. 20240401 commit (NCAR-candidate-20240319)

commit b6c576d
Author: Daniel Sarmiento <[email protected]>
Date:   Tue Apr 23 12:24:22 2024 -0400

    Merged global namelist (ufs-community#2173)

    * UFSWM - global_control.nml_IN has been added as the new regression test namelist template for all global regression tests. The namelist now uses pointers (i.e. @[abc]) for variables and default values have been added to the default_vars.sh script. A new section in default_vars.sh has been added (export_tiled) to account for tiled RTs that pulls the correct parameter files using the ATMRES variable.
    Regression tests have been modified to account for these changes. Tests that were not compatible with the GFSv17_p8 core have been disabled for now. They will be turned on as they are updated from GFSv16 to GFSv17.

commit 5d2ca19
Author: WenMeng-NOAA <[email protected]>
Date:   Fri Apr 19 13:59:12 2024 -0400

    Update upp submodule (ufs-community#2213)

    * UFSWM - Update inline post
      * FV3 - Update upp submodule for inline post

commit 47c0099
Author: Brian Curtis <[email protected]>
Date:   Wed Apr 17 15:59:48 2024 -0400

    Add bash linting to CI. Cleanup .sh scripts a bit. Address .sh bugs. Adds -v Verbose option. (ufs-community#2218)  Remove nowarn Intel compiler flag (ufs-community#2225)

    * UFSWM
    - Add bash linting to CI:
      - uses superlinter to check for consistent bash code writing
    - Cleans up .sh scripts to comply with superlinter
    - Cleans up .sh scripts to be more consistent, easier to read.
    - Add's -v verbose option if debugging outputs needed, otherwise simplifies rt.sh run echo's.
    - Addresses smaller bugs
      - quota/timeout search logic adjusted.
      - check for dirs existing (DISKNM, STMP, PTMP) before starting.
      - adjustments/cleanup to ecflow/rocoto sections
      - rt.sh will attempt to start ecflow, and only stop ecflow if it started from rt.sh.
      - fix for issue where run_dir will not delete properly.
    * FV3: Address compiler warnings
      * atmos_cubed_sphere: Address compiler warnings.

commit 4f32a4b
Author: Rick Grubin <[email protected]>
Date:   Mon Apr 15 07:21:08 2024 -0600

    Document ATMW / ATMAERO / HAFS WM configurations (ufs-community#2160)

    * UFSWM
      * doc/Userguide
        * source
          * conf.py
          * Configurations.rst
          * FAQ.rst
          * InputsOutputs.rst
          * Introduction.rst

commit ac4445d
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Apr 15 08:59:42 2024 -0400

    Bump idna from 3.6 to 3.7 in /doc/UsersGuide (ufs-community#2234)

    *doc/UserGuide
       *requirements.txt - updates inda version from 3.6 to 3.7

commit 281b32f
Author: Samuel Trahan (NOAA contractor) <[email protected]>
Date:   Mon Apr 15 08:38:01 2024 -0400

    bug fixes: kchunk3d ignored, hailwat uninitialized in dycore, tile_num wrong for nests (ufs-community#2201)

    * UFSWM - None.
      * FV3 - Write component will use kchunk3d. Model init sends the right tile number to CCPP.
        * atmos_cubed_sphere - Initialize the hailwat variable. Pass global_tile index to model.

commit 8a5f711
Author: Denise Worthen <[email protected]>
Date:   Thu Apr 11 13:32:26 2024 -0400

    Add PIO namelist control for CICE (ufs-community#2145)

    Update to CICE-Consortium/CICE aca8357. Adds implementation of namelist PIO options for CICE

commit 45c8b2a
Author: JONG KIM <[email protected]>
Date:   Thu Apr 4 19:49:13 2024 -0400

    Hotfix/cubed sphere hash fix: HAILCAST diagnostic code (units issue) (ufs-community#2223)

    cubed_sphere hash update: f060e85 for a bug- fix in the HAILCAST diagnostic code (units issue)

commit 26e6db6
Author: Denise Worthen <[email protected]>
Date:   Wed Apr 3 19:57:08 2024 -0400

    Enable cpl_scalars export from ATM and NoahMP for use by CMEPS (ufs-community#2175)

      * CMEPS - allow additional dimension in cpl_scalars for CSG and regional ATM domains for use in mediator history files
      * CMEPS - fix mapping mask for lnd->atm
      * FV3 - add export of cpl_scalars
      * NOAHMP - add export of cpl_scalars

commit 1411b90
Author: Dusan Jovic <[email protected]>
Date:   Mon Apr 1 18:04:44 2024 -0400

    Update module_write_netcdf to avoid hangs in RRFS runs (ufs-community#2193)

    * UFSWM - Update module_write_netcdf to avoid hangs in RRFS runs
      * FV3 - Update module_write_netcdf to avoid hangs in RRFS runs

commit 87c27b9
Author: Matthew Masarik <[email protected]>
Date:   Fri Mar 29 15:23:42 2024 -0400

    WW3 feature:  Langmuir turbulence parameterization (ufs-community#2195)

      * WW3 - Langmuir turbulence parameterization

commit c54e986
Author: Samuel Trahan (NOAA contractor) <[email protected]>
Date:   Wed Mar 27 16:11:03 2024 -0400

    regression test system bug fixes, eliminate MOM6 warnings (ufs-community#2197), add xr_cnvcld flag to FV3 (ufs-community#2185) (ufs-community#2202)

    * UFSWM - atparse.bash: correctly handle input that doesn't end with an end-of-line character. Fix some bugs in Rocoto support and clean up rt.sh.
      * FV3 - namelist flag xr_cnvcld to control if suspended grid-mean convective cloud condensate should be included in cloud fraction and optical depth calculation in radiation in the GFS suite
        * ccpp - physics-level changes to implement new namelist variable
      * MOM6 - update MOM6 code to eliminate all compiler warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants