Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model crashed at day 91.75 when compiled with S2S #2320

Closed
ShanSunNOAA opened this issue Jun 11, 2024 · 49 comments
Closed

model crashed at day 91.75 when compiled with S2S #2320

ShanSunNOAA opened this issue Jun 11, 2024 · 49 comments
Assignees
Labels
bug Something isn't working

Comments

@ShanSunNOAA
Copy link
Collaborator

Description

The UFS model crashed at time step 13320 with error of

341: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
... ...
345: ufs_model.x 0000000005CA5A13 ice_pio_mp_ice_pi 221 ice_pio.F90
345: ufs_model.x 0000000005C85EFE ice_history_write 173 ice_history_write.F90
345: ufs_model.x 0000000005A1225A ice_history_mp_ac 4130 ice_history.F90
345: ufs_model.x 0000000005C7B813 cice_runmod_mp_ci 369 CICE_RunMod.F90
345: ufs_model.x 00000000059BED10 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90
345: ufs_model.x 00000000006A1B08 Unknown Unknown Unknown

see details in /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log

To Reproduce:

The experiment used the following hash of https://github.com/ufs-community/ufs-weather-model
commit 5bec704
Author: Brian Curtis [email protected]
Date: Fri May 31 14:52:06 2024 -0400

Additional context

The model was able to run 9 months successfully with an earlier hash in April:
commit 45c8b2a
Author: JONG KIM [email protected]
Date: Thu Apr 4 19:49:13 2024 -0400

Output

The log containing the error message is available at /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log. I'd be happy to provide more details if needed. Thanks.

@ShanSunNOAA ShanSunNOAA added the bug Something isn't working label Jun 11, 2024
@junwang-noaa
Copy link
Collaborator

@DeniseWorthen may I ask if there is any restriction on cice file names for long forecast time?

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa This doesn't seem to be an issue related to filename length. Files are always named as yyyy-mm-dd-secsofday in the model. The line it is pointing to in ice_pio makes me believe this is an issue more likely to be related to the previous sym-link issue.

Do you have the associated ice_diag.d file? That is the log that ice writes out.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen @junwang-noaa Thanks for the quick responses from both of you. I think I found out why. It is this line in cice_wrapper_mod.F90: write(filename,'(a,i3.3)')'log.ice.f',int(hour)
that need to be (i4.4) to accommodate 4-digit hours. I am testing it now to make sure this is the only change that is needed.

Thanks!

@junwang-noaa
Copy link
Collaborator

It might be the cause. I saw a file 'log.ice.f***' in the run directory. We may still run into problem for climate runs if using "i4.4". In fv3atm, we use:

ndig = max(log10(nf_hours+0.5)+1., 3.)
        write(cform, '("(I",I1,".",I1,")")') ndig, ndig
        write(cfhour, cform) nf_hours

to get the number of digits for forecast hours

@ShanSunNOAA
Copy link
Collaborator Author

@junwang-noaa Thanks for the info. FV3 uses "max" for the forecast hours so the model won't crash. Can we borrow the same formula to get forecast hour in CICE?

@DeniseWorthen
Copy link
Collaborator

Both ATM and CICE write to a specific file name. In MOM, this was just added as a write to stdout. We should probably fix that. For MOM and CICE, we'd need to "find" the number of hours from the driver clock since currently fhmax is only in model_configure.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA Unless you're actually compiling runtime statistics (ie, the time for each model advance), then I would just set RunTimeLog = false in the allcomp part of ufs.configure for now.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thanks for your info. I cannot find RunTimeLog in the log file generated by workflow. Where do I set it in the workflow configuration? Thanks!

@junwang-noaa
Copy link
Collaborator

@DeniseWorthen in Fv3ATM, we compute the number of digits from the forecast hour, not fhmax. Can we use "hour" in ufs_logfhour to compute the format of fhour in the log file name? Also it looks to me that runtimelog is not used to control outputting cice log file, I saw the code in nuopc/cmeps/CICE_RunMod.F90.

         if (my_task == master_task) then
            do ns = 1,nstreams
               if (write_history(ns) .and. histfreq(ns) .eq. 'h') then
                  call calendar_sec2hms(msec,hh,mm,ss)
                  write(logmsg,'(6(i4,2x))')myear,mmonth,mday,hh,mm,ss
                  call ufs_logfhour(trim(logmsg),timesecs/c3600)
               end if
            end do
         end if
      end subroutine ice_step

@ShanSunNOAA
Copy link
Collaborator Author

Unfortunately, after modifying the log.ice filename, the model still crashed at day 91.75, with error
340: ufs_model.x 0000000005B97903 ice_pio_mp_ice_pi 221 ice_pio.F90
340: ufs_model.x 0000000005B77DEE ice_history_write 173 ice_history_write.F90
340: ufs_model.x 000000000590414A ice_history_mp_ac 4130 ice_history.F90
340: ufs_model.x 0000000005B6D703 cice_runmod_mp_ci 369 CICE_RunMod.F90
340: ufs_model.x 00000000058B0C00 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90

see /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/logs/2015052100/gfsfcst.log
The ice_diag.d is available at /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/gfs.20150521/00/model_data/ice/history/. The RUNDIRS is available at /scratch1/BMC/gsd-fv3-dev/NCEPDEV/global/Shan.Sun/stmp2/RUNDIRS/hydro1/gfsfcst.2015052100/fcst.1576533/.

Any suggestions on how to run beyond 3 months would be greatly appreciated. How to turn off RunTimeLog? Thanks.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA Sorry for the confusion. Jun is right. The feature for runtime logging as well as the logging to mark the end of the history write was put in at the same time and I didn't think through my answer.

I don't believe CICE has a concept of hours being anything other than 1-24. Let me take a look. In any case, it doesn't seem that is the issue, since your new run failed also. If it is still pointing to the ice_pio line, then I still suspect it has something to do w/ how pio is trying to write the file.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA I've been able to reproduce the error using the low resolution c48-coupled model in the RTs. So in this case, there is no sym-linked files, so that takes one issue out of the mix. In this case, it failed at day 86.

73: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x150a7a430bcc]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x150a79e0adf1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x150a79ad9eb9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x176584) [0x150a7999e584]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x17a9f9) [0x150a799a29f9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x199a60) [0x150a799c1a60]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x150a799997ec]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b4387) [0x150a79adc387]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x150a799376e1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_File_open+0x17d) [0x150a7a4492bd]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpio_create+0x199) [0x150a73e592c9]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpi_create+0x4e7) [0x150a73daf4a7]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile_int+0x2e6) [0x150a7c436696]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile+0x41) [0x150a7c432451]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpiof.so(piolib_mod_mp_createfile_+0x25e) [0x150a7c1caabe]

Jun suggested a couple of further tests. I'll keep you updated.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thank you for your prompt response and the clever approach. Please let me know if there is anything I can assist with.

@DeniseWorthen
Copy link
Collaborator

I added the following to the cice_wrapper

index d0aafbb..4668f03 100644
--- a/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
+++ b/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
@@ -72,7 +72,7 @@ subroutine ufs_logfhour(msg,hour)
     real(dbl_kind),    intent(in)    :: hour
     character(len=char_len)          :: filename
     integer(int_kind)                :: nunit
-    write(filename,'(a,i3.3)')'log.ice.f',int(hour)
+    write(filename,'(a,i4.4)')'log.ice.f',int(hour)
     open(newunit=nunit,file=trim(filename))
     write(nunit,'(a)')'completed: cice'
     write(nunit,'(a,f10.3)')'forecast hour:',hour

And did the following tests:

  1. Restarted from the above run at fh=2040. The model completed 100d. So I don't think there is any model behaviour that is causing a problem somewhere near ~90 days.
  2. I made a run with CICE history turned off, and the model ran out to 100d. So the issue is definitely in the CICE history write.
  3. I made a run adding the following to the job_card (see https://convergecfd.com/getting-started-guide-v3/environment-configuration-erro.htm)
export I_MPI_SHM_HEAP_VSIZE=16384

and the model ran out to 100days.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thanks for your investigation. Where to adjust or add "I_MPI_SHM_HEAP_VSIZE" when running UFS in the workflow environment, i.e., config.ice or config.fcst? Thanks.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA I don't know where to add this in the workflow. It is not an ice configuration specifically. I don't know where the equivalent of the job card gets generated. For the RTs, I added it just after the OMP_STACKSIZE

/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_2767165/dt900.shmheap/job_card

@junwang-noaa
Copy link
Collaborator

Another solution is to use intel 2023.02.

https://community.intel.com/t5/Intel-MPI-Library/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537

Hera admin confirmed the Intel/2023.2.1 and impi/2023.2.1 are available on hera, it's worth to try the new version of intel compiler to see if this issue is resolved.

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa This requires a spack-stack to be built w/ these compilers, correct?

@junwang-noaa
Copy link
Collaborator

Yes, a spack-stack issue is created.

@ShanSunNOAA
Copy link
Collaborator Author

Forgot to mention that Jessica suggested adding "export I_MPI_SHM_HEAP_VSIZE=16384" in the HERA.env.
This has been working well for me, using the workflow to run UFS for up to 9 months. Thank you all.

@DeniseWorthen
Copy link
Collaborator

I updated the model by one commit (to the 1.6 SS) and used the updated intel compiler build provided in JCSDA/spack-stack#1147

diff --git a/modulefiles/ufs_hera.intel.lua b/modulefiles/ufs_hera.intel.lua
index 993e0537..d27136ce 100644
--- a/modulefiles/ufs_hera.intel.lua
+++ b/modulefiles/ufs_hera.intel.lua
@@ -2,12 +2,12 @@ help([[
 loads UFS Model prerequisites for Hera/Intel
 ]])

-prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/Core")
+prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/ue-intel-2023.2.0/install/modulefiles/Core")

-stack_intel_ver=os.getenv("stack_intel_ver") or "2021.5.0"
+stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
 load(pathJoin("stack-intel", stack_intel_ver))

-stack_impi_ver=os.getenv("stack_impi_ver") or "2021.5.1"
+stack_impi_ver=os.getenv("stack_impi_ver") or "2021.10.0"
 load(pathJoin("stack-intel-oneapi-mpi", stack_impi_ver))

 cmake_ver=os.getenv("cmake_ver") or "3.23.1"

I re-ran the cpld c48 case w/o the export I_MPI_SHM_HEAP_VSIZE=16384. The model completed fh 2400, the same as the shmheap test case.

@junwang-noaa
Copy link
Collaborator

Great! So this does resolved the issue! Thanks for testing.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thanks for testing this solution, without increasing "I_MPI_SHM_HEAP_VSIZE". That is a great news.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jun 28, 2024

I've deleted my previous comment about letting the intel 2023.2 test to run as long as possible. I was using the old executable. I will try a clean test.

@DeniseWorthen
Copy link
Collaborator

I repeated my test using the 2023.2 intel w/o the heap_vsize variable and it completed fh3168 (132 days) before failing w/ the same error

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm-shm_numa_layout[my_numa_node].base_addr

I'll try a second test adding the vsize variable.

@DeniseWorthen
Copy link
Collaborator

The 2023.2 intel case with the addition of the heap_vzsize variable completed fh6816 (284 days) before timing-out.

@junwang-noaa
Copy link
Collaborator

@DeniseWorthen Do we know if setting "I_MPI_SHM_HEAP_VSIZE=16384" only will allow the test case to run through 284 days?

@DeniseWorthen
Copy link
Collaborator

Using the intel2023.2 compile w/o the heap size variable, I ran for 100days with the ProfileMemory=true setting and plotted the VmPeak value upon exiting the ModelAdvance (for the atm, this was after phase2) relative to the first value (ie, vmpeak(t)/vmpeak(t=0). I plotted every-other PE for ATM (top), OCN (mid), and ICE (bottom)

Screenshot 2024-07-12 at 9 27 22 AM Screenshot 2024-07-12 at 9 27 34 AM Screenshot 2024-07-12 at 9 28 05 AM

@DeniseWorthen
Copy link
Collaborator

I did two tests, one with the ATM C48 test and one w/ the C96 P8 test. The C48 atm test doesn't use MERRA, and uses low res fix files (all .grb). The C96 atm test uses MERRRA and the tiled fix files. Both show the same jump on the 15th of the month. The C48 (left) and C96 (right):

Screenshot 2024-07-15 at 5 34 22 PM Screenshot 2024-07-15 at 5 35 42 PM

@XiaqiongZhou-NOAA
Copy link
Contributor

XiaqiongZhou-NOAA commented Jul 30, 2024

I have a similar problem on WCOSS2. The coupled forecasts always crashed after reaching 135 days on WCOSS2.
export I_MPI_SHM_HEAP_VSIZE=16384 is added in wcoss2.env but it does not help.
The following is the error message:
324344 nid003457.cactus.wcoss2.ncep.noaa.gov 0: in atmos_model update, fhzero= 6.00000000000000 fhour= 3240.000
324345 0.0000000E+00
324346 nid003457.cactus.wcoss2.ncep.noaa.gov 0: zeroing coupling accumulated fields at kdt= 9720
324347 zeroing coupling accumulated fields at kdt= 9720
324348 write out restart at n_atmsteps= 9719 seconds= 11664000
324349 integration length= 3239.667
324350 nid003457.cactus.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps = 9719 time is 0.106315
324351 aft fcst run output time= 11664000 FBcount= 8 na= 9720
324352 nid004044.cactus.wcoss2.ncep.noaa.gov 373: MPICH Notice [Rank 373] [job id d90a271d-396d-409c-a11c-1a8f6d64b40b] [Mon Jul 29 20:29:51 2024] [nid004044] - Abort(7034 7535) (rank 373 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
324353 PMPI_Allreduce(472)..................: MPI_Allreduce(sbuf=0x7ffe305bd28c, rbuf=0x7ffe305bd29c, count=1, datatype=MPI_INT, op=op=0x98000000, comm=comm=0xc4004298) fa iled
324354 PMPI_Allreduce(457)..................:
324355 MPIR_CRAY_Allreduce(478).............:
324356 MPIDI_CRAY_Setup_Shared_Mem_Coll(838):
324357 MPIR_Comm_dup_impl(51)...............:
324358 MPII_Comm_copy(544)..................:
324359 MPIR_Get_contextid_sparse_group(679).: Too many communicators (0/2048 free on this process; ignore_id=0)

@LarissaReames-NOAA
Copy link
Collaborator

LarissaReames-NOAA commented Sep 12, 2024

As a new member of the SFS team, @junwang-noaa has asked me to look in to this issue. With the help of @DeniseWorthen I was able to get a few test runs in with memory profiling.

I first wanted to make sure this memory issue wasn't present in the hash that @ShanSunNOAA reported as working previously (45c8b2a). Because the module files in that branch are no longer available I used the modules in the current develop branch along with a test using the same intel2023.2.0 installation that Denise used above.. The same memory behavior is present in the C48 ATM test at this commit with both compilers (2021 first, 2023 second) though they aren't identical.
c48_mem
c48_mem_intel2023

These results made me think perhaps this is a library issues associated (somehow) with NetCDF reads. I decided to try a run of the regional_control ATM test which isn't run long enough to read new climo files on the 15th of the month, but it does read lateral boundary condition files (also in NetCDF format) every 180 minutes. We see similar memory jumps for these runs at every LBC read. Again, intel2021 first, 2023 second.
regional_mem
regional_mem_intel2023

Assuming this memory behavior is the source of the long run failures, then it looks like perhaps this is not a code issue if @ShanSunNOAA was able to use this hash successfully in the past but with different modules (and perhaps on non-rocky8 OS on Hera). I have a C48 run going right now with the intel 2023 executable with a full forecast length of 3200 hours to see how far it makes it. I'll report back on that once it's complete and will also open a new issue specific to these memory problems as they seem more pervasive than a single coding bug.

EDIT: Note I am using more than the standard allocation of 8 processes on a single node for the C48 tests to decrease runtime. I'll make sure to switch back to a single node when I check for long run failure.

@LarissaReames-NOAA
Copy link
Collaborator

Based on the test from my previous comment, I decided to start investigating other sources of the Assertion Failed failure message that @ShanSunNOAA originally reported by running S2S regression tests. After lots of unstable forecasts with the cpld_control_c48, I've been able to run several tests using cpld_control_nowave_noaero_p8 in #8a5f711, the first hash after Shan's working version.

After confirming that the Assertion Failed error does not occur with Shan's working hash (it does not), I can confirm that the Assertion Failed issues occur at #8a5f711. When I set CICE_IO = "NetCDF", instead of the default (as used in the failed runs) "PIO" as introduced in WM PR #2145, and restart/history_format='hdf5' or 'default' in ice_in, the failure disappears and the model runs to the 8 hour max wall clock time. As additional tests, I ran again with CICE_IO="PIO" and restart/history_format="pnetcdf1" and "pnetcdf5" and both of these runs also fail. This pretty clearly points to either the PIO library or its implementation as the source of the Assertion Failed errors.

I did one last test this afternoon to see if these are accumulating errors or to see if something specific in various ICs might be triggering it somehow. I left the ICE restart frequency the same (every 12 hours) but doubled the ICE history frequency from 6 to 3 hours. The model failed in 60% of the time (1210 FHR vs 2010 FHR) after writing 504 files (404 history & 100 restart) whereas the 6hr frequency wrote 501 files (335 hist, 167 restart). This suggest that there is some accumulating memory issue that is compounded every time an ICE history or restart file is written.

@DeniseWorthen
Copy link
Collaborator

Following today's infrastructure meeting, I set up the cpld_control_c48 test case on Gaea. I used dt_atmos=900 and set the CICE history write for hourly output. I turned off the restart writing for all components.

The model gave 675 hourly output files (28 days) before failing w/ the following. Rank 17 is one of the CICE ranks (petlist 16 19). This is similar to the message posted above for WCOSS2.

17: MPICH ERROR [Rank 17] [job id 135188771.0] [Mon Oct  7 17:37:30 2024] [c5n1294] - Abort(806965007) (rank 17 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
17:
17: aborting job:
17: Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
srun: error: c5n1294: task 17: Exited with exit code 255

@DeniseWorthen
Copy link
Collaborator

I repeated my test on Gaea using CICE hash 6449f40, which is prior to both PIO related commits (7a4b95e and aca8357). I compiled w/ PIO and used the previous namelist options for using pnetcdf. The model stopped with an identical error I posted above. This indicates that whatever the cause of this failure, it predates the recent updates to PIO in CICE.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA Just an update. The folks at CICE Consortium were able to replicate this in the standalone CICE and were able to commit a fix. We'll be updating CICE soon w/ the fix. See the associated CICE issue 94.

@ShanSunNOAA
Copy link
Collaborator Author

ShanSunNOAA commented Nov 15, 2024 via email

@LarissaReames-NOAA
Copy link
Collaborator

@ShanSunNOAA They've pushed the fix for CICE that should solve your issue. Do you want to try your forecast one more time with the top of develop and without any of the previously mentioned fixes (i.e., no stack size changes, compiled with the default CICE PIO configuration) just to confirm that this now works? Then we can close this issue.

@XiaqiongZhou-NOAA
Copy link
Contributor

@larissa:
I will test it on WCOSS2 with this fix.

@ShanSunNOAA
Copy link
Collaborator Author

ShanSunNOAA commented Dec 4, 2024 via email

@LarissaReames-NOAA
Copy link
Collaborator

@ShanSunNOAA Yep, that sounds right. Just run the test that originally caused this issue but with the top of develop of ufs-weather-model on Hera.

@ShanSunNOAA
Copy link
Collaborator Author

ShanSunNOAA commented Dec 4, 2024 via email

@ShanSunNOAA
Copy link
Collaborator Author

ShanSunNOAA commented Dec 7, 2024 via email

@LarissaReames-NOAA
Copy link
Collaborator

Well, I'm not super familiar with global workflow. I don't think any changes were ever made to the ufs-weather-model repository to correct for the issues with CICE. You should still see "history_format = pnetcdf2" in the ice_in file in your forecast run directory, indicating that you compiled, ran, and wrote CICE files using the previously problematic PIO option. I believe the only change that may have been made was to the job submission scripts to increase "I_MPI_SHM_HEAP_VSIZE" as discussed early on in this issue. I would check your job submission script and make sure that variable isn't exported.

@XiaqiongZhou-NOAA
Copy link
Contributor

@LarissaReames-NOAA , I can confirm that the CICE update solved the issue on WCOSS2. In order to run the new ufs-weather-model, with the global workflow "local CICE_hist_suffix="'x','x','x','x','x'" should be added in ~ush/parsing_namelists_CICE.sh. That is only change have to be made. I found the new model is slower than the old one, but this should be confirmed, and I will leave it alone so far.

@junwang-noaa
Copy link
Collaborator

@XiaqiongZhou-NOAA Thanks for testing. I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

5 participants