Issues in mesoscale det restart #601
Replies: 5 comments 37 replies
-
@PerryShafran-NOAA Do you have the log from the initial interrupted run? |
Beta Was this translation helpful? Give feedback.
-
@malloryprow The log file listed above is the interrupted run. In this case I ran only the interrupted run so I could compare what's in the restart directory vs the working directory. They differ significantly, which is likely why we have the smaller final file when we do the restart run. |
Beta Was this translation helpful? Give feedback.
-
I think I found it. Look at Looking at /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/stats/METplus_job_scripts/generate/job130 to follow what I am saying. The problem is that mesoscale runs with these different When mesoscale_util gets called to run copy_data_to_restart, it is being run outsides of the If you put the copy_to_restart call within in the loop, you should be good. |
Beta Was this translation helpful? Give feedback.
-
Yes that is possible. From looking yesterday if you see files that are not fully completed to the restart directory but that job is marked as completed then there is a problem. |
Beta Was this translation helpful? Give feedback.
-
@malloryprow Returning to this issue here, as now I have put mpmd into the system, and have worked, with Marcel's help, to incorporate restart back into the mpmd code. However, this issue remains: with restart, the final stat file is still smaller than it is without restart.
The _save file is the output file without restart (and matches what is found in emc.vpppg for the date). The other one was created when the script was submitted with a wallclock for 15 minutes. Then the wallclock was reset back to 1 hr and submitted again without clearing the stats directory. As you can see, it's significantly smaller. I do not know why stats files do not make it to the restart directory. In many cases, the working directory had more files in the restart directory than the noscrub restart directory. It seems that when the code ends, there are some small stat files having been created but didn't get copied because the end of the code occurred after file was generated but before the copy. I'd like some more help looking into this issue. Perry |
Beta Was this translation helpful? Give feedback.
-
Hi, everyone!
I need some assistance in diagnosing an issue here. With help from @MarcelCaron-NOAA, I installed restart in the mesoscale stats jobs for NAM and RAP. I noticed that when I do a restart job, the final stat file is much smaller than it would be for a full job. I found out why; not all the stat files from the stmp working directory makes it over to the restart directory.
Compare the following two directories:
working directory:
/lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241030
restart directory:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241030/rap/grid2obs/restart/c07/METplus_output/raob/point_stat/rap.20241030
These are the two directories after the job was killed after 7 minutes. Note that the working directory has 1397 stat files in it, while the restart directory has only 290 files in it.
The codebase can be found here:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS
Relevant job file is here:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats.sh
This job file is usually run with a
-v vhr=07
setting.The latest job log is here:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats_00.o159901212
I had made changes in the ush/mesoscale directory, and thus, checking out some of those scripts might be helpful as I might have missed or deleted a line somewhere that I shouldn't have. I'm not 100% familiar with these scripts as I wasn't the original developer, but I'm figuring stuff out little by little. Nevertheless, I feel stuck here and if anyone could offer some assistance/guidance, that would be great.
I think there may be similar issues in the NAM run, but I'll run that now to offer an additional data point. NAM and RAP use the same ush scripts, though they have different ex-scripts.
Thanks, all!
Perry
Beta Was this translation helpful? Give feedback.
All reactions