Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335

SamuelTrahanNOAA · 2024-06-21T19:46:33Z

Commit Queue Requirements:

Fill out all sections of this template.
N/A No subcomponents. ~~All sub component pull requests have been reviewed by their code managers.~~
Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
Commit 'test_changes.list' from previous step

Description:

The regression test system ignores all errors in all jobs. A job where fv3.exe crashed is the same as a job where it ran, and produced different results. Also, compilation job errors are ignored. This leads to several problems:

Test jobs are executed even if the compile job fails. (This bug is restricted to the Rocoto-based system.)
Jobs with prerequisites (such as restart tests) run even if their prerequisite fails.
A job that failed to copy input data or had syntax errors won't be caught until the entire workflow completes.
Temporary system issues require rerunning the entire workflow instead of only affected jobs.

In this new version of the regression test system:

Errors are caught, and result in the metascheduler considering the job as failed.
Dependencies are honored; if a job fails, anything that depends on it won't run.
Jobs that run to completion, but have changed results are considered to have succeeded. This behavior is unchanged.

New Self-Test System

A new tests/error-test.conf has several jobs to test whether the rt.sh fails and succeeds properly. This is a semi-automated self-test. It contains:

fail_to_copy - This test will fail in run_test.sh before running the job_card. That means ecflow will be unable to submit the job, and other metaschedulers will see the job start and abort with exit status 1
fail_to_run - A test that fails inside the job_card. All metaschedulers should see the job abort with exit status 1.
fail_to_compile - A compile job that will always fail to compile.
dependency_unmet - depends on the fail_to_compile, so it should not be submitted.
atm_dyn32 and control_c48.v2.sfc - These should succeed and match the baselines.
atm_dyn64 and control_c48 - Should succeed, but not match the baselines, because they use 64-bit dynamics instead of 32-bit.

All tests are variations on the control_c48 since it's one of the cheapest and oldest tests in rt.conf, but is unlikely to go away. (The control_c48 tests a core functionality: super-low-res GFS.)

Commit Message:

* UFSWM - restore error checking to regression test system and add a self-test suite

Priority:

Normal

Git Tracking

UFSWM:

Closes modify report of test failures to clearly indicate when a test failed to compare because it did not run #2330

Sub component Pull Requests:

N/A

UFSWM Blocking Dependencies:

This branch correctly detects failing tests. That means the regression tests will keep failing until this bug fix is merged:

Hotfix to update with cubed sphere bug fix #2362

Changes

Regression Test Changes (Please commit test_changes.list):

No Baseline Changes.

Input data Changes:

None.

Library Changes/Upgrades:

No Updates

Testing Log:

…st the features

…ld not be committed

SamuelTrahanNOAA · 2024-06-21T19:52:46Z

Pinging @DeniseWorthen who authored the relevant issue. Also, @DusanJovic-NOAA who authored the original regression test system.

SamuelTrahanNOAA · 2024-06-21T19:55:03Z

I've been testing this on top of #2326. The regression test has proven itself completely unusable for development due to the lack of error checking. Updating UPP and modulefiles required many changes that caused subsets of the tests to fail. A regression test system that is unable to differentiate between a test with changed results, and a test that could not run at all, is not useful for development.

jkbk2004 · 2024-06-27T12:38:22Z

@SamuelTrahanNOAA Can you follow up to clean the super-linter complaint ?

SamuelTrahanNOAA · 2024-06-27T15:46:45Z

After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto.

SamuelTrahanNOAA · 2024-06-27T18:49:15Z

After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto.

The jobs succeeded on the second attempt. This may have been a temporary system issue.

Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4. - [Commits](certifi/python-certifi@2024.02.02...2024.07.04) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]>

jkbk2004 · 2024-07-11T12:07:29Z

@SamuelTrahanNOAA can you sync up branch? we can start working on this pr. We also want to combine #2265 and #2357 to this pr as well.

BrianCurtis-NOAA · 2024-07-11T12:10:59Z

@jkbk2004 There are a TON of FIXME in these code changes. This should not be combined into anything until those are addressed.

jkbk2004 · 2024-07-11T12:18:20Z

@BrianCurtis-NOAA I agree we will do enough test of this new functionality of error checking. #2265 and #2357 are very minor PRs. Lets see how test goes. @zach1221 @FernandoAndrade-NOAA let's confirm how error checking runs on each machine.

SamuelTrahanNOAA · 2024-07-11T13:46:52Z

I've only tested this with Rocoto. For other workflow managers, I used educated guesses when modifying the scripts. The ecFlow system has logic and scripts outside of the tests/*.sh files. I don't know how that'll break things.

zach1221 · 2024-07-11T14:54:21Z

I've only tested this with Rocoto. For other workflow managers, I used educated guesses when modifying the scripts. The ecFlow system has logic and scripts outside of the tests/*.sh files. I don't know how that'll break things.

@SamuelTrahanNOAA can you please sync up your PR? @FernandoAndrade-NOAA and I can then test ecflow across the rdhpcs.

zach1221 · 2024-07-16T21:10:12Z

@FernandoAndrade-NOAA @BrianCurtis-NOAA I think we're ready to begin the rest of the tests now.

SamuelTrahanNOAA · 2024-07-17T21:41:59Z

Make sure you run the opnReqTest since this PR changes the regression test system itself. Anything related to testing should be tested.

BrianCurtis-NOAA · 2024-07-17T22:53:37Z

still no Acorn, can skip.

zach1221 · 2024-07-17T23:09:55Z

Make sure you run the opnReqTest since this PR changes the regression test system itself. Anything related to testing should be tested.

@SamuelTrahanNOAA ORTs were run successfully. Logs are posted above for control_p8, regional_control and a cpld_control case.
@jkbk2004 the derecho queue really hasn't moved for me all day. Do we want to skip?

SamuelTrahanNOAA · 2024-07-17T23:14:53Z

the derecho queue really hasn't moved for me all day. Do we want to skip?

Derecho was down for unplanned maintenance for six hours. You could log in, but your jobs would not run. It came out of maintenance about two hours ago.

zach1221 · 2024-07-18T02:50:32Z

Looks like the derecho jobs are going through now.

zach1221 · 2024-07-18T14:16:49Z

Ok we should be ready now to proceed with merging.

DeniseWorthen · 2024-07-19T21:24:55Z

@SamuelTrahanNOAA Thanks for all your work on this. I just ran a PR where I expected two failures. One was because an input file wasn't in the correct subdirectory in the input-data, and one because the answers were going to be different.

The first of these reported a failure FAILED: RUN DID NOT COMPLETE, which is great. But the second, where the run completed and all the comparisons were made (they were all different), still reported the failure as FAILED: UNABLE TO COMPLETE COMPARISON. Would that be what you expect?

SamuelTrahanNOAA · 2024-07-20T02:23:42Z

I didn't improve the wording of the comparison message. It would be great if they were more detailed than "unable to complete comparison."

DeniseWorthen · 2024-07-20T10:10:32Z

OK, thanks. I do think the message needs to indicate that it failed comparison, not that the comparison didn't complete . Maybe @BrianCurtis-NOAA can take a look at that.

Benjamin Cash and others added 7 commits May 3, 2024 12:31

Add detection for Frontera login nodes

8b12b4c

Add detection for Frontera compute nodes

babaef4

ufs frontera module files as created by Ufuk for ufs coastal

6d54abf

add frontera to module setup

db5766b

restore error checking to workflow and tweak some jobs to fail, to te…

7c6316c

…st the features

job-failing functionality (for test purposes) moved to right script

c7cbada

ignore some regression test system flag and temporary files that shou…

325c261

…ld not be committed

use pipefail to detect if job card fails

7bb328c

Merge remote-tracking branch 'upstream/develop' into error-checking

fd212fc

SamuelTrahanNOAA added 3 commits June 27, 2024 15:48

make linter happy

bae897a

try again to make linter happy

05835b8

set pipefail again for linter

eef6e73

SamuelTrahanNOAA and others added 6 commits June 27, 2024 20:02

run_compile.sh: do not duplicate redirect_out_err

b8ba251

correct a comment

e536905

Merge remote-tracking branch 'upstream/develop' into error-checking

f816cd2

bug fix to failed test detection with rocoto

bacf7a9

Merge branch 'develop' into feature/detect_frontera

42fc086

jkbk2004 mentioned this pull request Jul 8, 2024

Feature/detect frontera #2265

Closed

14 tasks

Merge remote-tracking branch 'upstream/develop' into error-checking

69619cf

zach1221 added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jul 16, 2024

zach1221 changed the title ~~Restore error checking in regression test system.~~ Restore error checking in regression test system. (Combined PR#2357 and PR#2265) Jul 17, 2024

zach1221 added jenkins-ort run ORT testing and removed jenkins-ort run ORT testing labels Jul 17, 2024

FernandoAndrade-NOAA and others added 5 commits July 17, 2024 02:33

add gaea RT log passed

8530a2a

add jet RT log passed

9f76caa

add control_p8_gnu ORT logs: passed

cf9663d

add cpld_control_gnu ORT logs: passed

e010b2f

add regional_control_gnu ORT logs: passed

31eab83

SamuelTrahanNOAA mentioned this pull request Jul 17, 2024

sync with head of NOAA-EMC UPP develop #2326

Merged

14 tasks

zach1221 and others added 3 commits July 17, 2024 11:08

add hercules RT logs: passed

0252b91

add orion RT logs: passed

099d33c

WCOSS2 RT Log: Passed

2f6d279

add derecho RT logs: passed

c36b15a

zach1221 requested review from BrianCurtis-NOAA and jkbk2004 July 18, 2024 14:16

BrianCurtis-NOAA approved these changes Jul 18, 2024

View reviewed changes

jkbk2004 approved these changes Jul 18, 2024

View reviewed changes

zach1221 merged commit 6a6ce43 into ufs-community:develop Jul 18, 2024
3 checks passed

climbfuji mentioned this pull request Aug 5, 2024

sporadic floating point errors in FV3/atmos_cubed_sphere/model/a2b_edge.F90 for nested configurations #2360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335

Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335

SamuelTrahanNOAA commented Jun 21, 2024 •

edited by zach1221

Loading

SamuelTrahanNOAA commented Jun 21, 2024

SamuelTrahanNOAA commented Jun 21, 2024 •

edited

Loading

jkbk2004 commented Jun 27, 2024

SamuelTrahanNOAA commented Jun 27, 2024

SamuelTrahanNOAA commented Jun 27, 2024

jkbk2004 commented Jul 11, 2024

BrianCurtis-NOAA commented Jul 11, 2024

jkbk2004 commented Jul 11, 2024 •

edited

Loading

SamuelTrahanNOAA commented Jul 11, 2024

zach1221 commented Jul 11, 2024

zach1221 commented Jul 16, 2024

SamuelTrahanNOAA commented Jul 17, 2024

BrianCurtis-NOAA commented Jul 17, 2024

zach1221 commented Jul 17, 2024

SamuelTrahanNOAA commented Jul 17, 2024

zach1221 commented Jul 18, 2024

zach1221 commented Jul 18, 2024

DeniseWorthen commented Jul 19, 2024 •

edited

Loading

SamuelTrahanNOAA commented Jul 20, 2024

DeniseWorthen commented Jul 20, 2024

Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335

Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335

Conversation

SamuelTrahanNOAA commented Jun 21, 2024 • edited by zach1221 Loading

Commit Queue Requirements:

Description:

New Self-Test System

Commit Message:

Priority:

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

Changes

Regression Test Changes (Please commit test_changes.list):

Input data Changes:

Library Changes/Upgrades:

Testing Log:

SamuelTrahanNOAA commented Jun 21, 2024

SamuelTrahanNOAA commented Jun 21, 2024 • edited Loading

jkbk2004 commented Jun 27, 2024

SamuelTrahanNOAA commented Jun 27, 2024

SamuelTrahanNOAA commented Jun 27, 2024

jkbk2004 commented Jul 11, 2024

BrianCurtis-NOAA commented Jul 11, 2024

jkbk2004 commented Jul 11, 2024 • edited Loading

SamuelTrahanNOAA commented Jul 11, 2024

zach1221 commented Jul 11, 2024

zach1221 commented Jul 16, 2024

SamuelTrahanNOAA commented Jul 17, 2024

BrianCurtis-NOAA commented Jul 17, 2024

zach1221 commented Jul 17, 2024

SamuelTrahanNOAA commented Jul 17, 2024

zach1221 commented Jul 18, 2024

zach1221 commented Jul 18, 2024

DeniseWorthen commented Jul 19, 2024 • edited Loading

SamuelTrahanNOAA commented Jul 20, 2024

DeniseWorthen commented Jul 20, 2024

SamuelTrahanNOAA commented Jun 21, 2024 •

edited by zach1221

Loading

SamuelTrahanNOAA commented Jun 21, 2024 •

edited

Loading

jkbk2004 commented Jul 11, 2024 •

edited

Loading

DeniseWorthen commented Jul 19, 2024 •

edited

Loading