-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335
Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335
Conversation
…ld not be committed
Pinging @DeniseWorthen who authored the relevant issue. Also, @DusanJovic-NOAA who authored the original regression test system. |
I've been testing this on top of #2326. The regression test has proven itself completely unusable for development due to the lack of error checking. Updating UPP and modulefiles required many changes that caused subsets of the tests to fail. A regression test system that is unable to differentiate between a test with changed results, and a test that could not run at all, is not useful for development. |
@SamuelTrahanNOAA Can you follow up to clean the super-linter complaint ? |
After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto. |
The jobs succeeded on the second attempt. This may have been a temporary system issue. |
Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4. - [Commits](certifi/python-certifi@2024.02.02...2024.07.04) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]>
@SamuelTrahanNOAA can you sync up branch? we can start working on this pr. We also want to combine #2265 and #2357 to this pr as well. |
@jkbk2004 There are a TON of FIXME in these code changes. This should not be combined into anything until those are addressed. |
@BrianCurtis-NOAA I agree we will do enough test of this new functionality of error checking. #2265 and #2357 are very minor PRs. Lets see how test goes. @zach1221 @FernandoAndrade-NOAA let's confirm how error checking runs on each machine. |
I've only tested this with Rocoto. For other workflow managers, I used educated guesses when modifying the scripts. The ecFlow system has logic and scripts outside of the tests/*.sh files. I don't know how that'll break things. |
@SamuelTrahanNOAA can you please sync up your PR? @FernandoAndrade-NOAA and I can then test ecflow across the rdhpcs. |
@FernandoAndrade-NOAA @BrianCurtis-NOAA I think we're ready to begin the rest of the tests now. |
Make sure you run the opnReqTest since this PR changes the regression test system itself. Anything related to testing should be tested. |
still no Acorn, can skip. |
@SamuelTrahanNOAA ORTs were run successfully. Logs are posted above for control_p8, regional_control and a cpld_control case. |
Derecho was down for unplanned maintenance for six hours. You could log in, but your jobs would not run. It came out of maintenance about two hours ago. |
Looks like the derecho jobs are going through now. |
Ok we should be ready now to proceed with merging. |
@SamuelTrahanNOAA Thanks for all your work on this. I just ran a PR where I expected two failures. One was because an input file wasn't in the correct subdirectory in the The first of these reported a failure |
I didn't improve the wording of the comparison message. It would be great if they were more detailed than "unable to complete comparison." |
OK, thanks. I do think the message needs to indicate that it failed comparison, not that the comparison didn't complete . Maybe @BrianCurtis-NOAA can take a look at that. |
Commit Queue Requirements:
All sub component pull requests have been reviewed by their code managers.Description:
The regression test system ignores all errors in all jobs. A job where fv3.exe crashed is the same as a job where it ran, and produced different results. Also, compilation job errors are ignored. This leads to several problems:
In this new version of the regression test system:
New Self-Test System
A new
tests/error-test.conf
has several jobs to test whether the rt.sh fails and succeeds properly. This is a semi-automated self-test. It contains:fail_to_copy
- This test will fail in run_test.sh before running the job_card. That means ecflow will be unable to submit the job, and other metaschedulers will see the job start and abort with exit status 1fail_to_run
- A test that fails inside the job_card. All metaschedulers should see the job abort with exit status 1.fail_to_compile
- A compile job that will always fail to compile.dependency_unmet
- depends on thefail_to_compile
, so it should not be submitted.atm_dyn32
andcontrol_c48.v2.sfc
- These should succeed and match the baselines.atm_dyn64
andcontrol_c48
- Should succeed, but not match the baselines, because they use 64-bit dynamics instead of 32-bit.All tests are variations on the control_c48 since it's one of the cheapest and oldest tests in rt.conf, but is unlikely to go away. (The control_c48 tests a core functionality: super-low-res GFS.)
Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
N/A
UFSWM Blocking Dependencies:
This branch correctly detects failing tests. That means the regression tests will keep failing until this bug fix is merged:
Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
Testing Log: