Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken GPU tests for CLUBB code #1226

Open
wants to merge 4 commits into
base: cam_development
Choose a base branch
from

Conversation

sjsprecious
Copy link
Collaborator

This PR fixes the broken ERS tests due to the recent GPU changes of CLUBB code (PR #1175).

Note that I need a new ccs_config tag from ESMCI/ccs_config_cesm#204 to complete this PR.

Closes #1220

@sjsprecious sjsprecious marked this pull request as draft January 15, 2025 18:38
@sjsprecious sjsprecious requested a review from nusbaume January 15, 2025 18:38
@sjsprecious sjsprecious added BFB bit for bit tag bug-fix This PR was created to fix a specific bug. labels Jan 15, 2025
@huebleruwm
Copy link

This fix confused me for a while, but I think I can explain it. If we have the setting clubb_config_flags%ipdf_call_placement = ipdf_post_advance_fields (which is default) we need 5 members of pdf_params_zm_chnk (w_1,w_2,varnce_w_1,varnce_w_2,mixt_frac) to be set before calling clubb. On the CPU this is handled just by pdf_params_zm_chnk being a module variable whose member arrays get allocated once at the beginning of the run and never deallocated, but I hadn't realized this and added these arrays to acc create to only allocate space for them, rather than copyin their values. So @sjsprecious, your change does seem like a good fix, and we should probably move pdf_params_zm_chnk(lchnk)%mixt_frac from the create list to the copy list as well.

I am unclear on why this didn't break the ECT test though. The values of those 5 member arrays should be different on the GPU than the CPU due to the lack of correct copying, and I would expect that to change answers.

@sjsprecious
Copy link
Collaborator Author

Thanks @huebleruwm for your feedback.

The five variables mentioned by you are initialized here and I think it should be fine to use acc create without any additional data copy (just like your original implementation). Are these variables used somewhere else before this initialization? If so, then it may make sense that acc create is not enough.

In addition, I guess even if we need the initial values of those variables, we should just do a acc copyin? But what only works so far is acc copy, which means we have to copy those variables back to CPU. This seems unnecessary to me. Do you have a thought on this?

@huebleruwm
Copy link

They're only initialized if it's the first restart step, otherwise we rely on the values from the previous timestep being maintained. It might be a more robust solution to remove the is_first_restart_step() part of the if statment here, so that these are initialized from the pbuf stored values, rather than relying on the module variable. That would allow us to keep the 5 fields as acc create.

Right now we would need them as acc copy, because they're updated each timestep by the GPU code, and only "saved" by keeping them in the module variable pdf_params_zm, which is only kept by the CPU timestep to timestep.

@sjsprecious
Copy link
Collaborator Author

Ah I see. Thanks for your clarification. That is clear to me now.

Would you like to implement the more robust solution or move on with my fix here? For the former one, you can either issue a PR to my branch or issue a different PR here to drop mine. For the latter one, I can simply add the mixt_frac variable suggested by you and let Jesse start the review process.

Regarding the ECT test, I am also surprised that the GPU code can pass given that it seems like a code bug. Are these five variables for diagnostic purpose only and not output to the history file? If so, they may not be used for ECT at all.

@huebleruwm
Copy link

huebleruwm commented Jan 21, 2025

I think it's simplest to go with your fix and move mixt_frac to the acc copy list. I can always make the more robust fix later and test it better in our branch first.

Those variables aren't just diagnostic and should affect the output. I took a look at where they're used internally though, it's to set the bounds of a complicated clipping routine that is rarely called (actually the same place where the initial ECT test breaking bug was). So I suppose it's possible that we need to run more timesteps before the initial value of those variables ever become important, and the ECT test just doesn't run enough timesteps to tease out the bug.

@sjsprecious
Copy link
Collaborator Author

Thanks @huebleruwm . That sounds good. I have added the mixt_frac variable to the acc copy list and move this PR out of draft state for review.

Your comments on the ECT make sense to me. If that clipping routine and five variables are rarely called/used, the output may not be changed significantly within a few time steps between CPU and GPU (though the answer is not BFB). Thus the ECT test won't treat it as an error. As you said, maybe we need to run a longer ECT test to capture such kind of bug.

@sjsprecious sjsprecious marked this pull request as ready for review January 21, 2025 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB bit for bit tag bug-fix This PR was created to fix a specific bug. CoupledEval3
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants