Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job crashes early in hdfio #1422

Open
freyso opened this issue May 21, 2024 · 10 comments
Open

job crashes early in hdfio #1422

freyso opened this issue May 21, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@freyso
Copy link
Contributor

freyso commented May 21, 2024

Summary

A SPHInX (restart) job fails to run due to failures in hdf5io. Error message is "ValueError: Objects can be only recovered from hdf5 if TYPE is given"

I cannot tell if this is related to restart.

pyiron Version and Platform

cmti

Expected Behavior

Job runs.

Actual Behavior

Job crashes.
Job execution crashes with the following error.out

> Traceback (most recent call last):
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/runpy.py", line 196, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
>     main()
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/cli/control.py", line 59, in main
>     args.cli(args)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
>     job_wrapper_function(
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 161, in job_wrapper_function
>     job = JobWrapper(
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 64, in __init__
>     self.job = pr.load(int(job_id))
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/project/jobloader.py", line 104, in __call__
>     return super().__call__(job_specifier, convert_to_object=convert_to_object)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/project/jobloader.py", line 75, in __call__
>     return self._project.load_from_jobpath(
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/project/generic.py", line 1001, in load_from_jobpath
>     job = job.to_object()
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/jobs/job/core.py", line 596, in to_object
>     return self.project_hdf5.to_object(object_type, **qwargs)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/storage/hdfio.py", line 1142, in to_object
>     return _to_object(self, class_name, **kwargs)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/storage/hdfio.py", line 117, in _to_object
>     raise ValueError("Objects can be only recovered from hdf5 if TYPE is given")
> ValueError: Objects can be only recovered from hdf5 if TYPE is given

Steps to Reproduce

?? Deleting and setting up the job again produces the error again.

@freyso freyso added the bug Something isn't working label May 21, 2024
@samwaseda
Copy link
Member

Hm there's not a single line coming from Sphinx in the error message. Do you have a small code to reproduce the error?

@pmrv
Copy link
Contributor

pmrv commented May 21, 2024

Could it be that there's a stray entry in the database from a time when you deleted the job files manually outside of pyiron?

@samwaseda
Copy link
Member

Can you also maybe try to see whether a different version of pyiron helps? It might help us figure out which changes could have caused the problem.

@freyso
Copy link
Contributor Author

freyso commented May 22, 2024

Changing to pyiron/2024-05-20 seemed to help. I was on pyiron/latest before, which apparently is NOT latest. Is it possible that the pyiron version used on the cluster is incompatible with the pyiron/latest on the login node?

This is a VERY frustrating experience I am having here. Loads of incomprehensible warnings. Error messages with zero information value. 'Objects can be only recovered from hdf5 if TYPE is given' is essentially a 'Something error occured'.

I close the ticket, nothing to win here any more.

@freyso freyso closed this as completed May 22, 2024
@jan-janssen
Copy link
Member

Changing to pyiron/2024-05-20 seemed to help. I was on pyiron/latest before, which apparently is NOT latest. Is it possible that the pyiron version used on the cluster is incompatible with the pyiron/latest on the login node?

@niklassiemer Can you comment on this?

@samwaseda
Copy link
Member

Hmmm to my taste the PR got closed a bit too early. If there are updates I would appreciate you guys to post them here.

@niklassiemer
Copy link
Member

Changing to pyiron/2024-05-20 seemed to help. I was on pyiron/latest before, which apparently is NOT latest. Is it possible that the pyiron version used on the cluster is incompatible with the pyiron/latest on the login node?

@niklassiemer Can you comment on this?

pyiron/latest is indeed after all the hand updated version with python3.10 which was somewhat older than the docker-stack build from yesterday. However, the version on the cluster and the one on the login node should not differ! Actually, the kernel chosen in the notebook should also be loaded on the compute node via preserving of the environment. If this is not the case, I need to know and find a solution!

@freyso
Copy link
Contributor Author

freyso commented May 24, 2024

Got the problem again, with the new kernel. So it's not about the python kernel.

I solved the problem again. This time, by avoiding minus-sign in the job name. I may have done this last time, too.

Is it possible that the appearance of a minus sign in the job name causes issues? It seems reproducible.
E20Vnm-test - fails in hdfio
E20Vnm_neutral - runs.

@freyso freyso reopened this May 24, 2024
@freyso
Copy link
Contributor Author

freyso commented May 24, 2024

another thought: could be some inconsistency in the name normalization. For hdf5 file '-' seems replaced by m, in the job table, the '-' is still there. In the working directory, it becomes E20Vnmmtest_hdf/E20Vnm-test/ = some mixture.
I got confused by this at some point, that's why I had changed from minus to underscore. Yet, for me, minus is more convenient to type, so high chances I do this again.
Also, when I remove the job via pr.remove_job, the _hdf5 directory stays in place.

@niklassiemer
Copy link
Member

Thanks for coming back to this! This could indeed be a reason! I opened an issue on pyiron_base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants