Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle automatic caching for RETURNN configs #349

Closed
JackTemaki opened this issue Dec 6, 2022 · 5 comments
Closed

handle automatic caching for RETURNN configs #349

JackTemaki opened this issue Dec 6, 2022 · 5 comments
Assignees

Comments

@JackTemaki
Copy link
Contributor

JackTemaki commented Dec 6, 2022

So far we relied on the RETURNN internal cache manager access that is implemented e.g. for HDFDataset or OggZipDataset. Now when training LMs, I added a caching function manually to the config and added the cfcall directly via CodeWrapper and DelayedFormat.

While this was the fasted approach to getting the training to work, I do not really like this approach. My preferred approach would be that the ReturnnConfig itself can handle this, meaning that it will write the def cf definition and update the paths accordingly without any potential hash influence and completely independent of the setup pipeline (legacy, returnn_common, etc...)

The question is if we want to rely on the internal cached marking, or find another way. We could e.g. simply force this for all Paths we find in the config. I am open to suggestions.

@michelwi
Copy link
Contributor

michelwi commented Dec 6, 2022

But keep in mind, that the cache manager is an i6 specific thing and we should try to hold our recipes generic enough s.t. they are also applicable on other clusters (ITC, Paderborn, AppTek)

@JackTemaki
Copy link
Contributor Author

But keep in mind, that the cache manager is an i6 specific thing and we should try to hold our recipes generic enough s.t. they are also applicable on other clusters (ITC, Paderborn, AppTek)

Yes sure!

@albertz
Copy link
Member

albertz commented Dec 6, 2022

Related: #310

@albertz
Copy link
Member

albertz commented Dec 6, 2022

I'm not sure whether there is a good and generic way to do this automatically.

Also, e.g. in #310, for ExternSprintDataset, I use lambdas quite a lot, because I don't want that the code (the cf function) is executed when the config is loaded (because I potentially want to load many RETURNN configs efficiently without having any such side effects) but only when it is really used. See the example in #310 (comment). There I introduce _DelayedCodeFormat. Then I use cf explicitly.

Instead of using cf explicitly, we could introduce a generic wrapper, like DelayedCachedFile or so, which generates such cf(filename) code when serialized, and we can just define the Sisyphus hash (_sis_hash) as the same as the file ref itself. That way, replacing any tk.Path by DelayedCachedFile would not change the hash.

DelayedCachedFile can do the cf also only optionally, when the user enables it, and otherwise output the file itself.

Or you can move this logic to the cf function, and just return the filename itself when caching is not enabled.

@JackTemaki
Copy link
Contributor Author

I will close this for now, we have the caching in RETURNN itself for heavy data like hdfs or ogg-zips, and there are sufficient options in the serialization helpers to wrap paths with caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants