handle automatic caching for RETURNN configs #349

JackTemaki · 2022-12-06T16:46:09Z

So far we relied on the RETURNN internal cache manager access that is implemented e.g. for HDFDataset or OggZipDataset. Now when training LMs, I added a caching function manually to the config and added the cfcall directly via CodeWrapper and DelayedFormat.

While this was the fasted approach to getting the training to work, I do not really like this approach. My preferred approach would be that the ReturnnConfig itself can handle this, meaning that it will write the def cf definition and update the paths accordingly without any potential hash influence and completely independent of the setup pipeline (legacy, returnn_common, etc...)

The question is if we want to rely on the internal cached marking, or find another way. We could e.g. simply force this for all Paths we find in the config. I am open to suggestions.

The text was updated successfully, but these errors were encountered:

michelwi · 2022-12-06T16:57:42Z

But keep in mind, that the cache manager is an i6 specific thing and we should try to hold our recipes generic enough s.t. they are also applicable on other clusters (ITC, Paderborn, AppTek)

JackTemaki · 2022-12-06T17:02:08Z

But keep in mind, that the cache manager is an i6 specific thing and we should try to hold our recipes generic enough s.t. they are also applicable on other clusters (ITC, Paderborn, AppTek)

Yes sure!

albertz · 2022-12-06T20:42:23Z

Related: #310

albertz · 2022-12-06T20:50:51Z

I'm not sure whether there is a good and generic way to do this automatically.

Also, e.g. in #310, for ExternSprintDataset, I use lambdas quite a lot, because I don't want that the code (the cf function) is executed when the config is loaded (because I potentially want to load many RETURNN configs efficiently without having any such side effects) but only when it is really used. See the example in #310 (comment). There I introduce _DelayedCodeFormat. Then I use cf explicitly.

Instead of using cf explicitly, we could introduce a generic wrapper, like DelayedCachedFile or so, which generates such cf(filename) code when serialized, and we can just define the Sisyphus hash (_sis_hash) as the same as the file ref itself. That way, replacing any tk.Path by DelayedCachedFile would not change the hash.

DelayedCachedFile can do the cf also only optionally, when the user enables it, and otherwise output the file itself.

Or you can move this logic to the cf function, and just return the filename itself when caching is not enabled.

JackTemaki · 2023-09-22T15:26:46Z

I will close this for now, we have the caching in RETURNN itself for heavy data like hdfs or ogg-zips, and there are sufficient options in the serialization helpers to wrap paths with caching.

JackTemaki assigned albertz, JackTemaki, mmz33, michelwi and Atticus1806 Dec 6, 2022

JackTemaki closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle automatic caching for RETURNN configs #349

handle automatic caching for RETURNN configs #349

JackTemaki commented Dec 6, 2022 •

edited

Loading

michelwi commented Dec 6, 2022

JackTemaki commented Dec 6, 2022

albertz commented Dec 6, 2022

albertz commented Dec 6, 2022

JackTemaki commented Sep 22, 2023

handle automatic caching for RETURNN configs #349

handle automatic caching for RETURNN configs #349

Comments

JackTemaki commented Dec 6, 2022 • edited Loading

michelwi commented Dec 6, 2022

JackTemaki commented Dec 6, 2022

albertz commented Dec 6, 2022

albertz commented Dec 6, 2022

JackTemaki commented Sep 22, 2023

JackTemaki commented Dec 6, 2022 •

edited

Loading