This repo contains example usage of some useful libraries for data generation/collection and deep learning in a research setting. I've been using these tools personally in many applications/projects and have distilled some common use cases here.
pip install -e .
- Modify
root_dir
incfg/generate_data.yaml
to specify where you'd like to keep output files. - Run
python scripts/generate_data.py
The generated data is a simple linear function with Gaussian noise. We'll train an MLP to fit this data in the next section.
This script uses Hydra to parse the config and generate the folder structure for data saving and logging, and it uses AsyncSavers to save data shards in a separate process.
The save_every
in the cfg is set to 100
, and n_samples
is set to 1000
- this means 10
shards will be saved.
The tag
field gives a non-unique identifier for the data generation run.
- Run
bash run/generate_all_data.sh
to generate data for 5 random seeds. See the file for how to set thetag
via commandline arguments.
- Modify
root_dir
incfg/train_model.yaml
appropriately. - Modify
wandb.logger.{entity, project, group}
with your Weights and Biases account/project details. - Run
python scripts/train_model.py
to train the MLP to fit the generated data.
This script uses Hydra to parse configs and generate local logging folder, PyTorch Lightning for training, and Weights and Biases for logging and uploading the trained model.
See data_learning_boilerplate/data.py
for an example of how to load the shards saved by AsyncSavers.
See data_learning_boilerplate/model.py
for an implementation of PyTorch Lightning module.
This is relevant to generating large amounts of data with simulations or some other data collection process.
Common Problems
- Data size is greater than memory size.
- Writing to disk is slow.
- Hard to reproduce/debug
Solutions:
- Save data in shards.
- Save data in a parallel process.
- Save configs/tags
Tools:
- AsyncSavers - for 1, 2
- Hydra - for 3
Common Problems:
- A lot of boilerplate code to just get started.
- Messy logs when runs are across machines/users.
- Hard to reproduce/debug
Solutions:
- Use a framework to abstract away non-important logic.
- Shareable, cloud-based logging.
- Save configs/tags/models
Tools:
- PyTorch Lightning - for 1, 3
- Weights and Biases - for 2, 3
- Hydra Configs - for 3
- AsyncSavers
- Save data in shards
- Save data in parallel process
- Hydra
- Composible configs
- Output directory management
- Override YAML config via command line
- PyTorch Lightning
- Abstracts away unimportant training logic
- Easily scale to multi-gpu, cluster-scale training
- Weights and Biases
- Much more customizable than TensorBoard
- Offloads logging to the cloud
- Logs shared by multiple machines / users
These libraries have many other very useful features that are not explored in this template. This particular way of combining these libraries is also something that I prefer and has worked for my own research projects. Modifications are probably necessary depending on your use cases. My goal is that this can serve as a template/reference for those who are interested in using these tools in research-like projects.