Project Boilerplate for Data Generation and Deep Learning

This repo contains example usage of some useful libraries for data generation/collection and deep learning in a research setting. I've been using these tools personally in many applications/projects and have distilled some common use cases here.

Installation

pip install -e .

Quick Start

Data Generation

Modify root_dir in cfg/generate_data.yaml to specify where you'd like to keep output files.
Run python scripts/generate_data.py

The generated data is a simple linear function with Gaussian noise. We'll train an MLP to fit this data in the next section.

This script uses Hydra to parse the config and generate the folder structure for data saving and logging, and it uses AsyncSavers to save data shards in a separate process. The save_every in the cfg is set to 100, and n_samples is set to 1000 - this means 10 shards will be saved.

The tag field gives a non-unique identifier for the data generation run.

Run bash run/generate_all_data.sh to generate data for 5 random seeds. See the file for how to set the tag via commandline arguments.

Deep Learning

Modify root_dir in cfg/train_model.yaml appropriately.
Modify wandb.logger.{entity, project, group} with your Weights and Biases account/project details.
Run python scripts/train_model.py to train the MLP to fit the generated data.

This script uses Hydra to parse configs and generate local logging folder, PyTorch Lightning for training, and Weights and Biases for logging and uploading the trained model.

See data_learning_boilerplate/data.py for an example of how to load the shards saved by AsyncSavers.

See data_learning_boilerplate/model.py for an implementation of PyTorch Lightning module.

Background

Data Generation

This is relevant to generating large amounts of data with simulations or some other data collection process.

Common Problems

Data size is greater than memory size.
Writing to disk is slow.
Hard to reproduce/debug

Solutions:

Save data in shards.
Save data in a parallel process.
Save configs/tags

Tools:

AsyncSavers - for 1, 2
Hydra - for 3

Deep Learning

Common Problems:

A lot of boilerplate code to just get started.
Messy logs when runs are across machines/users.
Hard to reproduce/debug

Solutions:

Use a framework to abstract away non-important logic.
Shareable, cloud-based logging.
Save configs/tags/models

Tools:

PyTorch Lightning - for 1, 3
Weights and Biases - for 2, 3
Hydra Configs - for 3

Quick Overview of these libraries:

AsyncSavers
- Save data in shards
- Save data in parallel process
Hydra
- Composible configs
- Output directory management
- Override YAML config via command line
PyTorch Lightning
- Abstracts away unimportant training logic
- Easily scale to multi-gpu, cluster-scale training
Weights and Biases
- Much more customizable than TensorBoard
- Offloads logging to the cloud
- Logs shared by multiple machines / users

Note

These libraries have many other very useful features that are not explored in this template. This particular way of combining these libraries is also something that I prefer and has worked for my own research projects. Modifications are probably necessary depending on your use cases. My goal is that this can serve as a template/reference for those who are interested in using these tools in research-like projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Project Boilerplate for Data Generation and Deep Learning

Installation

Quick Start

Data Generation

Deep Learning

Background

Data Generation

Deep Learning

Quick Overview of these libraries:

Note

Files

README.md

Latest commit

History

README.md

File metadata and controls

Project Boilerplate for Data Generation and Deep Learning

Installation

Quick Start

Data Generation

Deep Learning

Background

Data Generation

Deep Learning

Quick Overview of these libraries:

Note