v1.7.0
What's Changed
The primary feature of the new release is that pytorch is now the main mode of training.
The CMS status was presented at https://indico.cern.ch/event/1399688/#1-ml-for-pf.
- switch pytorch training to tfds array-record datasets by @farakiko in #228
- Timing the ONNX model, retrain CMS-GNNLSH-TF by @jpata in #229
- fixes for pytorch, CMS t1tttt dataset, update response plots by @jpata in #232
- fix pytorch multi-GPU training hang by @farakiko in #233
- feat: specify number of samples as cmd line arg in pytorch training and testing by @erwulff in #237
- Automatically name training dir in pytorch pipeline by @erwulff in #238
- pytorch backend major update by @farakiko in #240
- Update dist.barrier() and fix stale epochs for torch backend by @farakiko in #249
- multi-bin loss in TF, plot fixes by @jpata in #234
- PyTorch distributed num-workers>0 fix by @farakiko in #252
- speedup of the pytorch GNN-LSH model by @jpata in #245
- Implement HPO for PyTorch pipeline. by @erwulff in #246
- fix tensorboard error by @farakiko in #254
- fix config files by @erwulff in #255
- making the 3d-padded models more efficient in pytorch by @jpata in #256
- Fix pytorch inference after #256 by @jpata in #257
- Update training.py by @jpata in #261
- Reduce the number of data loader workers per dataset in pytorch by @farakiko in #262
- fix inference by @farakiko in #264
- Implementing configurable checkpointing. by @erwulff in #263
- restore onnx export in pytorch by @jpata in #265
- remove outdated forward_batch from pytorch by @jpata in #266
- Separate multiparticlegun samples from singleparticle gun samples by @farakiko in #267
- compare all three models in pytorch by @jpata in #268
- Allows testing on a given --load-checkpoint by @farakiko in #269
- added clic evaluation notebook by @jpata in #272
- Fix --load-checkpoint bug by @farakiko in #270
- Implement CometML logging to PyTorch training pipeline. by @erwulff in #273
- Add command line argument to choose experiments dir in PyTorch training pipeline by @erwulff in #274
- Implement multi-gpu training in HPO with Ray Tune and Ray Train by @erwulff in #277
- Better CometML logging + Ray Train vs DDP comparison by @erwulff in #278
- Fix checkpoint loading by @erwulff in #280
- Learning rate schedules and Mamba layer by @erwulff in #282
- use modern optimizer, revert multi-bin loss in TF by @jpata in #253
- track individual particle loss components, speedup inference by @jpata in #284
- Update the jet pt threshold to be the same as the PF paper by @farakiko in #283
- towards v1.7: new CMS datasets, CLIC hit-based datasets, TF backward-compat optimizations by @jpata in #285
- fix torch no grad by @jpata in #290
- pytorch regression output layer configurability by @jpata in #291
- Implement resume-from-checkpoint in HPO by @erwulff in #293
- enable FlashAttention in pytorch, update to torch 2.2.0 by @jpata in #292
- fix pad_power_of_two by @jpata in #296
- Feat val freq by @erwulff in #298
- normalize loss, reparametrize network by @jpata in #297
- fix up configs by @jpata in #300
- clean up loading by @jpata in #301
- Fix unpacking for 3d padded batch, update plot style by @jpata in #306
Full Changelog: v1.6...v1.7.0