You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use the DECIModule with distributed training on 4 GPUs. Saving the model in the distributed case yields an empty module whereas when training on one GPU I save the anticipated model with sem_module.
Is there a known approach to saving the model properly in the distributed training case? I am using the pytorch default DDP strategy.
trainer = pl.Trainer(
accelerator="gpu",
devices=4, # distribute training
max_epochs=1000,
fast_dev_run=test_run,
callbacks=[TQDMProgressBar(refresh_rate=19), checkpoint_callback],
enable_checkpointing=True,
)
# Training the model
trainer.fit(lightning_module, datamodule=data_module)
torch.save(lightning_module.sem_module, "model.pt")
The text was updated successfully, but these errors were encountered:
We haven't used multi-GPU training so we are relying on the lightning functionality. I'm not sure how lightning does it but usually with DDP your module gets nested under a .module member. If you do manage to make any progress on getting it to work, create a PR and we can incorporate it.
Hi -
I am trying to use the
DECIModule
with distributed training on 4 GPUs. Saving the model in the distributed case yields an empty module whereas when training on one GPU I save the anticipated model withsem_module
.Is there a known approach to saving the model properly in the distributed training case? I am using the pytorch default DDP strategy.
The text was updated successfully, but these errors were encountered: