Model not saving when using distributed training #65

eugfomitcheva · 2023-09-20T18:11:34Z

Hi -

I am trying to use the DECIModule with distributed training on 4 GPUs. Saving the model in the distributed case yields an empty module whereas when training on one GPU I save the anticipated model with sem_module.

Is there a known approach to saving the model properly in the distributed training case? I am using the pytorch default DDP strategy.

   trainer = pl.Trainer(
        accelerator="gpu",
        devices=4, # distribute training
        max_epochs=1000,
        fast_dev_run=test_run,
        callbacks=[TQDMProgressBar(refresh_rate=19), checkpoint_callback],
        enable_checkpointing=True,
    )

    # Training the model
    trainer.fit(lightning_module, datamodule=data_module)
    torch.save(lightning_module.sem_module, "model.pt")

The text was updated successfully, but these errors were encountered:

confoundry · 2023-09-22T09:37:39Z

Hi @eugfomitcheva,

We haven't used multi-GPU training so we are relying on the lightning functionality. I'm not sure how lightning does it but usually with DDP your module gets nested under a .module member. If you do manage to make any progress on getting it to work, create a PR and we can incorporate it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model not saving when using distributed training #65

Model not saving when using distributed training #65

eugfomitcheva commented Sep 20, 2023

confoundry commented Sep 22, 2023

Model not saving when using distributed training #65

Model not saving when using distributed training #65

Comments

eugfomitcheva commented Sep 20, 2023

confoundry commented Sep 22, 2023