Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training with DataParallel Results in RuntimeError #1

Open
ronigold opened this issue Sep 21, 2023 · 1 comment
Open

Multi-GPU Training with DataParallel Results in RuntimeError #1

ronigold opened this issue Sep 21, 2023 · 1 comment

Comments

@ronigold
Copy link

Description
I am trying to perform multi-GPU training using the DataParallel wrapper from PyTorch. When I try to run the fit method, I encounter a RuntimeError saying that the parameters and buffers must be on the same device.

Here's a snippet of the code that I am using:

# Initialize learner and model
learn = Learner(...)
learn.model = ...

# Attempt to use DataParallel
model = nn.DataParallel(learn.model, device_ids=[1, 2, 3])
learn.model = model

# Update DataLoader device
learn.dls.device = torch.device("cuda:1")

# Clear cache
torch.cuda.empty_cache()

# Start training
learn.fit(1)

Error Message
The error message I receive is:

RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:3

Environment
PyTorch version: (e.g., 1.9.0)
Library version: (e.g., 0.2.0)
CUDA/cuDNN version: (e.g., CUDA 11.8, cuDNN 8.2.1)
GPU models and configuration: (e.g., 4x Tesla T4)
Operating System: (e.g., Ubuntu 18.04)

Additional Context
I've tried to set both the model and the DataLoader to the same device but without success. It seems like the model parameters and DataLoader are ending up on different devices during the training, causing the error.

Would appreciate any guidance on how to resolve this issue or if it's something that needs to be addressed in the library.

@ronigold
Copy link
Author

ronigold commented Sep 21, 2023

Update:

I was able to perform training according to what appeared in the notebook but without multiple GPU's but on a single processor with 16 RAM by adding quantization to the model:

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = 'meta-llama/Llama-2-7b-hf'

llama_base = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    use_cache=False,
    token=TOKEN,  # Add your token here
    quantization_config=nf4_config
)

I've researched the base code a bit but I'd like to make sure:
When I call the fit method, does a normal workout take place behind the scenes? Not DeepSpeed or LORA based?
Because it's quite surprising that I was able to train on a single GPU even after the quantization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant