You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 3, 2024. It is now read-only.
I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe, since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card). From my testing result, the batch size cannot be changed, even just '192' (multiple of 64), it shows "error: 'hipErrorMemory Allocation'(1002)" .
Since the batch size only has '128', I just do a roughly math, the four cards training time will 3 ~ 3.5x longer as training time on 4xP100 system (batch_size=512).
May I ask is there some environment parameters should I set before the training which can help on enlarge the batch size on multiple GPU training?
I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR) Training job.
Steps to reproduce
Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website.
Your system configuration
Operating system: Ubuntu 16.04.3
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180
Server: Inventec P47
GPU: AMD MI25 x4
CPU: AMD EPYC 7601 x2
Memory: 512GB
The text was updated successfully, but these errors were encountered:
Thanks for the feedback. If I'm understanding your comments correctly, I believe I just reproduced your setup, but I didn't hit OOM errors.
First, reboot and try re-running your workload.
If that doesn't work, can you please send the results of hipInfo? See this directory: /opt/rocm/hip/samples/1_Utils/hipInfo. Also, can you show how you are running this 4-GPU workload?
Thanks,
Jeff
PS - Here's an example of how you might accomplish a 4-GPU run. You'll have to point the prototxt files to wherever you have ImageNet data located.
Hi @parallelo, Thanks for your feedback. Currently, the system was shipped to SC17 show together with the MI25. I will provide the update when I get the system back.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Issue summary
I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe, since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card). From my testing result, the batch size cannot be changed, even just '192' (multiple of 64), it shows "error: 'hipErrorMemory Allocation'(1002)" .
Since the batch size only has '128', I just do a roughly math, the four cards training time will 3 ~ 3.5x longer as training time on 4xP100 system (batch_size=512).
May I ask is there some environment parameters should I set before the training which can help on enlarge the batch size on multiple GPU training?
I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR) Training job.
Steps to reproduce
Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website.
Your system configuration
Operating system: Ubuntu 16.04.3
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180
Server: Inventec P47
GPU: AMD MI25 x4
CPU: AMD EPYC 7601 x2
Memory: 512GB
The text was updated successfully, but these errors were encountered: