How to choose kvstore (train in parallel) when training on nvidia dgx1 - problem in gpu communication #20002

feevos · 2021-03-10T05:27:32Z

feevos
Mar 10, 2021

Dear all, I am not sure if the issue I am having is a bug or not, but I am getting various segmentation faults with toy code that runs fine on other machines, but not in our DGX1 box (details of the model we have can be found here: https://developer.nvidia.com/blog/dgx-1-fastest-deep-learning-system/)

Machine configuration, has 8 V100, and these have P2P access in sets of 4, basically this diagram has the P2P possible communications (green arrows):

And this is the output of the machine configuration via nvidia tools that is in agreement with the diagram:

> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU5) : No
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU6) : No
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU7) : No
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU4) : No
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU6) : No
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU7) : No
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU4) : No
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU5) : No
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU7) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU4) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU5) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU6) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU1) : No
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU2) : No
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU3) : No
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU0) : No
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU2) : No
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU3) : No
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU0) : No
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU1) : No
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU3) : No
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU0) : No
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU1) : No
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU2) : No
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU6) : Yes

diagnostics of the machine:

----------Python Info----------                                                                                                    
Version      : 3.8.5                                                                                                               
Compiler     : GCC 7.3.0                                                                                                           
Build        : ('default', 'Sep  4 2020 07:30:14')                                                                                 
Arch         : ('64bit', 'ELF')                                                                                                    
------------Pip Info-----------                                                                                                    
Version      : 20.2.4                                                                                                              
Directory    : /raid/dia021/Software/anaconda3/lib/python3.8/site-packages/pip                                                     
----------MXNet Info-----------                                                                                                    
Version      : 2.0.0                                                                                                               
Directory    : /raid/dia021/Software/mxnet                                                                                         
Commit hash file "/raid/dia021/Software/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.   
Library      : ['/raid/dia021/Software/mxnet/libmxnet.so']                                                                         
Build features:                                                                                                                    
✔ CUDA                                                                                                                             
✔ CUDNN                                                                                                                            
✖ NCCL
✖ TENSORRT
✖ CUTENSOR
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✖ CPU_SSE4_1
✖ CPU_SSE4_2
✖ CPU_SSE4A
✖ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE                                                                                                                         
✔ LAPACK                                                                                                                             
✔ MKLDNN
✔ OPENCV
✔ DIST_KVSTORE
✔ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-4.15.0-136-generic-x86_64-with-glibc2.10
system       : Linux
node         : dgx1-wa2
release      : 4.15.0-136-generic 
version      : #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021
----------Hardware Info---------- 
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel 
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             2036.853
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            4390.28
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            51200K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t
m pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqd
q dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
 xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_s
hadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveo
pt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0501 sec, LOAD: 0.6590 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.3453 sec, LOAD: 0.4165 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certific
ate has expired (_ssl.c:1123)>, DNS finished in 0.89115309715271 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte
.gz, DNS: 0.1010 sec, LOAD: 0.7945 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0954 sec, LOAD: 0.6405 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.052054405212402344 sec.
----------Environment----------
CC="/usr/bin/gcc"
CXX="/usr/bin/g++"
KMP_DUPLICATE_LIB_OK="True"
KMP_INIT_AT_FORK="FALSE"

mxnet was installed via pip:

pip install -t . --no-cache-dir --upgrade mxnet-cu112 --pre -f https://dist.mxnet.io/python

The error is concinstent in various cuda versions, ones I tried: 10.1, 11.0, 11.2. I don't know if this is a bug and I should report it.

Now, when I use the following code with gpus [0,1,2,3], code works fine, the same when i use gpus [4,5,6,7]. But if I use all gpus I get segmentation fault (the model doesn't make a difference, as the problem persists for various models):

import mxnet as mx                                                                                                                      
mx.npx.set_np()                                                                                                                         
                                                                                                                                        
                                                                                                                                        
import numpy as np                                                                                                                      
                                                                                                                                        
from mxnet import gluon, nd, autograd                                                                                                   
#from gluoncv import utils                                                                                                              
import time, os, math, argparse                                                                                                         
                                                                                                                                        
from mxprosthesis.nn.loss.ftnmt_loss import *                                                                                           
                                                                                                                                        
NClasses=10                                                                                                                             
                                                                                                                                        
parser = argparse.ArgumentParser(description='CEECNet cifar10 tests')                                                                   
parser.add_argument('--root', type=str, default=r'/home/dia021/Software/.mxnet/datasets/cifar10',                                       
                    help='root directory that contains data')                                                                           
parser.add_argument('--batch-size', type=int, default=24*8,                                                                             
                    help='batch size for training and testing (default:64)')                                                            
parser.add_argument('--crop-size', type=int, default=256, # this is not the best solution, but ...                                      
                    help='crop size of input image, for memory efficiency(default:256)')                                                
parser.add_argument('--epochs', type=int, default=1000,                                                                                 
                    help='number of epochs to train (default: 600)')                                                                    
parser.add_argument('--lr', type=float, default=0.001,                                                                                  
                    help='learning rate (default: 0.001)')                                                                              
parser.add_argument('--cuda', action='store_true', default=True,                                                                        
                    help='Train on GPU with CUDA')                                                                                      
parser.add_argument('--nfilters_init',type=int, default=64,                                                                             
                    help='XX nfilters_init, default::32')                                                                               
parser.add_argument('--model',type=str, default='FracTALResNeXt',                                                                       
                    help='Model base for feature extraction, default::FracTALResNet')                                                   
parser.add_argument('--depth',type=int, default=4,                                                                                      
                    help='XX depth, default::3')                                                                                        
parser.add_argument('--ftdepth',type=int, default=0,                                                                                    
                    help='ftnmt depth, default::0')                                                                                     
parser.add_argument('--nlayers',type=list, default=4,                                                                                   
                    help='XX widths, default::2')                                                                                       
parser.add_argument('--nheads_start',type=int, default=64//4,                                                                           
                    help='XX nheads_start, default::{}'.format(16))                                                                     
parser.add_argument('--name-load-params',type=str, default=None,                                                                        
                    help='name-load-params, for restart, default=None')                                                                 
                                                                                                                                        
                                                                                                                                        
opt = parser.parse_args()                                                                                                               
                                                                                                                                        
                                                                                                                                        
import sys                                                                                                                              
sys.path.append(opt.root)                                                                                                               
                                                                                                                                        
# Data augmentation definitions                                                                                                         
from mxnet.gluon.data.vision import transforms                                                                                          
                                                                                                                                        
                                                                                                                                        
# Model definition                                                                                                                      
from mxprosthesis.models.classification.weirdnet.weird_dn_features import *                                                             
from mxnet.gluon import  nn                                                                                                             
class CEECNet(HybridBlock):                                                                                                             
    def __init__(self,NClasses=10, nfilters_init=opt.nfilters_init, nfilters_bottleneck=opt.nfilters_init, bottleneck_shrinkage=4, depth
=opt.depth, nlayers=opt.nlayers,norm_type='GroupNorm', norm_groups=8, nheads_start=opt.nheads_start,model=opt.model,ftdepth=opt.ftdepth,
**kwargs):                                                                                                                              
        super().__init__(**kwargs)                                                                                                      
                                                                                                                                        
        self.conv_first = Conv2DNormed(channels=nfilters_init,kernel_size=1,padding=0)                                                  
        self.convs = WeirdNet(nfilters=nfilters_init,depth=depth,nlayers=nlayers,model=model,ftdepth=ftdepth,nheads=nheads_start,**kwarg
s)                                                                                                                                      
                                                                                                                                        
        self.flatten = nn.Flatten()                                                                                                     
                                                                                                                                        
        self.fc1   = nn.Dense(units=1024,use_bias=False) # in_units = 16*5*5                                                            
        self.fc1bn = nn.BatchNorm(axis=-1)                                                                                              
        self.fc2   = nn.Dense(units=512,use_bias=False) # in_units = 120                                                                
        self.fc2bn = nn.BatchNorm(axis=-1)                                                                                              
        # @@@@@@@@@@@ Here 10 represents the 10 classes of cifar10 @@@@@@@@@@                                                           
        self.fc3  = gluon.nn.Dense(units=NClasses) # in units = 84                                                                      
                                                                                                                                        
    def set_ftdepth(self,ftdepth):                                                                                                      
        self.convs.set_ftdepth(ftdepth)                                                                                                 
                                                                                                                                        
                                                                                                                                        
    def forward(self, input):                                                                                                           
                                                                                                                                        
        #print (input.shape)                                                                                                            
        x = self.conv_first(input)                                                                                                      
        x = self.convs(x)                                                                                                               
                                                                                                                                        
        x = mx.npx.relu(x)                                                                                                              
        x = self.flatten(x) # transforms to x.shape[0], np.prod(x.shape[1:])                                                            
        x = self.fc1(x)                                                                                                                 
        x = mx.npx.relu(self.fc1bn(x))                                                                                                  
        x = self.fc2(x)                                                                                                                 
        x = mx.npx.relu(self.fc2bn(x))                                                                                                  
        x = self.fc3(x)                                                                                                                 
        #x = mx.npx.softmax(x,axis=-1)                                                                                                  
        return x                                                                                                                        
                                                                                                                                        
                                                                                                                                        
                                                                                                                                        
flname_write = r'Results/'+ opt.model + r'_EvolvingFracTAL'  +r'.txt'                                                                   
                                                                                                                                        
# ================== SAVING best model ==================================                                                               
import datetime, os                                                                                                                     
stamp = datetime.datetime.now().strftime('%Y-%m-%d-Time-%H:%M:%S_')                                                                     
flname_save_weights = r'Results/' + stamp + opt.model+ '_EvolvLoss_best_model.params'                                                   
# =========================================================================                                                             
                                                                                                                                        
# Decide on cuda:                                                                                                                       
if opt.cuda and mx.util.get_gpu_count():                                                                                                
    ctx = [mx.gpu(i) for i in range(mx.util.get_gpu_count())] # <=== finds all gpus and fails                                                                          
else:                                                                                                                                   
    ctx = [mx.cpu()]                                                                                                                    
                                                                                                                                        
# ctx = [mx.gpu(i) for i in range(4,8)] # <=== WORKS                                                                                                   
                                                                                                                                        
# Define model                                                                                                                          
net = CEECNet() # Similar with wide resnet16_10 in params ~17M, not similar in depth though!!!                                          
import re                                                                                                                               
if opt.name_load_params is not None:                                                                                                    
    net.load_parameters(opt.name_load_params,ctx=ctx)                                                                                   
    epoch_start = int(re.sub(r'^(.*)(epoch-)','',opt.name_load_params).replace('.params','') )                                          
    epoch_start = epoch_start + 1 # Start from + 1 to avoid overwriting weights.                                                        
else:                                                                                                                                   
    net.initialize(ctx=ctx)                                                                                                             
    epoch_start=0                                                                                                                       
                                                                                                                                        
net.hybridize(static_alloc=True, static_shape=True)  # ZoomZoom!!                                                                       
                                                                                                                                        
                                                                                                                                        
# Data augmentation definitions                                                                                                         
transform_train = transforms.Compose([                                                                                                  
    # Randomly crop an area, and then resize it to be 32x32                                                                             
    transforms.RandomResizedCrop(32),                                                                                                   
    # Randomly flip the image horizontally                                                                                              
    transforms.RandomFlipLeftRight(),                                                                                                   
    # Randomly jitter the brightness, contrast and saturation of the image                                                              
    transforms.RandomColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),                                                         
    # Transpose the image from height*width*num_channels to num_channels*height*width                                                   
    # and map values from [0, 255] to [0,1]                                                                                             
    transforms.ToTensor(),                                                                                                              
    # Normalize the image with mean and standard deviation calculated across all images                                                 
    transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])                                                            
])                                                                                                                                      
                                                                                                                                        
transform_test = transforms.Compose([                                                                                                   
    transforms.Resize(32),                                                                                                              
    transforms.ToTensor(),                                                                                                              
    transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])                                                            
])                                                                                                                                      
                                                                                                                                        
# Datasets/DataLoaders                                                                                                                  
dataset_train = gluon.data.vision.CIFAR10(root=opt.root,train=True).transform_first(transform_train)                                    
dataset_test = gluon.data.vision.CIFAR10(root=opt.root,train=False).transform_first(transform_test)                                     
                                                                                                                                        
datagen_train = gluon.data.DataLoader(dataset_train,batch_size=opt.batch_size,shuffle=True,num_workers=16,pin_memory=True)              
datagen_test = gluon.data.DataLoader(dataset_test,batch_size=opt.batch_size,shuffle=False,num_workers=16,pin_memory=True)               
                                                                                                                                        
                                                                                                                                        
                                                                                                                                        
                                                                                                                                        
                                                                                                                                        
# Adam parameters                                                                                                                       
optimizer = 'Adam'                                                                                                                      
lr = opt.lr                                                                                                                             
# *********************************************************************************************                                         
# Epochs in which we want to step                                                                                                       
steps_epochs = [350,450]                                                                                                                
# assuming we keep partial batches, see `last_batch` parameter of DataLoader                                                            
iterations_per_epoch = math.ceil(len(dataset_train) / opt.batch_size)                                                                   
# iterations just before starts of epochs (iterations are 1-indexed)                                                                    
steps_iterations = [s*iterations_per_epoch for s in steps_epochs]                                                                       
scheduler = mx.lr_scheduler.MultiFactorScheduler(base_lr=lr, step= steps_iterations, factor=0.1)                                        
# **********************************************************************************************                                        
optimizer_params = {'learning_rate': lr,'lr_scheduler':scheduler}                                                                       
#optimizer_params = {'learning_rate': lr} # Doing manual schedhuling                                                                    
trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params)                                                              
                                                                                                                                        
                                                                                                                                        
loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()                                                                                          
#all_losses = [ftnmt_loss(depth=i,axis=1) for i in [0,10,20]]                                                                           
#for tloss in all_losses:                                                                                                               
#    tloss.hybridize()                                                                                                                  
                                                                                                                                        
                                                                                                                                        
#import pysnooper                                                                                                                       
# development metric:                                                                                                                   
def test(tctx, tnet, tdatagen_dev):                                                                                                     
    metric1 = gluon.metric.Accuracy()                                                                                                   
    metric2 = gluon.metric.PCC()                                                                                                        
                                                                                                                                        
    print ("\nstarted testing ...")                                                                                                     
    for idx, data in enumerate(tdatagen_dev):                                                                                           
        print("\rRunning:: {}/{}".format(idx+1,len(tdatagen_dev)),end='',flush=True)                                                    
        #data = gluon.utils.split_and_load(batch[0], ctx_list=ctx, batch_axis=0)                                                        
        imgs, labels = data                                                                                                             
        #imgs = imgs.as_in_context(tctx)                                                                                                
        imgs = gluon.utils.split_and_load(imgs, ctx_list=ctx, batch_axis=0)                                                             
        #outputs = nd.concatenate(outputs,axis=0)                                                                                       
        with mx.autograd.predict_mode():                                                                                                
            preds = [tnet(timgs).as_in_context(mx.cpu()) for timgs in imgs]                                                             
        preds = mx.np.concatenate(preds,axis=0)                                                                                         
        metric1.update(preds=preds, labels=labels)                                                                                      
        #with pysnooper.snoop():                                                                                                        
        metric2.update(preds=preds, labels=labels)                                                                                      
        mx.npx.waitall() # necessary to avoid memory flooding                                                                           
                                                                                                                                        
                                                                                                                                        
    return metric1.get(), metric2.get()                                                                                                 
                                                                                                                                        
                                                                                                                                        
                                                                                                                                        
# ResNetv2 training - bblocks_init = 4                                                                                                  
epochs = opt.epochs                                                                                                                     
history = []                                                                                                                            
                                                                                                                                        
                                                                                                                                        
flag_step1=True                                                                                                                         
flag_step2=True                                                                                                                         
                                                                                                                                        
                                                                                                                                        
def train(epochs,ctx,flname_write):                                                                                                     
                                                                                                                                        
    global flag_step1                                                                                                                   
    global flag_step2                                                                                                                   
                                                                                                                                        
    train_metric = gluon.metric.Accuracy()                                                                                              
    with open(flname_write,"w") as f:                                                                                                   
        print('epoch','train_acc','val_acc','val_pcc','train_loss',file=f,flush=True)                                                   
                                                                                                                                        
        ref_metric = 1000                                                                                                               
        for epoch in range(epochs):                                                                                                     
            tic = time.time()                                                                                                           
            train_metric.reset()                                                                                                        
            train_loss = 0                                                                                                              
                                                                                                                                        
            # Loop through each batch of training data                                                                                  
            for i, (data,label) in enumerate(datagen_train):                                                                            
                print("\rWithin epoch completion:: {}/{}".format(i+1,len(datagen_train)),end='',flush=True)                             
                # Extract data and label                                                                                                
                data = gluon.utils.split_and_load(data,ctx_list=ctx)                                                                    
                label = gluon.utils.split_and_load(label,ctx_list=ctx)                                                                  
                                                                                                                                        
                #if epoch < 250 :                                                                                                       
                #    loss_fn = all_losses[0]                                                                                            
                #elif epoch >=250 and epoch < 350:                                                                                      
                #    loss_fn = all_losses[1]                                                                                            
                #else :                                                                                                                 
                #    loss_fn = all_losses[2]                                                                                            
                                                                                                                                        
                                                                                                                                        
                # AutoGrad                                                                                                              
                with autograd.record():                                                                                                 
                    outputs = [net(tdata) for tdata in data]                                                                            
                    losses  = [loss_fn(tout,tlabel).mean() for tout, tlabel in zip(outputs,label)]                                      
                                                                                                                                        
                    # necessary to avoid memory flooding                                                                                
                    mx.npx.waitall()                                                                                                    
                    # Backpropagation                                                                                                   
                    for l in losses:                                                                                                    
                        l.backward()                                                                                                    
                                                                                                                                        
                # Optimize                                                                                                              
                trainer.step(opt.batch_size) # This is the batch_size                                                                   
                                                                                                                                        
                                                                                                                                        
                # Update metrics                                                                                                        
                train_loss += sum(losses).item()/len(ctx)                                                                               
                #train_loss += sum(losses)/len(ctx)                                                                                     
                                                                                                                                        
                label = [l.as_in_context(mx.cpu()) for l in label]                                                                      
                label = mx.np.concatenate(label,axis=0)                                                                                 
                outputs = [out.as_in_context(mx.cpu()) for out in outputs]                                                              
                outputs = mx.np.concatenate(outputs,axis=0)                                                                             
                                                                                                                                        
                train_metric.update(labels=label, preds=outputs)                                                                        
            train_loss = train_loss / len(datagen_train) # Normalize to 0,1                                                             
            name, train_mse = train_metric.get()                                                                                        
            # Evaluate on Validation data                                                                                               
            nd.waitall() # necessary to avoid cuda malloc                                                                               
            (name, val_mse), (name2, val_mse2)  = test(ctx, net, datagen_test)                                                          
                                                                                                                                        
            # Print metrics                                                                                                             
            # print both on screen and in file                                                                                          
            print("\n")                                                                                                                 
            print('epoch={} train_acc={} val_acc={} val_pcc={} train_loss={} time={}'.format(epoch, train_mse, val_mse,val_mse2 , train_
loss, time.time()-tic))                                                                                                                 
            print(epoch, train_mse, val_mse, val_mse2, train_loss, file=f,flush=True)                                                   
                                                                                                                                        
                                                                                                                                        
            net.save_parameters(flname_save_weights.replace('best_model','epoch-{}'.format(epoch)))                                     
            if val_mse < ref_metric:                                                                                                    
                # Save best model parameters, according to minimum val_mse                                                              
                net.save_parameters(flname_save_weights)                                                                                
                ref_metric = val_mse                                                                                                    
                                                                                                                                        
                                                                                                                                        
                                                                                                                                        
if __name__=='__main__':                                                                                                                
    #tout = test(ctx,net,datagen_test)                                                                                                  
    #print ("Passed first test")                                                                                                        
    train(opt.epochs, ctx, flname_write)

error example when using all gpus:

(base) dia021@dgx1-wa2:/raid/dia021/Software/mxprosthesis/tests/runs/CorrectMultiHeadAttention/WeirdNet$ python train_x_parallel.py                                                                                                    
================================================                                                                                                                                                                                       
Using feature extraction units::FracTALResNeXt                                                                                                                                                                                         
------------------------------------------------                                                                                                                                                                                       
                                                                                                                                                                                                                                       
depth:= 0, nlayers WeirdUpDn::4, nfilters: 64, nheads::16                                                                                                                                                                              
depth:= 1, nlayers WeirdUpDn::4, nfilters: 64, nheads::16                                                                                                                                                                              
depth:= 2, nlayers WeirdUpDn::4, nfilters: 64, nheads::16                                                                                                                                                                              
depth:= 3, nlayers WeirdUpDn::4, nfilters: 64, nheads::16                                                                                                                                                                              
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:22:46] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU                                                                                                                                                  
[16:22:46] ../src/base.cc:80: cuDNN lib mismatch: linked-against version 8101 != compiled-against version 8100.  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.                                                                 
[16:22:48] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:22:50] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:22:53] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:22:55] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:22:57] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:22:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:23:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
[16:23:03] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU                                                                                                                                                  
Within epoch completion:: 1/261[16:23:14] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU_PINNED                                                                                                            
[16:23:27] ../src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) 
[16:23:55] ../src/kvstore/././comm.h:757: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off                                                      
[16:23:55] ../src/kvstore/././comm.h:766: .vvvv...                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: v.vv.v..                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: vv.v..v.                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: vvv....v                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: v....vvv                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: .v..v.vv                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: ..v.vv.v                                                                                                                                                                                     
[16:23:55] ../src/kvstore/././comm.h:766: ...vvvv.                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                     
  File "train_x_parallel.py", line 294, in <module>                                                                                                                                                                                    
    train(opt.epochs, ctx, flname_write)                                                                                                                                                                                               
  File "train_x_parallel.py", line 261, in train                                                                                                                                                                                       
    train_loss += sum(losses).item()/len(ctx)                                                                                                                                                                                          
  File "/raid/dia021/Software/mxnet/numpy/multiarray.py", line 1264, in item                                                                                                                                                           
    return self.asnumpy().item(*args)                                                                                                                                                                                                  
  File "/raid/dia021/Software/mxnet/ndarray/ndarray.py", line 2607, in asnumpy                                                                                                                                                         
    check_call(_LIB.MXNDArraySyncCopyToCPU(                                                                                                                                                                                            
  File "/raid/dia021/Software/mxnet/base.py", line 246, in check_call                                                                                                                                                                  
    raise get_last_ffi_error()                                                                                                                                                                                                         
mxnet.base.MXNetError: Traceback (most recent call last):                                                                                                                                                                              
  File "../include/mshadow/./stream_gpu-inl.h", line 91                                                                                                                                                                                
CUDA: Check failed: e == cudaSuccess (700 vs. 0) : an illegal memory access was encountered                                                                                                                                            
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered      
                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                       
(base) dia021@dgx1-wa2:/raid/dia021/Software/mxprosthesis/tests/runs/CorrectMultiHeadAttention/WeirdNet$

I am trying to use Horovod now to do the parallel training, and see if this will solve the issue, but if there is a quick fix/suggestion, I would be grateful to the community for help.

I've also tried with the flags prior training, it didn't fix the issue:

export MXNET_ENABLE_GPU_P2P=0 
export MXNET_KVSTORE_USETREE=0

Kind regards,
Foivos

Answered by feevos

Mar 11, 2021

I solved the problem by using docker containers, it's working wonders there :).

View full answer

feevos · 2021-03-11T05:09:20Z

feevos
Mar 11, 2021
Author

I solved the problem by using docker containers, it's working wonders there :).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to choose kvstore (train in parallel) when training on nvidia dgx1 - problem in gpu communication #20002

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to choose kvstore (train in parallel) when training on nvidia dgx1 - problem in gpu communication #20002

feevos Mar 10, 2021

Replies: 1 comment

feevos Mar 11, 2021 Author

feevos
Mar 10, 2021

feevos
Mar 11, 2021
Author