-
Dear all, I am not sure if the issue I am having is a bug or not, but I am getting various segmentation faults with toy code that runs fine on other machines, but not in our DGX1 box (details of the model we have can be found here: https://developer.nvidia.com/blog/dgx-1-fastest-deep-learning-system/) Machine configuration, has 8 V100, and these have P2P access in sets of 4, basically this diagram has the P2P possible communications (green arrows): And this is the output of the machine configuration via nvidia tools that is in agreement with the diagram: > Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU5) : No
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU6) : No
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU7) : No
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU4) : No
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU6) : No
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU7) : No
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU4) : No
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU5) : No
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU7) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU4) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU5) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU6) : No
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU1) : No
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU2) : No
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU3) : No
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU4) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU0) : No
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU2) : No
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU3) : No
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU5) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU0) : No
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU1) : No
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU3) : No
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU6) -> Tesla V100-SXM2-16GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU0) : No
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU1) : No
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU2) : No
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU7) -> Tesla V100-SXM2-16GB (GPU6) : Yes diagnostics of the machine:
mxnet was installed via pip: pip install -t . --no-cache-dir --upgrade mxnet-cu112 --pre -f https://dist.mxnet.io/python The error is concinstent in various cuda versions, ones I tried: 10.1, 11.0, 11.2. I don't know if this is a bug and I should report it. Now, when I use the following code with gpus [0,1,2,3], code works fine, the same when i use gpus [4,5,6,7]. But if I use all gpus I get segmentation fault (the model doesn't make a difference, as the problem persists for various models): import mxnet as mx
mx.npx.set_np()
import numpy as np
from mxnet import gluon, nd, autograd
#from gluoncv import utils
import time, os, math, argparse
from mxprosthesis.nn.loss.ftnmt_loss import *
NClasses=10
parser = argparse.ArgumentParser(description='CEECNet cifar10 tests')
parser.add_argument('--root', type=str, default=r'/home/dia021/Software/.mxnet/datasets/cifar10',
help='root directory that contains data')
parser.add_argument('--batch-size', type=int, default=24*8,
help='batch size for training and testing (default:64)')
parser.add_argument('--crop-size', type=int, default=256, # this is not the best solution, but ...
help='crop size of input image, for memory efficiency(default:256)')
parser.add_argument('--epochs', type=int, default=1000,
help='number of epochs to train (default: 600)')
parser.add_argument('--lr', type=float, default=0.001,
help='learning rate (default: 0.001)')
parser.add_argument('--cuda', action='store_true', default=True,
help='Train on GPU with CUDA')
parser.add_argument('--nfilters_init',type=int, default=64,
help='XX nfilters_init, default::32')
parser.add_argument('--model',type=str, default='FracTALResNeXt',
help='Model base for feature extraction, default::FracTALResNet')
parser.add_argument('--depth',type=int, default=4,
help='XX depth, default::3')
parser.add_argument('--ftdepth',type=int, default=0,
help='ftnmt depth, default::0')
parser.add_argument('--nlayers',type=list, default=4,
help='XX widths, default::2')
parser.add_argument('--nheads_start',type=int, default=64//4,
help='XX nheads_start, default::{}'.format(16))
parser.add_argument('--name-load-params',type=str, default=None,
help='name-load-params, for restart, default=None')
opt = parser.parse_args()
import sys
sys.path.append(opt.root)
# Data augmentation definitions
from mxnet.gluon.data.vision import transforms
# Model definition
from mxprosthesis.models.classification.weirdnet.weird_dn_features import *
from mxnet.gluon import nn
class CEECNet(HybridBlock):
def __init__(self,NClasses=10, nfilters_init=opt.nfilters_init, nfilters_bottleneck=opt.nfilters_init, bottleneck_shrinkage=4, depth
=opt.depth, nlayers=opt.nlayers,norm_type='GroupNorm', norm_groups=8, nheads_start=opt.nheads_start,model=opt.model,ftdepth=opt.ftdepth,
**kwargs):
super().__init__(**kwargs)
self.conv_first = Conv2DNormed(channels=nfilters_init,kernel_size=1,padding=0)
self.convs = WeirdNet(nfilters=nfilters_init,depth=depth,nlayers=nlayers,model=model,ftdepth=ftdepth,nheads=nheads_start,**kwarg
s)
self.flatten = nn.Flatten()
self.fc1 = nn.Dense(units=1024,use_bias=False) # in_units = 16*5*5
self.fc1bn = nn.BatchNorm(axis=-1)
self.fc2 = nn.Dense(units=512,use_bias=False) # in_units = 120
self.fc2bn = nn.BatchNorm(axis=-1)
# @@@@@@@@@@@ Here 10 represents the 10 classes of cifar10 @@@@@@@@@@
self.fc3 = gluon.nn.Dense(units=NClasses) # in units = 84
def set_ftdepth(self,ftdepth):
self.convs.set_ftdepth(ftdepth)
def forward(self, input):
#print (input.shape)
x = self.conv_first(input)
x = self.convs(x)
x = mx.npx.relu(x)
x = self.flatten(x) # transforms to x.shape[0], np.prod(x.shape[1:])
x = self.fc1(x)
x = mx.npx.relu(self.fc1bn(x))
x = self.fc2(x)
x = mx.npx.relu(self.fc2bn(x))
x = self.fc3(x)
#x = mx.npx.softmax(x,axis=-1)
return x
flname_write = r'Results/'+ opt.model + r'_EvolvingFracTAL' +r'.txt'
# ================== SAVING best model ==================================
import datetime, os
stamp = datetime.datetime.now().strftime('%Y-%m-%d-Time-%H:%M:%S_')
flname_save_weights = r'Results/' + stamp + opt.model+ '_EvolvLoss_best_model.params'
# =========================================================================
# Decide on cuda:
if opt.cuda and mx.util.get_gpu_count():
ctx = [mx.gpu(i) for i in range(mx.util.get_gpu_count())] # <=== finds all gpus and fails
else:
ctx = [mx.cpu()]
# ctx = [mx.gpu(i) for i in range(4,8)] # <=== WORKS
# Define model
net = CEECNet() # Similar with wide resnet16_10 in params ~17M, not similar in depth though!!!
import re
if opt.name_load_params is not None:
net.load_parameters(opt.name_load_params,ctx=ctx)
epoch_start = int(re.sub(r'^(.*)(epoch-)','',opt.name_load_params).replace('.params','') )
epoch_start = epoch_start + 1 # Start from + 1 to avoid overwriting weights.
else:
net.initialize(ctx=ctx)
epoch_start=0
net.hybridize(static_alloc=True, static_shape=True) # ZoomZoom!!
# Data augmentation definitions
transform_train = transforms.Compose([
# Randomly crop an area, and then resize it to be 32x32
transforms.RandomResizedCrop(32),
# Randomly flip the image horizontally
transforms.RandomFlipLeftRight(),
# Randomly jitter the brightness, contrast and saturation of the image
transforms.RandomColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),
# Transpose the image from height*width*num_channels to num_channels*height*width
# and map values from [0, 255] to [0,1]
transforms.ToTensor(),
# Normalize the image with mean and standard deviation calculated across all images
transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])
])
transform_test = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(),
transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])
])
# Datasets/DataLoaders
dataset_train = gluon.data.vision.CIFAR10(root=opt.root,train=True).transform_first(transform_train)
dataset_test = gluon.data.vision.CIFAR10(root=opt.root,train=False).transform_first(transform_test)
datagen_train = gluon.data.DataLoader(dataset_train,batch_size=opt.batch_size,shuffle=True,num_workers=16,pin_memory=True)
datagen_test = gluon.data.DataLoader(dataset_test,batch_size=opt.batch_size,shuffle=False,num_workers=16,pin_memory=True)
# Adam parameters
optimizer = 'Adam'
lr = opt.lr
# *********************************************************************************************
# Epochs in which we want to step
steps_epochs = [350,450]
# assuming we keep partial batches, see `last_batch` parameter of DataLoader
iterations_per_epoch = math.ceil(len(dataset_train) / opt.batch_size)
# iterations just before starts of epochs (iterations are 1-indexed)
steps_iterations = [s*iterations_per_epoch for s in steps_epochs]
scheduler = mx.lr_scheduler.MultiFactorScheduler(base_lr=lr, step= steps_iterations, factor=0.1)
# **********************************************************************************************
optimizer_params = {'learning_rate': lr,'lr_scheduler':scheduler}
#optimizer_params = {'learning_rate': lr} # Doing manual schedhuling
trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params)
loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()
#all_losses = [ftnmt_loss(depth=i,axis=1) for i in [0,10,20]]
#for tloss in all_losses:
# tloss.hybridize()
#import pysnooper
# development metric:
def test(tctx, tnet, tdatagen_dev):
metric1 = gluon.metric.Accuracy()
metric2 = gluon.metric.PCC()
print ("\nstarted testing ...")
for idx, data in enumerate(tdatagen_dev):
print("\rRunning:: {}/{}".format(idx+1,len(tdatagen_dev)),end='',flush=True)
#data = gluon.utils.split_and_load(batch[0], ctx_list=ctx, batch_axis=0)
imgs, labels = data
#imgs = imgs.as_in_context(tctx)
imgs = gluon.utils.split_and_load(imgs, ctx_list=ctx, batch_axis=0)
#outputs = nd.concatenate(outputs,axis=0)
with mx.autograd.predict_mode():
preds = [tnet(timgs).as_in_context(mx.cpu()) for timgs in imgs]
preds = mx.np.concatenate(preds,axis=0)
metric1.update(preds=preds, labels=labels)
#with pysnooper.snoop():
metric2.update(preds=preds, labels=labels)
mx.npx.waitall() # necessary to avoid memory flooding
return metric1.get(), metric2.get()
# ResNetv2 training - bblocks_init = 4
epochs = opt.epochs
history = []
flag_step1=True
flag_step2=True
def train(epochs,ctx,flname_write):
global flag_step1
global flag_step2
train_metric = gluon.metric.Accuracy()
with open(flname_write,"w") as f:
print('epoch','train_acc','val_acc','val_pcc','train_loss',file=f,flush=True)
ref_metric = 1000
for epoch in range(epochs):
tic = time.time()
train_metric.reset()
train_loss = 0
# Loop through each batch of training data
for i, (data,label) in enumerate(datagen_train):
print("\rWithin epoch completion:: {}/{}".format(i+1,len(datagen_train)),end='',flush=True)
# Extract data and label
data = gluon.utils.split_and_load(data,ctx_list=ctx)
label = gluon.utils.split_and_load(label,ctx_list=ctx)
#if epoch < 250 :
# loss_fn = all_losses[0]
#elif epoch >=250 and epoch < 350:
# loss_fn = all_losses[1]
#else :
# loss_fn = all_losses[2]
# AutoGrad
with autograd.record():
outputs = [net(tdata) for tdata in data]
losses = [loss_fn(tout,tlabel).mean() for tout, tlabel in zip(outputs,label)]
# necessary to avoid memory flooding
mx.npx.waitall()
# Backpropagation
for l in losses:
l.backward()
# Optimize
trainer.step(opt.batch_size) # This is the batch_size
# Update metrics
train_loss += sum(losses).item()/len(ctx)
#train_loss += sum(losses)/len(ctx)
label = [l.as_in_context(mx.cpu()) for l in label]
label = mx.np.concatenate(label,axis=0)
outputs = [out.as_in_context(mx.cpu()) for out in outputs]
outputs = mx.np.concatenate(outputs,axis=0)
train_metric.update(labels=label, preds=outputs)
train_loss = train_loss / len(datagen_train) # Normalize to 0,1
name, train_mse = train_metric.get()
# Evaluate on Validation data
nd.waitall() # necessary to avoid cuda malloc
(name, val_mse), (name2, val_mse2) = test(ctx, net, datagen_test)
# Print metrics
# print both on screen and in file
print("\n")
print('epoch={} train_acc={} val_acc={} val_pcc={} train_loss={} time={}'.format(epoch, train_mse, val_mse,val_mse2 , train_
loss, time.time()-tic))
print(epoch, train_mse, val_mse, val_mse2, train_loss, file=f,flush=True)
net.save_parameters(flname_save_weights.replace('best_model','epoch-{}'.format(epoch)))
if val_mse < ref_metric:
# Save best model parameters, according to minimum val_mse
net.save_parameters(flname_save_weights)
ref_metric = val_mse
if __name__=='__main__':
#tout = test(ctx,net,datagen_test)
#print ("Passed first test")
train(opt.epochs, ctx, flname_write) error example when using all gpus: (base) dia021@dgx1-wa2:/raid/dia021/Software/mxprosthesis/tests/runs/CorrectMultiHeadAttention/WeirdNet$ python train_x_parallel.py
================================================
Using feature extraction units::FracTALResNeXt
------------------------------------------------
depth:= 0, nlayers WeirdUpDn::4, nfilters: 64, nheads::16
depth:= 1, nlayers WeirdUpDn::4, nfilters: 64, nheads::16
depth:= 2, nlayers WeirdUpDn::4, nfilters: 64, nheads::16
depth:= 3, nlayers WeirdUpDn::4, nfilters: 64, nheads::16
[16:22:46] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
[16:22:46] ../src/base.cc:80: cuDNN lib mismatch: linked-against version 8101 != compiled-against version 8100. Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
[16:22:48] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:22:50] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:22:53] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:22:55] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:22:57] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:22:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:23:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[16:23:03] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
Within epoch completion:: 1/261[16:23:14] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU_PINNED
[16:23:27] ../src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[16:23:55] ../src/kvstore/././comm.h:757: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[16:23:55] ../src/kvstore/././comm.h:766: .vvvv...
[16:23:55] ../src/kvstore/././comm.h:766: v.vv.v..
[16:23:55] ../src/kvstore/././comm.h:766: vv.v..v.
[16:23:55] ../src/kvstore/././comm.h:766: vvv....v
[16:23:55] ../src/kvstore/././comm.h:766: v....vvv
[16:23:55] ../src/kvstore/././comm.h:766: .v..v.vv
[16:23:55] ../src/kvstore/././comm.h:766: ..v.vv.v
[16:23:55] ../src/kvstore/././comm.h:766: ...vvvv.
Traceback (most recent call last):
File "train_x_parallel.py", line 294, in <module>
train(opt.epochs, ctx, flname_write)
File "train_x_parallel.py", line 261, in train
train_loss += sum(losses).item()/len(ctx)
File "/raid/dia021/Software/mxnet/numpy/multiarray.py", line 1264, in item
return self.asnumpy().item(*args)
File "/raid/dia021/Software/mxnet/ndarray/ndarray.py", line 2607, in asnumpy
check_call(_LIB.MXNDArraySyncCopyToCPU(
File "/raid/dia021/Software/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "../include/mshadow/./stream_gpu-inl.h", line 91
CUDA: Check failed: e == cudaSuccess (700 vs. 0) : an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
[16:24:01] ../src/resource.cc:297: Ignore CUDA Error [16:24:01] ../src/storage/././storage_manager_helpers.h:148: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
(base) dia021@dgx1-wa2:/raid/dia021/Software/mxprosthesis/tests/runs/CorrectMultiHeadAttention/WeirdNet$ I am trying to use Horovod now to do the parallel training, and see if this will solve the issue, but if there is a quick fix/suggestion, I would be grateful to the community for help. I've also tried with the flags prior training, it didn't fix the issue:
Kind regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I solved the problem by using docker containers, it's working wonders there :). |
Beta Was this translation helpful? Give feedback.
I solved the problem by using docker containers, it's working wonders there :).