Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch version #13

Closed
ai1361720220000 opened this issue Apr 19, 2021 · 18 comments
Closed

pytorch version #13

ai1361720220000 opened this issue Apr 19, 2021 · 18 comments

Comments

@ai1361720220000
Copy link

ai1361720220000 commented Apr 19, 2021

Hello, thanks for your sharing this nice project. Could you tell the version of pytorch?

I replaced the code fromtrain_set = TableBank(root_dir=args.train_data_path)to
train_set = PubLayNet(root_dir=args.train_data_path)
and run the "python main_publaynet.py".
But here comes the error and i can't find the answer to solve it.
` [Apr-19 08:58] [INFO] sys.argv: ['main_publaynet.py'] <main, (): 555>
[Apr-19 08:58] [WARNING] Unknown arguments: [] <main, (): 558>
Use GPU: 0 for training
Use GPU: 1 for training
[Apr-19 08:58] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 08:58] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 08:58] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
[Apr-19 08:58] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
loading annotations into memory...
loading annotations into memory...
Done (t=64.58s)
creating index...
Done (t=64.64s)
creating index...
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 08:59] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 08:59] [INFO] Start training.. <mp_main, main_worker(): 381>
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 08:59] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 08:59] [INFO] Start training.. <mp_main, main_worker(): 381>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 398, in _send_bytes
self._send(buf)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[Apr-19 09:01] [CRITICAL]

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in train_one_epoch
images = list(image.to(device) for image in images)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in
images = list(image.to(device) for image in images)
AttributeError: 'Image' object has no attribute 'to'
<main, (): 568>
Traceback (most recent call last):
File "main_publaynet.py", line 562, in
main(args)
File "main_publaynet.py", line 230, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in train_one_epoch
images = list(image.to(device) for image in images)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in
images = list(image.to(device) for image in images)
AttributeError: 'Image' object has no attribute 'to'

[Apr-19 09:32] [INFO] sys.argv: ['main_publaynet.py'] <main, (): 555>
[Apr-19 09:32] [WARNING] Unknown arguments: [] <main, (): 558>
Use GPU: 1 for training
Use GPU: 0 for training
[Apr-19 09:32] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 09:32] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
[Apr-19 09:32] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 09:32] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
loading annotations into memory...
loading annotations into memory...
Done (t=59.31s)
creating index...
Done (t=61.41s)
creating index...
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 09:33] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 09:33] [INFO] Start training.. <mp_main, main_worker(): 381>
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 09:33] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 09:33] [INFO] Start training.. <mp_main, main_worker(): 381>
/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, *, bool as_tuple)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, , bool as_tuple)
It: 0 [ 0/83926] eta: 63 days, 1:26:42 lr: 0.001000 loss: 11.7813 (11.7813) loss_box_reg: 0.1196 (0.1196) loss_classifier: 0.4231 (0.4231) loss_mask: 5.5672 (5.5672) loss_objectness: 5.1233 (5.1233) loss_rpn_box_reg: 0.5482 (0.5482) time: 64.9191 data: 56.2254 max mem: 2146
It: 0 [ 0/83926] eta: 60 days, 12:14:23 lr: 0.001000 loss: 11.7813 (11.7813) loss_box_reg: 0.1196 (0.1196) loss_classifier: 0.4231 (0.4231) loss_mask: 5.5672 (5.5672) loss_objectness: 5.1233 (5.1233) loss_rpn_box_reg: 0.5482 (0.5482) time: 62.2937 data: 61.6931 max mem: 2147
[Apr-19 09:34] [WARNING] transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered <mp_main, train_one_epoch(): 457>
Traceback (most recent call last):
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 449, in train_one_epoch
losses.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f001540d536 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x7ae (0x7f0015650fbe in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f00153fdabd in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x523542 (0x7effd4603542 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5235e6 (0x7effd46035e6 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x1a311e (0x55b59911711e in /opt/conda/bin/python)
frame #6: + 0x10e91c (0x55b59908291c in /opt/conda/bin/python)
frame #7: + 0xfdfc8 (0x55b599071fc8 in /opt/conda/bin/python)
frame #8: + 0x10f147 (0x55b599083147 in /opt/conda/bin/python)
frame #9: + 0x10f15d (0x55b59908315d in /opt/conda/bin/python)
frame #10: + 0x10f15d (0x55b59908315d in /opt/conda/bin/python)
frame #11: + 0xf6457 (0x55b59906a457 in /opt/conda/bin/python)
frame #12: + 0xf64c3 (0x55b59906a4c3 in /opt/conda/bin/python)
frame #13: + 0xf6446 (0x55b59906a446 in /opt/conda/bin/python)
frame #14: + 0x1dc943 (0x55b599150943 in /opt/conda/bin/python)
frame #15: _PyEval_EvalFrameDefault + 0x2a59 (0x55b599143229 in /opt/conda/bin/python)
frame #16: _PyFunction_FastCallKeywords + 0xfb (0x55b5990d920b in /opt/conda/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x6a0 (0x55b599140e70 in /opt/conda/bin/python)
frame #18: _PyFunction_FastCallKeywords + 0xfb (0x55b5990d920b in /opt/conda/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x416 (0x55b599140be6 in /opt/conda/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x2f9 (0x55b5990892b9 in /opt/conda/bin/python)
frame #21: _PyFunction_FastCallKeywords + 0x387 (0x55b5990d9497 in /opt/conda/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x14ea (0x55b599141cba in /opt/conda/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0x2f9 (0x55b5990892b9 in /opt/conda/bin/python)
frame #24: PyEval_EvalCodeEx + 0x44 (0x55b59908a1d4 in /opt/conda/bin/python)
frame #25: PyEval_EvalCode + 0x1c (0x55b59908a1fc in /opt/conda/bin/python)
frame #26: + 0x22bf44 (0x55b59919ff44 in /opt/conda/bin/python)
frame #27: PyRun_StringFlags + 0x7d (0x55b5991ab21d in /opt/conda/bin/python)
frame #28: PyRun_SimpleStringFlags + 0x3f (0x55b5991ab27f in /opt/conda/bin/python)
frame #29: + 0x23737d (0x55b5991ab37d in /opt/conda/bin/python)
frame #30: _Py_UnixMain + 0x3c (0x55b5991ab6fc in /opt/conda/bin/python)
frame #31: __libc_start_main + 0xe7 (0x7f0020b32b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #32: + 0x1dc3c0 (0x55b5991503c0 in /opt/conda/bin/python)

[Apr-19 09:34] [CRITICAL]

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 424, in train_one_epoch
iter_num=args.iter_num
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/utils.py", line 216, in log_every
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

<main, (): 568>
Traceback (most recent call last):
File "main_publaynet.py", line 562, in
main(args)
File "main_publaynet.py", line 230, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 424, in train_one_epoch
iter_num=args.iter_num
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/utils.py", line 216, in log_every
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278`

@phamquiluan
Copy link
Owner

tbh, I dont remember, please try torch 1.4 and 1.5 first. Let me know if it does not help

@ai1361720220000
Copy link
Author

tbh, I dont remember, please try torch 1.4 and 1.5 first. Let me know if it does not help

the problem above happened with torch 1.5.0

@phamquiluan
Copy link
Owner

@ai1361720220000 please try with 1.4

@ai1361720220000
Copy link
Author

@ai1361720220000 please try with 1.4

ok. thanks. i will tell you if 1.4 could solve the problem.

@ai1361720220000
Copy link
Author

@ai1361720220000 please try with 1.4

i used one GPU to train the model in torch1.5.0, and it succeed in the first 180step, but after 180 step, it stopped training due to nan loss.
[Apr-20 03:07] [CRITICAL] Loss is nan, stopping training <__main__, train_one_epoch(): 445> [Apr-20 03:07] [CRITICAL] {'loss_classifier': tensor(-0.0729, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0.2256, device='cuda:0', grad_fn=<DivBackward0>), 'loss_mask': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_objectness': tensor(0.0368, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(0.1943, device='cuda:0', grad_fn=<DivBackward0>)} <__main__, train_one_epoch(): 446>

@phamquiluan
Copy link
Owner

try to reduce the learning rate

@ai1361720220000
Copy link
Author

try to reduce the learning rate
i changed lr to 0.00001 , but it only trained 10 step due to nan loss

@phamquiluan
Copy link
Owner

image

check this, it might from your dataset

@ai1361720220000
Copy link
Author

image

check this, it might from your dataset

i used PubLayNet dataset. Does the dataset have the issue above?

@phamquiluan
Copy link
Owner

you better check it yourself

@ai1361720220000
Copy link
Author

you better check it yourself

ok. thanks~

@phamquiluan
Copy link
Owner

@ai1361720220000 please let me know the result 🙇

@ai1361720220000
Copy link
Author

ai1361720220000 commented Apr 20, 2021

@ai1361720220000 please let me know the result 🙇

i find there are so many pics not match the condition 4 in the picture above.
In addition, why you didn't resize the input batch images to a same size in the publaynet.py?

@phamquiluan
Copy link
Owner

I dont remember, give it a try with your hypothesis

@phamquiluan
Copy link
Owner

@ai1361720220000 any progresses?

@ai1361720220000
Copy link
Author

ai1361720220000 commented Apr 21, 2021

@ai1361720220000 any progresses?

The Class Maskrcnn will normalize and resize the input images and labels respectively during the forward step, which are then feeded to maskrcnn network, so it is no need to perform resize and normalization in the dataloader.

def forward(self, images, targets=None):
        # type: (List[Tensor], Optional[List[Dict[str, Tensor]]])
        images = [img for img in images]
        for i in range(len(images)):
            image = images[i]
            target_index = targets[i] if targets is not None else None

            if image.dim() != 3:
                raise ValueError("images is expected to be a list of 3d tensors "
                                 "of shape [C, H, W], got {}".format(image.shape))
            image = self.normalize(image)
            image, target_index = self.resize(image, target_index)
            images[i] = image
            if targets is not None and target_index is not None:
                targets[i] = target_index

@phamquiluan
Copy link
Owner

@ai1361720220000 Are you still struggling with the error?

@phamquiluan
Copy link
Owner

Hi @ai1361720220000 , how are you doing 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants