-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch version #13
Comments
tbh, I dont remember, please try torch 1.4 and 1.5 first. Let me know if it does not help |
the problem above happened with torch 1.5.0 |
@ai1361720220000 please try with 1.4 |
ok. thanks. i will tell you if 1.4 could solve the problem. |
i used one GPU to train the model in torch1.5.0, and it succeed in the first 180step, but after 180 step, it stopped training due to nan loss. |
try to reduce the learning rate |
|
you better check it yourself |
ok. thanks~ |
@ai1361720220000 please let me know the result 🙇 |
i find there are so many pics not match the condition 4 in the picture above. |
I dont remember, give it a try with your hypothesis |
@ai1361720220000 any progresses? |
The Class Maskrcnn will normalize and resize the input images and labels respectively during the forward step, which are then feeded to maskrcnn network, so it is no need to perform resize and normalization in the dataloader.
|
@ai1361720220000 Are you still struggling with the error? |
Hi @ai1361720220000 , how are you doing 😄 |
Hello, thanks for your sharing this nice project. Could you tell the version of pytorch?
I replaced the code from
train_set = TableBank(root_dir=args.train_data_path)
totrain_set = PubLayNet(root_dir=args.train_data_path)
and run the "python main_publaynet.py".
But here comes the error and i can't find the answer to solve it.
` [Apr-19 08:58] [INFO] sys.argv: ['main_publaynet.py'] <main, (): 555>
[Apr-19 08:58] [WARNING] Unknown arguments: [] <main, (): 558>
Use GPU: 0 for training
Use GPU: 1 for training
[Apr-19 08:58] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 08:58] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 08:58] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
[Apr-19 08:58] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
loading annotations into memory...
loading annotations into memory...
Done (t=64.58s)
creating index...
Done (t=64.64s)
creating index...
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 08:59] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 08:59] [INFO] Start training.. <mp_main, main_worker(): 381>
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 08:59] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 08:59] [INFO] Start training.. <mp_main, main_worker(): 381>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 398, in _send_bytes
self._send(buf)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[Apr-19 09:01] [CRITICAL]
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in train_one_epoch
images = list(image.to(device) for image in images)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in
images = list(image.to(device) for image in images)
AttributeError: 'Image' object has no attribute 'to'
<main, (): 568>
Traceback (most recent call last):
File "main_publaynet.py", line 562, in
main(args)
File "main_publaynet.py", line 230, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in train_one_epoch
images = list(image.to(device) for image in images)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 427, in
images = list(image.to(device) for image in images)
AttributeError: 'Image' object has no attribute 'to'
[Apr-19 09:32] [INFO] sys.argv: ['main_publaynet.py'] <main, (): 555>
[Apr-19 09:32] [WARNING] Unknown arguments: [] <main, (): 558>
Use GPU: 1 for training
Use GPU: 0 for training
[Apr-19 09:32] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 09:32] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
[Apr-19 09:32] [INFO] ================================== <mp_main, main_worker(): 345>
[Apr-19 09:32] [INFO] Create dataset with root_dir=/home/notebook/data/group/projects/Document_Layout_Analysis_dataset/publaynet/train <mp_main, main_worker(): 346>
loading annotations into memory...
loading annotations into memory...
Done (t=59.31s)
creating index...
Done (t=61.41s)
creating index...
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 09:33] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 09:33] [INFO] Start training.. <mp_main, main_worker(): 381>
index created!
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [ 1474 334229]
[Apr-19 09:33] [INFO] Create data_loader.. with batch_size = 2 <mp_main, main_worker(): 372>
[Apr-19 09:33] [INFO] Start training.. <mp_main, main_worker(): 381>
/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, *, bool as_tuple)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, , bool as_tuple)
It: 0 [ 0/83926] eta: 63 days, 1:26:42 lr: 0.001000 loss: 11.7813 (11.7813) loss_box_reg: 0.1196 (0.1196) loss_classifier: 0.4231 (0.4231) loss_mask: 5.5672 (5.5672) loss_objectness: 5.1233 (5.1233) loss_rpn_box_reg: 0.5482 (0.5482) time: 64.9191 data: 56.2254 max mem: 2146
It: 0 [ 0/83926] eta: 60 days, 12:14:23 lr: 0.001000 loss: 11.7813 (11.7813) loss_box_reg: 0.1196 (0.1196) loss_classifier: 0.4231 (0.4231) loss_mask: 5.5672 (5.5672) loss_objectness: 5.1233 (5.1233) loss_rpn_box_reg: 0.5482 (0.5482) time: 62.2937 data: 61.6931 max mem: 2147
[Apr-19 09:34] [WARNING] transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered <mp_main, train_one_epoch(): 457>
Traceback (most recent call last):
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 449, in train_one_epoch
losses.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f001540d536 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x7ae (0x7f0015650fbe in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f00153fdabd in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x523542 (0x7effd4603542 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5235e6 (0x7effd46035e6 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x1a311e (0x55b59911711e in /opt/conda/bin/python)
frame #6: + 0x10e91c (0x55b59908291c in /opt/conda/bin/python)
frame #7: + 0xfdfc8 (0x55b599071fc8 in /opt/conda/bin/python)
frame #8: + 0x10f147 (0x55b599083147 in /opt/conda/bin/python)
frame #9: + 0x10f15d (0x55b59908315d in /opt/conda/bin/python)
frame #10: + 0x10f15d (0x55b59908315d in /opt/conda/bin/python)
frame #11: + 0xf6457 (0x55b59906a457 in /opt/conda/bin/python)
frame #12: + 0xf64c3 (0x55b59906a4c3 in /opt/conda/bin/python)
frame #13: + 0xf6446 (0x55b59906a446 in /opt/conda/bin/python)
frame #14: + 0x1dc943 (0x55b599150943 in /opt/conda/bin/python)
frame #15: _PyEval_EvalFrameDefault + 0x2a59 (0x55b599143229 in /opt/conda/bin/python)
frame #16: _PyFunction_FastCallKeywords + 0xfb (0x55b5990d920b in /opt/conda/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x6a0 (0x55b599140e70 in /opt/conda/bin/python)
frame #18: _PyFunction_FastCallKeywords + 0xfb (0x55b5990d920b in /opt/conda/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x416 (0x55b599140be6 in /opt/conda/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x2f9 (0x55b5990892b9 in /opt/conda/bin/python)
frame #21: _PyFunction_FastCallKeywords + 0x387 (0x55b5990d9497 in /opt/conda/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x14ea (0x55b599141cba in /opt/conda/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0x2f9 (0x55b5990892b9 in /opt/conda/bin/python)
frame #24: PyEval_EvalCodeEx + 0x44 (0x55b59908a1d4 in /opt/conda/bin/python)
frame #25: PyEval_EvalCode + 0x1c (0x55b59908a1fc in /opt/conda/bin/python)
frame #26: + 0x22bf44 (0x55b59919ff44 in /opt/conda/bin/python)
frame #27: PyRun_StringFlags + 0x7d (0x55b5991ab21d in /opt/conda/bin/python)
frame #28: PyRun_SimpleStringFlags + 0x3f (0x55b5991ab27f in /opt/conda/bin/python)
frame #29: + 0x23737d (0x55b5991ab37d in /opt/conda/bin/python)
frame #30: _Py_UnixMain + 0x3c (0x55b5991ab6fc in /opt/conda/bin/python)
frame #31: __libc_start_main + 0xe7 (0x7f0020b32b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #32: + 0x1dc3c0 (0x55b5991503c0 in /opt/conda/bin/python)
[Apr-19 09:34] [CRITICAL]
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 424, in train_one_epoch
iter_num=args.iter_num
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/utils.py", line 216, in log_every
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
<main, (): 568>
Traceback (most recent call last):
File "main_publaynet.py", line 562, in
main(args)
File "main_publaynet.py", line 230, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 398, in main_worker
args=args
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/main_publaynet.py", line 424, in train_one_epoch
iter_num=args.iter_num
File "/OCR/PubLayNet_pytorch/PubLayNet-master/training_code/utils.py", line 216, in log_every
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278`
The text was updated successfully, but these errors were encountered: