Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] fcenet model gets stuck after the first iteration during training #2043

Open
2 tasks done
tmargaryan-aligntech opened this issue May 2, 2024 · 1 comment
Open
2 tasks done
Assignees

Comments

@tmargaryan-aligntech
Copy link

tmargaryan-aligntech commented May 2, 2024

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

Environment

System environment:
sys.platform: win32
Python: 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]
CUDA available: True
MUSA available: False
numpy_random_seed: 0
GPU 0: Tesla V100-SXM2-16GB
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2
NVCC: Cuda compilation tools, release 11.2, V11.2.152
MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27051 for x64
GCC: n/a
PyTorch: 1.11.0+cu113
PyTorch compiling details: PyTorch built with:

  • C++ Version: 199711

  • MSVC 192829337

  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)

  • OpenMP 2019

  • LAPACK is enabled (usually provided by MKL)

  • CPU capability usage: AVX2

  • CUDA Runtime 11.3

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37

  • CuDNN 8.2

  • Magma 2.5.4

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

    TorchVision: 0.12.0+cu113
    OpenCV: 4.8.1
    MMEngine: 0.10.3

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 0
Distributed launcher: none
Distributed training: False
GPU number: 1

Reproduces the problem - code sample

Here is my config:

2024/05/02 19:45:57 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: win32
    Python: 3.10.9 (tags/v3.10.9:1dd9be6, Dec  6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 0
    GPU 0: Tesla V100-SXM2-16GB
    CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2
    NVCC: Cuda compilation tools, release 11.2, V11.2.152
    MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27051 for x64
    GCC: n/a
    PyTorch: 1.11.0+cu113
    PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.12.0+cu113
    OpenCV: 4.8.1
    MMEngine: 0.10.3

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 0
    Distributed launcher: none
    Distributed training: False
    GPU number: 1
------------------------------------------------------------

2024/05/02 19:45:57 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=16)
data_root = None
default_hooks = dict(
    checkpoint=dict(
        interval=5,
        rule='greater',
        save_best='icdar/hmean',
        type='CheckpointHook'),
    logger=dict(interval=5, type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    sync_buffer=dict(type='SyncBuffersHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(
        draw_gt=False,
        draw_pred=False,
        enable=False,
        interval=1,
        show=False,
        type='VisualizationHook'))
default_scope = 'mmocr'
det_test = dict(
    ann_file='test.json',
    data_prefix=dict(img_path='test_imgs/'),
    data_root=None,
    pipeline=None,
    test_mode=True,
    type='OCRDataset')
det_train = dict(
    ann_file='train.json',
    data_prefix=dict(img_path='train_imgs/'),
    data_root=None,
    filter_cfg=dict(filter_empty_gt=True, min_size=32),
    pipeline=None,
    type='OCRDataset')
det_val = dict(
    ann_file='train.json',
    data_prefix=dict(img_path='train_imgs/'),
    data_root=None,
    pipeline=None,
    test_mode=True,
    type='OCRDataset')
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = True
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=True, type='LogProcessor', window_size=10)
model = dict(
    backbone=dict(
        depth=50,
        frozen_stages=-1,
        init_cfg=dict(checkpoint='torchvision://resnet50', type='Pretrained'),
        norm_cfg=dict(requires_grad=True, type='BN'),
        norm_eval=False,
        num_stages=4,
        out_indices=(
            1,
            2,
            3,
        ),
        style='pytorch',
        type='mmdet.ResNet'),
    data_preprocessor=dict(
        bgr_to_rgb=True,
        mean=[
            123.675,
            116.28,
            103.53,
        ],
        pad_size_divisor=32,
        std=[
            58.395,
            57.12,
            57.375,
        ],
        type='TextDetDataPreprocessor'),
    det_head=dict(
        fourier_degree=5,
        in_channels=256,
        module_loss=dict(num_sample=50, type='FCEModuleLoss'),
        postprocessor=dict(
            alpha=1.2,
            beta=1.0,
            num_reconstr_points=50,
            scales=(
                8,
                16,
                32,
            ),
            score_thr=0.3,
            text_repr_type='quad',
            type='FCEPostprocessor'),
        type='FCEHead'),
    neck=dict(
        act_cfg=None,
        add_extra_convs='on_output',
        in_channels=[
            512,
            1024,
            2048,
        ],
        num_outs=3,
        out_channels=256,
        relu_before_extra_convs=True,
        type='mmdet.FPN'),
    type='FCENet')
optim_wrapper = dict(
    optimizer=dict(lr=1e-05, momentum=0.9, type='SGD', weight_decay=0.0005),
    type='OptimWrapper')
param_scheduler = None
randomness = dict(seed=0)
resume = False
test_cfg = dict(type='TestLoop')
test_dataloader = dict(
    batch_size=1,
    dataset=dict(
        datasets=[
            dict(
                ann_file='test.json',
                data_prefix=dict(img_path='test_imgs/'),
                data_root='C:/Data/detection',
                pipeline=None,
                test_mode=True,
                type='OCRDataset'),
        ],
        pipeline=[
            dict(
                color_type='color_ignore_orientation',
                type='LoadImageFromFile'),
            dict(keep_ratio=True, scale=(
                2260,
                2260,
            ), type='Resize'),
            dict(
                type='LoadOCRAnnotations',
                with_bbox=True,
                with_label=True,
                with_polygon=True),
            dict(
                meta_keys=(
                    'img_path',
                    'ori_shape',
                    'img_shape',
                    'scale_factor',
                ),
                type='PackTextDetInputs'),
        ],
        type='ConcatDataset'),
    num_workers=1,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
test_evaluator = dict(type='HmeanIOUMetric')
test_list = [
    dict(
        ann_file='test.json',
        data_prefix=dict(img_path='test_imgs/'),
        data_root=None,
        pipeline=None,
        test_mode=True,
        type='OCRDataset'),
]
test_pipeline = [
    dict(color_type='color_ignore_orientation', type='LoadImageFromFile'),
    dict(keep_ratio=True, scale=(
        1280,
        960,
    ), type='Resize'),
    dict(
        type='LoadOCRAnnotations',
        with_bbox=True,
        with_label=True,
        with_polygon=True),
    dict(type='FixInvalidPolygon'),
    dict(
        meta_keys=(
            'img_path',
            'ori_shape',
            'img_shape',
            'scale_factor',
        ),
        type='PackTextDetInputs'),
]
train_cfg = dict(max_epochs=1500, type='EpochBasedTrainLoop', val_interval=1)
train_dataloader = dict(
    batch_size=10,
    dataset=dict(
        datasets=[
            dict(
                ann_file='train.json',
                data_prefix=dict(img_path='train_imgs/'),
                data_root='C:/Data/detection',
                pipeline=None,
                test_mode=False,
                type='OCRDataset'),
        ],
        pipeline=[
            dict(
                color_type='color_ignore_orientation',
                type='LoadImageFromFile'),
            dict(
                type='LoadOCRAnnotations',
                with_bbox=True,
                with_label=True,
                with_polygon=True),
            dict(
                keep_ratio=True,
                ratio_range=(
                    0.75,
                    2.5,
                ),
                scale=(
                    800,
                    800,
                ),
                type='RandomResize'),
            dict(
                crop_ratio=0.5,
                iter_num=1,
                min_area_ratio=0.2,
                type='TextDetRandomCropFlip'),
            dict(
                prob=0.8,
                transforms=[
                    dict(min_side_ratio=0.3, type='RandomCrop'),
                ],
                type='RandomApply'),
            dict(
                prob=0.5,
                transforms=[
                    dict(
                        max_angle=30,
                        pad_with_fixed_color=False,
                        type='RandomRotate',
                        use_canvas=True),
                ],
                type='RandomApply'),
            dict(
                prob=[
                    0.6,
                    0.4,
                ],
                transforms=[
                    [
                        dict(keep_ratio=True, scale=800, type='Resize'),
                        dict(target_scale=800, type='SourceImagePad'),
                    ],
                    dict(keep_ratio=False, scale=800, type='Resize'),
                ],
                type='RandomChoice'),
            dict(direction='horizontal', prob=0.5, type='RandomFlip'),
            dict(
                brightness=0.12549019607843137,
                contrast=0.5,
                op='ColorJitter',
                saturation=0.5,
                type='TorchVisionWrapper'),
            dict(
                meta_keys=(
                    'img_path',
                    'ori_shape',
                    'img_shape',
                    'scale_factor',
                ),
                type='PackTextDetInputs'),
        ],
        type='ConcatDataset'),
    num_workers=8,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
train_list = [
    dict(
        ann_file='train.json',
        data_prefix=dict(img_path='train_imgs/'),
        data_root=None,
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=None,
        type='OCRDataset'),
]
train_pipeline = [
    dict(color_type='color_ignore_orientation', type='LoadImageFromFile'),
    dict(
        type='LoadOCRAnnotations',
        with_bbox=True,
        with_label=True,
        with_polygon=True),
    dict(type='FixInvalidPolygon'),
    dict(
        keep_ratio=True,
        ratio_range=(
            0.75,
            2.5,
        ),
        scale=(
            800,
            800,
        ),
        type='RandomResize'),
    dict(
        crop_ratio=0.5,
        iter_num=1,
        min_area_ratio=0.2,
        type='TextDetRandomCropFlip'),
    dict(
        prob=0.8,
        transforms=[
            dict(min_side_ratio=0.3, type='RandomCrop'),
        ],
        type='RandomApply'),
    dict(
        prob=0.5,
        transforms=[
            dict(
                max_angle=30,
                pad_with_fixed_color=False,
                type='RandomRotate',
                use_canvas=True),
        ],
        type='RandomApply'),
    dict(
        prob=[
            0.6,
            0.4,
        ],
        transforms=[
            [
                dict(keep_ratio=True, scale=800, type='Resize'),
                dict(target_scale=800, type='SourceImagePad'),
            ],
            dict(keep_ratio=False, scale=800, type='Resize'),
        ],
        type='RandomChoice'),
    dict(direction='horizontal', prob=0.5, type='RandomFlip'),
    dict(
        brightness=0.12549019607843137,
        contrast=0.5,
        op='ColorJitter',
        saturation=0.5,
        type='TorchVisionWrapper'),
    dict(
        meta_keys=(
            'img_path',
            'ori_shape',
            'img_shape',
            'scale_factor',
        ),
        type='PackTextDetInputs'),
]
val_cfg = dict(type='ValLoop')
val_dataloader = dict(
    batch_size=1,
    dataset=dict(
        datasets=[
            dict(
                ann_file='val.json',
                data_prefix=dict(img_path='val_imgs/'),
                data_root='C:/Data/detection',
                pipeline=None,
                test_mode=False,
                type='OCRDataset'),
        ],
        pipeline=[
            dict(
                color_type='color_ignore_orientation',
                type='LoadImageFromFile'),
            dict(keep_ratio=True, scale=(
                2260,
                2260,
            ), type='Resize'),
            dict(
                type='LoadOCRAnnotations',
                with_bbox=True,
                with_label=True,
                with_polygon=True),
            dict(
                meta_keys=(
                    'img_path',
                    'ori_shape',
                    'img_shape',
                    'scale_factor',
                ),
                type='PackTextDetInputs'),
        ],
        type='ConcatDataset'),
    num_workers=1,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
val_evaluator = dict(type='HmeanIOUMetric')
val_list = [
    dict(
        ann_file='train.json',
        data_prefix=dict(img_path='train_imgs/'),
        data_root=None,
        pipeline=None,
        test_mode=True,
        type='OCRDataset'),
]
vis_backends = [
    dict(type='LocalVisBackend'),
    dict(type='TensorboardVisBackend'),
]
visualizer = dict(
    name=
    'time.struct_time(tm_year=2024, tm_mon=5, tm_mday=2, tm_hour=19, tm_min=45, tm_sec=54, tm_wday=3, tm_yday=123, tm_isdst=0)',
    type='TextDetLocalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='TensorboardVisBackend'),
    ])
work_dir = 'work_dirs/fcenet_resnet50_fpn_1500e_totaltext/'

2024/05/02 19:46:05 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
2024/05/02 19:46:05 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) SyncBuffersHook                    
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_val_epoch:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) SyncBuffersHook                    
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) VisualizationHook                  
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_test_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) VisualizationHook                  
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_run:
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
2024/05/02 19:46:08 - mmengine - INFO - load model from: torchvision://resnet50
2024/05/02 19:46:08 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50
2024/05/02 19:46:08 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

Name of parameter - Initialization information

backbone.conv1.weight - torch.Size([64, 3, 7, 7]): 
PretrainedInit: load from torchvision://resnet50 

backbone.bn1.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.bn1.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.conv1.weight - torch.Size([64, 64, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.bn1.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.bn1.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.conv2.weight - torch.Size([64, 64, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.bn2.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.bn2.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.conv3.weight - torch.Size([256, 64, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.bn3.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.bn3.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.downsample.0.weight - torch.Size([256, 64, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.downsample.1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.0.downsample.1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.conv1.weight - torch.Size([64, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.bn1.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.bn1.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.conv2.weight - torch.Size([64, 64, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.bn2.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.bn2.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.conv3.weight - torch.Size([256, 64, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.bn3.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.1.bn3.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.conv1.weight - torch.Size([64, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.bn1.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.bn1.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.conv2.weight - torch.Size([64, 64, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.bn2.weight - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.bn2.bias - torch.Size([64]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.conv3.weight - torch.Size([256, 64, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.bn3.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer1.2.bn3.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.conv1.weight - torch.Size([128, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.bn1.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.bn1.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.conv2.weight - torch.Size([128, 128, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.bn2.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.bn2.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.conv3.weight - torch.Size([512, 128, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.bn3.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.bn3.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.downsample.0.weight - torch.Size([512, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.downsample.1.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.0.downsample.1.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.conv1.weight - torch.Size([128, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.bn1.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.bn1.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.conv2.weight - torch.Size([128, 128, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.bn2.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.bn2.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.conv3.weight - torch.Size([512, 128, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.bn3.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.1.bn3.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.conv1.weight - torch.Size([128, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.bn1.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.bn1.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.conv2.weight - torch.Size([128, 128, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.bn2.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.bn2.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.conv3.weight - torch.Size([512, 128, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.bn3.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.2.bn3.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.conv1.weight - torch.Size([128, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.bn1.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.bn1.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.conv2.weight - torch.Size([128, 128, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.bn2.weight - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.bn2.bias - torch.Size([128]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.conv3.weight - torch.Size([512, 128, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.bn3.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer2.3.bn3.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.conv1.weight - torch.Size([256, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.bn1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.bn1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.conv2.weight - torch.Size([256, 256, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.bn2.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.bn2.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.conv3.weight - torch.Size([1024, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.bn3.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.bn3.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.downsample.0.weight - torch.Size([1024, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.downsample.1.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.0.downsample.1.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.conv1.weight - torch.Size([256, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.bn1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.bn1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.conv2.weight - torch.Size([256, 256, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.bn2.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.bn2.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.conv3.weight - torch.Size([1024, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.bn3.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.1.bn3.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.conv1.weight - torch.Size([256, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.bn1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.bn1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.conv2.weight - torch.Size([256, 256, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.bn2.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.bn2.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.conv3.weight - torch.Size([1024, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.bn3.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.2.bn3.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.conv1.weight - torch.Size([256, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.bn1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.bn1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.conv2.weight - torch.Size([256, 256, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.bn2.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.bn2.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.conv3.weight - torch.Size([1024, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.bn3.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.3.bn3.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.conv1.weight - torch.Size([256, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.bn1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.bn1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.conv2.weight - torch.Size([256, 256, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.bn2.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.bn2.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.conv3.weight - torch.Size([1024, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.bn3.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.4.bn3.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.conv1.weight - torch.Size([256, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.bn1.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.bn1.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.conv2.weight - torch.Size([256, 256, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.bn2.weight - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.bn2.bias - torch.Size([256]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.conv3.weight - torch.Size([1024, 256, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.bn3.weight - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer3.5.bn3.bias - torch.Size([1024]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.conv1.weight - torch.Size([512, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.bn1.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.bn1.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.conv2.weight - torch.Size([512, 512, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.bn2.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.bn2.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.conv3.weight - torch.Size([2048, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.bn3.weight - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.bn3.bias - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.downsample.0.weight - torch.Size([2048, 1024, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.downsample.1.weight - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.0.downsample.1.bias - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.conv1.weight - torch.Size([512, 2048, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.bn1.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.bn1.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.conv2.weight - torch.Size([512, 512, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.bn2.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.bn2.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.conv3.weight - torch.Size([2048, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.bn3.weight - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.1.bn3.bias - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.conv1.weight - torch.Size([512, 2048, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.bn1.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.bn1.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.conv2.weight - torch.Size([512, 512, 3, 3]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.bn2.weight - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.bn2.bias - torch.Size([512]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.conv3.weight - torch.Size([2048, 512, 1, 1]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.bn3.weight - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

backbone.layer4.2.bn3.bias - torch.Size([2048]): 
PretrainedInit: load from torchvision://resnet50 

neck.lateral_convs.0.conv.weight - torch.Size([256, 512, 1, 1]): 
XavierInit: gain=1, distribution=uniform, bias=0 

neck.lateral_convs.0.conv.bias - torch.Size([256]): 
The value is the same before and after calling `init_weights` of FCENet  

neck.lateral_convs.1.conv.weight - torch.Size([256, 1024, 1, 1]): 
XavierInit: gain=1, distribution=uniform, bias=0 

neck.lateral_convs.1.conv.bias - torch.Size([256]): 
The value is the same before and after calling `init_weights` of FCENet  

neck.lateral_convs.2.conv.weight - torch.Size([256, 2048, 1, 1]): 
XavierInit: gain=1, distribution=uniform, bias=0 

neck.lateral_convs.2.conv.bias - torch.Size([256]): 
The value is the same before and after calling `init_weights` of FCENet  

neck.fpn_convs.0.conv.weight - torch.Size([256, 256, 3, 3]): 
XavierInit: gain=1, distribution=uniform, bias=0 

neck.fpn_convs.0.conv.bias - torch.Size([256]): 
The value is the same before and after calling `init_weights` of FCENet  

neck.fpn_convs.1.conv.weight - torch.Size([256, 256, 3, 3]): 
XavierInit: gain=1, distribution=uniform, bias=0 

neck.fpn_convs.1.conv.bias - torch.Size([256]): 
The value is the same before and after calling `init_weights` of FCENet  

neck.fpn_convs.2.conv.weight - torch.Size([256, 256, 3, 3]): 
XavierInit: gain=1, distribution=uniform, bias=0 

neck.fpn_convs.2.conv.bias - torch.Size([256]): 
The value is the same before and after calling `init_weights` of FCENet  

det_head.out_conv_cls.weight - torch.Size([4, 256, 3, 3]): 
NormalInit: mean=0, std=0.01, bias=0 

det_head.out_conv_cls.bias - torch.Size([4]): 
NormalInit: mean=0, std=0.01, bias=0 

det_head.out_conv_reg.weight - torch.Size([22, 256, 3, 3]): 
NormalInit: mean=0, std=0.01, bias=0 

det_head.out_conv_reg.bias - torch.Size([22]): 
NormalInit: mean=0, std=0.01, bias=0 
2024/05/02 19:46:08 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
2024/05/02 19:46:08 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
2024/05/02 19:46:08 - mmengine - INFO - Checkpoints will be saved to D:\AlignProjects\almarkocr\research\mmocr\trainer_det\work_dirs\fcenet_resnet50_fpn_1500e_totaltext.
2024/05/02 19:46:38 - mmengine - INFO - Epoch(train)    [1][5/8]  lr: 1.0000e-05  eta: 19:44:55  time: 5.9271  data_time: 4.9296  memory: 11810  loss: 7.8055  loss_text: 2.1384  loss_center: 2.1940  loss_reg_x: 1.6825  loss_reg_y: 1.7907
2024/05/02 19:46:39 - mmengine - INFO - Exp name: fcenet_resnet50_fpn_1500e_totaltext_20240502_194554

Reproduces the problem - command or script

My images are 1024x1024. There is no data issue for sure. I have tried different batch size, like 1, 2, 4, 8, but faced the same issue. DBNetPP model works fine on the same machine with the same data and have a good accuracy.

Reproduces the problem - error message

There is no error. The process gets stuck.

EDIT: After 14 hours, here are additional logs:

2024/05/02 19:46:38 - mmengine - INFO - Epoch(train)    [1][5/8]  lr: 1.0000e-05  eta: 19:44:55  time: 5.9271  data_time: 4.9296  memory: 11810  loss: 7.8055  loss_text: 2.1384  loss_center: 2.1940  loss_reg_x: 1.6825  loss_reg_y: 1.7907
2024/05/02 19:46:39 - mmengine - INFO - Exp name: fcenet_resnet50_fpn_1500e_totaltext_20240502_194554
2024/05/03 01:51:10 - mmengine - INFO - Epoch(val)    [1][ 5/80]    eta: 3 days, 19:07:41  time: 4374.1597  data_time: 0.8993  memory: 11810  
2024/05/03 08:46:18 - mmengine - INFO - Epoch(val)    [1][10/80]    eta: 3 days, 18:57:28  time: 4677.8364  data_time: 0.4499  memory: 1394  

Additional information

No response

@256-7421142
Copy link

请问你在复现FCENet时有遇到这个问题吗#2044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants