Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) #376

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

woongjoonchoi
Copy link

@woongjoonchoi woongjoonchoi commented Jan 15, 2025

when train with swin-transformer-v2 , RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) happend.

i fixed code
models/swin_transformer_v2.py line 159

Before :
logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01))).exp()
After :

    logit_scale_device =self.logit_scale.device
    logit_scale = torch.clamp(self.logit_scale, max=torch.log( torch.tensor(1. / 0.01).to(logit_scale_device) ) ).exp()  

This is how to reproduce error .

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval --cfg ./configs/swinv2/swinv2_tiny_patch4_window8_256.yaml --resume ././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth --data-path imagenet
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CUDA backtraces will not be collected because CPU sampling is disabled.
/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

main()
Tutel has not been installed. To use Swin-MoE, please install Tutel; otherwise, just ignore this.
To use FusedLAMB or FusedAdam, please install apex.
=> merge config from ./configs/swinv2/swinv2_tiny_patch4_window8_256.yaml
RANK and WORLD_SIZE in environ: 0/1
[rank0]:[W115 16:39:15.575750744 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[2025-01-15 16:39:15 swinv2_tiny_patch4_window8_256](main.py 434): INFO Full config saved to output/swinv2_tiny_patch4_window8_256/default/config.json
[2025-01-15 16:39:15 swinv2_tiny_patch4_window8_256](main.py 437): INFO AMP_ENABLE: true
AMP_OPT_LEVEL: ''
AUG:
AUTO_AUGMENT: rand-m9-mstd0.5-inc1
COLOR_JITTER: 0.4
CUTMIX: 1.0
CUTMIX_MINMAX: null
MIXUP: 0.8
MIXUP_MODE: batch
MIXUP_PROB: 1.0
MIXUP_SWITCH_PROB: 0.5
RECOUNT: 1
REMODE: pixel
REPROB: 0.25
BASE:

  • ''
    DATA:
    BATCH_SIZE: 128
    CACHE_MODE: part
    DATASET: imagenet
    DATA_PATH: imagenet
    IMG_SIZE: 256
    INTERPOLATION: bicubic
    MASK_PATCH_SIZE: 32
    MASK_RATIO: 0.6
    NUM_WORKERS: 8
    PIN_MEMORY: true
    ZIP_MODE: false
    ENABLE_AMP: false
    EVAL_MODE: true
    FUSED_LAYERNORM: false
    FUSED_WINDOW_PROCESS: false
    LOCAL_RANK: 0
    MODEL:
    DROP_PATH_RATE: 0.2
    DROP_RATE: 0.0
    LABEL_SMOOTHING: 0.1
    NAME: swinv2_tiny_patch4_window8_256
    NUM_CLASSES: 1000
    PRETRAINED: ''
    RESUME: ././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth
    SIMMIM:
    NORM_TARGET:
    ENABLE: false
    PATCH_SIZE: 47
    SWIN:
    APE: false
    DEPTHS:
    • 2
    • 2
    • 6
    • 2
      EMBED_DIM: 96
      IN_CHANS: 3
      MLP_RATIO: 4.0
      NUM_HEADS:
    • 3
    • 6
    • 12
    • 24
      PATCH_NORM: true
      PATCH_SIZE: 4
      QKV_BIAS: true
      QK_SCALE: null
      WINDOW_SIZE: 7
      SWINV2:
      APE: false
      DEPTHS:
    • 2
    • 2
    • 6
    • 2
      EMBED_DIM: 96
      IN_CHANS: 3
      MLP_RATIO: 4.0
      NUM_HEADS:
    • 3
    • 6
    • 12
    • 24
      PATCH_NORM: true
      PATCH_SIZE: 4
      PRETRAINED_WINDOW_SIZES:
    • 0
    • 0
    • 0
    • 0
      QKV_BIAS: true
      WINDOW_SIZE: 8
      SWIN_MLP:
      APE: false
      DEPTHS:
    • 2
    • 2
    • 6
    • 2
      EMBED_DIM: 96
      IN_CHANS: 3
      MLP_RATIO: 4.0
      NUM_HEADS:
    • 3
    • 6
    • 12
    • 24
      PATCH_NORM: true
      PATCH_SIZE: 4
      WINDOW_SIZE: 7
      SWIN_MOE:
      APE: false
      AUX_LOSS_WEIGHT: 0.01
      CAPACITY_FACTOR: 1.25
      COSINE_ROUTER: false
      COSINE_ROUTER_DIM: 256
      COSINE_ROUTER_INIT_T: 0.5
      DEPTHS:
    • 2
    • 2
    • 6
    • 2
      EMBED_DIM: 96
      GATE_NOISE: 1.0
      INIT_STD: 0.02
      IN_CHANS: 3
      IS_GSHARD_LOSS: false
      MLP_FC2_BIAS: true
      MLP_RATIO: 4.0
      MOE_BLOCKS:
      • -1
      • -1
      • -1
      • -1
        MOE_DROP: 0.0
        NORMALIZE_GATE: false
        NUM_HEADS:
    • 3
    • 6
    • 12
    • 24
      NUM_LOCAL_EXPERTS: 1
      PATCH_NORM: true
      PATCH_SIZE: 4
      PRETRAINED_WINDOW_SIZES:
    • 0
    • 0
    • 0
    • 0
      QKV_BIAS: true
      QK_SCALE: null
      TOP_VALUE: 1
      USE_BPR: true
      WINDOW_SIZE: 7
      TYPE: swinv2
      OUTPUT: output/swinv2_tiny_patch4_window8_256/default
      PRINT_FREQ: 10
      SAVE_FREQ: 1
      SEED: 0
      TAG: default
      TEST:
      CROP: true
      SEQUENTIAL: false
      SHUFFLE: false
      THROUGHPUT_MODE: false
      TRAIN:
      ACCUMULATION_STEPS: 1
      AUTO_RESUME: true
      BASE_LR: 0.000125
      CLIP_GRAD: 5.0
      EPOCHS: 300
      LAYER_DECAY: 1.0
      LR_SCHEDULER:
      DECAY_EPOCHS: 30
      DECAY_RATE: 0.1
      GAMMA: 0.1
      MULTISTEPS: []
      NAME: cosine
      WARMUP_PREFIX: true
      MIN_LR: 1.25e-06
      MOE:
      SAVE_MASTER: false
      OPTIMIZER:
      BETAS:
    • 0.9
    • 0.999
      EPS: 1.0e-08
      MOMENTUM: 0.9
      NAME: adamw
      START_EPOCH: 0
      USE_CHECKPOINT: false
      WARMUP_EPOCHS: 20
      WARMUP_LR: 1.25e-07
      WEIGHT_DECAY: 0.05

[2025-01-15 16:39:15 swinv2_tiny_patch4_window8_256](main.py 438): INFO {"cfg": "./configs/swinv2/swinv2_tiny_patch4_window8_256.yaml", "opts": null, "batch_size": null, "data_path": "imagenet", "zip": false, "cache_mode": "part", "pretrained": null, "resume": "././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth", "accumulation_steps": null, "use_checkpoint": false, "disable_amp": false, "amp_opt_level": null, "output": "output", "tag": null, "eval": true, "throughput": false, "fused_window_process": false, "fused_layernorm": false, "optim": null}
local rank 0 / global rank 0 successfully build train dataset
local rank 0 / global rank 0 successfully build val dataset
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 93): INFO Creating model:swinv2/swinv2_tiny_patch4_window8_256
/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/functional.py:534: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3595.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 95): INFO SwinTransformerV2(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
)
(pos_drop): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0): BasicLayer(
dim=96, input_resolution=(64, 64), depth=2
(blocks): ModuleList(
(0): SwinTransformerBlock(
dim=96, input_resolution=(64, 64), num_heads=3, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=96, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=3
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=3, bias=False)
)
(qkv): Linear(in_features=96, out_features=288, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=96, out_features=96, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): Identity()
(norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=96, out_features=384, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=384, out_features=96, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
dim=96, input_resolution=(64, 64), num_heads=3, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=96, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=3
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=3, bias=False)
)
(qkv): Linear(in_features=96, out_features=288, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=96, out_features=96, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=96, out_features=384, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=384, out_features=96, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(64, 64), dim=96
(reduction): Linear(in_features=384, out_features=192, bias=False)
(norm): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
)
)
(1): BasicLayer(
dim=192, input_resolution=(32, 32), depth=2
(blocks): ModuleList(
(0): SwinTransformerBlock(
dim=192, input_resolution=(32, 32), num_heads=6, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=192, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=6
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=6, bias=False)
)
(qkv): Linear(in_features=192, out_features=576, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=192, out_features=192, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
dim=192, input_resolution=(32, 32), num_heads=6, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=192, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=6
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=6, bias=False)
)
(qkv): Linear(in_features=192, out_features=576, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=192, out_features=192, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(32, 32), dim=192
(reduction): Linear(in_features=768, out_features=384, bias=False)
(norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
(2): BasicLayer(
dim=384, input_resolution=(16, 16), depth=6
(blocks): ModuleList(
(0): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(2): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(3): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(4): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(5): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(16, 16), dim=384
(reduction): Linear(in_features=1536, out_features=768, bias=False)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(3): BasicLayer(
dim=768, input_resolution=(8, 8), depth=2
(blocks): ModuleList(
(0-1): 2 x SwinTransformerBlock(
dim=768, input_resolution=(8, 8), num_heads=24, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=768, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=24
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=24, bias=False)
)
(qkv): Linear(in_features=768, out_features=2304, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
)
)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(avgpool): AdaptiveAvgPool1d(output_size=1)
(head): Linear(in_features=768, out_features=1000, bias=True)
)
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 98): INFO number of params: 28347154
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 101): INFO number of GFLOPs: 5.925697536
/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/utils.py:203: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self._scaler = torch.cuda.amp.GradScaler()
All checkpoints founded in output/swinv2_tiny_patch4_window8_256/default: []
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 151): INFO no checkpoint found in output/swinv2_tiny_patch4_window8_256/default, ignoring auto resume
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](utils.py 19): INFO ==============> Resuming form ././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth....................
/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/utils.py:24: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(config.MODEL.RESUME, map_location='cpu')
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](utils.py 26): INFO
/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py:308: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast(enabled=config.AMP_ENABLE):
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py", line 440, in
[rank0]: main(config)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py", line 155, in main
[rank0]: acc1, acc5, loss = validate(config, data_loader_val, model)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py", line 314, in validate
[rank0]: output = model(images)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 627, in forward
[rank0]: x = self.forward_features(x)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 619, in forward_features
[rank0]: x = layer(x)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 434, in forward
[rank0]: x = blk(x)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 292, in forward
[rank0]: attn_windows = self.attn(x_windows, mask=self.attn_mask) # nW
B, window_size
window_size, C
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 159, in forward
[rank0]: logit_scale = torch.clamp(self.logit_scale, max=torch.log( torch.tensor(1. / 0.01) ) ).exp()
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)
[rank0]:[W115 16:39:20.040987096 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0115 16:39:21.135000 14576 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 14603) of binary: /home/oongjoon/Desktop/Github/flashattn/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py", line 208, in
main()
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/typing_extensions.py", line 2853, in wrapper
return arg(*args, **kwargs)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py", line 204, in main
launch(args)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in launch
run(args)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-15_16:39:21
host : oongjoon-System-Product-Name
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Generated:

@woongjoonchoi woongjoonchoi changed the title swin_transformer_v2.py error RuntimeError fixed swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) Jan 15, 2025
Error :
RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)

Modified :  

Before :
logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

After : 

        logit_scale_device =self.logit_scale.device
        logit_scale = torch.clamp(self.logit_scale, max=torch.log( torch.tensor(1. / 0.01).to(logit_scale_device) ) ).exp()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant