Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap vec size 8 with USE_ROCM #1795

Open
wants to merge 2 commits into
base: 2.5_perf_fix
Choose a base branch
from

update

f69cc7e
Select commit
Loading
Failed to load commit list.
Open

Wrap vec size 8 with USE_ROCM #1795

update
f69cc7e
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Tests / Tests / Test Inductor / Run pytorch_inductor failed Dec 16, 2024 in 0s

failed: 12, skipped: 22976, passed: 25930

Send us feedback

Details

AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]

AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]

AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]

AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]

AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]

AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]

GPUTests.test_scaled_dot_product_efficient_attention_cuda

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10300, in test_scaled_dot_product_efficient_attention
    self.common(
  File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 589, in check_model_gpu
    check_model(
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 413, in check_model
    correct = ref_model(*ref_inputs, **ref_kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10296, in fn
    return aten._scaled_dot_product_efficient_attention(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

GPUTests.test_scaled_dot_product_efficient_attention_cuda

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from _efficient_attention_forward at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/attention.hip:1140 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::native::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#8 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#9 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#10 at::_ops::_efficient_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#11 at::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#12 at::native::_scaled_dot_product_efficient_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#13 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#14 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#15 at::_ops::_scaled_dot_product_efficient_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#16 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from VariableType_3.cpp:0
#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_3.cpp:0
#18 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#19 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#20 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#21 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#22 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#23 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#24 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#25 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#26 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#27 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#28 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#29 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#30 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#36 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#41 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#45 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#55 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#56 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#59 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#60 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#61 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#64 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#65 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#66 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#67 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#70 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#71 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#72 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#73 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#74 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#75 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#76 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#77 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#78 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#79 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#80 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#81 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#82 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#83 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#84 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#86 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#88 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#90 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#94 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#95 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#96 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#97 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#98 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#100 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#106 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#107 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#108 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#109 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#110 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#111 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#112 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#116 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#122 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#123 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#124 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#125 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#126 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#127 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#128 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#129 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#130 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#131 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#132 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#133 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#134 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#135 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#136 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#137 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#138 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#139 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#140 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#141 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#142 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#143 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#144 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#145 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#146 _start from ??:0


To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10300, in test_scaled_dot_product_efficient_attention
    self.common(
  File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 589, in check_model_gpu
    check_model(
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 413, in check_model
    correct = ref_model(*ref_inputs, **ref_kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10296, in fn
    return aten._scaled_dot_product_efficient_attention(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from _efficient_attention_forward at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/attention.hip:1140 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::native::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#8 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#9 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#10 at::_ops::_efficient_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#11 at::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#12 at::native::_scaled_dot_product_efficient_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#13 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#14 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#15 at::_ops::_scaled_dot_product_efficient_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#16 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from VariableType_3.cpp:0
#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_3.cpp:0
#18 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#19 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#20 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#21 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#22 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#23 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#24 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#25 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#26 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#27 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#28 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#29 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#30 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#36 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#41 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#45 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#55 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#56 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#59 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#60 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#61 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#64 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#65 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#66 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#67 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#70 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#71 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#72 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#73 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#74 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#75 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#76 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#77 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#78 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#79 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#80 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#81 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#82 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#83 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#84 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#86 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#88 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#90 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#94 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#95 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#96 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#97 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#98 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#100 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#106 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#107 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#108 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#109 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#110 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#111 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#112 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#116 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#122 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#123 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#124 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#125 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#126 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#127 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#128 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#129 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#130 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#131 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#132 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#133 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#134 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#135 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#136 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#137 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#138 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#139 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#140 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#141 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#142 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#143 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#144 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#145 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#146 _start from ??:0


To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

more test results are not shown here, view them on Jenkins