Wrap vec size 8 with USE_ROCM
#1795
+4
−0
Open
ROCm Repo Management API / Tests / Tests / Test Inductor / Run pytorch_inductor
failed
Dec 16, 2024 in 0s
failed: 12, skipped: 22976, passed: 25930
Send us feedback
Details
AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
return value(self)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
self.check_model(Model(), example_inputs)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
self.assertEqual(actual, expected, atol=atol, rtol=rtol)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
return value(self)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
self.check_model(Model(), example_inputs)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
self.assertEqual(actual, expected, atol=atol, rtol=rtol)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
return value(self)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
self.check_model(Model(), example_inputs)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
self.assertEqual(actual, expected, atol=atol, rtol=rtol)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
return value(self)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
self.check_model(Model(), example_inputs)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
self.assertEqual(actual, expected, atol=atol, rtol=rtol)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
return value(self)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
self.check_model(Model(), example_inputs)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
self.assertEqual(actual, expected, atol=atol, rtol=rtol)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestNonABICompatibleCuda.test_sdpa_2_non_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:977: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:296.)
t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 11408, in new_test
return value(self)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 988, in test_sdpa_2
self.check_model(Model(), example_inputs)
File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 127, in check_model
self.assertEqual(actual, expected, atol=atol, rtol=rtol)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleCuda.test_sdpa_2_abi_compatible_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard out
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 4)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
GPUTests.test_scaled_dot_product_efficient_attention_cuda
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10300, in test_scaled_dot_product_efficient_attention
self.common(
File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 589, in check_model_gpu
check_model(
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 413, in check_model
correct = ref_model(*ref_inputs, **ref_kwargs)
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10296, in fn
return aten._scaled_dot_product_efficient_attention(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
GPUTests.test_scaled_dot_product_efficient_attention_cuda
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from _efficient_attention_forward at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/attention.hip:1140 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::native::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#8 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#9 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#10 at::_ops::_efficient_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#11 at::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#12 at::native::_scaled_dot_product_efficient_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#13 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#14 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#15 at::_ops::_scaled_dot_product_efficient_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#16 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from VariableType_3.cpp:0
#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_3.cpp:0
#18 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#19 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#20 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#21 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#22 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#23 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#24 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#25 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#26 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#27 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#28 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#29 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#30 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#36 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#41 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#45 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#55 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#56 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#59 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#60 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#61 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#64 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#65 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#66 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#67 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#70 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#71 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#72 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#73 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#74 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#75 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#76 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#77 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#78 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#79 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#80 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#81 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#82 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#83 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#84 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#86 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#88 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#90 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#94 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#95 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#96 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#97 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#98 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#100 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#106 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#107 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#108 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#109 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#110 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#111 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#112 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#116 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#122 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#123 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#124 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#125 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#126 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#127 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#128 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#129 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#130 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#131 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#132 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#133 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#134 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#135 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#136 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#137 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#138 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#139 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#140 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#141 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#142 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#143 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#144 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#145 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#146 _start from ??:0
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10300, in test_scaled_dot_product_efficient_attention
self.common(
File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 589, in check_model_gpu
check_model(
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 413, in check_model
correct = ref_model(*ref_inputs, **ref_kwargs)
File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 10296, in fn
return aten._scaled_dot_product_efficient_attention(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from _efficient_attention_forward at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/attention.hip:1140 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::native::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#8 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#9 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___efficient_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from RegisterCUDA.cpp:0
#10 at::_ops::_efficient_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<c10::SymInt>, std::optional<c10::SymInt>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#11 at::_efficient_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::optional<long>, std::optional<long>, double, long, bool, std::optional<double>, std::optional<at::Tensor> const&, std::optional<long>) from ??:0
#12 at::native::_scaled_dot_product_efficient_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#13 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#14 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from RegisterCUDA.cpp:0
#15 at::_ops::_scaled_dot_product_efficient_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from ??:0
#16 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>) from VariableType_3.cpp:0
#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_efficient_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, bool, double, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_3.cpp:0
#18 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#19 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#20 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#21 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#218}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#22 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#23 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#24 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#25 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#26 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#27 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#28 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#29 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#30 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#36 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#41 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#45 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#55 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#56 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#59 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#60 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#61 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#64 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#65 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#66 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#67 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#70 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#71 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#72 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#73 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#74 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#75 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#76 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#77 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#78 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#79 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#80 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#81 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#82 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#83 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#84 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#86 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#88 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#90 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#94 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#95 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#96 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#97 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#98 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#100 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#106 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#107 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#108 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#109 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#110 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#111 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#112 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#116 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#122 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#123 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#124 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#125 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#126 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#127 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#128 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#129 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#130 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#131 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#132 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#133 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#134 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#135 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#136 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#137 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#138 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#139 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#140 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#141 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#142 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#143 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#144 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#145 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#146 _start from ??:0
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_scaled_dot_product_efficient_attention_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
more test results are not shown here, view them on Jenkins
Loading