Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array crash using RelWithDebInfo with CUDA. #1440

Open
BradWhitlock opened this issue Oct 8, 2024 · 1 comment
Open

Array crash using RelWithDebInfo with CUDA. #1440

BradWhitlock opened this issue Oct 8, 2024 · 1 comment
Assignees
Labels
bug Something isn't working GPU Issues related to GPU development Reviewed

Comments

@BradWhitlock
Copy link
Member

I think I was able to make a reproducer for an axom::Array crash. Build the following example program on rzansel. It will build in RelWithDebInfo mode and will crash in the axom::Array constructor trying to default-initialize memory with placement new on the host (but the memory should be on the GPU).

In thinking about this some more, having axom::Array constructors in a compiled .cpp file might be the real problem. In the original scenario, I did not do anything special in my CMakeLists.txt to indicate that the file has GPU code; I only added the source file to my library target. In this example, the bug.cpp file is marked as CUDA code so axom::CUDA_EXEC will be available. The execute.cpp file where the error happens is not marked as a GPU file, though some code in axom::Array might need it to be.

bug.cpp

#include <axom/config.hpp>
#include <axom/core.hpp>

// Forward declaration
namespace axom
{
namespace mir
{
namespace views
{
void buildShapeMap(axom::Array<axom::IndexType> &indices,
                   axom::Array<axom::IndexType> &values,
                   int allocatorID);
} // end namespace views
} // end namespace mir
} // end namespace axom

template <typename ExecSpace>
void execute()
{
  axom::Array<axom::IndexType> indices, values;
  int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
  axom::mir::views::buildShapeMap(/*node,*/ indices, values, allocatorID);

  std::cout << "indices.size=" << indices.size() << std::endl;
  std::cout << "values.size=" << values.size() << std::endl;
}

int main(int argc, char *argv[])
{
#if defined(AXOM_USE_CUDA)
  constexpr int BLOCK_SIZE = 256;
  using cuda_exec = axom::CUDA_EXEC<BLOCK_SIZE>;
  execute<cuda_exec>();
#endif

  return 0;
}

execute.cpp

#include <axom/config.hpp>
#include <axom/core.hpp>

#include <vector>

// Forward declaration
namespace axom
{
namespace mir
{
namespace views
{
void buildShapeMap(axom::Array<axom::IndexType> &indices,
                   axom::Array<axom::IndexType> &values,
                   int allocatorID)
{
  std::vector<axom::IndexType> srcIndices{{0,1,2,3,4,5}};
  std::vector<axom::IndexType> srcValues{{5,4,3,2,1,0}};
  const axom::IndexType n = srcValues.size();

  // The error would manifest here when calling Array::Array(n, n, allocatorID).
  indices = axom::Array<axom::IndexType>(n, n, allocatorID);
  values = axom::Array<axom::IndexType>(n, n, allocatorID);
  axom::copy(indices.data(), srcIndices.data(), n * sizeof(axom::IndexType));
  axom::copy(values.data(), srcValues.data(), n * sizeof(axom::IndexType));
}

} // end namespace views
} // end namespace mir
} // end namespace axom

This file was hacked together from the QuickStart guide and I made some additions so it would work with CUDA. I would have hoped that Axom would take care of that but, it didn't.

CMakeLists.txt

cmake_minimum_required(VERSION 3.21)

set(CMAKE_C_COMPILER "/usr/tce/packages/clang/clang-ibm-10.0.1-gcc-8.3.1/bin/clang" CACHE PATH "")
set(CMAKE_CXX_COMPILER "/usr/tce/packages/clang/clang-ibm-10.0.1-gcc-8.3.1/bin/clang++" CACHE PATH "")
#set(CMAKE_BUILD_TYPE Debug)
#set(CMAKE_BUILD_TYPE Release)
set(CMAKE_BUILD_TYPE RelWithDebInfo)

project(bug)

# Point to an installed RelWithDebInfo Axom
set(AXOM_DIR /usr/WS2/whitlocb/Axom/axom_mir/axom/[email protected]_cuda-relwithdebinfo)

#------------------------------------------------------------------------------
# Check for AXOM_DIR and use CMake's find_package to import axom's targets
#------------------------------------------------------------------------------
if(NOT DEFINED AXOM_DIR OR NOT EXISTS ${AXOM_DIR}/lib/cmake/axom-config.cmake)
    message(FATAL_ERROR "Missing required 'AXOM_DIR' variable pointing to an installed axom")
endif()

if (ENABLE_CUDA)
    enable_language(CUDA)
endif()

if (ENABLE_HIP)
    if (NOT ROCM_PATH)
        find_path(ROCM_PATH
            hip
            ENV{ROCM_DIR}
            ENV{ROCM_PATH}
            ENV{HIP_PATH}
            ${HIP_PATH}/..
            ${HIP_ROOT_DIR}/../
            ${ROCM_ROOT_DIR}
            /opt/rocm)
    endif()
    set(CMAKE_PREFIX_PATH "${CMAKE_PREFIX_PATH};${ROCM_PATH}")
    find_package(hip REQUIRED CONFIG PATHS ${ROCM_PATH})
endif()

include(CMakeFindDependencyMacro)

# 70=Volta
set(CMAKE_CUDA_ARCHITECTURES "70")

find_dependency(axom REQUIRED
                NO_DEFAULT_PATH 
                PATHS ${AXOM_DIR}/lib/cmake)


add_executable(bug bug.cpp execute.cpp)

set_source_files_properties(bug.cpp PROPERTIES
    LANGUAGE CUDA
)

set_target_properties(bug PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
)
target_link_libraries(bug axom::core ${CUDA_LIBRARIES})
@rhornung67 rhornung67 added bug Something isn't working GPU Issues related to GPU development Reviewed labels Oct 21, 2024
@rhornung67 rhornung67 added this to the FY25 Development milestone Dec 16, 2024
@publixsubfan
Copy link
Contributor

Looking at this further, the underlying issue is that the compiled axom::Array definition is different between the bug.cpp file compiled as CUDA and the execute.cpp file compiled as C++. This results in an ODR violation, since we can't satisfy the following constraint:

There can be more than one definition in a program of each of the following: ... , templated entity (template or member of template, but not full template specialization), as long as all following conditions are satisfied:
...

  • Each definition consists of the same sequence of tokens

Since we have an #if defined(AXOM_GPUCC) guard in the ArrayBase.hpp header, the two definitions are different. We should just remove that guard -- the compile with a non-CUDA compiler would error out, but I think the only real fix here is to compile all your source files (at least the ones that use Axom) as CUDA sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GPU Issues related to GPU development Reviewed
Projects
None yet
Development

No branches or pull requests

3 participants