Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

building examples with CUDA on RTX 4070 #280

Open
nibblelab opened this issue Nov 21, 2024 · 14 comments
Open

building examples with CUDA on RTX 4070 #280

nibblelab opened this issue Nov 21, 2024 · 14 comments

Comments

@nibblelab
Copy link

Hi,

I'm trying to compile the lib examples with CUDA 12 over an RTX 4070 GPU, but I'm having this error:

[ 36%] Building NVCC (Device) object examples/CMakeFiles/solver_cuda.dir/solver_cuda_generated_solver.cu.o
nvcc fatal   : Unsupported gpu architecture 'compute_30'
CMake Error at solver_cuda_generated_solver.cu.o.RelWithDebInfo.cmake:220 (message):
  Error generating
  /home/johnatas/MFLab/Code/amgcl/build/examples/CMakeFiles/solver_cuda.dir//./solver_cuda_generated_solver.cu.o

The GPU supported architectures are:

$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90

If I change the CMakeLists.txt to choose the architecture from the GPU itself by modifying

cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS ${CUDA_TARGET_ARCH}) 

to

cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS Auto) 

It will select a supported architecture, and compile the solver_cuda module (with some warnings) but the overall compiling will fail thanks to a series of compilation errors like:

[ 37%] Linking CXX executable solver_cuda
[ 37%] Built target solver_cuda
[ 38%] Building NVCC (Device) object examples/CMakeFiles/schur_pressure_correction_cuda.dir/schur_pressure_correction_cuda_generated_schur_pressure_correction.cu.o
...
avx512fp16intrin.h(101): error: more than one conversion function from "__half" to "<error-type>" applies
...
avx512fp16intrin.h(3187): error: return value type does not match the function type
...

How can I fix this to compile amgcl examples with CUDA 12?

My system settings:

  • Ubuntu 24.04 LTS
  • GCC 12.2
  • CUDA 12.0 - RTX 4070
  • Cmake 3.27.1
  • Boost 1.81.0
  • OpenMPI 4.1.5
  • Eigen3 3.4.0
  • Hwloc 2.9.0
@ddemidov
Copy link
Owner

There is a configuration option for target GPU architechure: https://github.com/ddemidov/amgcl/blob/master/CMakeLists.txt#L161

It has some outdated arhcs there, try to set it to just the one you need.

@nibblelab
Copy link
Author

Thanks for the reply.

changing the following line enables circumventing the architecture problem, but generates the other errors I've pointed out.

I've changed the CUDA_TARGET_ARCH for the architectures supported by the CPU as follows

set(CUDA_TARGET_ARCH "Pascal Volta Turing Ampere Ada" CACHE STRING "Target architecture(s) for CUDA")

and again, It enables to circumvent the architecture problem, but the compiling problem shows again, in another module though:

[ 68%] Linking CXX executable runtime_sdd_cuda
[ 68%] Built target runtime_sdd_cuda
[ 69%] Building NVCC (Device) object examples/mpi/CMakeFiles/runtime_sdd_3d_cuda.dir/runtime_sdd_3d_cuda_generated_runtime_sdd_3d.cu.o
. . .
avx512fp16intrin.h(38): error: vector_size attribute requires an arithmetic or enum type
. . .
avx512fp16intrin.h(62): error: more than one conversion function from "__half" to "<error-type>" applies:
. . .
avx512fp16intrin.h(4289): error: return value type does not match the function type

Those errors was generated with Pascal architecture which is the older one that RTX 4070 supports.

@ddemidov
Copy link
Owner

ddemidov commented Nov 21, 2024

Can you try and keep just the "Ada" on the arch list?

sorry, i've just noticed that you tried with "Auto". I guess I'll need to do some digging, as I don't have a device to test this.

@nibblelab
Copy link
Author

Thank you.

I've also used a recompiled version of GCC as part of a CFD software environment. Hence my interest in AMGCL. I'm updating my environment and Nvidia CUDA toolkit as well to verify if the problem is related to GCC as this is a possibility (avx512fp16intrin.h is a header from GCC not NVCC).

For now, I've disabled the CUDA part in CMakeLists to test the CPU part of the library.

@nibblelab
Copy link
Author

I've managed to solve this problem here. I was using the CUDA from Ubuntu repo and that was the source of the problem regarding the headers. Using the CUDA install from NVIDIA HPC repo solved the problem and I was finally able to compile and run amgcl on CUDA 12.x on RTX 4070.

It's necessary to make this change on CMakeLists.txt to ensure a compatible architecture will be used:

cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS Auto) 

I've tested the Poisson problem in the tutorials (https://amgcl.readthedocs.io/en/latest/tutorial/poisson3Db.html) on CUDA/GPU and CPU and found the results interesting. But, the problem is rather small and the CUDA version ends up being slower thanks to the overhead generated by the CPU-GPU communication.

I've changed the Stokes tutorial (https://amgcl.readthedocs.io/en/latest/tutorial/Stokes.html) to use the GPU by using the Poisson CUDA code as an example, but, I'm having problems converting the Bin matrices used on Stokes to MatrixMarket used on Poisson. Is there any tool to make this conversion or a tutorial on how to use the Bin matrices/vectors to use with amgcl on CUDA?

@ddemidov
Copy link
Owner

ddemidov commented Dec 6, 2024

That's great, thanks for letting me know!

There is ./examples/mm2bin and ./examples/bin2mm utilities:

./bin2mm --help
Options:
  -h [ --help ]         Show this help.
  -d [ --dense ]        Matrix is dense (use it with the RHS file).
  -i [ --input ] arg    Input binary file.
  -o [ --output ] arg   Ouput matrix in the MatrixMarket format.

Also, if you just want to test a 3D Poisson problem, you could run ./examples/solver -n 32 and ./examples/solver_cuda -n 32 to test the generated system for a 32x32x32 grid.

@nibblelab
Copy link
Author

Thanks for the reply again. I've used the bin2mm to convert the matrices and vector of ucube and use this code to run a CUDA version of ucube:

#include <vector>
#include <iostream>
#include <amgcl/backend/cuda.hpp>
#include <amgcl/adapter/crs_tuple.hpp>
#include <amgcl/make_solver.hpp>
#include <amgcl/amg.hpp>
#include <amgcl/coarsening/smoothed_aggregation.hpp>
#include <amgcl/relaxation/spai0.hpp>
#include <amgcl/solver/bicgstab.hpp>
#include <amgcl/io/mm.hpp>
#include <amgcl/profiler.hpp>

int main(int argc, char *argv[]) {
    // The matrix and the RHS file names should be in the command line options:
    if (argc < 3) {
        std::cerr << "Usage: " << argv[0] << " <matrix.mtx> <rhs.mtx>" << std::endl;
        return 1;
    }

    // Show the name of the GPU we are using:
    int device;
    cudaDeviceProp prop;
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&prop, device);
    std::cout << prop.name << std::endl;

    // The profiler:
    amgcl::profiler<> prof("UCube4");

    // Read the system matrix and the RHS:
    ptrdiff_t rows, cols;
    std::vector<ptrdiff_t> ptr, col;
    std::vector<double> val, rhs;

    prof.tic("read");
    std::tie(rows, cols) = amgcl::io::mm_reader(argv[1])(ptr, col, val);
    std::cout << "Matrix " << argv[1] << ": " << rows << "x" << cols << std::endl;

    std::tie(rows, cols) = amgcl::io::mm_reader(argv[2])(rhs);
    std::cout << "RHS " << argv[2] << ": " << rows << "x" << cols << std::endl;
    prof.toc("read");

    // We use the tuple of CRS arrays to represent the system matrix.
    // Note that std::tie creates a tuple of references, so no data is actually
    // copied here:
    auto A = std::tie(rows, ptr, col, val);

    // Compose the solver type
    typedef amgcl::backend::cuda<double> Backend;
    typedef amgcl::make_solver<
        amgcl::amg<
            Backend,
            amgcl::coarsening::smoothed_aggregation,
            amgcl::relaxation::spai0
            >,
        amgcl::solver::bicgstab<Backend>
        > Solver;

    // We need to initialize the CUSPARSE library and pass the handle to AMGCL
    // in backend parameters:
    Backend::params bprm;
    cusparseCreate(&bprm.cusparse_handle);

    // There is no way to pass the backend parameters without passing the
    // solver parameters, so we also need to create those. But we can leave
    // them with the default values:
    Solver::params prm;

    // Initialize the solver with the system matrix:
    prof.tic("setup");
    Solver solve(A, prm, bprm);
    prof.toc("setup");

    // Show the mini-report on the constructed solver:
    std::cout << solve << std::endl;

    // Solve the system with the zero initial approximation.
    // The RHS and the solution vectors should reside in the GPU memory:
    int iters;
    double error;
    thrust::device_vector<double> f(rhs);
    thrust::device_vector<double> x(rows, 0.0);

    prof.tic("solve");
    std::tie(iters, error) = solve(f, x);
    prof.toc("solve");

    // Output the number of iterations, the relative error,
    // and the profiling data:
    std::cout << "Iters: " << iters << std::endl
              << "Error: " << error << std::endl
              << prof << std::endl;
}

With this code I was able to run ucube with CUDA, and except for the two sets of matrix and vectors, the GPU version was faster than the CPU. But, I found the solution error to be very different between the CPU version and the GPU version.

For example, using the pair ucube_3_A.bin, ucube_3_b.bin at stokes_ucube produces this error: 3.40046e-09
using the converted pair ucube_3_A.mm, ucube_3_b.mm at stokes_ucube_cuda (the code above) produces this error: 0.00693082

What am I doing wrong?

@nibblelab
Copy link
Author

Also, there is another question. When running both this stokes_ucube_cuda and the benchmark solver_cuda I've noticed the GPU versions spends a lot of time setting the matrices and vectors at the GPU and in the overall run, this tends to reduce the gain from the GPU solver.

I've kind of expected an overhead when using the GPU, because of the additional communication over an PCI-e bridge, which tends to be slower than the CPU-RAM bridge, but what I've founded is this communication overhead can severely impact the use of the GPU.

Backing my claims with data, I've run solver_cuda with 32 x 32 x 32, 64 x 64 x 64, 128 x 128 x 128 and 256 x 256 x 256 meshes in my machine, a Ryzen 9 5900x 12c/24t - max 4.9 GHz, 64 GB RAM DDR4 3200, RTX 4070 12GB. Here is what I've found:

Image

Except for the smaller mesh, the GPU solver performs faster than the CPU solver. But when I consider the whole time to run the simulation, which includes the time spent in communicating with the GPU, the time gain by the GPU solver is greatly reduce by the time spent with CPU-GPU communication.

Since I'm new to this GPU solver thing, is this behavior expected or am I doing something wrong?

@ddemidov
Copy link
Owner

I'd say this is expected, especially if the solver is expensive to setup (e.g. ILU-type relaxation is used). If you can not reuse the preconditioner across many time steps, then you may want to look at the one that is simpler to setup, even if it takes more iterations to solve.
Also, you could look at other AMG implementations. For example, NVIDIA's AMGX does both the setup and soltion phases on the GPU, so it could perform better in your case.

@ddemidov
Copy link
Owner

ddemidov commented Jan 18, 2025

But, I found the solution error to be very different between the CPU version and the GPU version.

This sometimes may be explained by numerical intstability. For example, parallel reduction (vector sum in inner product implementation) may give slightly different results depending on the order of operations, which may lead to different solutions, especially if the system matrix is badly conditioned.

@nibblelab
Copy link
Author

But, I found the solution error to be very different between the CPU version and the GPU version.

This sometimes may be explained by numerical intstability. For example, parallel reduction (vector sum in inner product implementation) may give slightly different results depending on the order of operations, which may lead to different solutions, especially if the system matrix is badly conditioned.

Thanks for the reply again. But I have new questions.

Is the code I've used to port ucube to GPU is not wrong? It is the snippet of code I've shown in the thread and I've changed parts of stokes_ucube with parts of solver_cuda.

About the error, in the benchmark case the solution's error from both CPU and GPU are pretty much the same. So, in that case the matrix was well conditioned and therefore there wasn't any numerical instability. But, for ucube, the matrix is badly conditioned and this generates this kind of problem. Is that right? Assuming that is the case, how could I deal with this?

I'm asking because I'm integrating amgcl to a CFD code who more often than not, runs cases with very badly conditioned matrices. I know this because we use PETSc and, very commonly, and PETSc have a very hard time solving the linear system.

@nibblelab
Copy link
Author

I'd say this is expected, especially if the solver is expensive to setup (e.g. ILU-type relaxation is used). If you can not reuse the preconditioner across many time steps, then you may want to look at the one that is simpler to setup, even if it takes more iterations to solve. Also, you could look at other AMG implementations. For example, NVIDIA's AMGX does both the setup and soltion phases on the GPU, so it could perform better in your case.

Thanks again for the reply. Is there any pre-conditioner I could use on agmcl to test this?

I'm see to test also AMGX too. I was planning to test this lib in the future because, as far as I've checked, it only uses GPU and I need a amg library that works both in CPU and GPU, hence my interest in amgcl. But I'll check AMGX too. Thanks again.

@ddemidov
Copy link
Owner

Is the code I've used to port ucube to GPU is not wrong? It is the snippet of code I've shown in the thread and I've changed parts of stokes_ucube with parts of solver_cuda.

The code looks fine to me.

Is that right? Assuming that is the case, how could I deal with this?

Usually you need a preconditioner that is well-suited or even tailored to the problem. In case of the ucube you could use the schur pressure correction composite preconditioner from the same tutorial the problem comes from.

@nibblelab
Copy link
Author

Thanks agai for the replies.

I've tried to implement the schur-pressure pre-conditioner for cuda, using the CPU version as a parameter but the code don't even compile.

Then, reading the doc for the ucube problem, there is instructions for using the VexCL library to use a hybrid CPU-GPU solution. There is also a ported version of ucube for VexCL. So, I've download the VexCL library, recompile amgcl with it, but when I try to run the code, it crashes accusing a CUDA driver API error:

Image

stokes_ucube_cuda ldd:

Image

How could I fix this error?

Also, I've tested the AMGX and indeed has a lower setup time. I'll consider integrate amgx into the cfd as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants