Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorize-pr #856

Open
wants to merge 88 commits into
base: branch-24.03
Choose a base branch
from
Open

vectorize-pr #856

wants to merge 88 commits into from

Conversation

ipdemes
Copy link
Contributor

@ipdemes ipdemes commented Mar 24, 2023

This PR adds support for np.vectorize to cuNumeric
It depends on nv-legate/legate#640

This implementation has following limitations:


  1. It requires all output arrays or scalars be the same type and shape. This is something that we will probably have to require, at least I don’t see the way to support this in a performant way

  2. It requres all input arrays be the same shape. This is something we can fix with broadcasting.

  3. There is currently no support for “excluded” and “signature”



I had to manually add typings for Numba to avoid pre-commit errors. I don’t think I did it right and asked @bryevdv to look at it. 







And thank you @bryevdv for helping with fixing documentation issues

@ipdemes ipdemes changed the title Vectorize vectorize-pr Mar 28, 2023
cunumeric/vectorize.py Outdated Show resolved Hide resolved
Co-authored-by: Bryan Van de Ven <[email protected]>
@ipdemes ipdemes requested a review from bryevdv April 3, 2023 22:03
@manopapad
Copy link
Contributor

It looks like there's a lot of work needed to "lift" the original element-wise code to an operation that operates on the entire array. Would it be possible to instead ask numba to compile the original element-wise code to a device function, like it's done in cudf https://github.com/rapidsai/cudf/blob/branch-23.06/python/cudf/cudf/utils/cudautils.py#L251, then just call that function per-element inside a kernel?

@ipdemes
Copy link
Contributor Author

ipdemes commented Apr 4, 2023

@manopapad : I don't think we can do it: I believe ptx code generated by numba would only work on the densely allocated data. I had to add manual index calculation here for the general case

@manopapad
Copy link
Contributor

Collecting here the gist of some offline discussions with @gmarkall and @muraj:

With device=True, numba.cuda.compile_ptx produces a device function instead of a kernel:

Example using device=True
prm-login:/gpfs/fs1/mpapadakis/cunumeric> cat c.py
import numpy as np
import math
from numba import float64
from numba.cuda import compile_ptx
def foo(x):
    return math.sqrt(x)
(ptx, _) = compile_ptx(foo, (float64,), device=True)
print(ptx)
mpapadakis@prm-dgx-05:/gpfs/fs1/mpapadakis/cunumeric$ python c.py
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-29618528
// Cuda compilation tools, release 11.2, V11.2.152
// Based on NVVM 7.0.1
//

.version 7.2
.target sm_53
.address_size 64

	// .globl	_ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd;

.visible .func  (.param .b32 func_retval0) _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd(
	.param .b64 _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_0,
	.param .b64 _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_1
)
{
	.reg .b32 	%r<2>;
	.reg .f64 	%fd<3>;
	.reg .b64 	%rd<2>;


	ld.param.u64 	%rd1, [_ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_0];
	ld.param.f64 	%fd1, [_ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_1];
	sqrt.rn.f64 	%fd2, %fd1;
	st.f64 	[%rd1], %fd2;
	mov.u32 	%r1, 0;
	st.param.b32 	[func_retval0+0], %r1;
	ret;

}

My idea was that we could use this instead of producing a full kernel, pass the generated PTX to the C++ code, and call it within our GPU tasks, e.g. using the ParallelLoop policy:

  __CUDA_HD__ void operator()(const size_t idx, SparseTag) const noexcept
    auto p = pitches.unflatten(idx, rect.lo);
    kernel(&out[p], in0[p], in1[p], in2[p]);
  }

Note that the generated device functions follow a specific ABI, which we'll need to follow when calling them (I may or may not be doing it correctly in the snippet above).

Once we have the PTX for the device function, there is the question of how to wrap it in our generic kernel. We will presumably need to have a template for different number of dimensions, data types, and number of inputs/outputs. This template would need to be dynamically instantiated somehow for each new UDF.

The actual combining of the two pieces could be done using the cuLink APIs (see e.g. https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/ptxjit; the global entry function would be compiled separately as an unlinked cubin with an external reference to the numba device function symbol). Or we could write/generate some PTX that does our kernel launching (same compute capability as the device function), textually splice-in the PTX coming from the UDF, and compile the whole thing. The latter would have the maximum potential for inlining, but maybe this isn't that relevant, given that LTO was added to cuLinkAdd* in CUDA 11.2. Either alternative might require some special handling in the cmake build.

The smoothest fit within our stack would be if the kernel itself could be written in C++, so that it can use all the Legion Accessor classes directly, which make it easier to support NumPy’s broadcasting, views etc. It is possible to create a kernel around the device function and pass that to numba (in fact, that is how cudf supports UDFs today), but in that case we can't reuse the C++ Accessor classes, and instead have to "reimplement" that on the python side (that is what Irina is doing today).

My hunch overall is that this mode, of asking numba to generate scalar-only functions, and using those in our own launching logic, is a better fit for our usecase:

  • it would not require doing any source code parsing, which can easily break under us
  • it would fit with our OpenMP variants more easily

For completeness, let me list the reasons why (I think) the default full-kernel PTX compilation mode is not working out of the box for us:

  • The kernels that numba generates assume contiguous input arrays, which is not always the case in the presence of array views.
  • We don't want to launch the kernel on the python side. Instead we want the sub-tasks operating on the different parts of the array to launch their kernels on their local portions.

Some questions that I have about the feasibility of the approach:

  • What is the best way to combine the numba-generated PTX with our kernel (see above)?
  • Is there an equivalent numba path for generating CPU scalar-only functions, that we can use for the CPU variants?
  • Does this approach work with multiple outputs? It looks like the generated device function PTX assumes the two values are consecutive in memory.
    st.f64 	[%rd1], %fd2;
    st.f64 	[%rd1+8], %fd3;
    
    This would necessitate laying out the output arrays in AoS format, and colocating them.

@gmarkall
Copy link

gmarkall commented Apr 5, 2023

Thanks for the nice summary, @manopapad !

Note that the generated device functions follow a specific ABI, which we'll need to follow when calling them (I may or may not be doing it correctly in the snippet above).

I think you're following the ABI correctly in your snippet above (I can't see an error, anyway! 🙂)

Once we have the PTX for the device function, there is the question of how to wrap it in our generic kernel. We will presumably need to have a template for different number of dimensions, data types, and number of inputs/outputs. This template would need to be dynamically instantiated somehow for each new UDF.

This all makes sense to me.

The latter would have the maximum potential for inlining, but maybe this isn't that relevant, given that LTO was added to cuLinkAdd* in CUDA 11.2

For LTO it would be recommented to use the nvJitLink API in CUDA 12.0 onwards, as LTO was removed from the driver. I would like to have an option for compile_ptx to generate NVVM IR that can be used for LTO, but LTO-IR generation is not yet supported in NVVM (though it is something I have requested).

The smoothest fit within our stack would be if the kernel itself could be written in C++, so that it can use all the Legion Accessor classes directly, which make it easier to support NumPy’s broadcasting, views etc. It is possible to create a kernel around the device function and pass that to numba (in fact, that is how cudf supports UDFs today), but in that case we can't reuse the C++ Accessor classes, and instead have to "reimplement" that on the python side (that is what Irina is doing today).

It is possible to call C++ functions from Python kernels, but you need to do it through an extern "C" shim function, because Numba does not yet support the C++ ABI. I would like to make C++ functions directly callable from Python kernels, but have yet to implement that ABI support. Some shim functions for C++ string methods in cuDF are in shim.cu - the functions in this file are made available to Numba through the declare_device() method mentioned in the CUDA FFI docs.

  • The kernels that numba generates assume contiguous input arrays, which is not always the case in the presence of array views.

This is the case if you type the input arguments as being pointers to the data, rather than as arrays. It is not documented in the ABI documentation (because I wanted to find a more efficient way to support this) but if you gave the type as an array type (e.g. types.float32[:] for a non-contiguous array of float32s) then as long as you construct the appropriate arguments to Numba then you can pass a non-contiguous array. As an example, supposing you did this with a float32[:] (a 1D not-necessarily-contiguous array) and were passing a strided array skipping over every other element, then you'd need to give the following arguments for it in the argument list:

  • meminfo - a null: unused, for a "meminfo" object pointer.
  • parent another null: also unused, refers to a CPython "parent" object.
  • nitems - an integer giving the number of items in the array (or view).
  • itemsize - an integer specifying the size of each item in bytes.
  • data - a pointer to the data.
  • shape - the length of the first (only) dimension, which will be equal to nitems.
  • strides - the strides of the first (only) dimension, which would be 8 if we're striding every other element of a 32-bit float.

Supposing instead you were passing a 2D array with the type declared as float32[:, :], the arguments would be the same, except shape and strides would be replaced by 4 arguments:

  • shape0 - the size of the first dimension
  • shape1 - the size of the second dimension
  • stride0 - the stride of elements in the first dimension
  • stride1 - the stride of elements in the second dimension

and nitems would be equal to shape0 * shape1. The pattern continues this way for increasing numbers of dimensions.

Although it's not documented, if you wanted to rely on this I could add it to the documentation - this ABI is pretty baked-in to Numba at present so it can be pretty stable - I don't think it's changed since 2015.

What is the best way to combine the numba-generated PTX with our kernel (see above)?

I've not yet looked at the PR code (just tried to answer / comment here first) but for a first pass I think the approach of calling the PTX from your kernel and linking the PTX into your unlinked cubin will be the most straightforward way to get started, then an alternative approach can be explored later.

Is there an equivalent numba path for generating CPU scalar-only functions, that we can use for the CPU variants?

Not exactly - let me try to work out a close equivalent and post it in a follow-up.

Does this approach work with multiple outputs? It looks like the generated device function PTX assumes the two values are consecutive in memory.

I think your assessment is correct here, but I think your comments imply that this doesn't fit exactly into cuNumeric as it is - what would be the ideal way for multiple outputs to be handled from the cuNumeric perspective?

@gmarkall
Copy link

gmarkall commented Apr 5, 2023

A quick prototype of a compile_ptx() equivalent for the CPU looks like:

import math
from numba import float64
from numba.core import compiler, sigutils
from numba.core.compiler_lock import global_compiler_lock
from numba.core.registry import cpu_target


@global_compiler_lock
def compile_asm(func, sig):
    typingctx = cpu_target.typing_context
    targetctx = cpu_target.target_context

    flags = compiler.Flags()
    flags.no_cpython_wrapper = True
    flags.no_cfunc_wrapper = True

    args, return_type = sigutils.normalize_signature(sig)

    cres = compiler.compile_extra(
        typingctx=typingctx,
        targetctx=targetctx,
        func=func,
        args=args,
        return_type=return_type,
        flags=flags,
        locals={},
    )

    return cres.library.get_asm_str(), return_type


def foo(x):
    return math.sqrt(x)


asm, _ = compile_asm(foo, (float64,))
print(asm)

and produces:

	.text
	.file	"<string>"
	.globl	_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd
	.p2align	4, 0x90
	.type	_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,@function
_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd:
	vsqrtsd	%xmm0, %xmm0, %xmm0
	vmovsd	%xmm0, (%rdi)
	xorl	%eax, %eax
	retq
.Lfunc_end0:
	.size	_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd, .Lfunc_end0-_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd

	.type	_ZN08NumbaEnv8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,@object
	.comm	_ZN08NumbaEnv8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,8,8
	.section	".note.GNU-stack","",@progbits

Note that this followed the x86_64 System V ABI for a function with prototype:

retcode_t func(ret_type *ret, excinfo **exc, <args>);

This is slightly different to the CUDA target in that there is an extra parameter for exception info, which I think you can ignore (assuming you're not going to support Python exceptions from compiled code in cuNumeric).

In the example above, the return code of 0 was given to indicate no exception, the result was stored in the value pointed-to by %rdi (because it's the first pointer argument) and the "Python argument" x was in %xmm0 as it is the first floating-point argument.

If this looks like it's going in the right direction, we can firm up the best way to go about implementing / integrating this.

@ipdemes
Copy link
Contributor Author

ipdemes commented Apr 5, 2023

@manopapad Thank you for starting this conversation.
@gmarkall : thank you for your replies

I think the approach of calling the PTX from your kernel and linking the PTX into your unlinked cubin will be the most straightforward way to get started, then an alternative approach can be explored later.
Just to confirm I understand this correctly, do you mean something like this (using Manolis's example)

 __CUDA_HD__ void operator()(const size_t idx, SparseTag) const noexcept
    auto p = pitches.unflatten(idx, rect.lo);
    kernel(&out[p], in0[p], in1[p], in2[p]);
  }
  
 template<OUT, IN0, IN1, IN2>
 __CUDA_HD__ void kernel( OUT & out, IN0 &in0,  IN1 &in1, IN2 &in2){
     //packing arguments for numba-compiled UDF
     //calling numba-compiled UDF
 }

The actual combining of the two pieces could be done using the cuLink APIs (see e.g. https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/ptxjit; the global entry function would be compiled separately as an unlinked cubin with an external reference to the numba device function symbol
Can someone point me to an example how to do this? I was trying to search online,but wasn't able to fine something useful

@manopapad
Copy link
Contributor

@gmarkall Thank you so much for helping us out with this!

For LTO it would be recommented to use the nvJitLink API in CUDA 12.0 onwards, as LTO was removed from the driver. I would like to have an option for compile_ptx to generate NVVM IR that can be used for LTO, but LTO-IR generation is not yet supported in NVVM (though it is something I have requested).

I spent way too much time digging into this. Apparently once the code has made it to PTX, an uninlined device function will never be inlined into the calling kernel, no matter how the linking is done (e.g. the old cuLink APIs, the new nvJitLink APIs, or even textually combining the PTX snippets). Such an optimization is only possible when starting from CUDA C++ or LTO-IR. However, the situation is not as bad as if we were linking cubins, which would necessitate the use of a full ABI call (saving and restoring registers etc.). Instead the function call will be implemented using a relatively lightweight call, see the SASS on https://godbolt.org/z/sbvjxqeqP for an example.

Therefore, I suggest we go ahead and pre-compile the containing kernel as PTX (using something like nvcc -dc --ptx kernel.cu, and declaring the device function as an extern symbol), combine it at runtime with the device function PTX produced by numba, link those (using the cuLink APIs if CUDA<12.0, or the new nvJitLink APIs if CUDA>=12.0) then load the module with cuModuleLoadData. When LTO-IR generation becomes supported in numba, we can easily switch to that.

Does this approach work with multiple outputs? It looks like the generated device function PTX assumes the two values are consecutive in memory.

I think your assessment is correct here, but I think your comments imply that this doesn't fit exactly into cuNumeric as it is - what would be the ideal way for multiple outputs to be handled from the cuNumeric perspective?

The ideal in this situation would be to have each return value in the output become a separate pointer:

def fun(x):
  return (x +1, x+ 2)
# becomes
void fun(float* out1, float* out2, const float x) {
  *out1 = x + 1;
  *out2 = x + 2;
}

Then we could pass the two output arrays in SoA format, and pass one pointer from each to each function call. However, I would be hesitant to suggest implementing this mode, unless it has been requested by someone else. The way np.vectorize lifts multi-output functions to operate on distinct arrays is quite weird, so it's not reason enough (in my view) to change the (IMHO reasonable) way that numba currently handles tuple return values. We should be able to get around this by requesting colocation and AoS layout in the mapper.

If this looks like it's going in the right direction, we can firm up the best way to go about implementing / integrating this.

Looks good to me, but I'll let Irina see how easy it would be to actually use this :-)

@gmarkall One question, would the @numba.cfunc functionality be relevant here?

Just to confirm I understand this correctly, do you mean something like this (using Manolis's example)

I think we can just call directly into the device function that numba generates:

extern __device__ void devfun(float*, float, float, float);
__CUDA_HD__ void operator()(const size_t idx, SparseTag) const noexcept
    auto p = pitches.unflatten(idx, rect.lo);
    devfun(&out[p], in0[p], in1[p], in2[p]);
}

The actual name of the device function will be something auto-generated like _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd, so we'll need to change one of the two PTX snippets to make the names match.

Can someone point me to an example how to do this? I was trying to search online,but wasn't able to fine something useful

This CUDA sample uses the old cuLink APIs, and this one uses the new nvJitLink APIs.

@manopapad
Copy link
Contributor

link those (using the cuLink APIs if CUDA<12.0, or the new nvJitLink APIs if CUDA>=12.0)

Actually for now we can just always use cuLink, since that will work with PTX, and we can add nvJitLink when we need to handle LTO-IR.

@gmarkall
Copy link

@gmarkall Thank you so much for helping us out with this!

Glad to help out where I can 🙂

I spent way too much time digging into this. Apparently once the code has made it to PTX, an uninlined device function will never be inlined into the calling kernel, no matter how the linking is done (e.g. the old cuLink APIs, the new nvJitLink APIs, or even textually combining the PTX snippets). Such an optimization is only possible when starting from CUDA C++ or LTO-IR. However, the situation is not as bad as if we were linking cubins, which would necessitate the use of a full ABI call (saving and restoring registers etc.). Instead the function call will be implemented using a relatively lightweight call, see the SASS on https://godbolt.org/z/sbvjxqeqP for an example.

Thanks for all that digging - I had been a little surprised at the performance of uninlined PTX functions in general with Numba, in that there didn't seem to be as much of a performance penalty as I was expecting - the lightweight call explains what I'd been observing in the past.

Therefore, I suggest we go ahead and pre-compile the containing kernel as PTX (using something like nvcc -dc --ptx kernel.cu, and declaring the device function as an extern symbol), combine it at runtime with the device function PTX produced by numba, link those (using the cuLink APIs if CUDA<12.0, or the new nvJitLink APIs if CUDA>=12.0) then load the module with cuModuleLoadData. When LTO-IR generation becomes supported in numba, we can easily switch to that.

That makes sense, and sounds pretty close to the approach used in cuDF for e.g. string UDFs and groupby-apply that make use of C++ device functions (and the planned approach for when LTO-IR is available).

Then we could pass the two output arrays in SoA format, and pass one pointer from each to each function call. However, I would be hesitant to suggest implementing this mode, unless it has been requested by someone else.

I think I've had a request for something similar on the Numba discourse, so I'd like to spend a little time looking into whether this can be implemented in a relatively striaghtforward way on top of the most recent changes in Numba 0.57 (RC is out at the moment, release due soon) - will post back here if I come across a straightforward way to do things.

@gmarkall One question, would the @numba.cfunc functionality be relevant here?

It looks like it provides the right sort of ABI (or close to it) but I wonder whether it does too much, in that you get back a function pointer so you can call the code loaded into the process - does that take away too much control from you, and you need to handle compilation / loading on the node(s)?

The actual name of the device function will be something auto-generated like _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd, so we'll need to change one of the two PTX snippets to make the names match.

In the past I've hacked around this by searching the PTX for things like .globl to find the name and then doing a string replace with what I want the function to be be called (or using the name I found directly).

@manopapad
Copy link
Contributor

@gmarkall One question, would the @numba.cfunc functionality be relevant here?

It looks like it provides the right sort of ABI (or close to it) but I wonder whether it does too much, in that you get back a function pointer so you can call the code loaded into the process - does that take away too much control from you, and you need to handle compilation / loading on the node(s)?

That's a good point. Looking back at your earlier suggestion, it looks like the choices are:

  • Have Numba give us assembly code for the scalar function (e.g. using Graham's prototype code from vectorize-pr #856 (comment)), and leave it to the individual tasks to assemble (presumably calling an LLVM API to do that).

  • Have Numba go all the way to generating a function pointer (e.g. using the @numba.cfunc API). This spares us from needing to call an assembler, but the function pointer is only valid on the same process where Numba is running.

I feel like the second option is less fuss. One approach I can imagine would involve tagging each UDF with a "global" ID, that all processes agree on (just an auto-incrementing counter or something), but each process can have a local function pointer associated with that. We would need to make a CFFI call to the C++ layer, to cache the function pointer for each tag in a static map, so that the point tasks that use that UDF later will find the actual code to call. We wouldn't need to involve Legion at all in the caching, since this is purely process-local information.

I should note that the UDF would only get registered on processes where Numba is running, so there is the theoretical risk that a point task will not find a function pointer registered, but in Legate we typically have a Python interpreter running on every process, so this scenario won't be an issue for us.

@marcinz marcinz changed the base branch from branch-23.05 to branch-23.07 May 18, 2023 20:26
@marcinz marcinz changed the base branch from branch-23.07 to branch-23.09 July 18, 2023 15:42
@marcinz marcinz changed the base branch from branch-23.09 to branch-23.11 September 26, 2023 00:36
@marcinz marcinz changed the base branch from branch-23.11 to branch-24.01 November 9, 2023 17:13
Copy link
Contributor

@manopapad manopapad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commiting some old comments I had, these may or may not be applicable if we switch to using device functions.

Comment on lines +51 to +62
pa.bool_: ty.bool_,
pa.int8: ty.int8,
pa.int16: ty.int16,
pa.int32: ty.int32,
pa.int64: ty.int64, # np.int is int
pa.uint8: ty.uint8,
pa.uint16: ty.uint16,
pa.uint32: ty.uint32,
pa.uint64: ty.uint64, # np.uint is np.uint64
pa.float16: ty.float16,
pa.float32: ty.float32,
pa.float64: ty.float64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pa.bool_: ty.bool_,
pa.int8: ty.int8,
pa.int16: ty.int16,
pa.int32: ty.int32,
pa.int64: ty.int64, # np.int is int
pa.uint8: ty.uint8,
pa.uint16: ty.uint16,
pa.uint32: ty.uint32,
pa.uint64: ty.uint64, # np.uint is np.uint64
pa.float16: ty.float16,
pa.float32: ty.float32,
pa.float64: ty.float64,

Python-level arithmetic values can only be bool/int/float/complex. You'll probably also need to remove the import pyarrow at the top.

Comment on lines +117 to +120
def convert_to_cunumeric_dtype(dtype: Any) -> Any:
if dtype in CUNUMERIC_TYPE_MAP:
return CUNUMERIC_TYPE_MAP[dtype]
raise TypeError("dtype is not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def convert_to_cunumeric_dtype(dtype: Any) -> Any:
if dtype in CUNUMERIC_TYPE_MAP:
return CUNUMERIC_TYPE_MAP[dtype]
raise TypeError("dtype is not supported")
def convert_to_cunumeric_dtype(dtype: type) -> ty.Dtype:
if dtype in CUNUMERIC_TYPE_MAP:
return CUNUMERIC_TYPE_MAP[dtype]
raise TypeError(f"{dtype} is not supported")

fprintf(stderr, "UDF function wasn't generated yet");
LEGATE_ABORT;
}
return udf_caches_[hash];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return udf_caches_[hash];
return finder->second;

{
auto finder = udf_caches_.find(hash);
if (udf_caches_.end() == finder) {
fprintf(stderr, "UDF function wasn't generated yet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fprintf(stderr, "UDF function wasn't generated yet");
fprintf(stderr, "UDF function has not been generated yet");

@@ -113,6 +121,11 @@ class Pitches<0, C_ORDER> {
point[0] += index;
return point;
}
__CUDA_HD__
inline const size_t* data(void) { return &pitches[0]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just return NULL in this case?

Comment on lines +78 to +79
auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);
auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);
auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);
auto device_pitches = create_buffer<uint64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);
auto device_strides = create_buffer<uint64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);

nitpick, but that's what the function signature expects

Comment on lines +107 to +112
CUresult status = cuLaunchKernel(
func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config);
if (status != CUDA_SUCCESS) {
fprintf(stderr, "Failed to launch a CUDA kernel\n");
assert(false);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUresult status = cuLaunchKernel(
func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config);
if (status != CUDA_SUCCESS) {
fprintf(stderr, "Failed to launch a CUDA kernel\n");
assert(false);
}
CHECK_CUDA(cuLaunchKernel(
func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config));

Driver and runtime error codes are aligned, so it should be acceptable to reuse CHECK_CUDA here.

{
int64_t ptx_hash = context.scalars()[0].value<int64_t>();
std::string ptx = context.scalars()[1].value<std::string>();
Processor point = legate::Processor::get_executing_processor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is not used.

Comment on lines +64 to +72
#if CUDA_VERSION >= 6050
const char *name, *str;
assert(cuGetErrorName(result, &name) == CUDA_SUCCESS);
assert(cuGetErrorString(result, &str) == CUDA_SUCCESS);
fprintf(stderr, "CU: cuModuleLoadDataEx = %d (%s): %s\n", result, name, str);
#else
fprintf(stderr, "CU: cuModuleLoadDataEx = %d\n", result);
#endif
exit(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if CUDA_VERSION >= 6050
const char *name, *str;
assert(cuGetErrorName(result, &name) == CUDA_SUCCESS);
assert(cuGetErrorString(result, &str) == CUDA_SUCCESS);
fprintf(stderr, "CU: cuModuleLoadDataEx = %d (%s): %s\n", result, name, str);
#else
fprintf(stderr, "CU: cuModuleLoadDataEx = %d\n", result);
#endif
exit(-1);
CHECK_CUDA(result);

I believe we can fall back to the existing CUDA error reporting routines, after we've printed out the JIT-specific logs.

Comment on lines +56 to +57
"ERROR: Device side asserts are not supported by the "
"CUDA driver for MAC OSX, see NVBugs 1628896.\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"ERROR: Device side asserts are not supported by the "
"CUDA driver for MAC OSX, see NVBugs 1628896.\n");
"ERROR: Device side asserts are not supported by the "
"CUDA driver for MAC OSX.\n");

The nvbug is not going to be accessible to most users (I know it's also referenced in Legion, but same comment applies there too).

@marcinz marcinz changed the base branch from branch-24.01 to branch-24.03 February 22, 2024 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:new-feature PR introduces a new feature and will be classified as such in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants