vectorize-pr #856

ipdemes · 2023-03-24T04:10:15Z

This PR adds support for np.vectorize to cuNumeric
It depends on nv-legate/legate#640

This implementation has following limitations: 

It requires all output arrays or scalars be the same type and shape. This is something that we will probably have to require, at least I don’t see the way to support this in a performant way 
It requres all input arrays be the same shape. This is something we can fix with broadcasting. 
There is currently no support for “excluded” and “signature”

  I had to manually add typings for Numba to avoid pre-commit errors. I don’t think I did it right and asked @bryevdv to look at it.   
  
  And thank you @bryevdv for helping with fixing documentation issues

… verions

cunumeric/vectorize.py

typings/numba/types/__init__.pyi

Co-authored-by: Bryan Van de Ven <[email protected]>

manopapad · 2023-04-03T22:45:08Z

It looks like there's a lot of work needed to "lift" the original element-wise code to an operation that operates on the entire array. Would it be possible to instead ask numba to compile the original element-wise code to a device function, like it's done in cudf https://github.com/rapidsai/cudf/blob/branch-23.06/python/cudf/cudf/utils/cudautils.py#L251, then just call that function per-element inside a kernel?

ipdemes · 2023-04-04T04:07:04Z

@manopapad : I don't think we can do it: I believe ptx code generated by numba would only work on the densely allocated data. I had to add manual index calculation here for the general case

manopapad · 2023-04-05T00:07:24Z

Collecting here the gist of some offline discussions with @gmarkall and @muraj:

With device=True, numba.cuda.compile_ptx produces a device function instead of a kernel:

Example using device=True

prm-login:/gpfs/fs1/mpapadakis/cunumeric> cat c.py
import numpy as np
import math
from numba import float64
from numba.cuda import compile_ptx
def foo(x):
    return math.sqrt(x)
(ptx, _) = compile_ptx(foo, (float64,), device=True)
print(ptx)

mpapadakis@prm-dgx-05:/gpfs/fs1/mpapadakis/cunumeric$ python c.py
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-29618528
// Cuda compilation tools, release 11.2, V11.2.152
// Based on NVVM 7.0.1
//

.version 7.2
.target sm_53
.address_size 64

	// .globl	_ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd;

.visible .func  (.param .b32 func_retval0) _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd(
	.param .b64 _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_0,
	.param .b64 _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_1
)
{
	.reg .b32 	%r<2>;
	.reg .f64 	%fd<3>;
	.reg .b64 	%rd<2>;


	ld.param.u64 	%rd1, [_ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_0];
	ld.param.f64 	%fd1, [_ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd_param_1];
	sqrt.rn.f64 	%fd2, %fd1;
	st.f64 	[%rd1], %fd2;
	mov.u32 	%r1, 0;
	st.param.b32 	[func_retval0+0], %r1;
	ret;

}

My idea was that we could use this instead of producing a full kernel, pass the generated PTX to the C++ code, and call it within our GPU tasks, e.g. using the ParallelLoop policy:

  __CUDA_HD__ void operator()(const size_t idx, SparseTag) const noexcept
    auto p = pitches.unflatten(idx, rect.lo);
    kernel(&out[p], in0[p], in1[p], in2[p]);
  }

Note that the generated device functions follow a specific ABI, which we'll need to follow when calling them (I may or may not be doing it correctly in the snippet above).

Once we have the PTX for the device function, there is the question of how to wrap it in our generic kernel. We will presumably need to have a template for different number of dimensions, data types, and number of inputs/outputs. This template would need to be dynamically instantiated somehow for each new UDF.

The actual combining of the two pieces could be done using the cuLink APIs (see e.g. https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/ptxjit; the global entry function would be compiled separately as an unlinked cubin with an external reference to the numba device function symbol). Or we could write/generate some PTX that does our kernel launching (same compute capability as the device function), textually splice-in the PTX coming from the UDF, and compile the whole thing. The latter would have the maximum potential for inlining, but maybe this isn't that relevant, given that LTO was added to cuLinkAdd* in CUDA 11.2. Either alternative might require some special handling in the cmake build.

The smoothest fit within our stack would be if the kernel itself could be written in C++, so that it can use all the Legion Accessor classes directly, which make it easier to support NumPy’s broadcasting, views etc. It is possible to create a kernel around the device function and pass that to numba (in fact, that is how cudf supports UDFs today), but in that case we can't reuse the C++ Accessor classes, and instead have to "reimplement" that on the python side (that is what Irina is doing today).

My hunch overall is that this mode, of asking numba to generate scalar-only functions, and using those in our own launching logic, is a better fit for our usecase:

it would not require doing any source code parsing, which can easily break under us
it would fit with our OpenMP variants more easily

For completeness, let me list the reasons why (I think) the default full-kernel PTX compilation mode is not working out of the box for us:

The kernels that numba generates assume contiguous input arrays, which is not always the case in the presence of array views.
We don't want to launch the kernel on the python side. Instead we want the sub-tasks operating on the different parts of the array to launch their kernels on their local portions.

Some questions that I have about the feasibility of the approach:

What is the best way to combine the numba-generated PTX with our kernel (see above)?
Is there an equivalent numba path for generating CPU scalar-only functions, that we can use for the CPU variants?
Does this approach work with multiple outputs? It looks like the generated device function PTX assumes the two values are consecutive in memory.
```
st.f64 	[%rd1], %fd2;
st.f64 	[%rd1+8], %fd3;
```
This would necessitate laying out the output arrays in AoS format, and colocating them.

gmarkall · 2023-04-05T12:51:19Z

Thanks for the nice summary, @manopapad !

Note that the generated device functions follow a specific ABI, which we'll need to follow when calling them (I may or may not be doing it correctly in the snippet above).

I think you're following the ABI correctly in your snippet above (I can't see an error, anyway! 🙂)

Once we have the PTX for the device function, there is the question of how to wrap it in our generic kernel. We will presumably need to have a template for different number of dimensions, data types, and number of inputs/outputs. This template would need to be dynamically instantiated somehow for each new UDF.

This all makes sense to me.

The latter would have the maximum potential for inlining, but maybe this isn't that relevant, given that LTO was added to cuLinkAdd* in CUDA 11.2

For LTO it would be recommented to use the nvJitLink API in CUDA 12.0 onwards, as LTO was removed from the driver. I would like to have an option for compile_ptx to generate NVVM IR that can be used for LTO, but LTO-IR generation is not yet supported in NVVM (though it is something I have requested).

The smoothest fit within our stack would be if the kernel itself could be written in C++, so that it can use all the Legion Accessor classes directly, which make it easier to support NumPy’s broadcasting, views etc. It is possible to create a kernel around the device function and pass that to numba (in fact, that is how cudf supports UDFs today), but in that case we can't reuse the C++ Accessor classes, and instead have to "reimplement" that on the python side (that is what Irina is doing today).

It is possible to call C++ functions from Python kernels, but you need to do it through an extern "C" shim function, because Numba does not yet support the C++ ABI. I would like to make C++ functions directly callable from Python kernels, but have yet to implement that ABI support. Some shim functions for C++ string methods in cuDF are in shim.cu - the functions in this file are made available to Numba through the declare_device() method mentioned in the CUDA FFI docs.

The kernels that numba generates assume contiguous input arrays, which is not always the case in the presence of array views.

This is the case if you type the input arguments as being pointers to the data, rather than as arrays. It is not documented in the ABI documentation (because I wanted to find a more efficient way to support this) but if you gave the type as an array type (e.g. types.float32[:] for a non-contiguous array of float32s) then as long as you construct the appropriate arguments to Numba then you can pass a non-contiguous array. As an example, supposing you did this with a float32[:] (a 1D not-necessarily-contiguous array) and were passing a strided array skipping over every other element, then you'd need to give the following arguments for it in the argument list:

meminfo - a null: unused, for a "meminfo" object pointer.
parent another null: also unused, refers to a CPython "parent" object.
nitems - an integer giving the number of items in the array (or view).
itemsize - an integer specifying the size of each item in bytes.
data - a pointer to the data.
shape - the length of the first (only) dimension, which will be equal to nitems.
strides - the strides of the first (only) dimension, which would be 8 if we're striding every other element of a 32-bit float.

Supposing instead you were passing a 2D array with the type declared as float32[:, :], the arguments would be the same, except shape and strides would be replaced by 4 arguments:

shape0 - the size of the first dimension
shape1 - the size of the second dimension
stride0 - the stride of elements in the first dimension
stride1 - the stride of elements in the second dimension

and nitems would be equal to shape0 * shape1. The pattern continues this way for increasing numbers of dimensions.

Although it's not documented, if you wanted to rely on this I could add it to the documentation - this ABI is pretty baked-in to Numba at present so it can be pretty stable - I don't think it's changed since 2015.

What is the best way to combine the numba-generated PTX with our kernel (see above)?

I've not yet looked at the PR code (just tried to answer / comment here first) but for a first pass I think the approach of calling the PTX from your kernel and linking the PTX into your unlinked cubin will be the most straightforward way to get started, then an alternative approach can be explored later.

Is there an equivalent numba path for generating CPU scalar-only functions, that we can use for the CPU variants?

Not exactly - let me try to work out a close equivalent and post it in a follow-up.

Does this approach work with multiple outputs? It looks like the generated device function PTX assumes the two values are consecutive in memory.

I think your assessment is correct here, but I think your comments imply that this doesn't fit exactly into cuNumeric as it is - what would be the ideal way for multiple outputs to be handled from the cuNumeric perspective?

gmarkall · 2023-04-05T13:11:53Z

A quick prototype of a compile_ptx() equivalent for the CPU looks like:

import math
from numba import float64
from numba.core import compiler, sigutils
from numba.core.compiler_lock import global_compiler_lock
from numba.core.registry import cpu_target


@global_compiler_lock
def compile_asm(func, sig):
    typingctx = cpu_target.typing_context
    targetctx = cpu_target.target_context

    flags = compiler.Flags()
    flags.no_cpython_wrapper = True
    flags.no_cfunc_wrapper = True

    args, return_type = sigutils.normalize_signature(sig)

    cres = compiler.compile_extra(
        typingctx=typingctx,
        targetctx=targetctx,
        func=func,
        args=args,
        return_type=return_type,
        flags=flags,
        locals={},
    )

    return cres.library.get_asm_str(), return_type


def foo(x):
    return math.sqrt(x)


asm, _ = compile_asm(foo, (float64,))
print(asm)

and produces:

	.text
	.file	"<string>"
	.globl	_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd
	.p2align	4, 0x90
	.type	_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,@function
_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd:
	vsqrtsd	%xmm0, %xmm0, %xmm0
	vmovsd	%xmm0, (%rdi)
	xorl	%eax, %eax
	retq
.Lfunc_end0:
	.size	_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd, .Lfunc_end0-_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd

	.type	_ZN08NumbaEnv8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,@object
	.comm	_ZN08NumbaEnv8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,8,8
	.section	".note.GNU-stack","",@progbits

Note that this followed the x86_64 System V ABI for a function with prototype:

retcode_t func(ret_type *ret, excinfo **exc, <args>);

This is slightly different to the CUDA target in that there is an extra parameter for exception info, which I think you can ignore (assuming you're not going to support Python exceptions from compiled code in cuNumeric).

In the example above, the return code of 0 was given to indicate no exception, the result was stored in the value pointed-to by %rdi (because it's the first pointer argument) and the "Python argument" x was in %xmm0 as it is the first floating-point argument.

If this looks like it's going in the right direction, we can firm up the best way to go about implementing / integrating this.

ipdemes · 2023-04-05T17:22:26Z

@manopapad Thank you for starting this conversation.
@gmarkall : thank you for your replies

I think the approach of calling the PTX from your kernel and linking the PTX into your unlinked cubin will be the most straightforward way to get started, then an alternative approach can be explored later.
Just to confirm I understand this correctly, do you mean something like this (using Manolis's example)

 __CUDA_HD__ void operator()(const size_t idx, SparseTag) const noexcept
    auto p = pitches.unflatten(idx, rect.lo);
    kernel(&out[p], in0[p], in1[p], in2[p]);
  }
  
 template<OUT, IN0, IN1, IN2>
 __CUDA_HD__ void kernel( OUT & out, IN0 &in0,  IN1 &in1, IN2 &in2){
     //packing arguments for numba-compiled UDF
     //calling numba-compiled UDF
 }

The actual combining of the two pieces could be done using the cuLink APIs (see e.g. https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/ptxjit; the global entry function would be compiled separately as an unlinked cubin with an external reference to the numba device function symbol
Can someone point me to an example how to do this? I was trying to search online,but wasn't able to fine something useful

manopapad · 2023-04-07T08:06:54Z

@gmarkall Thank you so much for helping us out with this!

For LTO it would be recommented to use the nvJitLink API in CUDA 12.0 onwards, as LTO was removed from the driver. I would like to have an option for compile_ptx to generate NVVM IR that can be used for LTO, but LTO-IR generation is not yet supported in NVVM (though it is something I have requested).

I spent way too much time digging into this. Apparently once the code has made it to PTX, an uninlined device function will never be inlined into the calling kernel, no matter how the linking is done (e.g. the old cuLink APIs, the new nvJitLink APIs, or even textually combining the PTX snippets). Such an optimization is only possible when starting from CUDA C++ or LTO-IR. However, the situation is not as bad as if we were linking cubins, which would necessitate the use of a full ABI call (saving and restoring registers etc.). Instead the function call will be implemented using a relatively lightweight call, see the SASS on https://godbolt.org/z/sbvjxqeqP for an example.

Therefore, I suggest we go ahead and pre-compile the containing kernel as PTX (using something like nvcc -dc --ptx kernel.cu, and declaring the device function as an extern symbol), combine it at runtime with the device function PTX produced by numba, link those (using the cuLink APIs if CUDA<12.0, or the new nvJitLink APIs if CUDA>=12.0) then load the module with cuModuleLoadData. When LTO-IR generation becomes supported in numba, we can easily switch to that.

Does this approach work with multiple outputs? It looks like the generated device function PTX assumes the two values are consecutive in memory.

I think your assessment is correct here, but I think your comments imply that this doesn't fit exactly into cuNumeric as it is - what would be the ideal way for multiple outputs to be handled from the cuNumeric perspective?

The ideal in this situation would be to have each return value in the output become a separate pointer:

def fun(x):
  return (x +1, x+ 2)
# becomes
void fun(float* out1, float* out2, const float x) {
  *out1 = x + 1;
  *out2 = x + 2;
}

Then we could pass the two output arrays in SoA format, and pass one pointer from each to each function call. However, I would be hesitant to suggest implementing this mode, unless it has been requested by someone else. The way np.vectorize lifts multi-output functions to operate on distinct arrays is quite weird, so it's not reason enough (in my view) to change the (IMHO reasonable) way that numba currently handles tuple return values. We should be able to get around this by requesting colocation and AoS layout in the mapper.

If this looks like it's going in the right direction, we can firm up the best way to go about implementing / integrating this.

Looks good to me, but I'll let Irina see how easy it would be to actually use this :-)

@gmarkall One question, would the @numba.cfunc functionality be relevant here?

Just to confirm I understand this correctly, do you mean something like this (using Manolis's example)

I think we can just call directly into the device function that numba generates:

extern __device__ void devfun(float*, float, float, float);
__CUDA_HD__ void operator()(const size_t idx, SparseTag) const noexcept
    auto p = pitches.unflatten(idx, rect.lo);
    devfun(&out[p], in0[p], in1[p], in2[p]);
}

The actual name of the device function will be something auto-generated like _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd, so we'll need to change one of the two PTX snippets to make the names match.

Can someone point me to an example how to do this? I was trying to search online,but wasn't able to fine something useful

This CUDA sample uses the old cuLink APIs, and this one uses the new nvJitLink APIs.

manopapad · 2023-04-07T19:58:10Z

link those (using the cuLink APIs if CUDA<12.0, or the new nvJitLink APIs if CUDA>=12.0)

Actually for now we can just always use cuLink, since that will work with PTX, and we can add nvJitLink when we need to handle LTO-IR.

gmarkall · 2023-04-11T11:16:24Z

@gmarkall Thank you so much for helping us out with this!

Glad to help out where I can 🙂

I spent way too much time digging into this. Apparently once the code has made it to PTX, an uninlined device function will never be inlined into the calling kernel, no matter how the linking is done (e.g. the old cuLink APIs, the new nvJitLink APIs, or even textually combining the PTX snippets). Such an optimization is only possible when starting from CUDA C++ or LTO-IR. However, the situation is not as bad as if we were linking cubins, which would necessitate the use of a full ABI call (saving and restoring registers etc.). Instead the function call will be implemented using a relatively lightweight call, see the SASS on https://godbolt.org/z/sbvjxqeqP for an example.

Thanks for all that digging - I had been a little surprised at the performance of uninlined PTX functions in general with Numba, in that there didn't seem to be as much of a performance penalty as I was expecting - the lightweight call explains what I'd been observing in the past.

Therefore, I suggest we go ahead and pre-compile the containing kernel as PTX (using something like nvcc -dc --ptx kernel.cu, and declaring the device function as an extern symbol), combine it at runtime with the device function PTX produced by numba, link those (using the cuLink APIs if CUDA<12.0, or the new nvJitLink APIs if CUDA>=12.0) then load the module with cuModuleLoadData. When LTO-IR generation becomes supported in numba, we can easily switch to that.

That makes sense, and sounds pretty close to the approach used in cuDF for e.g. string UDFs and groupby-apply that make use of C++ device functions (and the planned approach for when LTO-IR is available).

Then we could pass the two output arrays in SoA format, and pass one pointer from each to each function call. However, I would be hesitant to suggest implementing this mode, unless it has been requested by someone else.

I think I've had a request for something similar on the Numba discourse, so I'd like to spend a little time looking into whether this can be implemented in a relatively striaghtforward way on top of the most recent changes in Numba 0.57 (RC is out at the moment, release due soon) - will post back here if I come across a straightforward way to do things.

@gmarkall One question, would the @numba.cfunc functionality be relevant here?

It looks like it provides the right sort of ABI (or close to it) but I wonder whether it does too much, in that you get back a function pointer so you can call the code loaded into the process - does that take away too much control from you, and you need to handle compilation / loading on the node(s)?

The actual name of the device function will be something auto-generated like _ZN8__main__3fooB2v1B96cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKUBDUB3kNVDaQRKChQ_2bSEFA_2fkGdcqwkAEd, so we'll need to change one of the two PTX snippets to make the names match.

In the past I've hacked around this by searching the PTX for things like .globl to find the name and then doing a string replace with what I want the function to be be called (or using the name I found directly).

manopapad · 2023-04-11T19:22:20Z

@gmarkall One question, would the @numba.cfunc functionality be relevant here?

It looks like it provides the right sort of ABI (or close to it) but I wonder whether it does too much, in that you get back a function pointer so you can call the code loaded into the process - does that take away too much control from you, and you need to handle compilation / loading on the node(s)?

That's a good point. Looking back at your earlier suggestion, it looks like the choices are:

Have Numba give us assembly code for the scalar function (e.g. using Graham's prototype code from vectorize-pr #856 (comment)), and leave it to the individual tasks to assemble (presumably calling an LLVM API to do that).
Have Numba go all the way to generating a function pointer (e.g. using the @numba.cfunc API). This spares us from needing to call an assembler, but the function pointer is only valid on the same process where Numba is running.

I feel like the second option is less fuss. One approach I can imagine would involve tagging each UDF with a "global" ID, that all processes agree on (just an auto-incrementing counter or something), but each process can have a local function pointer associated with that. We would need to make a CFFI call to the C++ layer, to cache the function pointer for each tag in a static map, so that the point tasks that use that UDF later will find the actual code to call. We wouldn't need to involve Legion at all in the caching, since this is purely process-local information.

I should note that the UDF would only get registered on processes where Numba is running, so there is the theoretical risk that a point task will not find a function pointer registered, but in Legate we typically have a Python interpreter running on every process, so this scenario won't be an issue for us.

manopapad

Commiting some old comments I had, these may or may not be applicable if we switch to using device functions.

manopapad · 2023-05-19T19:26:37Z

cunumeric/utils.py

+    pa.bool_: ty.bool_,
+    pa.int8: ty.int8,
+    pa.int16: ty.int16,
+    pa.int32: ty.int32,
+    pa.int64: ty.int64,  # np.int is int
+    pa.uint8: ty.uint8,
+    pa.uint16: ty.uint16,
+    pa.uint32: ty.uint32,
+    pa.uint64: ty.uint64,  # np.uint is np.uint64
+    pa.float16: ty.float16,
+    pa.float32: ty.float32,
+    pa.float64: ty.float64,


Suggested change

pa.bool_: ty.bool_,

pa.int8: ty.int8,

pa.int16: ty.int16,

pa.int32: ty.int32,

pa.int64: ty.int64, # np.int is int

pa.uint8: ty.uint8,

pa.uint16: ty.uint16,

pa.uint32: ty.uint32,

pa.uint64: ty.uint64, # np.uint is np.uint64

pa.float16: ty.float16,

pa.float32: ty.float32,

pa.float64: ty.float64,

Python-level arithmetic values can only be bool/int/float/complex. You'll probably also need to remove the import pyarrow at the top.

manopapad · 2023-05-19T19:28:20Z

cunumeric/utils.py

+def convert_to_cunumeric_dtype(dtype: Any) -> Any:
+    if dtype in CUNUMERIC_TYPE_MAP:
+        return CUNUMERIC_TYPE_MAP[dtype]
+    raise TypeError("dtype is not supported")


Suggested change

def convert_to_cunumeric_dtype(dtype: Any) -> Any:

if dtype in CUNUMERIC_TYPE_MAP:

return CUNUMERIC_TYPE_MAP[dtype]

raise TypeError("dtype is not supported")

def convert_to_cunumeric_dtype(dtype: type) -> ty.Dtype:

if dtype in CUNUMERIC_TYPE_MAP:

return CUNUMERIC_TYPE_MAP[dtype]

raise TypeError(f"{dtype} is not supported")

manopapad · 2023-05-19T19:32:17Z

src/cunumeric/cudalibs.cu

+    fprintf(stderr, "UDF function wasn't generated yet");
+    LEGATE_ABORT;
+  }
+  return udf_caches_[hash];


Suggested change

return udf_caches_[hash];

return finder->second;

manopapad · 2023-05-19T19:35:17Z

src/cunumeric/cudalibs.cu

+{
+  auto finder = udf_caches_.find(hash);
+  if (udf_caches_.end() == finder) {
+    fprintf(stderr, "UDF function wasn't generated yet");


Suggested change

fprintf(stderr, "UDF function wasn't generated yet");

fprintf(stderr, "UDF function has not been generated yet");

manopapad · 2023-05-19T19:43:45Z

src/cunumeric/pitches.h

@@ -113,6 +121,11 @@ class Pitches<0, C_ORDER> {
    point[0] += index;
    return point;
  }
+  __CUDA_HD__
+  inline const size_t* data(void) { return &pitches[0]; }


Can we just return NULL in this case?

manopapad · 2023-05-22T22:20:06Z

src/cunumeric/vectorize/eval_udf.cu

+    auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);
+    auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);


Suggested change

auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);

auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);

auto device_pitches = create_buffer<uint64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);

auto device_strides = create_buffer<uint64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);

nitpick, but that's what the function signature expects

manopapad · 2023-05-22T23:17:43Z

src/cunumeric/vectorize/eval_udf.cu

+    CUresult status = cuLaunchKernel(
+      func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config);
+    if (status != CUDA_SUCCESS) {
+      fprintf(stderr, "Failed to launch a CUDA kernel\n");
+      assert(false);
+    }


Suggested change

CUresult status = cuLaunchKernel(

func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config);

if (status != CUDA_SUCCESS) {

fprintf(stderr, "Failed to launch a CUDA kernel\n");

assert(false);

}

CHECK_CUDA(cuLaunchKernel(

func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config));

Driver and runtime error codes are aligned, so it should be acceptable to reuse CHECK_CUDA here.

manopapad · 2023-05-23T00:02:50Z

src/cunumeric/vectorize/create_cu_kernel.cu

+{
+  int64_t ptx_hash = context.scalars()[0].value<int64_t>();
+  std::string ptx  = context.scalars()[1].value<std::string>();
+  Processor point  = legate::Processor::get_executing_processor();


This value is not used.

manopapad · 2023-05-23T00:58:50Z

src/cunumeric/vectorize/create_cu_kernel.cu

+#if CUDA_VERSION >= 6050
+      const char *name, *str;
+      assert(cuGetErrorName(result, &name) == CUDA_SUCCESS);
+      assert(cuGetErrorString(result, &str) == CUDA_SUCCESS);
+      fprintf(stderr, "CU: cuModuleLoadDataEx = %d (%s): %s\n", result, name, str);
+#else
+      fprintf(stderr, "CU: cuModuleLoadDataEx = %d\n", result);
+#endif
+      exit(-1);


Suggested change

#if CUDA_VERSION >= 6050

const char *name, *str;

assert(cuGetErrorName(result, &name) == CUDA_SUCCESS);

assert(cuGetErrorString(result, &str) == CUDA_SUCCESS);

fprintf(stderr, "CU: cuModuleLoadDataEx = %d (%s): %s\n", result, name, str);

#else

fprintf(stderr, "CU: cuModuleLoadDataEx = %d\n", result);

#endif

exit(-1);

CHECK_CUDA(result);

I believe we can fall back to the existing CUDA error reporting routines, after we've printed out the JIT-specific logs.

manopapad · 2023-05-23T07:17:59Z

src/cunumeric/vectorize/create_cu_kernel.cu

+              "ERROR: Device side asserts are not supported by the "
+              "CUDA driver for MAC OSX, see NVBugs 1628896.\n");


Suggested change

"ERROR: Device side asserts are not supported by the "

"CUDA driver for MAC OSX, see NVBugs 1628896.\n");

"ERROR: Device side asserts are not supported by the "

"CUDA driver for MAC OSX.\n");

The nvbug is not going to be accessible to most users (I know it's also referenced in Legion, but same comment applies there too).

ipdemes added 30 commits January 30, 2023 14:57

towards using vectorize

e7f6efd

making simple test work on CPUs

02b5ffc

making simple GPU function work

8b8e601

changing isort version in pre-commmit due to some errors in the older…

d330016

… verions

adding type stubs for numba

6199a41

clean-up

23b80c5

forcing partition by the first dimention for vectorize

3e07363

adding caching to vectorize

67b5675

adding support for otypes

5b634fb

requiring exact instance creation for EVAL_UDF task

844a29f

fixing logic for generating CPU functions

7285498

some clean-up

f1192e6

adding logic for scalar arguments

e8b544f

refactoring cectorize kernel to compute point

37a7281

making GPU kernel work with sparse arrays

024e943

some clean-up

24b16c3

hashing CUDA kernel for user function

c76b2fe

removing PTX from task arguments in the case the function was hashed

c8e8847

fixing Torchswe test

392b4ee

fixing Torchswe test

4278be6

fixing Torchswe test

ff0782b

adding debug output

c772522

adding debug output

958a7b8

removing debug output

1f35532

do not pass string if PTX is hashed

63265c3

split Cuda kerenel generation with kernel execution

798289f

Merge remote-tracking branch 'origin/branch-23.03' into vectorize

6dbc77a

fixing errors after merge

591fe8e

fixing logic for generating numba function

7b56a5c

adding support for upper-case Letters in argument names in UDF

57bc2b3

adding missing type stubs for numpy

98f15d6

ipdemes mentioned this pull request Mar 28, 2023

Isort and black conflict when formatting a file #868

Closed

ipdemes changed the title ~~Vectorize~~ vectorize-pr Mar 28, 2023

ipdemes added 3 commits March 29, 2023 23:16

checking return statememt of UDF for special characters

8d118ca

updating the map between pyarow types and legate types

889fb65

adding type stubs for pyarrow

7df88a4

bryevdv reviewed Apr 3, 2023

View reviewed changes

cunumeric/vectorize.py Outdated Show resolved Hide resolved

bryevdv reviewed Apr 3, 2023

View reviewed changes

typings/numba/types/__init__.pyi Outdated Show resolved Hide resolved

Update cunumeric/vectorize.py

a65eea1

Co-authored-by: Bryan Van de Ven <[email protected]>

ipdemes mentioned this pull request Apr 3, 2023

vectorize-pr nv-legate/legate#640

Merged

addressing comments from Bryan

4dc584b

ipdemes requested a review from bryevdv April 3, 2023 22:03

marcinz changed the base branch from branch-23.05 to branch-23.07 May 18, 2023 20:26

ipdemes added 2 commits May 18, 2023 20:43

Merge remote-tracking branch 'origin/branch-23.07' into vectorize

0446909

fixed some typos

c123d63

marcinz changed the base branch from branch-23.07 to branch-23.09 July 18, 2023 15:42

marcinz changed the base branch from branch-23.09 to branch-23.11 September 26, 2023 00:36

marcinz changed the base branch from branch-23.11 to branch-24.01 November 9, 2023 17:13

manopapad reviewed Nov 30, 2023

View reviewed changes

marcinz changed the base branch from branch-24.01 to branch-24.03 February 22, 2024 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorize-pr #856

vectorize-pr #856

ipdemes commented Mar 24, 2023

manopapad commented Apr 3, 2023

ipdemes commented Apr 4, 2023

manopapad commented Apr 5, 2023

gmarkall commented Apr 5, 2023

gmarkall commented Apr 5, 2023

ipdemes commented Apr 5, 2023

manopapad commented Apr 7, 2023

manopapad commented Apr 7, 2023

gmarkall commented Apr 11, 2023

manopapad commented Apr 11, 2023

manopapad left a comment

manopapad May 19, 2023

manopapad May 19, 2023

manopapad May 19, 2023

manopapad May 19, 2023

manopapad May 19, 2023

manopapad May 22, 2023

manopapad May 22, 2023

manopapad May 23, 2023

manopapad May 23, 2023

manopapad May 23, 2023

	fprintf(stderr, "UDF function wasn't generated yet");
	fprintf(stderr, "UDF function has not been generated yet");

		auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM);
		auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM);

		"ERROR: Device side asserts are not supported by the "
		"CUDA driver for MAC OSX, see NVBugs 1628896.\n");

vectorize-pr #856

Are you sure you want to change the base?

vectorize-pr #856

Conversation

ipdemes commented Mar 24, 2023

manopapad commented Apr 3, 2023

ipdemes commented Apr 4, 2023

manopapad commented Apr 5, 2023

gmarkall commented Apr 5, 2023

gmarkall commented Apr 5, 2023

ipdemes commented Apr 5, 2023

manopapad commented Apr 7, 2023

manopapad commented Apr 7, 2023

gmarkall commented Apr 11, 2023

manopapad commented Apr 11, 2023

manopapad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment