-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vectorize-pr #856
base: branch-24.03
Are you sure you want to change the base?
vectorize-pr #856
Conversation
Co-authored-by: Bryan Van de Ven <[email protected]>
It looks like there's a lot of work needed to "lift" the original element-wise code to an operation that operates on the entire array. Would it be possible to instead ask numba to compile the original element-wise code to a device function, like it's done in cudf https://github.com/rapidsai/cudf/blob/branch-23.06/python/cudf/cudf/utils/cudautils.py#L251, then just call that function per-element inside a kernel? |
@manopapad : I don't think we can do it: I believe ptx code generated by numba would only work on the densely allocated data. I had to add manual index calculation here for the general case |
Collecting here the gist of some offline discussions with @gmarkall and @muraj: With Example using device=True
My idea was that we could use this instead of producing a full kernel, pass the generated PTX to the C++ code, and call it within our GPU tasks, e.g. using the
Note that the generated device functions follow a specific ABI, which we'll need to follow when calling them (I may or may not be doing it correctly in the snippet above). Once we have the PTX for the device function, there is the question of how to wrap it in our generic kernel. We will presumably need to have a template for different number of dimensions, data types, and number of inputs/outputs. This template would need to be dynamically instantiated somehow for each new UDF. The actual combining of the two pieces could be done using the The smoothest fit within our stack would be if the kernel itself could be written in C++, so that it can use all the Legion Accessor classes directly, which make it easier to support NumPy’s broadcasting, views etc. It is possible to create a kernel around the device function and pass that to numba (in fact, that is how cudf supports UDFs today), but in that case we can't reuse the C++ Accessor classes, and instead have to "reimplement" that on the python side (that is what Irina is doing today). My hunch overall is that this mode, of asking numba to generate scalar-only functions, and using those in our own launching logic, is a better fit for our usecase:
For completeness, let me list the reasons why (I think) the default full-kernel PTX compilation mode is not working out of the box for us:
Some questions that I have about the feasibility of the approach:
|
Thanks for the nice summary, @manopapad !
I think you're following the ABI correctly in your snippet above (I can't see an error, anyway! 🙂)
This all makes sense to me.
For LTO it would be recommented to use the nvJitLink API in CUDA 12.0 onwards, as LTO was removed from the driver. I would like to have an option for
It is possible to call C++ functions from Python kernels, but you need to do it through an
This is the case if you type the input arguments as being pointers to the data, rather than as arrays. It is not documented in the ABI documentation (because I wanted to find a more efficient way to support this) but if you gave the type as an array type (e.g.
Supposing instead you were passing a 2D array with the type declared as
and Although it's not documented, if you wanted to rely on this I could add it to the documentation - this ABI is pretty baked-in to Numba at present so it can be pretty stable - I don't think it's changed since 2015.
I've not yet looked at the PR code (just tried to answer / comment here first) but for a first pass I think the approach of calling the PTX from your kernel and linking the PTX into your unlinked cubin will be the most straightforward way to get started, then an alternative approach can be explored later.
Not exactly - let me try to work out a close equivalent and post it in a follow-up.
I think your assessment is correct here, but I think your comments imply that this doesn't fit exactly into cuNumeric as it is - what would be the ideal way for multiple outputs to be handled from the cuNumeric perspective? |
A quick prototype of a import math
from numba import float64
from numba.core import compiler, sigutils
from numba.core.compiler_lock import global_compiler_lock
from numba.core.registry import cpu_target
@global_compiler_lock
def compile_asm(func, sig):
typingctx = cpu_target.typing_context
targetctx = cpu_target.target_context
flags = compiler.Flags()
flags.no_cpython_wrapper = True
flags.no_cfunc_wrapper = True
args, return_type = sigutils.normalize_signature(sig)
cres = compiler.compile_extra(
typingctx=typingctx,
targetctx=targetctx,
func=func,
args=args,
return_type=return_type,
flags=flags,
locals={},
)
return cres.library.get_asm_str(), return_type
def foo(x):
return math.sqrt(x)
asm, _ = compile_asm(foo, (float64,))
print(asm) and produces: .text
.file "<string>"
.globl _ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd
.p2align 4, 0x90
.type _ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,@function
_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd:
vsqrtsd %xmm0, %xmm0, %xmm0
vmovsd %xmm0, (%rdi)
xorl %eax, %eax
retq
.Lfunc_end0:
.size _ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd, .Lfunc_end0-_ZN8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd
.type _ZN08NumbaEnv8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,@object
.comm _ZN08NumbaEnv8__main__3fooB2v1B30c8tJTC_2fWQI8IW1CiAAYKRrSBJTQBEd,8,8
.section ".note.GNU-stack","",@progbits Note that this followed the x86_64 System V ABI for a function with prototype: retcode_t func(ret_type *ret, excinfo **exc, <args>); This is slightly different to the CUDA target in that there is an extra parameter for exception info, which I think you can ignore (assuming you're not going to support Python exceptions from compiled code in cuNumeric). In the example above, the return code of If this looks like it's going in the right direction, we can firm up the best way to go about implementing / integrating this. |
@manopapad Thank you for starting this conversation.
|
@gmarkall Thank you so much for helping us out with this!
I spent way too much time digging into this. Apparently once the code has made it to PTX, an uninlined device function will never be inlined into the calling kernel, no matter how the linking is done (e.g. the old Therefore, I suggest we go ahead and pre-compile the containing kernel as PTX (using something like
The ideal in this situation would be to have each return value in the output become a separate pointer:
Then we could pass the two output arrays in SoA format, and pass one pointer from each to each function call. However, I would be hesitant to suggest implementing this mode, unless it has been requested by someone else. The way
Looks good to me, but I'll let Irina see how easy it would be to actually use this :-) @gmarkall One question, would the
I think we can just call directly into the device function that numba generates:
The actual name of the device function will be something auto-generated like
This CUDA sample uses the old |
Actually for now we can just always use |
Glad to help out where I can 🙂
Thanks for all that digging - I had been a little surprised at the performance of uninlined PTX functions in general with Numba, in that there didn't seem to be as much of a performance penalty as I was expecting - the lightweight call explains what I'd been observing in the past.
That makes sense, and sounds pretty close to the approach used in cuDF for e.g. string UDFs and groupby-apply that make use of C++ device functions (and the planned approach for when LTO-IR is available).
I think I've had a request for something similar on the Numba discourse, so I'd like to spend a little time looking into whether this can be implemented in a relatively striaghtforward way on top of the most recent changes in Numba 0.57 (RC is out at the moment, release due soon) - will post back here if I come across a straightforward way to do things.
It looks like it provides the right sort of ABI (or close to it) but I wonder whether it does too much, in that you get back a function pointer so you can call the code loaded into the process - does that take away too much control from you, and you need to handle compilation / loading on the node(s)?
In the past I've hacked around this by searching the PTX for things like |
That's a good point. Looking back at your earlier suggestion, it looks like the choices are:
I feel like the second option is less fuss. One approach I can imagine would involve tagging each UDF with a "global" ID, that all processes agree on (just an auto-incrementing counter or something), but each process can have a local function pointer associated with that. We would need to make a CFFI call to the C++ layer, to cache the function pointer for each tag in a static map, so that the point tasks that use that UDF later will find the actual code to call. We wouldn't need to involve Legion at all in the caching, since this is purely process-local information. I should note that the UDF would only get registered on processes where Numba is running, so there is the theoretical risk that a point task will not find a function pointer registered, but in Legate we typically have a Python interpreter running on every process, so this scenario won't be an issue for us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commiting some old comments I had, these may or may not be applicable if we switch to using device functions.
pa.bool_: ty.bool_, | ||
pa.int8: ty.int8, | ||
pa.int16: ty.int16, | ||
pa.int32: ty.int32, | ||
pa.int64: ty.int64, # np.int is int | ||
pa.uint8: ty.uint8, | ||
pa.uint16: ty.uint16, | ||
pa.uint32: ty.uint32, | ||
pa.uint64: ty.uint64, # np.uint is np.uint64 | ||
pa.float16: ty.float16, | ||
pa.float32: ty.float32, | ||
pa.float64: ty.float64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pa.bool_: ty.bool_, | |
pa.int8: ty.int8, | |
pa.int16: ty.int16, | |
pa.int32: ty.int32, | |
pa.int64: ty.int64, # np.int is int | |
pa.uint8: ty.uint8, | |
pa.uint16: ty.uint16, | |
pa.uint32: ty.uint32, | |
pa.uint64: ty.uint64, # np.uint is np.uint64 | |
pa.float16: ty.float16, | |
pa.float32: ty.float32, | |
pa.float64: ty.float64, |
Python-level arithmetic values can only be bool/int/float/complex. You'll probably also need to remove the import pyarrow
at the top.
def convert_to_cunumeric_dtype(dtype: Any) -> Any: | ||
if dtype in CUNUMERIC_TYPE_MAP: | ||
return CUNUMERIC_TYPE_MAP[dtype] | ||
raise TypeError("dtype is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def convert_to_cunumeric_dtype(dtype: Any) -> Any: | |
if dtype in CUNUMERIC_TYPE_MAP: | |
return CUNUMERIC_TYPE_MAP[dtype] | |
raise TypeError("dtype is not supported") | |
def convert_to_cunumeric_dtype(dtype: type) -> ty.Dtype: | |
if dtype in CUNUMERIC_TYPE_MAP: | |
return CUNUMERIC_TYPE_MAP[dtype] | |
raise TypeError(f"{dtype} is not supported") |
fprintf(stderr, "UDF function wasn't generated yet"); | ||
LEGATE_ABORT; | ||
} | ||
return udf_caches_[hash]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return udf_caches_[hash]; | |
return finder->second; |
{ | ||
auto finder = udf_caches_.find(hash); | ||
if (udf_caches_.end() == finder) { | ||
fprintf(stderr, "UDF function wasn't generated yet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fprintf(stderr, "UDF function wasn't generated yet"); | |
fprintf(stderr, "UDF function has not been generated yet"); |
@@ -113,6 +121,11 @@ class Pitches<0, C_ORDER> { | |||
point[0] += index; | |||
return point; | |||
} | |||
__CUDA_HD__ | |||
inline const size_t* data(void) { return &pitches[0]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just return NULL in this case?
auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM); | ||
auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto device_pitches = create_buffer<int64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM); | |
auto device_strides = create_buffer<int64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM); | |
auto device_pitches = create_buffer<uint64_t>(Point<1>(DIM - 1), Memory::Kind::Z_COPY_MEM); | |
auto device_strides = create_buffer<uint64_t>(Point<1>(DIM), Memory::Kind::Z_COPY_MEM); |
nitpick, but that's what the function signature expects
CUresult status = cuLaunchKernel( | ||
func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config); | ||
if (status != CUDA_SUCCESS) { | ||
fprintf(stderr, "Failed to launch a CUDA kernel\n"); | ||
assert(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUresult status = cuLaunchKernel( | |
func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config); | |
if (status != CUDA_SUCCESS) { | |
fprintf(stderr, "Failed to launch a CUDA kernel\n"); | |
assert(false); | |
} | |
CHECK_CUDA(cuLaunchKernel( | |
func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, 0, stream, NULL, config)); |
Driver and runtime error codes are aligned, so it should be acceptable to reuse CHECK_CUDA
here.
{ | ||
int64_t ptx_hash = context.scalars()[0].value<int64_t>(); | ||
std::string ptx = context.scalars()[1].value<std::string>(); | ||
Processor point = legate::Processor::get_executing_processor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is not used.
#if CUDA_VERSION >= 6050 | ||
const char *name, *str; | ||
assert(cuGetErrorName(result, &name) == CUDA_SUCCESS); | ||
assert(cuGetErrorString(result, &str) == CUDA_SUCCESS); | ||
fprintf(stderr, "CU: cuModuleLoadDataEx = %d (%s): %s\n", result, name, str); | ||
#else | ||
fprintf(stderr, "CU: cuModuleLoadDataEx = %d\n", result); | ||
#endif | ||
exit(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if CUDA_VERSION >= 6050 | |
const char *name, *str; | |
assert(cuGetErrorName(result, &name) == CUDA_SUCCESS); | |
assert(cuGetErrorString(result, &str) == CUDA_SUCCESS); | |
fprintf(stderr, "CU: cuModuleLoadDataEx = %d (%s): %s\n", result, name, str); | |
#else | |
fprintf(stderr, "CU: cuModuleLoadDataEx = %d\n", result); | |
#endif | |
exit(-1); | |
CHECK_CUDA(result); |
I believe we can fall back to the existing CUDA error reporting routines, after we've printed out the JIT-specific logs.
"ERROR: Device side asserts are not supported by the " | ||
"CUDA driver for MAC OSX, see NVBugs 1628896.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ERROR: Device side asserts are not supported by the " | |
"CUDA driver for MAC OSX, see NVBugs 1628896.\n"); | |
"ERROR: Device side asserts are not supported by the " | |
"CUDA driver for MAC OSX.\n"); |
The nvbug is not going to be accessible to most users (I know it's also referenced in Legion, but same comment applies there too).
This PR adds support for
np.vectorize
to cuNumericIt depends on nv-legate/legate#640
This implementation has following limitations:
I had to manually add typings for Numba to avoid pre-commit errors. I don’t think I did it right and asked @bryevdv to look at it.
And thank you @bryevdv for helping with fixing documentation issues