-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start / Stop / Dump trace hooks for NCCL profiler for a tracing ecosystem integration #1210
base: master
Are you sure you want to change the base?
Conversation
NCCL provides a profiling/tracing capability to record various operations during collectives including setting up buffers, sending data to and from GPU etc. This change will enable us to control NCCL profiling from the application layer through a start/stop interface. Enhancements * It uses a compile time flag and traces the whole application. So it does not support start and stop API. * Does not annotate the start and stop of the overall collective and provide collective name. * Missing chunk/data size measurement. * Add nccl API markers. * Improve clean up for profiler and collective event buffers. * Make trace dumping not dependent on collective markings. * Future enhancements will include sampling to enable always-on collection.
Can you explain more in details how this integrates into Kineto? I'm failing to see how Kineto would call ncclProfilerStart/ncclProfilerStop, and also how it would get it's profiling data back into the global profiling (we generate a file, but will this file become part of the global profiling?). |
Hey Sylvain, thanks for taking the time out to understand this.
We plan to integrate these start stop methods to our other tooling as well. Although, Kineto is the main user. For Kineto, on a high level, we implement the Now when you wrap your model in a Today our
So consolidating the data with GPU trace (aka global profiling) was not an initial design goal. The generated file is stored at the same location at the GPU trace. |
@sjeaugey Hey, do let me know your thoughts on adding these methods to the profiler to - We don't really have to change the functionality of the stock profiler that comes with the library if you don't want to, but just having a better interface allows cheap, quick, basic, out-of-the-box collective tracing for everything that sits above NCCL in the stack (and that's a lot of prod-like things) without having to re-compile each of them with a plugin (not so prod-like) :) We can always defer any code that improves the specialization of a default profiler to be added as a plugin. |
@sanrise The ability to seamlessly profile NCCL collectives with Kineto/PyTorch would be absolutely amazing! However, I'm a little skeptical about using the existing proxy profiler for this. I'll try to summarize the potential problems below, and propose a hopefully more robust solution based on a different approach.
First, the existing profiler has to be changed in some way because it simply doesn't work for anything non-trivial. I tried to fix this (see this issue, this PR, and this commit message for more details); actually, while digging through old PRs I just found out I was not the first person to attempt this. Those PR were never merged, and I believe the reason why is that the existing profiler was never intended to be used for prod-like things (see this and this, for example). The NCCL maintainers simply don't have the bandwidth to review large changes to something that is only meant to serve as proof of concept (which is very understandable). Second, the proxy profiler events only trigger for the NCCL channels that are proxied over actual network. Anything happening over NVLink, for example, is invisible to the proxy profiler. The kernels used in NCCL do a lot of performance-sensitive operations on the GPU even when there's no network involved (more on that at the end of this comment), and it would be really nice to have visibility into this, too. Third, perhaps the most fundamental problem is that the proxy profiler events are inherently hard to match with the actual collective kernels launched from PyTorch, since the CPU proxy thread(s) are asynchronous with respect to both GPU kernels and the other CPU threads. From cursory reading, this PR assigns a event triggered by a proxy thread to the collective most recently recorded with With all this in mind, what if we profiled the actual NCCL kernels instead of the proxy thread? I have a proof-of-concept implementation in my private NCCL fork which essentially does this:
The diff to add this to NCCL is surprisingly small – a hundred of lines tops. The host callback can then do anything it wants with the raw timings obtained after the kernel finishes. I just collect them in host memory to eventually enrich the raw Kineto trace so that it looks like this: This basically gives you full visibility into what is happening within the pink Having said that, I definitely don't want to hijack this PR with unrelated proposals. Just integrating the existing proxy profiler to PyTorch in some form would also be great. But if this idea of fine-grained kernel profiling sounds interesting to PyTorch/NCCL developers, I can try to create a separate PR (or issue) to discuss this further. |
@dfyz It's looks coooool, how much performance loss will this trace capability have on NCCL? In addition, how can we use the feature you mentioned above? could you share a PR? And this is useful for analyzing performance-related issues, thanks. |
I didn't see any noticeable performance impact on the kernels themselves, but the host callback I wrote to process the timings from the kernels is very naive. It does have a noticeable impact on performance when the number of ranks (and hence the number of timestamps collected) is very large. In extreme cases, it could double the effective time it was needed to run the collectives.
There is no easy way, because my proof-of-concept never evolved beyond that (mostly because I wasn't sure it is of interest to anyone). It works for my purposes, but the code quality is not that great and only some NCCL kernels can be traced (more exactly, ring-based Having said that, I just published this PoC to my NCCL fork, hoping it might serve as inspiration to someone:
|
NCCL already provides a profiling/tracing capability to record various operations during collectives including setting up buffers, sending data to and from GPU etc. Today it uses a compile time flag and traces the whole application and does not support any start and stop knobs.
Having such a control allows the application layer to gain temporary visibility into collective communication. This allows us to introduce abstractions within PyTorch's (libkineto) profiler to do things like (a) report NCCL traces along with GPU traces for only a subset of training iterations or (b) simply trace collectives simulated using torch.distributed; to gain efficient, fine grain visibility of collective operations.
This change will add such lifecycle management interfaces (start/stop/dump-trace) to the existing profiler along with other improvements to the collection.
Enhancements
Example:
Upon Kineto integration, the PyTorch profiler module can generate NCCL traces. This file was generated by the profiler:
In above example, we simply profiled this PyTorch snippet:
(These changes have been authored and iterated on by @briancoutinho and various engineers at Meta over a period of time)