-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiler: Show memory state on deferred allocation OOM #1797
Comments
@manopapad I think the real challenge of this is picking a visualization tool. I can dump all the data out of Legion to make that picture say with graphviz or matplotlib, but there are going to be hundreds, if not thousands, of instances and holes to report, so I think we need a more dynamic visualization tool for rendering that because the zoomed-in close representation is not going to be comprehensible to a human. They aren't going to see what they need to need to see in large and then be able to zoom in on things to look at. Do you have thoughts on how you'd want to do that? Alternatively we can do a text-based representation for now and just have a tool that reports the largest holes in sorted order and the total size of all holes. |
Yes, we can start with a text dump for now, and iterate on the actual visualization. Maybe @bryevdv has a good idea. One more thing to note, in Legate we would also like to include additional information in this visualization, e.g. which user-level object corresponds to each field, so we would need to dump additional information on top of this. |
So my plan was to add the following method to the mapper runtime:
Any mapper could invoke that at any time to dump the memory state of a particular memory. You don't have to wait until you are OOM, but can do it as many times as you want throughout you run. I'm not promising that it will be fast as it will finish writing to the file and close the file before returning, but there's nothing stopping you from using it periodically. What would you add to that function call to record what you want and then how would you write the tool to parse it? |
I don't think we would add extra information to the call directly, but would possibly include extra information in the output file. In particular, we'd want to record which Legate-level Stores correspond to which Legion fields, and record relevant information on the Stores that would help a user track values back to their code:
|
Well that's what I mean: you'll need to pass some kind of data to the call because Legion is going to be the thing writing the output file. You can't add anything yourself (at least not directly) to the output file. The file format will be black box because I want to reserve the right to change the format in the future as the internals of Legion change. There will be an agreed upon format between Legion and the tool that parses the file given a particular Legion commit (similar to how Legion Prof and Legion Spy work today). That means that if you want to pass data yourself in some form it's going to need to be passed through Legion. |
I would much prefer to avoid this pattern, and just document (and version) the output format. This way multiple tools can read it, and it's not just the one tool that you provide. And you don't even have the responsibility to provide and maintain the one tool. We should honestly do the same for the Legion profiler output (if we're not already), but that's a different discussion.
We would either append data to the end of the same file, or dump our info into a separate file that we ship together with Legion's output. |
I'm on the fence about this. We don't have multiple tools that read the output right now, and for Legion Prof at least we don't particularly want it: there is a lot of business logic that goes into the I think it's reasonable, to the extent that the user is expected to provide this directly, to document any formats and version them. That's fair. But I think the status quo for Legion Prof is still the best trade-off all around given the constraints and inherent complexity in the problem we are solving. This is not the type of situation where open standards help, because there is (again) so much business logic we need to deal with. So overall I'm with Mike on this one in terms of how it would likely be implemented. We can always add modes or passes to Legion Prof to do whatever data manipulation you need to do to extract what you want from the logs themselves. |
@elliottslaughter I see your point regarding documenting the Legion Profiler format (I might disagree, but I need to educate myself more before I can express a meaningful opinion). But do you hold the same opinion for the (proposed) standalone information of "memory state dump"? This is a new set of information, not necessarily expected to integrate with the existing Legion profiler, so I suggest we build this up using a documented format, rather than having a single tool that knows how to parse the information. Then we can use the well-documented format to build up a tool that shows information specific to Legate's semantics (see #1797 (comment)), rather than trying to cram everything into the one tool that Legion provides. |
I'll separate the discussion on Legion Prof from the discussion on the new tool for out-of-memory conditions. First on the Legion Prof front, I pretty strongly agree with Elliott that we should have one tool for parsing and organizing the profiling data. I've made the case to Elliott (and I think I've mostly convinced him) that the runtime should just log stuff in the fastest way possible to minimize profiling overheads and it should be up to the parsing tool to put the pieces back together offline where overhead doesn't matter. Additionally there are lots of "connections" to be made between disparate logging statements. All this adds up to some really non-trivial code with semantics that are difficult to get right and change often. I don't think anyone should be replicating our effort to do that and there should be just one Legion Prof tool that people which people can write different backends for that can extract specific information that they want. This new tool that we're discussing for OOM conditions is different, but also shares some similarities to Legion Prof. The one thing this new tool has going for it is that all the information will be logged at a single point in time (when we actually run out of memory). This means that there will be fewer "connections" and semantics that we need to worry about. However, there are still some gotchas that we need to worry about. For example, how do you name instances? Everything user visible in the mapping interface goes on Realm instance IDs, but those are recycled when instances are deleted so they are not unique. A mapper might be holding references to multiple So where does that leave us? I feel like we still want a Legion-specific component of this tool that allows the runtime to dump whatever format it wants and then for others to be able to query it similar to Legion Prof. The question then is do clients pass semantic information into that tool and log through it's interface, or do they log their own data independently and then have some way to create associations with the things they know? If someone can describe a way to create such associations in a sane way, then I wouldn't mind supporting the independent pathway where you log separately, and then use the Legion tool to parse and extract what you want and do some kind of a "join" with your own data to make the needed connections and report what you want. |
Fair enough
This doesn't matter to the interface we're talking about here. You will be printing out based on the "internal" state of the runtime, so every instance ID you're printing is presumably "live". Making sure that the user isn't printing out garbage-collected instance names on its "side-logging" is not your problem.
Sure, with you there. And I'm not asking that everything in the log format be mappable back to application-level things. If you just document what each instance is being used for, for example:
then we can "pick and choose" what we combine with user-level info, and dump the rest "as-is". Another way to say this: If you can make the format self-describing (e.g. json with an associated schema), then just do that and avoid introducing another middleman tool. IMHO there's not the same requirement here as with the main profiling to be as lean as possible, so you can get away with this. As I write this, I realize that some explanatory information, like the association from task ID to task name / provenance, will also need to be made available to the processing tool (whether that's the user's tool directly, or an intermediate Legion tool). You probably don't want to be dumping the task name on every entry that references that task ID. |
How are you going to "combine" things if all the runtime names are in terms of internal things like distributed IDs/unique events and all your instances are names are in terms of Realm instance IDs (which are not unique)? You're not going to have any way to build the associations that you care about because you won't be able to map your instances to the things Legion is telling you about.
I don't think I'm going to be dumping any task IDs in this tool as Legion doesn't track tasks/provenances for instances. If you want you can log those on the side.
We'll be smart about deduplicating strings. |
Separating out a side discussion from #1739.
In order to get a full picture of memory usage, we would need to visualize a number of different objects that take up space on a Realm memory, some of which are only visible internally to the Runtime:
PhysicalInstance
sDeferredBuffer
s /DeferredValue
sFuture
instancesWe also need a way to let the mapper request this logging (today e.g. the
DefaultMapper
simply aborts on deferred allocation failure).The text was updated successfully, but these errors were encountered: