-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360
Conversation
5014b1f
to
a48fc23
Compare
a48fc23
to
5b8dbdd
Compare
This sets HIP_VISIBLE_DEVICES from the .yaml -- it can be exposed as a pytest option as well (it's an option for the server CLI) but this is easiest and most obvious. #342 describes why this method was chosen for the runner (it's a little faster to init) |
ctest --timeout 30 --output-on-failure --test-dir build | ||
pytest tests/apps/sd/e2e_test.py -v -s --system=amdgpu | ||
HIP_VISIBLE_DEVICES=0 pytest tests/apps/sd/e2e_test.py -v -s --system=amdgpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sets HIP_VISIBLE_DEVICES from the .yaml -- it can be exposed as a pytest option as well (it's an option for the server CLI) but this is easiest and most obvious. #342 describes why this method was chosen for the runner (it's a little faster to init)
I think @saienduri also did something with HIP_VISIBLE_DEVICES
for having multiple github actions runner instances on the same node, with one runner per GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runner setup doesn't seem to be restricting the visible devices -- with no specified visibility, a test failed due to a known multi-gpu bug. Thats OK as long as we don't mind using an environment variable here. If we switch to a runner that manages that variable, and multi-gpu still isn't fixed, we can remove the usage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monorimet can you try using ROCR_VISIBLE_DEVICES? That's what we've been using in IREE and SHARK-TestSuite
worker = sysman.ls.create_worker(f"{name}-inference-{device.name}-{i}") | ||
fiber = sysman.ls.create_fiber(worker, devices=[device]) | ||
self.workers.append(worker) | ||
self.fibers.append(fiber) | ||
self.locks.append(asyncio.Lock()) | ||
self.fiber_status.append(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in love with this, but haven't dreamed up a different way about it yet. The context manager was making per-call difficult to toggle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we're missing a data structure. If I read this right, you just want to be able to pick a free fiber, right? There should really be some kind of a Pool or something which we don't have yet, but you can fake it with a simple fiber idle list: all available fibers go on the idle_list. Then when you need one, you pop and have the fiber put itself back when done. Something like that. You'd typically use a data structure that can yield if none are available but I think you are somehow never managing to underflow here? The underflow blocking could be faked today with a Queue or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's give this a go and see where it takes us. I think we'll end up landing on a simpler set of options but we can break it down then.
worker = sysman.ls.create_worker(f"{name}-inference-{device.name}-{i}") | ||
fiber = sysman.ls.create_fiber(worker, devices=[device]) | ||
self.workers.append(worker) | ||
self.fibers.append(fiber) | ||
self.locks.append(asyncio.Lock()) | ||
self.fiber_status.append(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we're missing a data structure. If I read this right, you just want to be able to pick a free fiber, right? There should really be some kind of a Pool or something which we don't have yet, but you can fake it with a simple fiber idle list: all available fibers go on the idle_list. Then when you need one, you pop and have the fiber put itself back when done. Something like that. You'd typically use a data structure that can yield if none are available but I think you are somehow never managing to underflow here? The underflow blocking could be faked today with a Queue or something like that.
No description provided.