Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Handle extended resource requests via DRA Driver #5004

Open
4 tasks
klueska opened this issue Dec 17, 2024 · 6 comments
Open
4 tasks

DRA: Handle extended resource requests via DRA Driver #5004

klueska opened this issue Dec 17, 2024 · 6 comments
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@klueska
Copy link
Contributor

klueska commented Dec 17, 2024

Enhancement Description

  • One-line enhancement description (can be used as a release note):
    Allow DRA drivers to honor requests made via the extended resource API (e.g. nvidia.com/gpu: 2) rather than requiring a standard device plugin be used.

  • Kubernetes Enhancement Proposal:

    • TBD
    • Incremental PRs:
      • TBD
  • Discussion Link:

  • Primary contact (assignee):
    @klueska, @pohly, @johnbelamaric

  • Responsible SIGs:
    /sig node
    /wg device-management

  • Enhancement target (which target equals to which milestone):

    • Alpha release target: 1.33
    • Beta release target: 1.34
    • Stable release target: 1.35
  • Alpha

    • KEP (k/enhancements) update PR(s):
      • TBD
    • Code (k/k) update PR(s):
      • TBD
    • Docs (k/website) update PR(s):
      • TBD
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Dec 17, 2024
@johnbelamaric
Copy link
Member

+1 yes please!

@johnbelamaric
Copy link
Member

johnbelamaric commented Dec 17, 2024

We need to sort out the requirements. A few initial questions:

  1. For newly created pods, I think it's clear we want this to be transparent. Existing manifests that use the extended resource API should continue to work as before, without modification.
  2. Can we handle this invisibly in the driver layer, or do we need to have DRA invoked at the control plane level and select the specific devices? If we don't, we will likely have a race condition - unless the scheduler can do some magical accounting (which seems possible).
  3. How do we handle upgrades? If we have a node running device plugin, and we switch to the DRA driver (or we upgrade to a driver that supports both), do you have to delete the pods? Do they automatically adopt the devices? If so, how do we write those back to the allocation logic (since no DRA claim exists).
  4. What happens if there are pods in a deployment, and some land on nodes with device plugin and some with DRA drivers?
  5. We talked about letting specific device classes be advertised as specific extended resources. This could mean the existing resource names get mapped to specific device classes by the admin. It could also mean we have a convention like deviceclass.k8s.io/foo: 4 for extended resource names. How do these choices interplay with the questions above?

@lengrongfu
Copy link
Member

Can each dra-driver implement a webhook to create a ResourceClaimTemplate after creating a pod and modify the application method of resources in the pod?

@klueska
Copy link
Contributor Author

klueska commented Jan 7, 2025

@lengrongfu that is what this KEP would be designed to avoid. There would be integrated scheduler support for all drivers, rather than requiring each DRA driver to provide a webhook.

@alculquicondor
Copy link
Member

Open questions (from SIG Scheduling meeting):

  • How to handle resource quotas
  • Scheduling throughput (API requests and overall processing).

@ffromani
Copy link
Contributor

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Triage
Development

No branches or pull requests

6 participants