Batch tasks using Dask in Argo #120

jpolchlo · 2023-01-03T19:09:06Z

Overview

This PR provides some infrastructure for running Dask jobs independent of Jupyter. This enables long-running jobs that would be cumbersome to remain logged into Jupyter for. It also opens up a workflow based on standard python scripts, rather than Jupyter notebooks. I've been using a standard form for my scripts so that the cluster can be configured via the Argo job submission interface:

import logging
import dask_gateway

logger = logging.getLogger("DaskWorkflow")
gw = dask_gateway.Gateway(auth="jupyterhub")

try:
    opts = gw.cluster_options()
    opts.worker_memory = int(os.environ['DASK_OPTS__WORKER_MEMORY'])
    opts.worker_cores = int(os.environ['DASK_OPTS__WORKER_CORES'])
    opts.scheduler_memory = int(os.environ['DASK_OPTS__SCHEDULER_MEMORY'])
    opts.scheduler_cores = int(os.environ['DASK_OPTS__SCHEDULER_CORES'])
    cluster = gw.new_cluster(opts)
    cluster.scale(int(os.environ['DASK_OPTS__N_WORKERS']))
    client = cluster.get_client()

    logger.warning(f"Client dashboard: {client.dashboard_link}")

    # Client code goes here
finally:
    gw.stop_cluster(client.cluster.name)

Closes #112

Checklist

Ran nbautoexport export . in /opt/src/notebooks and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)
Documentation updated if needed
PR has a name that won't get you publicly shamed for vagueness

Notes

This workflow will eventually be added to the cluster configs as a ClusterWorkflowTemplate, but that will be handled by azavea/kubernetes-deployment#34.

Testing Instructions

Start a new workflow
Copy in the contents of run-dask-job.yaml into the manual editor
Adjust the parameter values in the parameters tab to configure the size of the cluster and the source code location (currently the latter must be specified as an HTTP(S) URL)
Run the workflow (there will be a 3–6 minute delay for the Dask resources to come on line)
If you'd like to monitor the progress, grab the client dashboard URL from the pod logs for the task; append the value to https://jupyter.noaa.azavea.com
Since the workflow does not specify any garbage collection, delete the workflow when you're done to avoid stacking up old pods

vlulla · 2023-01-04T16:10:30Z

This looks great! I have a minor observation to share: in my exploration of distributed dask (using SSHCluster) i learned that Client forks a process¹. I have also learned that it is considered a best practice to have Client() initialized in __main__ block. I am completely unfamiliar with how try/finally works with python interpreter initialization to know if your setup completely sidesteps this issue. Anyways, I thought my finding was worthwhile to share and hence this comment.

Anyways, this looks great! I am going to emulate this in my argo workflows and seek your advice on any issues that i run into.

https://github.com/dask/distributed/issues/516#issuecomment-306468605 ↩

vlulla

Looks good!

jpolchlo · 2023-01-04T17:01:41Z

Is the gist of your comment that I ought to modify the template as follows?

import logging
import dask_gateway

logger = logging.getLogger("DaskWorkflow")

def main():
    gw = dask_gateway.Gateway(auth="jupyterhub")

    try:
        opts = gw.cluster_options()
        opts.worker_memory = int(os.environ['DASK_OPTS__WORKER_MEMORY'])
        opts.worker_cores = int(os.environ['DASK_OPTS__WORKER_CORES'])
        opts.scheduler_memory = int(os.environ['DASK_OPTS__SCHEDULER_MEMORY'])
        opts.scheduler_cores = int(os.environ['DASK_OPTS__SCHEDULER_CORES'])
        cluster = gw.new_cluster(opts)
        cluster.scale(int(os.environ['DASK_OPTS__N_WORKERS']))
        client = cluster.get_client()

        logger.warning(f"Client dashboard: {client.dashboard_link}")

        # Client code goes here
    finally:
        gw.stop_cluster(client.cluster.name)

if __name__ == "__main'":
    main()

It's worth noting that Dask Distributed works differently to Dask Gateway, and it should not be relying on threads/processes in the same way. I've not encountered any difficulty starting a Client from the template as it was presented (which was not in a main block).

vlulla · 2023-01-04T18:00:21Z

Indeed, that is the gist of my comment. Additionally, I think that modifying it this way makes the script work correctly when we are trying to experiment in a non-argo environment.

By the way, there's a minor typo: it ought to be "__main__" instead of "__main'".

Thanks for considering my point!

jpolchlo · 2023-01-04T18:06:05Z

Oops! Typo. Thanks. I adjusted the template in the README and the base flow example.

jpolchlo · 2023-01-04T18:07:33Z

As an additional point, it should be noted that without more complex logic, such an example template won't be interchangeable between the cloud environment and a local Dask distributed environment, since they have different imports and setup. I think.

vlulla · 2023-01-04T18:22:47Z

Yes, point taken!

rajadain · 2023-01-04T18:57:06Z

This could be made a little more explicit like this:

import logging
import dask_gateway

logger = logging.getLogger("DaskWorkflow")

def run_on_cluster(fn):
    gw = dask_gateway.Gateway(auth="jupyterhub")

    try:
        opts = gw.cluster_options()
        opts.worker_memory = int(os.environ['DASK_OPTS__WORKER_MEMORY'])
        opts.worker_cores = int(os.environ['DASK_OPTS__WORKER_CORES'])
        opts.scheduler_memory = int(os.environ['DASK_OPTS__SCHEDULER_MEMORY'])
        opts.scheduler_cores = int(os.environ['DASK_OPTS__SCHEDULER_CORES'])
        cluster = gw.new_cluster(opts)
        cluster.scale(int(os.environ['DASK_OPTS__N_WORKERS']))
        client = cluster.get_client()

        logger.warning(f"Client dashboard: {client.dashboard_link}")

        fn()
    finally:
        gw.stop_cluster(client.cluster.name)


def client_code():
    # Client code goes here


def main():
    run_on_cluster(client_code)


if __name__ == "__main'":
    main()

Going to try to run the example on the cluster now.

…structure to the demo problem

jpolchlo · 2023-01-06T17:08:29Z

@rajadain I took your advice (a bit) and modularized the template a bit more. You add sort of two levels of indirection into your code, that I simplified a bit. Check the modified README. Was there a particular reason that you wanted to elect the client code function as a higher-order function call?

rajadain · 2023-01-06T18:09:05Z

Was there a particular reason that you wanted to elect the client code function as a higher-order function call?

Just for clarity, so the client code is free of distraction. Your solution works well!

jpolchlo added 10 commits December 16, 2022 15:17

First shot at running dask tasks through argo

8c05178

Try different auth mode

d866003

[wip]

6cada31

[wip]

3fde997

[wip]

49079a3

[wip]

aca480e

[wip]

b4bc90f

[wip]

71ffcf9

[wip]

bbc54f0

Update Dask runner script to include more configuration options

f79a68d

jpolchlo mentioned this pull request Jan 3, 2023

Install an Argo workflow to run Dask jobs azavea/kubernetes-deployment#34

Open

Update README

cf442f9

jpolchlo requested review from rajadain and vlulla January 4, 2023 14:53

vlulla approved these changes Jan 4, 2023

View reviewed changes

jpolchlo added 2 commits January 4, 2023 13:03

Use a main function in Dask job template

4a1549f

Update base flow script to use main function

6e7a4fd

jpolchlo added 3 commits January 4, 2023 17:09

Add required quotes

25f1f33

Fix environment variable name

e44aa37

Be a bit more modular in the dask task template; apply that template …

e8b9ea3

…structure to the demo problem

jpolchlo mentioned this pull request Jan 6, 2023

Workflow to rechunk NWM retrospective zarr data #122

Merged

3 tasks

rajadain approved these changes Jan 6, 2023

View reviewed changes

jpolchlo merged commit ca54709 into master Jan 6, 2023

jpolchlo deleted the workflow/dask-task branch January 6, 2023 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch tasks using Dask in Argo #120

Batch tasks using Dask in Argo #120

jpolchlo commented Jan 3, 2023 •

edited

Loading

vlulla commented Jan 4, 2023

vlulla left a comment

jpolchlo commented Jan 4, 2023

vlulla commented Jan 4, 2023

jpolchlo commented Jan 4, 2023

jpolchlo commented Jan 4, 2023

vlulla commented Jan 4, 2023

rajadain commented Jan 4, 2023

jpolchlo commented Jan 6, 2023

rajadain commented Jan 6, 2023

Batch tasks using Dask in Argo #120

Batch tasks using Dask in Argo #120

Conversation

jpolchlo commented Jan 3, 2023 • edited Loading

Overview

Checklist

Notes

Testing Instructions

vlulla commented Jan 4, 2023

Footnotes

vlulla left a comment

Choose a reason for hiding this comment

jpolchlo commented Jan 4, 2023

vlulla commented Jan 4, 2023

jpolchlo commented Jan 4, 2023

jpolchlo commented Jan 4, 2023

vlulla commented Jan 4, 2023

rajadain commented Jan 4, 2023

jpolchlo commented Jan 6, 2023

rajadain commented Jan 6, 2023

jpolchlo commented Jan 3, 2023 •

edited

Loading