Add uses bulkdata argument to paasta spark run #3995

timmow · 2024-12-16T16:11:36Z

This makes the change to paasta spark run so that
https://github.yelpcorp.com/sysgit/yelpsoa-configs/pull/52010 will work as expected

I'm not checking here if the /nail/bulkdata volume is specified in the spark config, e.g
spark.kubernetes.executor.volumes.hostPath.0.mount.path=/nail/bulkdata

doing this and setting uses_bulkdata set to True would result in multiple docker volumes being set which would cause a failure.

This follows on from this conversation in
slack and will allow us to complete this
project

nemacysts

lgtm (and i think it's perfectly fine to not check if there's a spark config for mounting /nail/bulkdata since it looks like that's not something anyone is currently doing - and i doubt any of our spark users would add such a mount un-prompted)

that said i'll let someone for ml compute ship since they own this file in its entirety :)

chi-yelp

Lgtm. Agree that as Luis mentioned no need to check if the user uses Spark options to mount bulkdata, and I also double checked that no one is doing that

SuperMatt · 2024-12-17T08:11:15Z

paasta_tools/cli/cmds/spark_run.py

+        "--uses-bulkdata",
+        help="Mount /nail/bulkdata in the container",
+        action="store_true",
+        default=False,


Should we not set the default to true for now, then roll out my change to add the flag everywhere, and then set the default to false?

good point - I thought by default we are getting the /nail/bulkdata mount from the host, but I cant actually see where that is happening - these changes are to configure_and_run_docker_container - which calls run_docker_container which then calls os.execlpe("paasta_docker_wrapper", *docker_run_cmd, merged_env) which is defined here but none of these seem to include system_volumes from /etc/paasta/volumes.json so i'm actually unsure why spark-run is mounting the bulkdata volume at all currently

@nemacysts / @chi-yelp do you have any ideas where the /nail/bulkdata mount is happening? Or is there any way I can test this like I did with #3893 in the paasta playground

hmm I checked using the following command and it seems /nail/bulkdata isn't mounted:

./.tox/py38-linux/bin/paasta spark-run --aws-profile=dev --cmd bash

so volumes are found from the instance config here which is then passed to spark conf here and then later referenced here

Meaning it is important that I check if /nail/bulkdata is in the list of volumes before adding it. And the command @chi-yelp shared is not mounting bulkdata because its not specifying a service / instance that uses bulkdata - @chi-yelp was that command run on my branch?

Sorry for the late reply, yes I ran the command on the branch of this PR after creating the virtualenv by make dev.

The mount paths will be deduplicated here by a dict, but I think it's also good for checking the input anyway

This makes the change to paasta spark run so that https://github.yelpcorp.com/sysgit/yelpsoa-configs/pull/52010 will work as expected I'm not checking here if the /nail/bulkdata volume is specified in the spark config, e.g `spark.kubernetes.executor.volumes.hostPath.0.mount.path=/nail/bulkdata` - doing this and setting uses_bulkdata set to True would result in multiple docker volumes being set which would cause a failure. This follows on from [this conversation in slack](https://yelp.slack.com/archives/CA8BWU65D/p1729768030212919) and will allow us to complete [this project](https://yelpwiki.yelpcorp.com/display/PRODENG/Project+Incredible+Bulk)

This makes the change to paasta spark run so that https://github.yelpcorp.com/sysgit/yelpsoa-configs/pull/52010 will work as expected. This works by adding the uses_bulkdata key to the intsance config if the spark job has the key present and set to true. I have added this arg to the tests so that they pass, however we're not explicitly testing that this functionality works. See #3995 for more informa tion about why we're doing this.

This makes the change to paasta spark run so that https://github.yelpcorp.com/sysgit/yelpsoa-configs/pull/52010 will work as expected. This works by adding the uses_bulkdata key to the intsance config if the spark job has the key present and set to true. I have added this arg to the tests so that they pass, and also created a test so that we can check all the different ways that uses_bulkdata can be set, either on paasta spark-run as an argument, or in the instance config. See #3995 for more information about why we're doing this.

timmow · 2025-01-20T15:49:24Z

closing in favor of #4005

timmow requested review from nemacysts, SuperMatt and chi-yelp December 16, 2024 16:12

nemacysts reviewed Dec 16, 2024

View reviewed changes

chi-yelp previously approved these changes Dec 16, 2024

View reviewed changes

SuperMatt reviewed Dec 17, 2024

View reviewed changes

timmow added 2 commits January 13, 2025 06:12

wip

c9de580

timmow dismissed chi-yelp’s stale review via c9de580 January 13, 2025 14:12

timmow force-pushed the u/tmower/uses-bulkdata-spark-run-PERES-5194 branch from 0b43710 to c9de580 Compare January 13, 2025 14:12

SuperMatt mentioned this pull request Jan 14, 2025

Add uses_bulkdata argument to paasta spark run instance_config #4005

Open

timmow closed this Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add uses bulkdata argument to paasta spark run #3995

Add uses bulkdata argument to paasta spark run #3995

timmow commented Dec 16, 2024

nemacysts left a comment

chi-yelp left a comment

SuperMatt Dec 17, 2024

timmow Dec 18, 2024

chi-yelp Dec 18, 2024 •

edited

Loading

timmow Jan 8, 2025

chi-yelp Jan 13, 2025

timmow commented Jan 20, 2025

Add uses bulkdata argument to paasta spark run #3995

Add uses bulkdata argument to paasta spark run #3995

Conversation

timmow commented Dec 16, 2024

nemacysts left a comment

Choose a reason for hiding this comment

chi-yelp left a comment

Choose a reason for hiding this comment

SuperMatt Dec 17, 2024

Choose a reason for hiding this comment

timmow Dec 18, 2024

Choose a reason for hiding this comment

chi-yelp Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

timmow Jan 8, 2025

Choose a reason for hiding this comment

chi-yelp Jan 13, 2025

Choose a reason for hiding this comment

timmow commented Jan 20, 2025

chi-yelp Dec 18, 2024 •

edited

Loading