6 scheduling #36

vlerkin · 2024-11-07T17:25:47Z

What happens in the PR:

The logic of event watcher was separated in an observer class; the logic of log watching stayed in a log handler class, but the initialization was changed to subscribe to the event in case jobless feature was configured;
The new class KubernetesScheduler was created to handle logic when jobs must be unsuspended and how (ordered);
scheduler endpoint was modified, logic to set a value for start_suspended parameter was added;
schedule method from k8s launcher has a new start_suspended parameter, it's value is passed when called inside the api; also new methods were added: unsuspend_job patches existing suspended job suspend=False, get_running_jobs_count returns the number of jobs that are currently running, list_suspended_jobs returns the list of jobs where spec.suspend is true, _get_job_name extracts the job name from the metadata, it is then used for unsuspend function;

The big picture:
Event watcher connects to the k8s api and receives the stream of events, it then notifies the subscribers if a new event is received and passes it to the provided callback. The subscriber - KubernetesScheduler - receives event in a handle_pod_event method, this method reacts to the changes in job statuses, and if job completed running or failed it calls another method - check_and_unsuspend_jobs - that checks capacity and unsuspends jobs until the number of allowed parallel jobs is reached, while doing this it relies on another method - get_next_suspended_job_id - to unsuspend the most recent job, to keep the order in which jobs were initially scheduled.
When the job is scheduled, based on the number of currently active jobs and max_proc provided in the config (default is 4), the job runs or goes to the queue of suspended jobs (native k8s queue). Then events that change the number of active jobs trigger the logic of KubernetesScheduler class that unsuspend suspended jobs until the desired state (num of parallel jobs) is achieved.

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully

…that handles the logic to unsuspend jobs and get the next in order according to the creation timestamp; modify schedule endpoint to start jobs suspended if there is already enogh jobs running; modify corresponding function in k8s launcher; add to k8s launcher methods to unsuspend job, to get current number of running jobs, to list suspended jobs and a private method to get job name to be used for unsuspend function

…source watcher instance to enable_joblogs to subscribe to the event watcher if the log feature is configured; delete logic about event watcher from main; pass container for list objects function instead of container name; remove start methon from log handler class; modify joblogs init to subscribe to event watcher

wvengen

Ah, nice you were able to come up with something so quickly already!
I looked at it from a high level, and noticed that this is currently implemented for Kubernetes only (that makes sense), and also setup in such a way that it needs refactoring for Docker. I would think of the scheduler as something that could work for both Docker and Kubernetes, especially the scheduling decisions. Also, there is now k8s-specific code in the main file (e.g. the import), and the kubernetes scheduler, this makes the code somewhat spaghetti: there are specific implementation-specific classes where responsibility is meant to be delegated. If you need to access the scheduler in the main file, use a generic scheduler, and make the docker-based parts not implemented. I think that would give a much cleaner design.

Also, I would consider making the launcher responsible for scheduling. And then have the scheduler talk to the launcher to actually start jobs.

I'm not yet sure if we should allow running without the scheduler, or if it would always be active.

kubernetes.yaml

wvengen · 2024-11-08T08:33:12Z

Hope my feedback was at an angle that helps you at this stage. In any case, well done, keep it going!

p.s. the CI error looks like it could be cause by Kubernetes-specific things having entered into the main api code, which wouldn't work when running with Docker.

vlerkin · 2024-11-11T08:12:26Z

Working on Docker implementation to be added to this PR

…rs and run more from the queue of created jobs when capacity is available; add backgroung thread that sleeps for 5 sec and triggers the function that starts additional containers up to capacity; add a method to gracefully stop the background thread that might be used in the future to stop the thread when app stops; encapsulate k8s and docker related schedule functionality in corresponding launchers and keep api.py launcher agnostic; add max_proc to config for docker

…nd patch is sufficient

wvengen

Great to see a working version! Quite readable :)
I think it needs a little cleanup, but you're getting there, I think.

wvengen · 2024-11-12T11:21:49Z

scrapyd_k8s.sample-docker.conf

@@ -16,6 +16,9 @@ repository   = scrapyd_k8s.repository.Local
 # Since this is the Docker example, we choose Docker here.
 launcher     = scrapyd_k8s.launcher.Docker

+# Maximum number of jobs running in parallel
+max_proc     = 1
+


not sure yet if we want to enable this by default
do you know what scrapyd has in its default configuration?

In scrapyd the max_proc parameter is not default. I think there are reasons to consider making it default because it enhances cluster stability but on the other hand, if a user did not think of optimal resource usage this batch limiting can lead to a smaller output at a given time. So it's beneficial if we need to be conscious about recourses of the prod setting but not very handy if we want to go all in and extract as much data as possible and faster.

Thanks for your thoughts. I think we'd like the default to be the minimum needed, while still having the important elements (e.g. resource limits are important for scheduling). In a cluster there are already limits, so I think it should be fine without. Similarity to scrapyd is also a plus, so I'd favour not having this set by default.

Both implementations are now optional and are activated if a user provided max_proc in the scrapyd section, it is also described in details in CONFIG.md

wvengen · 2024-11-12T11:22:30Z

scrapyd_k8s/api.py

@@ -7,6 +7,7 @@
 from natsort import natsort_keygen, ns

 from .config import Config
+from .k8s_resource_watcher import ResourceWatcher


reference to k8s-specific term k8s_resource_watcher, this file should use the configured backend and launchers instead

It is required for conditional start of joblogs feature in the run method and this method does not rely on a specific launcher, that is why I can't think of a clean way how to remove it from the api.py. But I am open for suggestions if you see a better way.

If the ResourceWatcher is not Kubernetes-specific, perhaps the k8s_ could be dropped from the filename?

Looking at the code, I do see k8s-specific things in there.
I think the launcher needs to be responsible for this somehow. Then the launcher can decide what the k8s and Docker specific parts are.

(sorry, I'm not diving fully now into the whole code, otherwise I could have given a more direct answer making more sense perhaps - can you think of a way to separate the k8s part here?)

I do agree that it shouldn't be here and we don't want to violate SOLID, I will think how to refactor this.

Resolved as a part of PR #38

scrapyd_k8s/k8s_resource_watcher.py

wvengen · 2024-11-12T11:24:13Z

scrapyd_k8s/k8s_resource_watcher.py

+            try:
+                subscriber(event)
+            except Exception as e:
+                logger.exception(f"Error notifying subscriber {subscriber.__name__}: {e}")


why catch this? when do you expect this?

This exception is added so we can separate the exception originated in particular subscriber from watcher and from other subscribers. And we also protect watcher from crashing due to problems in subscribers which is external in terms of the design.
If there is an unexpected edge case for subscriber it is nice to catch it and understand where it comes from, also easier to debug.

Broad except clauses are generally an anti-pattern. I'm not sure yet what to think about this. There is an argument for, but it can also be a sign that the error conditions have not been thought through. So my question would be: when do you expect an exception here? Are these legit cases (e.g. maybe timeouts, then the question is, how do we handle timeouts - does it need to be handled here or a level down)?
If there is a crash of scrapyd-k8s, it will restart, and it is clearer that there is an unexpected error in the application.

@leewesleyv any thoughts?

I think the broad Exception should be part of this try-except clause since it is unpredictable what callbacks are executed (not really since we only have 2 currently, but in terms of future proofness). Also, one of the callbacks also raises a bare Exception when an issue is encountered. We also need to make sure that when one of the subscribers fails, we do send the event to the other subscribers.

However, it might be better to also re-raise that bare Exception as a custom exception so we can do more specific error handling and logging. Same goes for the other callback. Maybe we can design a custom exception that can be raised in the callback that we can then catch while executing these callbacks.

Thank you for your input, colleagues!
What I see here is:

Keep the broad exception for the Observer since we cannot guarantee what subscribers do and we do not want one subscriber to mess up all the features that rely on the Observer and Observer itself.

Think of custom exceptions for subscribers. - this is something I am going to work on.

wvengen · 2024-11-12T11:25:13Z

scrapyd_k8s/k8s_scheduler.py

@@ -0,0 +1,96 @@
+import logging


Can we fit this in the directory structure? I wouldn't expect this in the src root.

For me, this functionality seems related to the kubernetes launcher.

k8s_scheduler is not the part of launcher because it is 1) a subscriber and should have this functionality to subscribe to the observer, 2) be the part which is initialized optionally when a user wants to limit the number of parallel job (launcher is not optional), 3) contains higher-level logic that uses low-level methods and helper methods from the launcher.

If you don't like that this file is located in the root, I can relocate it to the directory that belongs to this feature, say, limit_jobs or something like this, you can pick any name you like and I will add it.

scrapyd_k8s/launcher/k8s.py

wvengen · 2024-11-12T11:28:00Z

scrapyd_k8s/launcher/k8s.py

+            return suspended_jobs
+        except Exception as e:
+            logger.exception(f"Error listing suspended jobs: {e}")
+            return []


Is there a situation where you would want to get both the running and suspended job count? Then it could be nice to do one call to list_namespaced_job to obtain both.

Yes, good point, working on it

So, I've been thinking about this, but those are two very different flows that require the latest cluster state when called, but I can make the code a bit more DRY.

wvengen · 2024-11-12T11:28:34Z

scrapyd_k8s/launcher/k8s.py

+        if not jobs.items:
+            logger.error(f"No job found with job_id={job_id}")
+            return None
+        return jobs.items[0].metadata.name


when you're listing jobs, would you also get the name already?

…nnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient

…and a label selector to make the code in listjobs, get_running_jobs and list_suspended_jobs DRY; refactor listjobs to use the helper function with the existing _parse_job as a filter_func parameter

…unction because list jobs uses a different logic

vlerkin · 2024-11-20T20:42:51Z

I have problems because I separated this PR partially and now I have a multiverse which I need to refactor to the only source of truth. Going to spend some uncertain amount of time on that.

wvengen · 2024-11-21T06:31:08Z

The way I would do this:

Continue working on this PR, until you need the functionality developed in the other PR (or until it has been merged).
Interactive rebase on the branch of the other PR. Filter out the commits you had here that you rewrote in the other branch.
There may be little or much work to do in resolving conflicts. If it is really many, in various commits, you may consider another route (see below).
Test, done.

Of this is much work in many commits, you may consider first doing an interactive rebase of this PR, to simplify it, and reduce the number of commits (that each may need amending).

Yes, this is a bit of work, but something I come across now and then, in various projects.
Sorry for the complexity!

vlerkin · 2024-11-21T07:29:02Z

Thank you for the advice!
I was thinking of dropping the commit with the merging main to this branch, then make the code work so the tests run if needed. Then merging with that other branch that refactored the observer further and make the code of both branches work together and then check if there are any conflicts with main and resolving those. This is a bit longer way than simply redoing the merge with the main branch but I messed up the last one because I lost track of changes, so gradually rebuilding this branch is a bit easier for me.

No worries, this is me who messed up merging, complexity is part of the job:D Learning to make more granular commits and cleaner PRs the hard way:D

…nnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient

… connection to the k8s was achieved so only sequential failures detected; add exception handling to watch_pods to handle failure in urllib3, when source version is old and not available anymore, and when stream is ended; remove k8s resource watcher initialization from run function in api.py and move it to k8s.py launcher as _init_resource_watcher; refactor existing logic from joblogs/__init__.py to keep it in _init_resource_watcher and enable_joblogs in k8s launcher

…ed to re-establish connection to the Kubernetes wather

…k to the CONFIG.md in the README.md; remove variables for reconnection_attempts, backoff_time and backoff_coefficient fron the sample config since default values are provided in the code.

… a package that has an enable function in launcher/k8s.py which also part of resource watcher initialization; initialize the scheduler if max_proc was provided in the scrapyd section of the config file; refactor related methods in the launcher to use extra functionality for job number limiting only if max_proc is provided

…; remove max_proc from config file since by default we want to runn all scheduled jobs in parallel; add a section about max_proc to the CONFIG.md

… propagate the exception

…o handle and propagate the critical ones

…on_timestamps to the jobs that do not have them, so they are proccessed at the end of the queue; add unit tests for k8s_scheduler class

vlerkin · 2024-11-27T10:38:02Z

I modified one of the methods in the scheduler (get_next_suspended_job_id) to handle cases if a job does not have a creation_timestamp. It is not expected but if someone used a custom resource and forgot to add this field or made any other error, the job will get the timestamp assigned and will be processed like other jobs in the queue.

Also, there are now unit tests that cover different scenarios for the scheduler.

If you have any other comments for improvements, let me know!

…for docker implementation of the scheduler

leewesleyv · 2024-12-04T12:24:52Z

scrapyd_k8s/joblogs/log_handler_k8s.py

@@ -8,12 +8,20 @@

 logger = logging.getLogger(__name__)

+# Custom Exceptions


I'd move these custom exceptions to a separate file.

Good point! If I make a file with custom exceptions within joblogs module, is it fine? I just feel like making a general file for custom exceptions in the root directory for this optional feature is not a good idea. What do you think?

Fine for sure for now. If we add more exceptions later we could reconsider to make a more central one then.

vlerkin added 4 commits November 1, 2024 16:48

extracted resource watching logic into a separate class; implemented …

4c0a82d

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully

Merge branch 'main' into 6-scheduling

9dae80d

vlerkin requested a review from wvengen November 7, 2024 17:25

change enable_joblogs method signature in docker

963f279

wvengen reviewed Nov 8, 2024

View reviewed changes

kubernetes.yaml Outdated Show resolved Hide resolved

vlerkin added 3 commits November 11, 2024 14:50

refactor api.py and launcher/k8s.py to keep api.py launcher agnostic

7dc24bc

remove update from RBAC because we perform patching of the resource a…

01b120b

…nd patch is sufficient

wvengen reviewed Nov 12, 2024

View reviewed changes

wvengen mentioned this pull request Nov 18, 2024

6 scheduling observer extraction #38

Merged

vlerkin added 2 commits November 20, 2024 20:20

create a helper finction _filter_jobs that accepts a filter function …

2d5e6bf

…and a label selector to make the code in listjobs, get_running_jobs and list_suspended_jobs DRY; refactor listjobs to use the helper function with the existing _parse_job as a filter_func parameter

change the signature of the filtering method to also accept a parse f…

36217d9

…unction because list jobs uses a different logic

vlerkin force-pushed the 6-scheduling branch from 1a5201c to 36217d9 Compare November 21, 2024 09:03

vlerkin added 5 commits November 21, 2024 14:09

added a CONFIG.md file with detailed explanations about parameters us…

6a64485

…ed to re-establish connection to the Kubernetes wather

move section about config file from README.md to CONFIG.md; add a lin…

3604267

…k to the CONFIG.md in the README.md; remove variables for reconnection_attempts, backoff_time and backoff_coefficient fron the sample config since default values are provided in the code.

Merge branch 'main' into 6-scheduling

6394633

vlerkin force-pushed the 6-scheduling branch from c8b35ad to 6394633 Compare November 21, 2024 17:38

vlerkin added 3 commits November 22, 2024 12:06

implement the feature to limit running jobs to be optional for Docker…

e23bfce

…; remove max_proc from config file since by default we want to runn all scheduled jobs in parallel; add a section about max_proc to the CONFIG.md

add exceptions to the KubernetesScheduler class, for the init method,…

17d14b6

… propagate the exception

vlerkin added 3 commits November 25, 2024 15:50

Merge remote-tracking branch 'upstream/main' into 6-scheduling

a30c531

add custom exceptions to the class for log handling; add exceptions t…

8d8032c

…o handle and propagate the critical ones

modify method that fetches nez job id to be unsuspended to add creati…

331090f

…on_timestamps to the jobs that do not have them, so they are proccessed at the end of the queue; add unit tests for k8s_scheduler class

make the proper test structure according to the standards; add tests …

16a8276

…for docker implementation of the scheduler

leewesleyv reviewed Dec 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6 scheduling #36

6 scheduling #36

vlerkin commented Nov 7, 2024

wvengen left a comment •

edited

Loading

wvengen commented Nov 8, 2024

vlerkin commented Nov 11, 2024

wvengen left a comment

wvengen Nov 12, 2024

vlerkin Nov 12, 2024 •

edited

Loading

wvengen Nov 14, 2024

vlerkin Nov 22, 2024

wvengen Nov 12, 2024

vlerkin Nov 12, 2024

wvengen Nov 12, 2024

wvengen Nov 12, 2024

vlerkin Nov 13, 2024

vlerkin Nov 19, 2024

wvengen Nov 12, 2024

vlerkin Nov 13, 2024 •

edited

Loading

wvengen Nov 14, 2024

leewesleyv Nov 14, 2024

vlerkin Nov 22, 2024

wvengen Nov 12, 2024

wvengen Nov 12, 2024

vlerkin Nov 21, 2024

wvengen Nov 12, 2024

vlerkin Nov 19, 2024

vlerkin Nov 20, 2024

wvengen Nov 12, 2024

vlerkin commented Nov 20, 2024

wvengen commented Nov 21, 2024

vlerkin commented Nov 21, 2024

vlerkin commented Nov 27, 2024

leewesleyv Dec 4, 2024

vlerkin Dec 5, 2024

leewesleyv Dec 6, 2024

		@@ -8,12 +8,20 @@

		logger = logging.getLogger(__name__)

		# Custom Exceptions

6 scheduling #36

Are you sure you want to change the base?

6 scheduling #36

Conversation

vlerkin commented Nov 7, 2024

wvengen left a comment • edited Loading

Choose a reason for hiding this comment

wvengen commented Nov 8, 2024

vlerkin commented Nov 11, 2024

wvengen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlerkin Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlerkin Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlerkin commented Nov 20, 2024

wvengen commented Nov 21, 2024

vlerkin commented Nov 21, 2024

vlerkin commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wvengen left a comment •

edited

Loading

vlerkin Nov 12, 2024 •

edited

Loading

vlerkin Nov 13, 2024 •

edited

Loading