Replies: 8 comments 11 replies
-
I'm going to argue for a less generic and more email-centered table, something along the lines:
Then you'd need two celery tasks:
|
Beta Was this translation helpful? Give feedback.
-
Couple of corrections to this bit:
|
Beta Was this translation helpful? Give feedback.
-
A couple of thoughts., You don't need to use the DB as the queue to use the DB in a queue. You can start a task with an id. The first thing it does is open the DB and look-up it's details. If the details say it's done (or maybe if they are there at all), the job just exits cleanly without doing anything. In our case we don't need any details apart from "done". The rest of the info for the task can live in Celery as normal, and therefore change easily without any DB migrations. The minimum required to achieve this (assuming present = done) I think is: class TaskDone:
key: str
expires_at: datetime The key needs to be something which is repeatably generated to be the same for the same task. So maybe the start date the task is running for + the email hashed for example.
I think this would need to change to a job which works out what to send, and individual jobs to send them. This allows each individual task to fail or work independently with the system above. This is a problem we can choose to not have. Queuing tasks to celery should be extremely fast. If the big task fails, that's fine, you can run it again, generate all the task, and they should all quit when they see the record in the DB. You could be smarter and check the DB first and not queue.
We pretty much already have this in the event table, but without expiry, however this is tripping of my heebie jeebie sensor. I don't think we want to build a second generic mechanism if it's actually only required to do something very specific. The smallest thing which works is probably best. |
Beta Was this translation helpful? Give feedback.
-
It occurs to me you can make this pretty generic by making the key a hash of the inputs to the task (where the task is deterministic). The only wrinkle here would be a situation where we end up sending exactly the same details two days in a row. You could work around this by:
|
Beta Was this translation helpful? Give feedback.
-
It's not really about deduping, but about idempotence. You make it so it's fine to run your task multiple times. Celery won't run the exact same job (successfully) more than once, but it will happily run as many different jobs as you ask for with the same arguments. They aren't the same job to Celery. So if you don't know if it's worked, and queue it again, it's a new job. That's what we would gain here. Celery guarantees it will run at least once successfully, we step in to ensure it runs at most once successfully.
I think this is close to a useful way of thinking about this. We accept this is a numbers game, you can't really win. Whatever situation you setup you can think of a scenario that it doesn't work for. What you can do is massively reduce the time window, and so probability of failure, up to a point where you say it's fine. The less we have to pay to cross that threshold the better. So if (made up numbers incoming, if we have some it would be great to update with them):
The total length of the job is 505s (because we don't queue). At the moment a failure in that time anywhere causes us a problem. Assuming you're right that having individual jobs for emails would pretty much make them work as we'd like then that change alone would take our "vulnerable" time down to 5.12s. Which has solved 99% of the problem right there. I suspect this might be enough for practical purposes. Adding idempotence (or some version of it) then could only improve the vulnerable time by the 1% left at best. |
Beta Was this translation helpful? Give feedback.
-
I've just realised I'm not 100% about the current setup of tasks. Do we have an over task currently which is kicking off the chunked tasks? Do we have any estimates / guesstimates for how long each part of the process takes? |
Beta Was this translation helpful? Give feedback.
-
Ah, I wonder if we're talking about reinventing Celery result backends? Maybe we should just use that. It supports SQLAlchemy. We've talked about potentially needing a results backend for other cases: for example if we want to move some of LMS's currently in-line network requests into Celery tasks. Worth looking into anyway |
Beta Was this translation helpful? Give feedback.
-
I've updated the first comment at the top of this discussion with an outline of a solution based on the discussion so far. @marcospri @jon-betts could you give that a read and make any comments that you have? One question I've left unanswered is: should we hash the keys in the |
Beta Was this translation helpful? Give feedback.
-
(I've edited this post to reflect the solution that I think we've come to in the discussion below.)
Problem
The LMS app's Celery tasks that send emails can potentially end up sending sending the same email to the same user multiple times, particularly in this scenario:
send_instructor_email_digests()
task gets called and begins sending emails to a batch of 50 userssend_instructor_email_digests()
task before terminating)send_instructor_email_digests()
task gets called again and begins sending the same emails to the same batch of 50 users again. Any of these 50 users whose emails had already been sent by the first call tosend_instructor_email_digests()
in step 1 will now be sent a second time, resulting in duplicate emails being sent.The above scenario would happen if the
send_instructor_email_digests()
task used late acknowledgment (theacks_late=True
ortask_acks_late=True
setting). If the task doesn't use late acknowledgment then RabbitMQ won't redeliver the message in step 3 and instead of some users receiving duplicate emails you'll have some users not receiving their emails at all (the users whose emails hadn't been sent yet when the worker was terminated).Extracting the email sending into a separate task won't help with this problem. For example if we split out a
send_email()
task that just sends one email per task call. Then instead of generating and sending a batch of 50 emails thesend_instructor_email_digests()
tasks would have to generate the data for 50 emails and then call 50 instances of the newsend_email()
task. You just get the same problem: ifsend_instructor_email_digests()
schedules 25 or the 50send_email()
tasks then gets terminated then whensend_instructor_email_digests()
gets run again (by RabbitMQ message redelivery) it'll callsend_email()
again for those first 25 emails and send them again. (Or if it uses early acknowledgment:send_email()
will never get called for the last 25 emails and they'll never be sent.) This doesn't mean that we won't extract a separate single-email task because doing so may have other benefits, but it's irrelevant to the fundamental problem of avoiding duplicate emails.This also applies to
send_instructor_email_digest_tasks()
. This is the higher-level periodic task that finds all the users to be emailed,groups them into batches, and calls
send_instructor_email_digests()
for each batch. If thissend_instructor_email_digest_tasks()
gets terminated mid-execution then it'll either make duplicate calls tosend_instructor_email_digests()
when it gets re-run (late acknowledgment, entire batches of users will get duplicate emails) or fail to make some of thesend_instructor_email_digests()
that it should have made (early acknowledgment, entire batches will not get their emails at all).Solution
Original Slack thread mentioning this idea.
Keep a log of emails sent in a database table. Very soon after sending each individual email commit a DB transaction that adds a row to this new database table logging the email send that happened. Before sending any email check this new database table to make sure that the email hasn't already sent.
This will make it safe for both the
send_instructor_email_digest_tasks()
andsend_instructor_email_digests()
tasks to be re-run because they won't re-send any emails that they've previously logged in the DB table. So the issue described above with RabbitMQ message redelivery will no longer be a problem.This will also allow us to simplify these tasks and their tests: both tasks currently use manual Celery retries (the calls to
self.retry()
) with custom arguments to avoid re-sending emails in the case of a Celery retry (note: which is not the same thing as a RabbitMQ redelivery). This would no longer be necessary and we could use the much simpler automatic retries.This won't be perfect: you can imagine the code sending an email and then getting terminated before it commits the DB transaction logging the send, with the result that this email will get re-sent. But it's the best we can do and hopefully it should be good enough: hopefully it'll happen rarely enough that a worker termination happens at exactly the wrong time after sending an email but before committing the DB transaction.
Proposed new DB table:
The
key
just needs to be any string that uniquely identifies the task that has been done and that can be regenerated again deterministically in order to ask whether the task has already been done. For the instructor email digests feature this can be the task type, theh_userid
of the instructor and the current date (to the day: no time):"instructor_email_digest:<h_userid>:<YYYY-MM-DD>"
.A periodic task will delete all rows whose
expires_at
time has passed.Beta Was this translation helpful? Give feedback.
All reactions