Using a DB table to avoid sending duplicate emails #5284

seanh · 2023-04-12T18:14:53Z

seanh
Apr 12, 2023
Maintainer

(I've edited this post to reflect the solution that I think we've come to in the discussion below.)

Problem

The LMS app's Celery tasks that send emails can potentially end up sending sending the same email to the same user multiple times, particularly in this scenario:

The send_instructor_email_digests() task gets called and begins sending emails to a batch of 50 users
The Celery worker gets terminated (for example by an Elastic Beanstalk deployment or autoscaling down) without having a chance to do a "warm" or "graceful" shutdown (i.e. to finish the in-progress send_instructor_email_digests() task before terminating)
RabbitMQ never heard back from the Celery worker about the result of the task so after a timeout it'll redeliver the task to another worker
The send_instructor_email_digests() task gets called again and begins sending the same emails to the same batch of 50 users again. Any of these 50 users whose emails had already been sent by the first call to send_instructor_email_digests() in step 1 will now be sent a second time, resulting in duplicate emails being sent.

The above scenario would happen if the send_instructor_email_digests() task used late acknowledgment (the acks_late=True or task_acks_late=True setting). If the task doesn't use late acknowledgment then RabbitMQ won't redeliver the message in step 3 and instead of some users receiving duplicate emails you'll have some users not receiving their emails at all (the users whose emails hadn't been sent yet when the worker was terminated).

Extracting the email sending into a separate task won't help with this problem. For example if we split out a send_email() task that just sends one email per task call. Then instead of generating and sending a batch of 50 emails the send_instructor_email_digests() tasks would have to generate the data for 50 emails and then call 50 instances of the new send_email() task. You just get the same problem: if send_instructor_email_digests() schedules 25 or the 50 send_email() tasks then gets terminated then when send_instructor_email_digests() gets run again (by RabbitMQ message redelivery) it'll call send_email() again for those first 25 emails and send them again. (Or if it uses early acknowledgment: send_email() will never get called for the last 25 emails and they'll never be sent.) This doesn't mean that we won't extract a separate single-email task because doing so may have other benefits, but it's irrelevant to the fundamental problem of avoiding duplicate emails.

This also applies to send_instructor_email_digest_tasks(). This is the higher-level periodic task that finds all the users to be emailed,
groups them into batches, and calls send_instructor_email_digests() for each batch. If this send_instructor_email_digest_tasks() gets terminated mid-execution then it'll either make duplicate calls to send_instructor_email_digests() when it gets re-run (late acknowledgment, entire batches of users will get duplicate emails) or fail to make some of the send_instructor_email_digests() that it should have made (early acknowledgment, entire batches will not get their emails at all).

Solution

Original Slack thread mentioning this idea.

Keep a log of emails sent in a database table. Very soon after sending each individual email commit a DB transaction that adds a row to this new database table logging the email send that happened. Before sending any email check this new database table to make sure that the email hasn't already sent.

This will make it safe for both the send_instructor_email_digest_tasks() and send_instructor_email_digests() tasks to be re-run because they won't re-send any emails that they've previously logged in the DB table. So the issue described above with RabbitMQ message redelivery will no longer be a problem.

This will also allow us to simplify these tasks and their tests: both tasks currently use manual Celery retries (the calls to self.retry()) with custom arguments to avoid re-sending emails in the case of a Celery retry (note: which is not the same thing as a RabbitMQ redelivery). This would no longer be necessary and we could use the much simpler automatic retries.

This won't be perfect: you can imagine the code sending an email and then getting terminated before it commits the DB transaction logging the send, with the result that this email will get re-sent. But it's the best we can do and hopefully it should be good enough: hopefully it'll happen rarely enough that a worker termination happens at exactly the wrong time after sending an email but before committing the DB transaction.

Proposed new DB table:

class TaskDone(CreatedUpdatedMixin, BASE):
    __tablename__ = "task_done"
    id = Column(sa.Integer, autoincrement=True, primary_key=True)
    key = Column(UnicodeText, nullable=False)
    expires_at = Column(DateTime, server_default=func.now(), nullable=False)

The key just needs to be any string that uniquely identifies the task that has been done and that can be regenerated again deterministically in order to ask whether the task has already been done. For the instructor email digests feature this can be the task type, the h_userid of the instructor and the current date (to the day: no time): "instructor_email_digest:<h_userid>:<YYYY-MM-DD>".

A periodic task will delete all rows whose expires_at time has passed.

marcospri · 2023-04-13T07:59:11Z

marcospri
Apr 13, 2023
Maintainer

I'm going to argue for a less generic and more email-centered table, something along the lines:

class Email(CreatedUpdatedMixin, BASE):
    id = Column(Integer, autoincrement=True, primary_key=True)
    send_at = Column(Timestamp) 
    send_to = Column(String)

    retries = Column(Integer) 

    template = Column(String)
    subject = Column(String)    
    tamplate_vars = Column(JSONB)

    sent_at = Column(Timestamp)

Then you'd need two celery tasks:

Digest-centered one. It needs to do its queries and insert rows here, commit at the end. Anything goes wrong it can be retried.
A higher frequency generic email sending one that runs for example every hour. Selects for update pending emails and attempts to send them updating the table accordingly (set sent_at or decrease retries)

1 reply

seanh Apr 13, 2023
Maintainer Author

I think I'd rather use the DB table as a log for recording which emails have been sent so that we can avoid re-sending them and then use Celery+RabbitMQ as the queue, rather than using the DB table as a task queue of emails-to-be-sent.

Otherwise I think we'd end up having to re-invent something much like Celery workers/AMQP consumers but reading from a Postgres queue. Some examples of the kinds of thing we'd likely end up needing to do:

Many of our emails need to be sent immediately (account activations, password resets, reply notifications, ...) so they can't wait for an hourly periodic email-sending task to be run. We don't even want them to wait for a minute or even five seconds for a periodic task to run. We want the email to be sent immediately. We'd need long-running worker processes that constantly poll the queue or that hold open connections to receive message deliveries from the broker.
We have potentially large amounts of emails to send, for example if we have a lot of traffic (creating a lot of reply notifications and other types of emails) or if we have a lot of users to send nightly digests to. So we need multiple workers chewing through those email tasks in parallel. Then you need a way to prevent two workers taking the same task at once. Then if a worker takes a task but doesn't complete it you need a way to re-deliver a task to another worker after a timeout. But if a task keeps failing you can't just keep redelivering it forever, so you need a way to eventually expire tasks. So you end up needing something like round-robin message delivery, timeouts, re-delivery, and TTLs.
Trying to send an email might fail e.g. due to a Mailchimp outage, so you end up needing retries with exponential backoff, random jitter, and giving up after a fixed number of retries.
Monitoring and alerting. Is the queue getting too long? Have the workers stopped running? Are too many messages getting dropped? I'm getting paged by PagerDuty and I need to check if anything's wrong, can I see a dashboard showing all the queue lengths, throughput, worker connections, message drop rates, etc?

It's starting to sound almost as complex as RabbitMQ :)

I know that a log of emails-that-have-been-sent looks awfully similar to a queue of emails-to-be-sent and it's tempting to say "let's just implement this with Postgres" but I think a mere log for de-duplication purposes is actually an awful lot simpler than an actual queue (especially since we don't even need it to work perfectly for de-duplication: good enough is good enough).

seanh · 2023-04-13T12:14:21Z

seanh
Apr 13, 2023
Maintainer Author

We need this in h just as much as we need it in LMS (e.g. for all the emails that h sends). So I think we might want to implement this as a Pyramid extension that both apps use. (I think it's possible for an extension to provide SQLAlchemy models that the app can register.)

Couple of corrections to this bit:

This has nothing to do with Pyramid. We could maybe have a separate Python library that provides the model and service code for both h and LMS to use here. Having SQLAlchemy models in a library might be awkward when it comes to DB migrations though so it may not be worth the effort.
I don't think h's email sending tasks need this deduplication functionality because they don't send emails in batches like LMS's send_instructor_email_digests() task does. A task that just sends one email already has deduplication:
1. Task runs, sends email, returns
2. The very next thing that happens (more or less) is that the Celery worker acknowledges the task to RabbitMQ which removes the task from the queue
So there's no need to try to log the email send to a DB table right after sending the email, since the task already gets removed from the queue right after sending the email, so the message won't be re-delivered to another worker and the email won't be re-sent.

Of course the worker could get terminated after sending the email but before acknowledging the message. But it could also get terminated after sending the email but before committing the DB transaction, so there's no difference.

So I don't think we need this kind of deduplication at all in h, at least not yet. Maybe a need for it in h will arise some day in future. But for I think this is an LMS-only thing.

With the batch email sending in LMS the situation is different: the task is looping over a list of n emails and sending them one-by-one and not acknowledging the RabbitMQ message until it has sent all n. So there's plenty of time for it to get terminated having already sent <=n of the emails but not yet acknowledged the message. (Splitting the email sending out into a separate task that sends one email at a time doesn't help: you end up with the same problem when you have to loop over your n emails scheduling a task for each of them.)

0 replies

jon-betts · 2023-04-13T12:25:30Z

jon-betts
Apr 13, 2023

A couple of thoughts.,

You don't need to use the DB as the queue to use the DB in a queue. You can start a task with an id. The first thing it does is open the DB and look-up it's details. If the details say it's done (or maybe if they are there at all), the job just exits cleanly without doing anything.

In our case we don't need any details apart from "done". The rest of the info for the task can live in Celery as normal, and therefore change easily without any DB migrations.

The minimum required to achieve this (assuming present = done) I think is:

class TaskDone:
    key: str
    expires_at: datetime

The key needs to be something which is repeatably generated to be the same for the same task. So maybe the start date the task is running for + the email hashed for example.

With the batch email sending in LMS the situation is different: the task is looping over a list of n emails and sending them one-by-one and not acknowledging the RabbitMQ message until it has sent all n.

I think this would need to change to a job which works out what to send, and individual jobs to send them. This allows each individual task to fail or work independently with the system above. This is a problem we can choose to not have. Queuing tasks to celery should be extremely fast.

If the big task fails, that's fine, you can run it again, generate all the task, and they should all quit when they see the record in the DB. You could be smarter and check the DB first and not queue.

I think we might want some kind of generic log table

We pretty much already have this in the event table, but without expiry, however this is tripping of my heebie jeebie sensor. I don't think we want to build a second generic mechanism if it's actually only required to do something very specific.

The smallest thing which works is probably best.

3 replies

seanh Apr 13, 2023
Maintainer Author

You don't need to use the DB as the queue to use the DB in a queue. You can start a task with an id. The first thing it does is open the DB and look-up it's details. If the details say it's done (or maybe if they are there at all), the job just exits cleanly without doing anything.

In our case we don't need any details apart from "done". The rest of the info for the task can live in Celery as normal, and therefore change easily without any DB migrations.

The minimum required to achieve this (assuming present = done) I think is:
class TaskDone:
    key: str
    expires_at: datetime
The key needs to be something which is repeatably generated to be the same for the same task. So maybe the start date the task is running for + the email hashed for example.

Yeah, this is what I was thinking (particularly the first sentence which I've bolded). In my example model code above I've got a JSON data column but I think you might be right that only a string key is needed.

I think an integer ttl (in seconds) might be easier to use as you can just do db.add(TaskDone(ttl=60*5)), you don't need to mess with datetime's and UTC and so on. The expiry time is then task_done.created + datetime.timedelta(seconds=task_done.ttl) but only the periodic task for deleting expired rows ever needs to worry about that. (The actual calculation would probably be in a boolean @property named expired on the TaskDone class itself.)

seanh Apr 13, 2023
Maintainer Author

With the batch email sending in LMS the situation is different: the task is looping over a list of n emails and sending them one-by-one and not acknowledging the RabbitMQ message until it has sent all n.

I think this would need to change to a job which works out what to send, and individual jobs to send them. This allows each individual task to fail or work independently with the system above. This is a problem we can choose to not have. Queuing tasks to celery should be extremely fast.

I don't think this is correct, see my comment above:

Splitting the email sending out into a separate task that sends one email at a time doesn't help: you end up with the same problem when you have to loop over your n emails scheduling a task for each of them.

Ok, sure, maybe we think Rabbit is really fast and reliable. But theoretically if you're looping over n items and adding a task for each item to RabbitMQ you could fail in the middle of that loop just the same as you could fail in the middle of looping over n items and sending an email for each.

So I don't think we gain any benefit from splitting the actual email-sending out into a separate task. But what we can do is commit a DB transaction that adds a row to the TaskDone table immediately after sending each email.

If the big task fails, that's fine, you can run it again, generate all the task, and they should all quit when they see the record in the DB.

If you have a Celery task that sends a single email per task-run then that task doesn't need a DB table to avoid duplication, see my comment above:

A task that just sends one email already has deduplication:

Task runs, sends email, returns

The very next thing that happens (more or less) is that the Celery worker acknowledges the task to RabbitMQ which removes the task from the queue

So there's no need to try to log the email send to a DB table right after sending the email, since the task already gets removed from the queue right after sending the email, so the message won't be re-delivered to another worker and the email won't be re-sent.

But the thing that would need the DB table is that big task that is looping over your n items and scheduling another task for each item. It's not fine to just run that task again if it fails because it might have failed mid-way through that loop. So re-running it might schedule duplicate tasks.

You could be smarter and check the DB first and not queue.

Yes, exactly. But at this point you may as well have not split it into two separate tasks in the first place

seanh Apr 13, 2023
Maintainer Author

I think we might want some kind of generic log table

We pretty much already have this in the event table, but without expiry, however this is tripping of my heebie jeebie sensor. I don't think we want to build a second generic mechanism if it's actually only required to do something very specific.

The smallest thing which works is probably best.

👍 Agree, I think what we want a new table just for this purpose that is the simplest thing that works

jon-betts · 2023-04-13T12:51:54Z

jon-betts
Apr 13, 2023

It occurs to me you can make this pretty generic by making the key a hash of the inputs to the task (where the task is deterministic). The only wrinkle here would be a situation where we end up sending exactly the same details two days in a row.

You could work around this by:

Adding a date field in addition to the hash as before
Adding a date field you don't really need in the hashed data (which is similar, but more generic)
Relying on the expiry date to differentiate

1 reply

seanh Apr 13, 2023
Maintainer Author

I don't think we're trying to de-duplicate Celery task calls per-se. Celery tasks already have de-duplication: as soon as your task function returns successfully the very next thing that happens is that the Celery worker acknowledges the task to RabbitMQ which removes it from the queue. Recording the "task done" in a DB table doesn't get you anything that you don't already have.

The problem is whenever you have a Celery task that is looping over a batch of sub-tasks, such as:

Doing a for loop and sending an email in each iteration of the loop
Doing a for loop and scheduling an instance of some other Celery task in each iteration of the loop

In cases like these you need the DB table to record the completion of each sub-task (each iteration round the for loop) so that if your task never gets to the end of the for loop (e.g. the worker gets killed) and the task later gets re-tried it can pick up where it left off and avoid doing duplicate iterations of the loop.

This problem only occurs when you have batching of work like we do in LMS's email digests.

I hope that makes sense

jon-betts · 2023-04-13T13:11:13Z

jon-betts
Apr 13, 2023

It's not really about deduping, but about idempotence. You make it so it's fine to run your task multiple times. Celery won't run the exact same job (successfully) more than once, but it will happily run as many different jobs as you ask for with the same arguments. They aren't the same job to Celery. So if you don't know if it's worked, and queue it again, it's a new job.

That's what we would gain here. Celery guarantees it will run at least once successfully, we step in to ensure it runs at most once successfully.

If you have a Celery task that sends a single email per task-run then that task doesn't need a DB table to avoid duplication, see #5284 (comment):

I think this is close to a useful way of thinking about this. We accept this is a numbers game, you can't really win. Whatever situation you setup you can think of a scenario that it doesn't work for. What you can do is massively reduce the time window, and so probability of failure, up to a point where you say it's fine. The less we have to pay to cross that threshold the better.

So if (made up numbers incoming, if we have some it would be great to update with them):

Running the query to queue the work takes: 5s
Queuing tasks to celery would take: 120ms
Sending the emails takes 500ms each, for 1000 emails total: 500s

The total length of the job is 505s (because we don't queue). At the moment a failure in that time anywhere causes us a problem.

Assuming you're right that having individual jobs for emails would pretty much make them work as we'd like then that change alone would take our "vulnerable" time down to 5.12s. Which has solved 99% of the problem right there. I suspect this might be enough for practical purposes.

Adding idempotence (or some version of it) then could only improve the vulnerable time by the 1% left at best.

2 replies

seanh Apr 13, 2023
Maintainer Author

I think we're getting some pretty crossed wires as far as the deduplication/idempotence stuff goes. It's not necessary to have separate per-email tasks in order to get idempotence, nor does having separate per-email tasks automatically give you idempotence. I don't think separate-versus-combined tasks makes a difference either way to the question of idempotence. But at this point I think we probably need to use voices to clear up the confusion.

I think there might be some other reasons to split the email-sending out into separate one-task-per-email tasks:

It's going to be simpler with DB transactions. If you have a task that sends n emails in a for loop then it needs to commit a transaction to the TaskDone table at the end of each iteration of that for loop and then start a new transaction at the beginning of the next iteration. Whereas if you have a task that sends just 1 email then it just needs to commit one transaction at the end of the task. It should be set up in celery.py (if it isn't already) that each run of a task gets a DB transaction set up for it that gets automatically committed when the task finishes successfully or rolled back if the task fails. So the actual task itself won't need any code to do with transactions or commits, it will merely need a db.add(TaskDone(...)).
It avoids re-work. If the first task has already done the work of generating the whole batch of n digests (h API requests, DB queries, ...) then it schedules 100 send_email tasks. If 50 of those send_email tasks fail then we only need to re-try sending 50 emails. We don't need to re-do the digest generation. I don't think this is very important: fails and re-tries are hopefully rare and not really something we need to worry too much about optimising. But still it's a 👍
In cases where you actually just want to send a one-off email (say a "Someone has replied to your annotation" email rather than a daily digest) those can re-use the same send_email task
I think that long-running Celery tasks are generally a bad idea and splitting them up into lots of short-running tasks is generally a good idea. A task that sends n emails is one longer task whereas whereas n tasks that each send 1 email are n shorter tasks
It might allow us to implement it as a generic "Celery task idempotent-er" by doing something like hashing the task arguments as @jon-betts mentioned above, which could make it nicely reusable for other potential tasks that might want this in future

On the other hand, disadvantages:

More DB queries. The separate single-email task is going to have to do a separate single-email DB query to check whether the email has already been sent. Whereas a task that sends a batch of emails can do a single query to check which from the entire batch have already been sent

jon-betts Apr 17, 2023

I think we're getting some pretty crossed wires as far as the deduplication/idempotence stuff goes. It's not necessary to have separate per-email tasks in order to get idempotence

I don't think anyone is suggesting it gets you idempotence. That is something you always have to bring to the party with Celery. So we are all agreed here.

I think that long-running Celery tasks are generally a bad idea and splitting them up into lots of short-running tasks is generally a good idea. A task that sends n emails is one longer task whereas whereas n tasks that each send 1 email are n shorter tasks

This is it. They are atomic and short. When a random line in time cuts off all tasks running, it can only effect those running.

More DB queries.

This is true, but might not change the game a lot. If we have n emails you have to execute n + 1 queries as you must save progress as you go. Separate queries would change it to being 2n queries instead. Which is worse but the same order. Also if the emails do take any appreciable amount of time to send, they will be spread out over that period of time.

It also has better atomicity, as there race conditions are less likely and less severe. Two longer tasks could start work, check to see what the progress is, see none, and then both send all the emails.

This can still happen with the single sender setup, but only at the individual email level. Also the time periods involved should be much shorter, so you'd have to be way less lucky for it to happen.

jon-betts · 2023-04-13T13:16:05Z

jon-betts
Apr 13, 2023

I've just realised I'm not 100% about the current setup of tasks. Do we have an over task currently which is kicking off the chunked tasks?

Do we have any estimates / guesstimates for how long each part of the process takes?

1 reply

seanh Apr 13, 2023
Maintainer Author

I've just realised I'm not 100% about the current setup of tasks. Do we have an over task currently which is kicking off the chunked tasks?

There's one task that finds all the users who we need to email and splits them into batches:

lms/lms/tasks/email_digests.py

Lines 17 to 72 in 95bdcdb

    
           @app.task 
        
           def send_instructor_email_digest_tasks(batch_size): 
        
               """ 
        
               Generate and send instructor email digests. 
        
               The email digests will cover activity that occurred in the time period from 
        
               5AM UTC yesterday morning to 5AM UTC this morning. 
        
               Emails will be sent to all instructors who're participating in the feature 
        
               and who have digest activity in the time period. 
        
               5AM UTC is chosen because it equates to midnight EST. Most of our target 
        
               users for this feature are in the EST timezone and we want each email 
        
               digests to cover "yesterday" (midnight to midnight) EST. 
        
               EST is 5 hours behind UTC (ignoring daylight savings for simplicity: we 
        
               don't need complete accuracy in the timing). 
        
               """ 
        
               now = datetime.now(timezone.utc) 
        
               updated_before = datetime(year=now.year, month=now.month, day=now.day, hour=5) 
        
               updated_after = updated_before - timedelta(days=1) 
        
               updated_before = updated_before.isoformat() 
        
               updated_after = updated_after.isoformat() 
        
               with app.request_context() as request:  # pylint:disable=no-member 
        
                   with request.tm: 
        
                       h_userids = request.db.scalars( 
        
                           select(User.h_userid) 
        
                           .select_from(User) 
        
                           .distinct() 
        
                           .join(ApplicationInstance) 
        
                           .join(AssignmentMembership) 
        
                           .join(LTIRole) 
        
                           .where( 
        
                               ApplicationInstance.settings["hypothesis"][ 
        
                                   "instructor_email_digests_enabled" 
        
                               ].astext 
        
                               == "true", 
        
                               LTIRole.type == "instructor", 
        
                           ) 
        
                       ).all() 
        
                       batches = [ 
        
                           h_userids[i : i + batch_size] 
        
                           for i in range(0, len(h_userids), batch_size) 
        
                       ] 
        
                       for batch in batches: 
        
                           send_instructor_email_digests.apply_async( 
        
                               (), 
        
                               { 
        
                                   "h_userids": batch, 
        
                                   "updated_after": updated_after, 
        
                                   "updated_before": updated_before, 
        
                               }, 
        
                           )

And a second task that takes a whole batch and generates and sends the emails:

lms/lms/tasks/email_digests.py

Lines 75 to 127 in 95bdcdb

    
           @app.task(bind=True, max_retries=2) 
        
           def send_instructor_email_digests( 
        
               self, 
        
               *, 
        
               h_userids: List[str], 
        
               updated_after: str, 
        
               updated_before: str, 
        
               override_to_email: Optional[str] = None, 
        
           ) -> None: 
        
               """ 
        
               Generate and send instructor email digests to the given users. 
        
               The email digests will cover activity that occurred in the time period 
        
               described by the `updated_after` and `updated_before` arguments. 
        
               :param h_userids: the h_userid's of the instructors to email 
        
               :param updated_after: the beginning of the time period as an ISO 8601 format string 
        
               :param updated_before: the end of the time period as an ISO 8601 format string 
        
               :param override_to_email: send all the emails to this email address instead 
        
                   of the users' email addresses (this is for test purposes) 
        
               """ 
        
               # Caution: datetime.fromisoformat() doesn't support all ISO 8601 strings! 
        
               # This only works for the subset of ISO 8601 produced by datetime.isoformat(). 
        
               updated_after = datetime.fromisoformat(updated_after) 
        
               updated_before = datetime.fromisoformat(updated_before) 
        
               # How long to wait (in seconds) before retrying the task if it fails. 
        
               retry_countdown = (3600 * (self.request.retries + 1)) + random.randint(0, 900) 
        
               with app.request_context() as request:  # pylint:disable=no-member 
        
                   with request.tm: 
        
                       digest_service = request.find_service(DigestService) 
        
                       try: 
        
                           digest_service.send_instructor_email_digests( 
        
                               h_userids, 
        
                               updated_after, 
        
                               updated_before, 
        
                               override_to_email=override_to_email, 
        
                           ) 
        
                       except SendDigestsError as err: 
        
                           self.retry( 
        
                               kwargs={ 
        
                                   **self.request.kwargs, 
        
                                   "h_userids": list(err.errors.keys()), 
        
                               }, 
        
                               countdown=retry_countdown, 
        
                           ) 
        
                       except Exception as exc:  # pylint:disable=broad-exception-caught 
        
                           LOG.exception(exc) 
        
                           report_exception(exc) 
        
                           self.retry(countdown=retry_countdown)

I think where we might be heading is a chain of three separate tasks:

Find all the users and break them into batches
Generate the digests for a whole batch of users at once
Send one already-generated email to one user

Where the third task is just a generic send_email task that doesn't know anything about digest emails in particular

Do we have any estimates / guesstimates for how long each part of the process takes?

No, I think we're gonna have to wait until the feature has actually been released for a while in order to know that.

seanh · 2023-04-13T14:12:59Z

seanh
Apr 13, 2023
Maintainer Author

Ah, I wonder if we're talking about reinventing Celery result backends? Maybe we should just use that. It supports SQLAlchemy. We've talked about potentially needing a results backend for other cases: for example if we want to move some of LMS's currently in-line network requests into Celery tasks. Worth looking into anyway

2 replies

seanh Apr 13, 2023
Maintainer Author

Looks like Celery might be able to deduplicate tasks for us, if a result backend is in use: https://docs.celeryq.dev/en/stable/userguide/configuration.html#worker-deduplicate-successful-tasks

seanh Apr 17, 2023
Maintainer Author

Hmm, I've been looking through the Celery docs. The whole result backends and task results aspect isn't very clearly documented but I don't think it's intended for quite the same use-case as we have here. The docs also suggest in one place that the SQLAlchemy result backend might not be stable. Redis appears to be the main supported result backend. I think we should probably forget about it as far as this feature goes

seanh · 2023-04-17T14:52:41Z

seanh
Apr 17, 2023
Maintainer Author

I've updated the first comment at the top of this discussion with an outline of a solution based on the discussion so far. @marcospri @jon-betts could you give that a read and make any comments that you have?

One question I've left unanswered is: should we hash the keys in the TaskDone table? Hashing isn't strictly necessary for the email digests feature, the key just needs to be a string like "instructor_email_digest:<h_userid>:<YYYY-MM-DD>", which doesn't even contain any sensitive data. But I wonder whether it might be a good idea to build in always-hashing-the-keys into the TaskDone table from the start so that any future uses of this table don't potentially store any unencrypted sensitive data in it? On the other hand hashing does obscure the values (especially if the entire key is hashed not just any sensitive parts) which can make manually inspecting the table harder.

1 reply

jon-betts Apr 17, 2023

Hashing means you don't have to worry about the number of things you need to put in the string, which is nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a DB table to avoid sending duplicate emails #5284

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using a DB table to avoid sending duplicate emails #5284

seanh Apr 12, 2023 Maintainer

Problem

Solution

Replies: 8 comments · 11 replies

marcospri Apr 13, 2023 Maintainer

seanh Apr 13, 2023 Maintainer Author

seanh Apr 13, 2023 Maintainer Author

jon-betts Apr 13, 2023

seanh Apr 13, 2023 Maintainer Author

seanh Apr 13, 2023 Maintainer Author

seanh Apr 13, 2023 Maintainer Author

jon-betts Apr 13, 2023

seanh Apr 13, 2023 Maintainer Author

jon-betts Apr 13, 2023

seanh Apr 13, 2023 Maintainer Author

jon-betts Apr 17, 2023

jon-betts Apr 13, 2023

seanh Apr 13, 2023 Maintainer Author

seanh Apr 13, 2023 Maintainer Author

seanh Apr 13, 2023 Maintainer Author

seanh Apr 17, 2023 Maintainer Author

seanh Apr 17, 2023 Maintainer Author

jon-betts Apr 17, 2023

seanh
Apr 12, 2023
Maintainer

Replies: 8 comments 11 replies

marcospri
Apr 13, 2023
Maintainer

seanh Apr 13, 2023
Maintainer Author

seanh
Apr 13, 2023
Maintainer Author

jon-betts
Apr 13, 2023

seanh Apr 13, 2023
Maintainer Author

seanh Apr 13, 2023
Maintainer Author

seanh Apr 13, 2023
Maintainer Author

jon-betts
Apr 13, 2023

seanh Apr 13, 2023
Maintainer Author

jon-betts
Apr 13, 2023

seanh Apr 13, 2023
Maintainer Author

jon-betts
Apr 13, 2023

seanh Apr 13, 2023
Maintainer Author

seanh
Apr 13, 2023
Maintainer Author

seanh Apr 13, 2023
Maintainer Author

seanh Apr 17, 2023
Maintainer Author

seanh
Apr 17, 2023
Maintainer Author