Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video deletion backlog/maintenance #1431

Open
becky-gilbert opened this issue Jul 1, 2024 · 0 comments
Open

Video deletion backlog/maintenance #1431

becky-gilbert opened this issue Jul 1, 2024 · 0 comments

Comments

@becky-gilbert
Copy link
Contributor

becky-gilbert commented Jul 1, 2024

Summary

Due to problems with our system for video deletion in S3 (#1423, #1430), we have a backlog of videos in S3 that are not in our DB and need to be deleted. We may also want to consider adding a task to check the S3 videos against those in our DB, so that any lingering S3 videos that should be deleted are cleaned up as part of regular maintenance.

Description

We recently found a problem with our S3 video deletion process, and as a result we will need to address the backlog of video files (~300) in S3 that should've been deleted. We can do this by:

  1. getting the file names from the "Video.DoesNotExist" Sentry error that is generated when a file could not be deleted, and/or
  2. comparing the video file names from S3 with those in our DB and removing any from S3 that do not exist in our DB.

One question is whether to do this "manually" (i.e triggered/monitored by a dev, though it could be partially automated with a script that generates a list of files and then deletes them via the AWS CLI), or via a fully-automated Celery task. If we were to do this via a Celery, we would need to put some safeguards in place to ensure that we never accidentally delete videos (e.g. if there were a database connection problem).

Proposal

I suggest we make this a fully-automated Celery task that does the following for all video storage buckets:

  • Get list of all video file names from bucket (perhaps filter by date created, and only grab those older than e.g. 1 year)
  • Get list of all videos that are currently already queued for cloud deletion (as part of our 7-day soft deletion for e.g. deleting preview data)
  • For each S3 video, if it does not exist in the database and is not already queued for deletion, then delete it immediately.

Implementation notes:

  • We could get the full sets of all S3 videos and all DB videos and compare them in one go, but that would be memory-intensive. Hence the much slower but memory-light approach of just getting the S3 video list and iterating through that.
  • We could queue it for soft deletion, by adding it to the delete_video_from_cloud queue, but there's probably no point to doing that. These cases differ from those when videos are deleted via user actions (deleting preview responses, checking the 'withdraw' box in the exit survey), in which case it is possible for users to realize that they made an error and get in touch with us about trying to recover data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant