Support exporting source media to google drive #115

philmcmahon · 2025-01-08T17:20:04Z

What does this change?

This PR adds a new option to the export page to export the original source media to google drive. As part of this I've redesigned the export page - it now looks like this:

The export has several stages - see below for screenshots of that

On the assumption that, following this change, users will frequently be exporting more than one file, all exported files (even if there's just one) are now stored in a subfolder of the 'Guardian Transcribe Tool' folder (suffixed with date/time in case they have multiple files with the same name).

I decided to use a lambda to perform the export to google drive. This has the advantages of being easier to setup than an ECS task, and faster to start. It has the disadvantage that we can only export files up to 10GB (the maximum ephemeral storage), and we only have 15 minutes to do the upload. In my (limited) testing, I found that the lambda was able to export a 1.2GB file in 70 seconds, so I suspect we'll be limited more by the max file size than the timeout - but only just.

I had to use a separate lambda function for this rather than the API itself because API gateway has a 30s timeout, and once the lambda returns a http response it gets terminated. There are workarounds to this but I couldn't find anything that works nicely with serverless-express so I decided to create a separate function (this has the advantage that we don't need our API lambda to have loads of memory/disk space).

Some error reporting exists for if the file is too large. I still need to add an error for if the lambda times out whilst performing the export - might leave for a future PR though.

The feature relies on the file extension to tell google drive what type the file is - this seems to work reasonably well. A future feature could run apache tika or something similar on the file to determine the file type.

In theory the uploadFileToGoogleDrive function should be streaming the file 128MB at a time, in practice I found that the function ran out of memory when uploading a 1.2Gb file when the lambda only had 512MB. This needs more investigation - for now I have set the memory to 2GB. I think it's worth getting in as is because my 1.2GB test file was off a 1h30 youtube video, and I suspect many videos will be under this length. Might be a bit of fun though to try and work out how memory management in node works.

How to test

This is currently live on CODE, you can try it out here https://transcribe.code.dev-gutools.co.uk/

Screenshots

github-actions · 2025-01-08T17:23:35Z

Deploy build 791 of `investigations::transcription-service` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2025-01-08T17:23:39Z

Deploy build 671 of `investigations::transcription-service-repository` to CODE

All deployment options

From guardian/actions-riff-raff.

…ow for downloading/uploading large files

… ui to s3

…endpoint

…nning after returning a response

philmcmahon requested a review from a team as a code owner January 8, 2025 17:20

philmcmahon marked this pull request as draft January 9, 2025 10:03

philmcmahon force-pushed the pm-save-media-google-drive branch from d6baa92 to 0ef1107 Compare January 9, 2025 12:35

philmcmahon marked this pull request as ready for review January 10, 2025 15:34

philmcmahon added 22 commits January 10, 2025 15:35

Add 'extension' field to metadata of scraped media

0091de7

Add functionality to export original source media to google drive

a2dcfd3

Increase ephemeral storage of lambda to 10gb, memory to 512mb, to all…

f083b13

…ow for downloading/uploading large files

Set originalfilename and extension metadata on files uploaded via the…

5c6b98c

… ui to s3

Prevent double . in filename

de469a0

Tidy up filename

9126bbe

Show links to individual files when export complete

60ab235

Rename ExportButton ExportForm

56327a0

Fix dynamo table name

a5f4b4c

Include date and time in folder name

7a78c1f

Bump lambda timeout to 15 minutes

7144de2

Refactor so that export returns immediately then client polls status …

b31667c

…endpoint

Fix export status reporting

9ca9a87

Add extra logging for export

4be932d

Disable callbackWaitsForEmptyEventLoop to allow lambda to continue ru…

d13c138

…nning after returning a response

Source media is now input media, fix logs

3ea1060

Fix media download file path

6dde0d2

Log progress of s3 download

02c368c

Try setting resolution mode to callback

15712e9

Remove await from export promise

f0d5af5

Add empty media-export lambda

c803341

Move export to google drive functionality to export-media lambda

f2e538f

philmcmahon force-pushed the pm-save-media-google-drive branch from 6a5806f to f2e538f Compare January 10, 2025 15:37

philmcmahon mentioned this pull request Jan 10, 2025

Improve url form validation #117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exporting source media to google drive #115

Support exporting source media to google drive #115

philmcmahon commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

Support exporting source media to google drive #115

Are you sure you want to change the base?

Support exporting source media to google drive #115

Conversation

philmcmahon commented Jan 8, 2025 • edited Loading

What does this change?

How to test

Screenshots

github-actions bot commented Jan 8, 2025 • edited Loading

Deploy build 791 of investigations::transcription-service to CODE

github-actions bot commented Jan 8, 2025 • edited Loading

Deploy build 671 of investigations::transcription-service-repository to CODE

philmcmahon commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

Deploy build 791 of `investigations::transcription-service` to CODE

github-actions bot commented Jan 8, 2025 •

edited

Loading

Deploy build 671 of `investigations::transcription-service-repository` to CODE