diff --git a/Doc/Case_study.md b/Doc/Case_study.md index e021d495..070fb2bc 100644 --- a/Doc/Case_study.md +++ b/Doc/Case_study.md @@ -1,7 +1,7 @@ # BFI National Archive RAWcooked Case Study **by Joanna White, Knowledge & Collections Developer** -At the [BFI National Archive](https://www.bfi.org.uk/bfi-national-archive) we have been encoding DPX sequences to FFV1 Matroska since late 2019. In that time our RAWcooked workflow has evolved with the development of RAWcooked, DPX resolutions and flavours and changes in our encoding project priorities. Today we have a fairly hands-off automated workflow which handles 2K and 4K image sequences. This workflow is built on some of the flags developed in RAWcooked by Media Area and written in a mix of Bash shell and Python3 scripts ([BFI Data & Digital Preservation GitHub](https://github.com/bfidatadigipres/dpx_encoding)). In addition to RAWcooked we use other Media Area tools to complete necessary stages of this workflow. Our encoding processes do not include any alpha channel or audio file processing, but RAWcooked is capable of encoding both into the completed FFV1 Matroska. +At the [BFI National Archive](https://www.bfi.org.uk/bfi-national-archive) we have been encoding DPX sequences to FFV1 Matroska since late 2019. In that time our RAWcooked workflow has evolved with the development of RAWcooked, DPX resolutions, flavours and changes in our encoding project priorities. Today we have a fairly hands-off automated workflow which handles 2K and 4K image sequences. This workflow is built on some of the flags developed in RAWcooked by Media Area and written in a mix of Bash shell and Python3 scripts ([BFI Data & Digital Preservation GitHub](https://github.com/bfidatadigipres/dpx_encoding)). In addition to RAWcooked we use other Media Area tools to complete necessary stages of this workflow. Our encoding processes do not include any alpha channel or audio file processing, but RAWcooked is capable of encoding both into the completed FFV1 Matroska. This case study is broken into the following sections: * [Server configurations](#server_config) @@ -45,9 +45,9 @@ When encoding 2K RGB we generally reach between 3 and 10 frames per second (fps) For each image sequence processed the metadata of the first DPX is collected and saved into the sequence folder, along with total folder size in bytes and a list of all contents of the sequence. We collect this information using [Media Area's MediaInfo software](https://mediaarea.net/MediaInfo) and capture the output into script variables. -Next the first file within the image sequence is checked against a [Media Area's MediaConch software](https://mediaarea.net/MediaConch) policy for the file ([BFI's DPX policy](https://github.com/bfidatadigipres/dpx_encoding/blob/main/rawcooked_dpx_policy.xml)). If it passes then we know it can be encoded by RAWcooked and by our current licence. Any that fail are assessed for possible RAWcooked licence expansion or possible anomalies in the DPX. +Next the first file within the image sequence is checked against a DPX policy created using [Media Area's MediaConch software](https://mediaarea.net/MediaConch) - ([BFI's DPX policy](https://github.com/bfidatadigipres/dpx_encoding/blob/main/rawcooked_dpx_policy.xml)). If it passes then we know it can be encoded by RAWcooked and by our current licence. Any that fail are assessed for possible RAWcooked licence expansion or possible anomalies in the DPX. -The frame pixel size and colourspace of the sequence are used to calculate the potential reduction rate of the RAWcooked encode based on previous reduction experience. We make an assumption that 2K RGB will always be atleast one third smaller, so calculate a 1.3TB sequences to make a 1TB FFV1 Matroska. For 2K Luma and all 4K we must assume that very small size reductions could occur so map 1TB to 1TB. This step is necessary to control file ingest sizes to our Digital Preservation Infrastructure where we currently have a maximum verifiable ingest file size of 1TB. Where a sequence is over 1TB we have Python scripts to split that DPX sequence across additional folders depending on total size. +The frame pixel size and colourspace of the sequence are used to calculate the potential reduction rate of the RAWcooked encode based on previous reduction experience. We make an assumption that 2K RGB will always be atleast one third smaller, so calculate a 1.3TB sequence will make a 1TB FFV1 Matroska. For 2K Luma and all 4K we must assume that very small size reductions could occur so map 1TB to 1TB. This step is necessary to control file ingest sizes to our Digital Preservation Infrastructure where we currently have a maximum verifiable ingest file size of 1TB. Where a sequence is over 1TB we have Python scripts to split that DPX sequence across additional folders depending on total size. | RAWcooked 2K RGB | RAWcooked Luma & RAWcooked 4K | | -------------------- | ----------------------------- | @@ -56,7 +56,7 @@ The frame pixel size and colourspace of the sequence are used to calculate the p ### Encoding the image sequence -To encode our image sequences we use the ```--all``` flag released in RAWcooked v21. This flag was a sponsorship development by [NYPL](https://www.nypl.org/), and sees several preservation essential flags merged into this one simple flag. Most imporantly it includes the creation of checksum hashes for every image file in the sequence, with this data being saved into the RAWcooked reversibility file and embedded into the Matroska wrapper. This ensures that when decoded the retrieved sequence can be verified as bit-identical to the original source sequence. +To encode our image sequences we use the ```--all``` flag released in RAWcooked v21. This flag was a sponsorship development by [NYPL](https://www.nypl.org/), and sees several preservation essential flags merged into one. Most imporantly it includes the creation of checksum hashes for every image file in the sequence, with this data being saved into the RAWcooked reversibility file and embedded into the Matroska wrapper. This ensures that when decoded the retrieved sequence can be verified as bit-identical to the original source sequence. Our RAWcooked encode command: ``` @@ -67,7 +67,7 @@ rawcooked -y --all --no-accept-gaps -s 5281680 path/sequence_name/ -o path/seque | ---------------------- | ------------------------------------------ | | ```rawcooked``` | Calls the software | | ```-y``` | Answers 'yes' to software questions | -| ```-all``` | Preservation command with CRC-32 hashes | +| ```--all``` | Preservation command with CRC-32 hashes | | ```--no-accept-gaps``` | Exit with warning if sequence gaps found | | | --all command defaults to accepting gaps | | ```-s 5281680``` | Set max attachment size to 5MB | @@ -75,14 +75,14 @@ rawcooked -y --all --no-accept-gaps -s 5281680 path/sequence_name/ -o path/seque | ```>>``` | Capture console output to text file | | ```2>&1``` | stderr and stdout messages captured in log | -This command is generally launched from within a Bash script, and is passed to [GNU Parallel](https://www.gnu.org/software/parallel/) to run multiple encodes at the same time. This software makes it very simple to fix a specific number of encodes specified by the ```--jobs``` flag. Parallelisation is the act of processing jobs in parallel, dividing up the work to save time. If not run in parallel a computer will usually process jobs serially, one after another. As well as parallelisation, FFmpeg usinges multi-threading to create the FFV1 file. The FFV1 codec has slices through each frame (64 slice minimum in RAWcooked frame) which allows for granular checksum verification, but also allows FFmpeg multi-threading. Each slice block is split into different processing tasks and run across your CPU threads, so for our server that works as 64 separate tasks per thread, one slice per frame of the FFV1 file. +This command is generally launched from within a Bash script, and is passed to [GNU Parallel](https://www.gnu.org/software/parallel/) to run concurrent encodings. This software makes it very simple to fix a specific number of encodes using the ```--jobs``` flag. Parallelisation is the act of processing jobs in parallel, dividing up the work across threads to maximise efficiency. If not run in parallel a computer will usually process jobs serially, one after another. As well as parallelisation, FFmpeg uses multi-threading to create the FFV1 file. The FFV1 codec has slices through each frame (usually between 64 and 576 slices) which allows for granular checksum verification, but also allows FFmpeg multi-threading. Each slice block is split into different processing tasks and run across your CPU threads, so for our server that works as 64 separate tasks per thread, one slice per frame of the FFV1 file. By listing all the image sequence paths in one text file you can launch a parallel command like this to run 5 parallel encodes: ``` cat ${sequence_list.txt} | parallel --jobs 5 "rawcooked -y --all --no-accept-gaps -s 5281680 {} -o {}.mkv >> {}.mkv.txt 2>&1" ``` -We always capture our console logs for every encode. The ```2>&1``` ensures any error messages are output alongside the usual standard console messages for the software. These are essential for us to review if a problem is found with an encode. Over time they also provide a very clear record of changes encountered in FFmpeg and RAWcooked software, and valuable metadata of our own image sequence files. These logs have been critical in identifying unanticipated edge cases with some DPX encodings, allowing for impact assessment by Media Area. We would definitely encourage all RAWcooked users to capture and retain this information as part of their long-term preservation of their RAWcooked sequences. +We always capture our console logs for every encode job. The ```2>&1``` ensures any error messages are output alongside the usual standard console messages for the software. These are essential for us to review if a problem is found with an FFV1 Matroska. Over time they also provide a very clear record of changes encountered in FFmpeg and RAWcooked software, and valuable metadata of our own image sequence files. These logs have been critical in identifying unanticipated edge cases with some DPX encodings, allowing for impact assessment by Media Area. We would definitely encourage all RAWcooked users to capture and retain this information as part of the long-term preservation of RAWcooked encoded sequences. ### Encoding log assessment @@ -156,7 +156,7 @@ Reversibility was checked, no issue detected. ``` -If an encoding has completed then in this last section you might see different types of human readable message including: +If an encoding has completed then in this last section you might see different types of messages that include: * Warnings about the image sequence files * Errors experienced during encoding * Information about the RAWcooked encode (shown above) @@ -183,13 +183,13 @@ Error: unsupported DPX (non conforming) alternate end of line non padding Please contact info@mediaarea.net if you want support of such content. ``` -The automation scripts used at the BFI National Archive look for any messages that have 'Error' in them. If any are found the FFV1 Matroska is deleted and the sequence is queued for a repeated encode attempt. Likewise, if the completion statement suggests a failure then the FFV1 is deleted and the sequence is queued for a repeat encode. A successful completion statement should always read: +The automation scripts used at the BFI National Archive look for any messages that have 'Error' in them. If any are found the FFV1 Matroska is deleted and the sequence is queued for a repeated encoding attempt. Likewise, if the completion statement suggests a failure then the FFV1 is deleted and the sequence is queued for a repeat encode. A successful completion statement should always read: ```Reversibility was checked, no issues detected.``` There is one error message that triggers a specific type of re-encode: ```Error: the reversibility file is becoming big | Error: undecodable file is becoming too big``` -For this error we know that we need to re-encode our image sequence with the additional flag ```--output-version 2``` which writes the large reversibility data to the FFV1 Matroska once encoding has completed. FFmpeg has an upper size limit of 1GB for attachments. If there is lots of additional data stored in your DPX file headers then this flag will ensure that your FFV1 Matroska completes fine and the data remains verifiably reversible. FFV1 Matroskas that are encoded using the ```--output-version 2``` flag are not backward compatible with RAWcooked version before V 21.09. +For this error we know that we need to re-encode our image sequence with the additional flag ```--output-version 2``` which writes the large reversibility data to the FFV1 Matroska once encoding has completed. FFmpeg has an upper size limit of 1GB for attachments. If there is additional data stored in your DPX file headers (not zero padding) then this flag will ensure that this data is stored safely into the reversibility data and that the FFV1 Matroska remains verifiably reversible. FFV1 Matroskas that are encoded using the ```--output-version 2``` flag are not backward compatible with RAWcooked version before V21.09. ### FFV1 Matroska validation @@ -204,24 +204,24 @@ Again the stderr and stdout messages are captured to a log, and this log is chec ### FFV1 Matroska decode to image sequence -We have automation scripts that return an FFV1 Matroska back to the original image sequence. These are essential for our film preseration colleagues who may need to perform grading or enhancement work on preserved films. For this we use the ```--all``` command again which can select decode when an FFV1 Matroska is supplied. +We have automation scripts that return an FFV1 Matroska back to the original image sequence. These are essential for our film preservation colleagues who may need to perform grading or enhancement work on preserved films. For this we use the ```--all``` command again which automatically selects decode when an FFV1 Matroska is supplied. This simple script runs this command: ``` rawcooked -y --all path/sequence_name.dpx -o path/decode_sequence >> path/sequence_name.txt 2>&1 ``` -It decodes the FFV1 Matroska back to image sequence, checks the logs for ```Reversibility was checked, no issue detected``` and reports the outcome to a script log. +It decodes the FFV1 Matroska back to it's original form as a DPX image sequence, checks the logs for ```Reversibility was checked, no issue detected``` and reports the outcome to a script log. --- ## Conclusion We began using RAWcooked to convert 3 petabytes of 2K DPX sequence data to FFV1 Matroska for our *Unlocking Film Heritage* project. This lossless compression to FFV1 has saved us an estimated 1600TB of storage space, which has saved thousands of pounds of additional magnetic storage tape purchases. Undoubtedly this software offers amazing financial incentives with all the benefits of open standards and open-source tools. It also creates a viewable video file of an otherwise invisible DPX scan, so useful for viewing the unseen technology of film. -Today, our workflow runs 24/7 performing automated encoding of business-as-usual DPX sequences with relatively little overview. There is a need for manual intervention when repeated errors are encountered. This is usually indicated when an image sequences doesn't make it to our Digital Preservation Infrastructure. Most often this is caused by a new image sequence 'flavour' that we do not have covered by our RAWcooked licence, or sometimes it can indicate a problem with either RAWcooked or FFmpeg while encoding a specific DPX scan. There can be many differences found in DPX metadata depending on the scanning technology. Where errors are found by our automations these are reported to an error log named after the image seqeuence. +Today, our workflow runs 24/7 performing automated encoding of business-as-usual DPX sequences with relatively little overview. There is a need for manual intervention when repeated errors are encountered. This is usually indicated in error logs or when an image sequences doesn't make it to our Digital Preservation Infrastructure. Most often this is caused by a new image sequence 'flavour' that we do not have covered by our RAWcooked licence, or sometimes it can indicate a problem with either RAWcooked or FFmpeg while encoding a specific DPX scan. There can be many differences found in DPX metadata depending on the scanning technology used. Where errors are found by our automations these are reported to an error log named after the image seqeuence. -When we solely encoded 2K sequences we found we could run multiple parallel processes with good efficiency, seeing as many as 32 concurrent encodings running at once. This was before we implemented the ```--all``` command which calculates checksums adding them to the reversibility data and runs a checksum comparison of the Matroska after encoding has completed which expands the encoding process. When introducing this command we reduced our concurrency, particularly as our workflow introduced a final ```--check``` pass against the Matroska file that automated the deletion of the DPX sequence, when successful. We generally set 6 to 8 concurrent encodings on our busier QNAP storage, and 2 concurrent encodings on other storage. +Our 2K workflows could run multiple parallel processes with good efficiency, seeing as many as 32 concurrent encodings running at once against a single storage device. This was before we implemented the ```--all``` command which calculates checksums adding them to the reversibility data and runs a checksum comparison of the Matroska after encoding has completed which expands the encoding process. When introducing this command we reduced our concurrency, particularly as our workflow introduced a final ```--check``` pass against the Matroska file that automated the deletion of the DPX sequence, when successful. We also expanded our storage devices for RAWcooking and currently have 8 storage devices (a mix of Isilon, QNAPs and G-Rack NAS) generally set for between 2 and 8 concurrent encodings with the aim of not exceeding 32. -In recent years we have seen a shift from majority 2K DPX to majority 4K DPX with mostly 12- or 16-bit depths. Very recently we have found ```--all``` encoding and parallel ```--check``` efficiency increases when running just two parallel encodings at any given time. Below are some recent 4K DPX encoding times using RAWcooked's ```--all``` command with a maximum of two parallel encodings, and where we can assume another single ```--check``` run is underway from the server: +In recent years we have seen a shift from majority 2K DPX to majority 4K DPX with mostly 12- or 16-bit depths. To maintain speed of specific DPX throughout it is better to limit our parallel encodings to two DPX per storage at any given time. Below are some recent 4K DPX encoding times using RAWcooked's ```--all``` command with a maximum of just two parallel encodings per server targeting a single QNAP storage, and where we can assume a single ```--check``` run is underway also: * Parallel 4K RGB 16-bit DPX (699.4 GB) - MKV duration 5:10 (639.8 GB) - encoding time 5:17:00 - MKV 8.5% smaller than DPX * Parallel 4K RGB 16-bit DPX (723.1 GB) - MKV duration 5:20 (648.9 GB) - encoding time 5:40:07 - MKV 10.25% smaller than DPX @@ -232,24 +232,24 @@ In recent years we have seen a shift from majority 2K DPX to majority 4K DPX wit * Parallel 4K RGB 12-bit DPX (887.3 GB) - MKV duration 10:54 (208.7 GB) - encoding time 5:02:00 - MKV 76.5% smaller than DPX ** ** Where the MKV is significantly smaller than the DPX then a black and while filter will have been applied to an RGB scan, as in these cases. -A separate 2K solo and parallel encoding test revealed much quicker encoding times for >10 minute sequences, again using the ```--all``` command and where we can assume another single ```--check``` run is also working in parallel: +A separate 2K solo and parallel encoding test revealed much quicker encoding times for >10 minute sequences, again using the ```--all``` command against a single QNAP storage, and where we can assume another single ```--check``` run is also working in parallel: * Solo 2K RGB 12-bit DPX (341 GB) - MKV duration 16:16 - encoding time 1:20:00 - MKV 22.5% smaller than DPX * Solo 2K RGB 16-bit DPX (126 GB) - MKV duration 11:42 - encoding time 1:02:00 - MKV was 30.6% smaller than the DPX * Parallel 2K RGB 16-bit DPX (367 GB) - MKV duration 11:34 - encoding time 2:40:00 - MKV was 27.6% smaller than the DPX * Parallel 2K RGB 16-bit DPX (325 GB) - MKV duration 10:15 - encoding time 2:21:00 - MKV was 24.4% smaller than the DPX -It provides us with great reassurance to implement the ```--all``` command and we remain highly satisfied with RAWcooked encoding of DPX sequences despite the reduction in our concurrent encodings. The embedded DPX hashes which ```all``` includes are critical for long-term preservation of the digitised film. In addition there are checksums embedded in the slices of every video frame (upto 576 per frame so 576 checksums per video frame) allowing granular analysis of any problems found with digital FFV1 preservation files, should they arise. This is thanks to the FFV1 codec, and it allows us to pinpoint exactly where digital damage may have ocurred. This means we can easily replace the impacted DPX files with duplicates from our duplicate preservation copies. Open-source RAWcooked, FFV1 and Matroska allow open access to their source code which means reduced likelihood of obsolescence long into the future. Finally, we plan to begin testing RAWcooked encoding of TIFF image sequences with the intention of encoding DCDM image sequences to FFV1 also. +It provides us with great reassurance to implement the ```--all``` command and we remain highly satisfied with RAWcooked encoding of DPX sequences despite the reduction in our concurrent encodings. The embedded DPX hashes which ```--all``` includes are critical for long-term preservation of the digitised film. In addition there are checksums embedded in the slices of every video frame (up to 576 checksums *per* video frame) allowing granular analysis of any problems found with digital FFV1 preservation files, should they arise. This is thanks to the FFV1 codec, and it allows us to pinpoint exactly where digital damage may have ocurred. This means we can easily replace the impacted DPX files using our duplicate preservation copies. Open-source RAWcooked, FFV1 and Matroska allow open access to their source code which means reduced likelihood of obsolescence long into the future. Finally, we plan to begin testing RAWcooked encoding of TIFF image sequences with the intention of encoding DCDM image sequences to FFV1 also. ### Useful test approaches -When any system upgrades occur we like to run reversibility test to ensure RAWcooked is still operating as we would expect. This is usually in response to RAWcooked software updates, FFmpeg updates, but also for updates to our operating system. To perform a reversibility test, a cross-section of image sequences are encoded using our usual ```--all``` command, and then decoded again fully. The image sequences of both the original and decoded version then have whole file MD5 checksums generated for every and saved into one manifest for the source and one for the decoded version. These manifests are then ```diff``` checked to ensure that every single image file is identical. +When any system upgrades occur we like to run reversibility test to ensure RAWcooked is still operating as we would expect. This is usually in response to RAWcooked software updates, FFmpeg updates, but also for updates to our operating system. To perform a reversibility test, a cross-section of image sequences are encoded using our usual ```--all``` command, and then decoded again fully. The image sequences of both the original and decoded version then have whole file MD5 checksums generated for every DPX which are written into a manifest for the source DPX and a manifest for the decoded DPX. These manifests are then ```diff``` checked to ensure that every single image file has identical checksums. -To have confidence in the --check feature, which confirms for us a DPX sequence can be deleted, we ran several --check command tests that included editing test FFV1 Matroska metadata using hex editor software, and altering test DPX files in the same way while partially encoded. The encoding/check features always identified these data breakages correctly which helped build our confidence in the --all and --check flags. +To have confidence in the ```--check``` feature, which confirms for us a DPX sequence can be deleted, we ran several ```--check``` command tests that included editing test FFV1 Matroska metadata using hex editor software, and altering test DPX files in the same way during the encoding run. The encoding/check features always identified these data breakages correctly which helped build our confidence in the ```--all``` and ```--check``` flags. When we encounter an error there are a few commands used that make reporting the issue a little easier at the [Media Area RAWcooked GitHub issue tracker](https://github.com/MediaArea/RAWcooked/issues). ``` -rawcooked -d -y -all --accept-gaps +rawcooked -d -y -all --no-accept-gaps ``` Adding the ```-d``` flag doesn't run the encoding but returns the command that would be sent to FFmpeg. This flag also leaves the reversibility data available as a text file and this is useful for sending to Media Area to help with finding errors. ``` @@ -261,15 +261,15 @@ echo $? ``` This command should be run directly after a failed RAWcooked encode, and it will tell you the exit code returned from that terminated run. -The results of these three enquiries is always a brilliant way to open an Issue enquiry for Media Area and will help ensure swift diagnose for your problem. It may also be necessary to supply a DPX sequence, and your ```head``` command can be used again to extract the header data. +The results of these three enquiries is always a great help when opening an Issue enquiry for Media Area aiding diagnosis of your problem. It may also be necessary to supply a DPX image, and your ```head``` command can be used again to extract the header data. ## Additional resources -* [BFI National Archive DPX Preservation Workflows](https://digitensions.home.blog/2019/11/08/dpx-preservation-workflow/) * [Media Area's RAWcooked GitHub page](https://github.com/MediaArea/RAWcooked) * ['No Time To Wait! 5' presentation by Joanna White about the BFI's evolving RAWcooked use](https://www.youtube.com/watch?v=Mgo_DKHJEfI) * [BFI National Archive RAWcooked cheat sheet for optimization](https://github.com/bfidatadigipres/dpx_encoding/blob/main/RAWcooked_Cheat_Sheet.pdf) +* [BFI National Archive DPX Preservation Workflows](https://digitensions.home.blog/2019/11/08/dpx-preservation-workflow/) * [Further conference presentations about BFI National Archive use of RAWcooked, by Joanna White](https://youtu.be/4cG5RL_CZqg?si=w-iEICSfXqBco5NB) * [RAWCooking With Gas: A Film Digitization and QC Workflow-in-progress by Genevieve Havemeyer-King](https://youtu.be/-cJxq7Vr3Nk?si=BjPWzsZ7LRKMVZNF) * [Introduction to FFV1 and Matroska for film scans by Kieran O’Leary](https://kieranjol.wordpress.com/2016/10/07/introduction-to-ffv1-and-matroska-for-film-scans/)