How to save images with FlateDecode? #337

BartMiki · 2025-01-17T15:40:32Z

BartMiki
Jan 17, 2025

Hello! I'm working on an application to clean up a scanned PDF file. The idea is to remove shadows in the background, straighten the text so the PDFs could be fed into an OCR software or printed. For testing I loaded a PDF document and saved it as a new file. However I can see that the files saved are much larger then the ones that were read (input ~5MB, output ~18MB). I uploaded the generated file into online PDF optimizer and investigated the result. It has around ~8MB so much better then my 18MB output. The difference is that the image files have a FlateDecode in filters:

pdf_image.get_filters()=['FlateDecode', 'DCTDecode']

I wonder if there is a way to force use FlateDecode for the PDF Images in pypdfium?

mara004 · 2025-01-17T15:54:28Z

mara004
Jan 17, 2025
Maintainer

So basically, you're extracting images from an existing pdf, processing them, and then put them into a new PDF?
Can you mention the pypdfium2 APIs that you are using to add the images? What were the filters in the pypdfium2 output - just DCTDecode? Was the PDF optimizer lossless, or did it merely re-encode the image with different settings?

The thing is, usually, adding flate compression on top of DCT only results in marginal size improvements, so that sounds a bit as if the optimizer might have been lossy.

2 replies

BartMiki Jan 23, 2025
Author

I dug deeper into the image that was compressed online and it looks like the image has been saved with 1 bit per pixel encoding instead of 8 bits per pixel. Which is fine as I use OpenCV to threshold image anyway. That's why the compression achieved by the online tool is so high.

Nevertheless I still wonder if it is possible to apply filters to images in an output PDF. I'm new to PDF editing so maybe this is not a relavant question, but I wonder what are good practices in pypydfium2 to ensure the smallest PDF size possible while retaining the original information intact.

Thankfully I don't need the PDF output, I'm using TIFF images now to pack multiple images into a single file and I get a much smaller output size.

mara004 Jan 23, 2025
Maintainer

Unfortunately pdfium's public API is rather limited when it comes to images.

It works fine for JPEG, but otherwise it only provides the FPDF_BITMAP entrypoint, which does not support binary or CMYK images, or images with higher bit-depth. If you're using PdfBitmap.from_pil() and PdfImage.set_bitmap(), these will be transcoded to grayscale, RGB, or 8-bit respectively.¹ Also you can't choose the encoding (IIRC pdfium will just flate compress the bitmap data).

The wrappers are just there to expose what pdfium can do, but again, I agree they're limited, so you may be better off with img2pdf or similar, especially when working with binary images (which seems to be your use case).

The docs mention "Due to the restricted number of color formats and bit depths supported by PDFium's bitmap implementation, this may be a lossy operation." ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save images with FlateDecode? #337

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to save images with FlateDecode? #337

BartMiki Jan 17, 2025

Replies: 1 comment · 2 replies

mara004 Jan 17, 2025 Maintainer

BartMiki Jan 23, 2025 Author

mara004 Jan 23, 2025 Maintainer

Footnotes

BartMiki
Jan 17, 2025

Replies: 1 comment 2 replies

mara004
Jan 17, 2025
Maintainer

BartMiki Jan 23, 2025
Author

mara004 Jan 23, 2025
Maintainer