Different validation results with GreenField and PDFBox parser #1253

tknall · 2022-05-23T13:58:30Z

Validating a PDF/A-2b compliant PDF document with embedded CID TrueType font subset leads to different results, depending on the underlying parser engine:

veraPDF 1.21.159 (PDFBox): validationReport-veraPDF-1.21.159-PDFBox.xml
veraPDF 1.21.161 (GreenField): validationReport-veraPDF-1.21.161-GreenField.xml

While GreenField approves PDF/A-2b compliance (as do other validators like callas pdfaPilot / Adobe Acrobat preflight), the PDFBox instance fails validation with this error message: "A CID Font subset does not define CIDSet entry in its Descriptor dictionary"

When inspecting the demo file Hello_World_PDFA-2b.pdf we cannot reproduce the issue since the allegedly missing CIDSet entry is present:

P.S. Fun fact: the demo file has been created using PDFBox (2.0.25)

Which one is right? PDFBox or Greenfield?

The text was updated successfully, but these errors were encountered:

bdoubrov · 2022-06-08T08:59:24Z

The issue is in the difference of internal font engines of veraPDF greenfield and PDFBox. In more detail, greenfield assumes the the glyph with GID=0 is always implicitly present in CID-based fonts, while PDFBox assumes that such glyph does not exist.

To fix this issue we need to patch the internals of PDFBox. The behavior of veraPDF greenfield is correct.

bwegge · 2023-02-07T03:00:25Z

I have observed a similar difference between the two parsers when verifying the attached pdf file.
pdf-a-unicode.pdf

With greenfield, veraPDF complains about missing glyph to unicode mappings (for the lower case \mu, which should be mapped to U+1D707 with recent newpx font packages), whereas the PDFBox version confirms compliance with pdf/a-2u. I am a bit unsure which version to trust (Who verifies the pdf verifier?), but I hope the PDFBox result is the correct one.

The source code for the attached pdf uses the newpx font which recently included unicode mappings:

\documentclass{scrartcl}
\usepackage{newpxtext,newpxmath}
\usepackage[a-2u]{pdfx}
\begin{document}
Some greek characters seem to miss unicode mappings: $\mu$  % <- verification fails; comment out to succeed
Works for others: $\Sigma$
\end{document}

bdoubrov · 2023-02-10T12:21:42Z

Hi @bwegge Thanks for reporting this issue. It is not a simple one, and this is why we have a difference between PDFBox and greenfield.

In short, there is a syntax error in the unicode mapping of the newpx font. This error is treated differently in PDFBox and greenfield parsers, which resulted in an extra validation error in the latter case.

In more detail, the unicode mapping in PDF fonts is defined via so-called /ToUnicode entry defining how character code from PDF page description are mapped to Unicode. Here is the problematic ToUnicode map:
ToUnicode.txt

In particular, this line:

<0a> <1a> <d835def9>

which says that byte characters in the range from 0A to 1A in the PDF page content have to be mapped to Unicode characters UTF16 "D835DEF9" and further on. This syntax is described in PDF 1.7 spec (ISO 32000-1, clause 9.10.3). However, there is an additional format requirement that says:

When defining ranges of this type, the value of the last byte in the string shall be less than or equal to 255 − (srcCode2 − srcCode1). This ensures that the last byte of the string shall not be incremented past 255; otherwise, the result of mapping is undefined.

This requirement is clearly violated here, and thus the unicode mapping becomes undefined.

We have adjusted the Greenfield parser implementation so that it:

reports this error in the ToUnicode mapping
has the same further behavior as the PDFBox one.

As far as we can see, this is also how Adobe Acrobat handles this particular format error. So, as of the latest dev build of veraPDF both PDFBox and Greenfield will report that there are no PDF/A issues found in your document. But Greenfield will additionally report the above error in the embedded Unicode. This error is shown as log message in the console and can also be optionally included into the validation report.

bwegge · 2023-02-10T14:54:45Z

Hi Boris, thanks a lot for your reply and for looking into the issue. Could the actual problem also be caused by the pdflatex compiler (or whatever tool assembles the CMap) in case it (sometimes unwarily) merges adjacent codes to ranges? Since the newpx font package in /usr/share/texlive/texmf-dist/fonts/type1/public/newpx/NewPXMI_gnu.pfb specifies the mappings individually on separate lines, it seems to be correct on their part (i.e., not causing some byte overflow):

dup 23/u1D706 put
dup 24/u1D707 put
dup 25/u1D708 put

(I am no expert and have no idea if the mappings in the produced pdf are actually taken from this file or another, it's seemed just likely to pick the one in the type1 folder since I use the T1 option for inputenc.)

More specifically: In (do_)write_tounicode in
https://github.com/TeX-Live/texlive-source/blob/4f771e41a6c3799e9d16e44633c7fa95dc41f1bc/texk/web2c/pdftexdir/tounicode.c#L382 as well as
https://github.com/TeX-Live/texlive-source/blob/4f771e41a6c3799e9d16e44633c7fa95dc41f1bc/texk/web2c/luatexdir/font/tounicode.c#L394, it seems that ranges are identified with adjacent unicode codes, but I don't see any check for an overflow of the last unicode byte. Is it possible that the issue comes from this merging of adjacent codes without the check for the additional format requirement?

bdoubrov · 2024-05-24T10:09:00Z

The support for PDFBox version will stop after the next release 1.28. It is strongly recommended to switch to the Greenfield version with the continued long-term support

bdoubrov assigned MaximPlusov May 27, 2022

bdoubrov added this to the 1.22 milestone Jun 3, 2022

bdoubrov removed this from the 1.22 milestone Jul 1, 2022

MaximPlusov mentioned this issue Feb 10, 2023

Fixing toUnicode interval parsing veraPDF/veraPDF-parser#551

Merged

MaximPlusov mentioned this issue Apr 13, 2023

PdfA3a Rule 6.2.11.7-1 ignores Chinese unicodes. #1321

Closed

MaximPlusov removed their assignment May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different validation results with GreenField and PDFBox parser #1253

Different validation results with GreenField and PDFBox parser #1253

tknall commented May 23, 2022

bdoubrov commented Jun 8, 2022

bwegge commented Feb 7, 2023 •

edited

Loading

bdoubrov commented Feb 10, 2023

bwegge commented Feb 10, 2023 •

edited

Loading

bdoubrov commented May 24, 2024

Different validation results with GreenField and PDFBox parser #1253

Different validation results with GreenField and PDFBox parser #1253

Comments

tknall commented May 23, 2022

bdoubrov commented Jun 8, 2022

bwegge commented Feb 7, 2023 • edited Loading

bdoubrov commented Feb 10, 2023

bwegge commented Feb 10, 2023 • edited Loading

bdoubrov commented May 24, 2024

bwegge commented Feb 7, 2023 •

edited

Loading

bwegge commented Feb 10, 2023 •

edited

Loading