-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different validation results with GreenField and PDFBox parser #1253
Comments
The issue is in the difference of internal font engines of veraPDF greenfield and PDFBox. In more detail, greenfield assumes the the glyph with GID=0 is always implicitly present in CID-based fonts, while PDFBox assumes that such glyph does not exist. To fix this issue we need to patch the internals of PDFBox. The behavior of veraPDF greenfield is correct. |
I have observed a similar difference between the two parsers when verifying the attached pdf file. With greenfield, veraPDF complains about missing glyph to unicode mappings (for the lower case \mu, which should be mapped to U+1D707 with recent newpx font packages), whereas the PDFBox version confirms compliance with pdf/a-2u. I am a bit unsure which version to trust (Who verifies the pdf verifier?), but I hope the PDFBox result is the correct one. The source code for the attached pdf uses the newpx font which recently included unicode mappings:
|
Hi @bwegge Thanks for reporting this issue. It is not a simple one, and this is why we have a difference between PDFBox and greenfield. In short, there is a syntax error in the unicode mapping of the newpx font. This error is treated differently in PDFBox and greenfield parsers, which resulted in an extra validation error in the latter case. In more detail, the unicode mapping in PDF fonts is defined via so-called /ToUnicode entry defining how character code from PDF page description are mapped to Unicode. Here is the problematic ToUnicode map: In particular, this line:
which says that byte characters in the range from 0A to 1A in the PDF page content have to be mapped to Unicode characters UTF16 "D835DEF9" and further on. This syntax is described in PDF 1.7 spec (ISO 32000-1, clause 9.10.3). However, there is an additional format requirement that says:
This requirement is clearly violated here, and thus the unicode mapping becomes undefined. We have adjusted the Greenfield parser implementation so that it:
As far as we can see, this is also how Adobe Acrobat handles this particular format error. So, as of the latest dev build of veraPDF both PDFBox and Greenfield will report that there are no PDF/A issues found in your document. But Greenfield will additionally report the above error in the embedded Unicode. This error is shown as log message in the console and can also be optionally included into the validation report. |
Hi Boris, thanks a lot for your reply and for looking into the issue. Could the actual problem also be caused by the pdflatex compiler (or whatever tool assembles the CMap) in case it (sometimes unwarily) merges adjacent codes to ranges? Since the newpx font package in /usr/share/texlive/texmf-dist/fonts/type1/public/newpx/NewPXMI_gnu.pfb specifies the mappings individually on separate lines, it seems to be correct on their part (i.e., not causing some byte overflow):
(I am no expert and have no idea if the mappings in the produced pdf are actually taken from this file or another, it's seemed just likely to pick the one in the type1 folder since I use the T1 option for More specifically: In |
The support for PDFBox version will stop after the next release 1.28. It is strongly recommended to switch to the Greenfield version with the continued long-term support |
Validating a PDF/A-2b compliant PDF document with embedded CID TrueType font subset leads to different results, depending on the underlying parser engine:
While GreenField approves PDF/A-2b compliance (as do other validators like callas pdfaPilot / Adobe Acrobat preflight), the PDFBox instance fails validation with this error message: "
A CID Font subset does not define CIDSet entry in its Descriptor dictionary
"When inspecting the demo file Hello_World_PDFA-2b.pdf we cannot reproduce the issue since the allegedly missing
CIDSet
entry is present:P.S. Fun fact: the demo file has been created using PDFBox (2.0.25)
Which one is right? PDFBox or Greenfield?
The text was updated successfully, but these errors were encountered: