Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different validation results with GreenField and PDFBox parser #1253

Open
tknall opened this issue May 23, 2022 · 5 comments
Open

Different validation results with GreenField and PDFBox parser #1253

tknall opened this issue May 23, 2022 · 5 comments

Comments

@tknall
Copy link

tknall commented May 23, 2022

Validating a PDF/A-2b compliant PDF document with embedded CID TrueType font subset leads to different results, depending on the underlying parser engine:

While GreenField approves PDF/A-2b compliance (as do other validators like callas pdfaPilot / Adobe Acrobat preflight), the PDFBox instance fails validation with this error message: "A CID Font subset does not define CIDSet entry in its Descriptor dictionary"

When inspecting the demo file Hello_World_PDFA-2b.pdf we cannot reproduce the issue since the allegedly missing CIDSet entry is present:

Hello_World_PDFA-2b-structure

P.S. Fun fact: the demo file has been created using PDFBox (2.0.25)

Which one is right? PDFBox or Greenfield?

@bdoubrov bdoubrov added this to the 1.22 milestone Jun 3, 2022
@bdoubrov
Copy link
Contributor

bdoubrov commented Jun 8, 2022

The issue is in the difference of internal font engines of veraPDF greenfield and PDFBox. In more detail, greenfield assumes the the glyph with GID=0 is always implicitly present in CID-based fonts, while PDFBox assumes that such glyph does not exist.

To fix this issue we need to patch the internals of PDFBox. The behavior of veraPDF greenfield is correct.

@bdoubrov bdoubrov removed this from the 1.22 milestone Jul 1, 2022
@bwegge
Copy link

bwegge commented Feb 7, 2023

I have observed a similar difference between the two parsers when verifying the attached pdf file.
pdf-a-unicode.pdf

With greenfield, veraPDF complains about missing glyph to unicode mappings (for the lower case \mu, which should be mapped to U+1D707 with recent newpx font packages), whereas the PDFBox version confirms compliance with pdf/a-2u. I am a bit unsure which version to trust (Who verifies the pdf verifier?), but I hope the PDFBox result is the correct one.

The source code for the attached pdf uses the newpx font which recently included unicode mappings:

\documentclass{scrartcl}
\usepackage{newpxtext,newpxmath}
\usepackage[a-2u]{pdfx}
\begin{document}
Some greek characters seem to miss unicode mappings: $\mu$  % <- verification fails; comment out to succeed
Works for others: $\Sigma$
\end{document}

image

@bdoubrov
Copy link
Contributor

Hi @bwegge Thanks for reporting this issue. It is not a simple one, and this is why we have a difference between PDFBox and greenfield.

In short, there is a syntax error in the unicode mapping of the newpx font. This error is treated differently in PDFBox and greenfield parsers, which resulted in an extra validation error in the latter case.

In more detail, the unicode mapping in PDF fonts is defined via so-called /ToUnicode entry defining how character code from PDF page description are mapped to Unicode. Here is the problematic ToUnicode map:
ToUnicode.txt

In particular, this line:

<0a> <1a> <d835def9>

which says that byte characters in the range from 0A to 1A in the PDF page content have to be mapped to Unicode characters UTF16 "D835DEF9" and further on. This syntax is described in PDF 1.7 spec (ISO 32000-1, clause 9.10.3). However, there is an additional format requirement that says:

When defining ranges of this type, the value of the last byte in the string shall be less than or equal to 255 − (srcCode2 − srcCode1). This ensures that the last byte of the string shall not be incremented past 255; otherwise, the result of mapping is undefined.

This requirement is clearly violated here, and thus the unicode mapping becomes undefined.

We have adjusted the Greenfield parser implementation so that it:

  • reports this error in the ToUnicode mapping
  • has the same further behavior as the PDFBox one.

As far as we can see, this is also how Adobe Acrobat handles this particular format error. So, as of the latest dev build of veraPDF both PDFBox and Greenfield will report that there are no PDF/A issues found in your document. But Greenfield will additionally report the above error in the embedded Unicode. This error is shown as log message in the console and can also be optionally included into the validation report.

@bwegge
Copy link

bwegge commented Feb 10, 2023

Hi Boris, thanks a lot for your reply and for looking into the issue. Could the actual problem also be caused by the pdflatex compiler (or whatever tool assembles the CMap) in case it (sometimes unwarily) merges adjacent codes to ranges? Since the newpx font package in /usr/share/texlive/texmf-dist/fonts/type1/public/newpx/NewPXMI_gnu.pfb specifies the mappings individually on separate lines, it seems to be correct on their part (i.e., not causing some byte overflow):

dup 23/u1D706 put
dup 24/u1D707 put
dup 25/u1D708 put

(I am no expert and have no idea if the mappings in the produced pdf are actually taken from this file or another, it's seemed just likely to pick the one in the type1 folder since I use the T1 option for inputenc.)

More specifically: In (do_)write_tounicode in
https://github.com/TeX-Live/texlive-source/blob/4f771e41a6c3799e9d16e44633c7fa95dc41f1bc/texk/web2c/pdftexdir/tounicode.c#L382 as well as
https://github.com/TeX-Live/texlive-source/blob/4f771e41a6c3799e9d16e44633c7fa95dc41f1bc/texk/web2c/luatexdir/font/tounicode.c#L394, it seems that ranges are identified with adjacent unicode codes, but I don't see any check for an overflow of the last unicode byte. Is it possible that the issue comes from this merging of adjacent codes without the check for the additional format requirement?

@bdoubrov
Copy link
Contributor

The support for PDFBox version will stop after the next release 1.28. It is strongly recommended to switch to the Greenfield version with the continued long-term support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants