Issue with PDF Interpretation in Built-in Pipeline #11

zhangxiaoxing · 2024-10-14T06:09:47Z

Hi everyone,

I’ve encountered an issue where some PDF files are not being correctly interpreted by the built-in pipeline. Specifically, the title, abstract and the majority of the body text are completely missed, although part of the supplementary text and some template text are picked up. This suggests that the uploading and file-related functions are working as expected, but there seems to be a problem with the text extraction process.

Unfortunately, my current sample files are behind a paywall, so I cannot share them directly. However, I will try to find some public-domain samples to illustrate the issue more clearly.

To work around this, I’m looking for recommendations on candidate tools or libraries that can pre-process these PDF files. The goal is to ensure that the text is correctly extracted and formatted for later processing by LLMs. Features like layout preservation and graphics are not necessary for the current use case.

Any suggestions or insights would be greatly appreciated! Thanks in advance for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with PDF Interpretation in Built-in Pipeline #11

Issue with PDF Interpretation in Built-in Pipeline #11

zhangxiaoxing commented Oct 14, 2024

Issue with PDF Interpretation in Built-in Pipeline #11

Issue with PDF Interpretation in Built-in Pipeline #11

Comments

zhangxiaoxing commented Oct 14, 2024