Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with PDF Interpretation in Built-in Pipeline #11

Open
zhangxiaoxing opened this issue Oct 14, 2024 · 1 comment
Open

Issue with PDF Interpretation in Built-in Pipeline #11

zhangxiaoxing opened this issue Oct 14, 2024 · 1 comment

Comments

@zhangxiaoxing
Copy link

Hi everyone,

I’ve encountered an issue where some PDF files are not being correctly interpreted by the built-in pipeline. Specifically, the title, abstract and the majority of the body text are completely missed, although part of the supplementary text and some template text are picked up. This suggests that the uploading and file-related functions are working as expected, but there seems to be a problem with the text extraction process.

Unfortunately, my current sample files are behind a paywall, so I cannot share them directly. However, I will try to find some public-domain samples to illustrate the issue more clearly.

To work around this, I’m looking for recommendations on candidate tools or libraries that can pre-process these PDF files. The goal is to ensure that the text is correctly extracted and formatted for later processing by LLMs. Features like layout preservation and graphics are not necessary for the current use case.

Any suggestions or insights would be greatly appreciated! Thanks in advance for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@zhangxiaoxing and others