You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve encountered an issue where some PDF files are not being correctly interpreted by the built-in pipeline. Specifically, the title, abstract and the majority of the body text are completely missed, although part of the supplementary text and some template text are picked up. This suggests that the uploading and file-related functions are working as expected, but there seems to be a problem with the text extraction process.
Unfortunately, my current sample files are behind a paywall, so I cannot share them directly. However, I will try to find some public-domain samples to illustrate the issue more clearly.
To work around this, I’m looking for recommendations on candidate tools or libraries that can pre-process these PDF files. The goal is to ensure that the text is correctly extracted and formatted for later processing by LLMs. Features like layout preservation and graphics are not necessary for the current use case.
Any suggestions or insights would be greatly appreciated! Thanks in advance for your help.
The text was updated successfully, but these errors were encountered:
Hi everyone,
I’ve encountered an issue where some PDF files are not being correctly interpreted by the built-in pipeline. Specifically, the title, abstract and the majority of the body text are completely missed, although part of the supplementary text and some template text are picked up. This suggests that the uploading and file-related functions are working as expected, but there seems to be a problem with the text extraction process.
Unfortunately, my current sample files are behind a paywall, so I cannot share them directly. However, I will try to find some public-domain samples to illustrate the issue more clearly.
To work around this, I’m looking for recommendations on candidate tools or libraries that can pre-process these PDF files. The goal is to ensure that the text is correctly extracted and formatted for later processing by LLMs. Features like layout preservation and graphics are not necessary for the current use case.
Any suggestions or insights would be greatly appreciated! Thanks in advance for your help.
The text was updated successfully, but these errors were encountered: