-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring PDF loaders: 02 PyMuPDF #29063
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list
039819c
to
3beda82
Compare
@eyurtsev I rebase the code with master ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great will take a look in the AM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left two major comment, a few stylistic comments and some nits.
Let's tackle the two major comments:
- Define the standardized structure of metadata
- Create a dedicated ImageParser which is a blob parser
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
@@ -78,6 +203,192 @@ def extract_from_images_with_rapidocr( | |||
return text | |||
|
|||
|
|||
# Type to change the function to convert images to text. | |||
CONVERT_IMAGE_TO_TEXT = Optional[Callable[[Iterable[np.ndarray]], Iterator[str]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR:
Why not use an ImageBlobParser w/ the regular Blob to Document interface. it'll allow reusing the image logic for images that do not originate from pdfs (e.g., to re-use for a web crawler)
A PDF parser doesn't would accept a parser as part of the initializer
class PDFParser(...):
def __Init__(self, ... *, ..., image_blob_parser: Optional[BlobParser] = None):
pass
If the image_pdf_parser is provided, then it'll be used for OCR purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done it! There are now 3 ImageBlogParsers!
0d99673
to
3fe4ec5
Compare
4342991
to
760267b
Compare
9fc89e0
to
d30b26d
Compare
6765dbf
to
df1d4d5
Compare
df1d4d5
to
91234f0
Compare
all_text = _merge_text_and_extras(extras, text_from_page) | ||
|
||
if not all_text: | ||
# logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out code
) | ||
except ImportError: | ||
raise ImportError( | ||
"`rapidocr-onnxruntime` package not found, please install it with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: incorrect import error
logger.debug("Image text: %s", content.replace("\n", "\\n")) | ||
yield Document( | ||
page_content=content, | ||
metadata={"source": blob.source}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagate blob metadata as well?
) | ||
|
||
|
||
class MultimodalBlobParser(ImageBlobParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLMImageBlobParser
or something like that?
Goal is to communicate that this is being done by a multi modal llm
def __init__( | ||
self, | ||
*, | ||
format: Literal["text", "markdown", "html"] = "text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This format is fairly surprising to see as part of the API -- but I think I'm OK with it.
logger = logging.getLogger(__name__) | ||
|
||
|
||
class ImageBlobParser(BaseBlobParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename to BaseImageBlobParser or mark as private so it's clear that i's abstract
return pytesseract.image_to_string(img, lang="+".join(self.langs)).strip() | ||
|
||
|
||
_prompt_images_to_description = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style to follow google conventions a bit more closely
_prompt_images_to_description = ( | |
_PROMPT_IMAGES_TO_DESCRIPTION = ( |
@@ -0,0 +1,149 @@ | |||
import base64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense -- would you be willing to do a documentation pass for the API reference for this file? you could push the entire code through chat gpt and ask for google style doc-strings. It'll probably do a reasonable
Insert image, if possible, between two paragraphs. | ||
In this way, a paragraph can be continued on the next page. | ||
""" | ||
parser = self.parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change in load
in this file is still a breaking change .. see comment on the left hand side of the PR review
Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.
For more details, see PR 28970.
@eyurtsev it's the continuation of PDFLoader modifications.