New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactoring PDF loaders: 02 PyMuPDF #29063

Open

pprados wants to merge 11 commits into langchain-ai:master from pprados:pprados/02-pymupdf

Contributor

pprados commented Jan 7, 2025 •

edited

Loading

Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

@eyurtsev it's the continuation of PDFLoader modifications.

vercel bot commented Jan 7, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 10, 2025 3:46pm

vercel bot deployed to Preview

January 7, 2025 08:55

View deployment

vercel bot deployed to Preview

January 7, 2025 09:15

View deployment

pprados marked this pull request as ready for review

January 7, 2025 09:16

dosubot bot added size:XXL community Ɑ: doc loader labels

ccurme assigned eyurtsev

pprados added 7 commits

January 7, 2025 17:08


          Prepare the integration of new versions of PDFLoader.

21759e2

Add file_path with PurePath
Add CloudBlobLoader in __init__
Replace Dict/List to dict/list


          Fix Line too long


          Fix Line too long

668dc9c


          Fix Line too long

7a5b5c5


          Fix Line too long

6340ded


          Update PyMuPDF


          Fix tu

3beda82

pprados force-pushed the pprados/02-pymupdf branch from 039819c to 3beda82 Compare

January 7, 2025 16:09

vercel bot deployed to Preview

January 7, 2025 16:18

View deployment

Contributor Author

pprados commented Jan 7, 2025

@eyurtsev I rebase the code with master ;-)

eyurtsev reviewed

View reviewed changes

Collaborator

eyurtsev left a comment

Great will take a look in the AM

pprados mentioned this pull request

Refactoring PDF loaders: all #28970

Draft

2 tasks

eyurtsev reviewed

View reviewed changes

Collaborator

eyurtsev left a comment

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

Define the standardized structure of metadata
Create a dedicated ImageParser which is a blob parser

libs/community/langchain_community/document_loaders/parsers/pdf.py Outdated Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Outdated Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Outdated Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Outdated Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Outdated Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py Outdated

@@ @@ -78,6 +203,192 @@ def extract_from_images_with_rapidocr( @@
                   return text
+              # Type to change the function to convert images to text.
+              CONVERT_IMAGE_TO_TEXT = Optional[Callable[[Iterable[np.ndarray]], Iterator[str]]]

Collaborator

eyurtsev Jan 9, 2025

MAJOR:

Why not use an ImageBlobParser w/ the regular Blob to Document interface. it'll allow reusing the image logic for images that do not originate from pdfs (e.g., to re-use for a web crawler)

A PDF parser doesn't would accept a parser as part of the initializer

class PDFParser(...):
   def __Init__(self, ...  *, ..., image_blob_parser: Optional[BlobParser] = None):
      pass

If the image_pdf_parser is provided, then it'll be used for OCR purposes.

Contributor Author

pprados Jan 10, 2025

I've done it! There are now 3 ImageBlogParsers!

pprados added 3 commits

January 9, 2025 16:48


          Fix review - step 1

743a83e


          Fix all remarques

b623750


          Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf

20f5a41

pprados marked this pull request as draft

January 10, 2025 12:45

vercel bot deployed to Preview

January 10, 2025 13:30

View deployment

pprados force-pushed the pprados/02-pymupdf branch from 0d99673 to 3fe4ec5 Compare

January 10, 2025 13:40

vercel bot deployed to Preview

January 10, 2025 13:49

View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 4342991 to 760267b Compare

January 10, 2025 14:05

vercel bot deployed to Preview

January 10, 2025 14:15

View deployment

pprados force-pushed the pprados/02-pymupdf branch 3 times, most recently from 9fc89e0 to d30b26d Compare

January 10, 2025 14:47

vercel bot deployed to Preview

January 10, 2025 14:58

View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 6765dbf to df1d4d5 Compare

January 10, 2025 15:09

vercel bot deployed to Preview

January 10, 2025 15:24

View deployment


          Fix remarques

91234f0

pprados force-pushed the pprados/02-pymupdf branch from df1d4d5 to 91234f0 Compare

January 10, 2025 15:37

vercel bot deployed to Preview

January 10, 2025 15:46

View deployment

pprados marked this pull request as ready for review

January 10, 2025 15:46

dosubot bot added the 🤖:docs label

eyurtsev reviewed

View reviewed changes

libs/community/langchain_community/document_loaders/parsers/pdf.py Show resolved Hide resolved

libs/community/langchain_community/document_loaders/parsers/pdf.py

+                      all_text = _merge_text_and_extras(extras, text_from_page)
+                      if not all_text:
+                          # logger.warning(

Collaborator

eyurtsev Jan 10, 2025

commented out code

libs/community/langchain_community/document_loaders/parsers/images.py

+                              )
+                      except ImportError:
+                          raise ImportError(
+                              "`rapidocr-onnxruntime` package not found, please install it with "

Collaborator

eyurtsev Jan 11, 2025

nit: incorrect import error

libs/community/langchain_community/document_loaders/parsers/images.py

+                              logger.debug("Image text: %s", content.replace("\n", "\\n"))
+                              yield Document(
+                                  page_content=content,
+                                  metadata={"source": blob.source},

Collaborator

eyurtsev Jan 11, 2025

Propagate blob metadata as well?

libs/community/langchain_community/document_loaders/parsers/images.py

		)


		class MultimodalBlobParser(ImageBlobParser):

Collaborator

eyurtsev Jan 11, 2025

LLMImageBlobParser or something like that?

Goal is to communicate that this is being done by a multi modal llm

libs/community/langchain_community/document_loaders/parsers/images.py

+                  def __init__(
+                      self,
+                      *,
+                      format: Literal["text", "markdown", "html"] = "text",

Collaborator

eyurtsev Jan 11, 2025

This format is fairly surprising to see as part of the API -- but I think I'm OK with it.

libs/community/langchain_community/document_loaders/parsers/images.py

		logger = logging.getLogger(__name__)


		class ImageBlobParser(BaseBlobParser):

Collaborator

eyurtsev Jan 11, 2025

Could we rename to BaseImageBlobParser or mark as private so it's clear that i's abstract

libs/community/langchain_community/document_loaders/parsers/images.py

		return pytesseract.image_to_string(img, lang="+".join(self.langs)).strip()


		_prompt_images_to_description = (

Collaborator

eyurtsev Jan 11, 2025

nit: style to follow google conventions a bit more closely

Suggested change

      
            _prompt_images_to_description = (
          
            _PROMPT_IMAGES_TO_DESCRIPTION  = (

libs/community/langchain_community/document_loaders/parsers/images.py

		@@ -0,0 +1,149 @@
		import base64

Collaborator

eyurtsev Jan 11, 2025

I think this makes sense -- would you be willing to do a documentation pass for the API reference for this file? you could push the entire code through chat gpt and ask for google style doc-strings. It'll probably do a reasonable

libs/community/langchain_community/document_loaders/pdf.py

+                      Insert image, if possible, between two paragraphs.
+                      In this way, a paragraph can be continued on the next page.
+                      """
+                      parser = self.parser

Collaborator

eyurtsev Jan 11, 2025

the change in load in this file is still a breaking change .. see comment on the left hand side of the PR review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Ɑ: doc loader 🤖:docs size:XXL