Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring PDF loaders: 02 PyMuPDF #29063

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Jan 7, 2025

  • Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"

  • Description: Update PyMuPDFParser/Loader

  • Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

@eyurtsev it's the continuation of PDFLoader modifications.

Copy link

vercel bot commented Jan 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 10, 2025 3:46pm

@pprados pprados marked this pull request as ready for review January 7, 2025 09:16
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 7, 2025
@pprados
Copy link
Contributor Author

pprados commented Jan 7, 2025

@eyurtsev I rebase the code with master ;-)

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great will take a look in the AM

@pprados pprados mentioned this pull request Jan 8, 2025
2 tasks
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

  1. Define the standardized structure of metadata
  2. Create a dedicated ImageParser which is a blob parser

@@ -78,6 +203,192 @@ def extract_from_images_with_rapidocr(
return text


# Type to change the function to convert images to text.
CONVERT_IMAGE_TO_TEXT = Optional[Callable[[Iterable[np.ndarray]], Iterator[str]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR:

Why not use an ImageBlobParser w/ the regular Blob to Document interface. it'll allow reusing the image logic for images that do not originate from pdfs (e.g., to re-use for a web crawler)


A PDF parser doesn't would accept a parser as part of the initializer

class PDFParser(...):
   def __Init__(self, ...  *, ..., image_blob_parser: Optional[BlobParser] = None):
      pass

If the image_pdf_parser is provided, then it'll be used for OCR purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done it! There are now 3 ImageBlogParsers!

@pprados pprados force-pushed the pprados/02-pymupdf branch from df1d4d5 to 91234f0 Compare January 10, 2025 15:37
@pprados pprados marked this pull request as ready for review January 10, 2025 15:46
@dosubot dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Jan 10, 2025
all_text = _merge_text_and_extras(extras, text_from_page)

if not all_text:
# logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out code

)
except ImportError:
raise ImportError(
"`rapidocr-onnxruntime` package not found, please install it with "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: incorrect import error

logger.debug("Image text: %s", content.replace("\n", "\\n"))
yield Document(
page_content=content,
metadata={"source": blob.source},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate blob metadata as well?

)


class MultimodalBlobParser(ImageBlobParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMImageBlobParser or something like that?

Goal is to communicate that this is being done by a multi modal llm

def __init__(
self,
*,
format: Literal["text", "markdown", "html"] = "text",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format is fairly surprising to see as part of the API -- but I think I'm OK with it.

logger = logging.getLogger(__name__)


class ImageBlobParser(BaseBlobParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename to BaseImageBlobParser or mark as private so it's clear that i's abstract

return pytesseract.image_to_string(img, lang="+".join(self.langs)).strip()


_prompt_images_to_description = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style to follow google conventions a bit more closely

Suggested change
_prompt_images_to_description = (
_PROMPT_IMAGES_TO_DESCRIPTION = (

@@ -0,0 +1,149 @@
import base64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense -- would you be willing to do a documentation pass for the API reference for this file? you could push the entire code through chat gpt and ask for google style doc-strings. It'll probably do a reasonable

Insert image, if possible, between two paragraphs.
In this way, a paragraph can be continued on the next page.
"""
parser = self.parser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change in load in this file is still a breaking change .. see comment on the left hand side of the PR review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

2 participants