Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

版面分析漏掉的区域无法解析到结果 #1441

Open
caijijuhe opened this issue Jan 7, 2025 · 6 comments
Open

版面分析漏掉的区域无法解析到结果 #1441

caijijuhe opened this issue Jan 7, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@caijijuhe
Copy link

Description of the bug | 错误描述

Snipaste_2025-01-07_14-35-41
在版面分析中存在漏识别区域,这样在后续使用pymupdf进行字符填充的时候这些内容就会被舍弃

How to reproduce the bug | 如何复现

Snipaste_2025-01-07_14-37-22
是否可以用pymupdf的get_blocks作为版面分析结果的校验补充呢,补充上版面分析漏掉的文字区域。

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.10.x

Device mode | 设备模式

cuda

@caijijuhe caijijuhe added the bug Something isn't working label Jan 7, 2025
@myhloli
Copy link
Collaborator

myhloli commented Jan 9, 2025

原则上layout的结果是后续所有pipeline的基础,是可信的,pymu的block信息是不可信的,所以不会通过pymu的block信息对layout的block信息进行修复。这种case只有通过迭代layout模型来修复。

@myhloli
Copy link
Collaborator

myhloli commented Jan 9, 2025

能提供一下可以复现的样本吗

@ufxelv80
Copy link

我也遇到同样的问题,解析之后内容有丢失
image

@myhloli
Copy link
Collaborator

myhloli commented Jan 10, 2025

@ufxelv80 你这个还不一样,上面那个是layout漏检,你这个是元素被识别成页脚。一些比较靠近页面上边缘或者下边缘的元素是会被当成页眉页脚丢弃的。

@caijijuhe
Copy link
Author

能提供一下可以复现的样本吗

怎么给你呢

@myhloli
Copy link
Collaborator

myhloli commented Jan 10, 2025

能提供一下可以复现的样本吗

怎么给你呢

[email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants