-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use TextLine objects instead of the full plain text in coordinate extraction #58
Conversation
This comes in handy with the changes I am thinking about for label studio with regards to coordinate extraction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with some open questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it makes sense to create a new python file for this. Could we thematically group it to line.py as this is the code that creates the objects defined in line.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm more in favor of more files with fewer lines of code per file, and maybe group files into subdirectories/packages/modules when we start having too many files in one place.
But if you feel strongly about it, we could also put these in one file, they are certainly closely related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up ticket regarding this is created on Jira.
src/stratigraphy/extract.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with that change. I am just thinking if there is any use case where we'd need the page / doc object inside process page. But should that be the case we can always add the page object again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but at this point in time, I don't see a need for it.
Also, it feels more consistent now with the geometric_lines
, which are derived from the page object in a similar way, but were already passed as a separate parameter to the process_page
method.
The extraction of the coordinates values themselves still happens from plain text, not from the TextLine objects. We could improve on this in a future implementation, to use the TextLine objects everywhere. The main advantage of this would be, that we would also know where on the PDF page the coordinates were found, and we could use this information for the visualisation.
There are no changes in the KPIs for both the Zurich dataset and the Geoquat validation dataset.