Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TextLine objects instead of the full plain text in coordinate extraction #58

Merged
merged 7 commits into from
Jun 7, 2024

Conversation

stijnvermeeren-swisstopo
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo commented Jun 5, 2024

  • Separate the creation of TextLine object for a given PDF page into a separate method, that is used by both layer extraction and coordinate extraction.
  • Instead of finding the coordinate key in the plain document text, find it in the TextLine objects, so that we also have the information of where it is located on the PDF page
  • Instead of looking for coordinates within 100 characters of the coordinate key (in the defaut "reading order"), look for coordinates to the right and/or immediately below the coordinate key. This should be a little more robust / less dependent on they way (in which order) text is defined in the document.
  • Instead of concatenating the text of all pages at the beginning of the coordinate extraction, we iterate over the pages and return on the first page where a valid coordinate was found. (This also will allow us to know on which page the coordinate was found, which is necessary for a future visualisation.)
  • Improve logging a little bit.

The extraction of the coordinates values themselves still happens from plain text, not from the TextLine objects. We could improve on this in a future implementation, to use the TextLine objects everywhere. The main advantage of this would be, that we would also know where on the PDF page the coordinates were found, and we could use this information for the visualisation.

There are no changes in the KPIs for both the Zurich dataset and the Geoquat validation dataset.

Copy link

github-actions bot commented Jun 5, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1881880%3–482
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py94940%3–237
src/stratigraphy/util
   coordinate_extraction.py1211984%29, 49, 53, 57, 61–69, 90, 186, 284, 288, 300–303
   dataclasses.py32391%37–39
   depthcolumn.py2086768%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 313, 323, 352, 373, 376–387, 402–403, 448–490
   depthcolumnentry.py20480%12, 15, 27, 34
   description_block_splitter.py70297%24, 139
   draw.py74740%3–226
   duplicate_detection.py51510%3–146
   extract_text.py27293%38–39
   find_depth_columns.py89693%41–42, 70, 82, 175–176
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1075251%25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
   language_detection.py18180%3–45
   layer_identifier_column.py91910%3–227
   line.py49492%25, 42, 51, 98
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1861860%3–386
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py401855%22, 40–47, 61–63, 87–88, 100–105
TOTAL1832100945% 

Tests Skipped Failures Errors Time
58 0 💤 0 ❌ 0 🔥 0.710s ⏱️

@redur
Copy link
Contributor

redur commented Jun 6, 2024

This comes in handy with the changes I am thinking about for label studio with regards to coordinate extraction.

Copy link
Contributor

@redur redur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with some open questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it makes sense to create a new python file for this. Could we thematically group it to line.py as this is the code that creates the objects defined in line.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm more in favor of more files with fewer lines of code per file, and maybe group files into subdirectories/packages/modules when we start having too many files in one place.
But if you feel strongly about it, we could also put these in one file, they are certainly closely related.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up ticket regarding this is created on Jira.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that change. I am just thinking if there is any use case where we'd need the page / doc object inside process page. But should that be the case we can always add the page object again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but at this point in time, I don't see a need for it.

Also, it feels more consistent now with the geometric_lines, which are derived from the page object in a similar way, but were already passed as a separate parameter to the process_page method.

@redur redur assigned redur and stijnvermeeren-swisstopo and unassigned redur Jun 6, 2024
@stijnvermeeren-swisstopo stijnvermeeren-swisstopo merged commit 3e912ee into main Jun 7, 2024
3 checks passed
@stijnvermeeren-swisstopo stijnvermeeren-swisstopo deleted the coordinates-with-position branch June 7, 2024 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants