Use TextLine objects instead of the full plain text in coordinate extraction #58

stijnvermeeren-swisstopo · 2024-06-05T08:58:14Z

Separate the creation of TextLine object for a given PDF page into a separate method, that is used by both layer extraction and coordinate extraction.
Instead of finding the coordinate key in the plain document text, find it in the TextLine objects, so that we also have the information of where it is located on the PDF page
Instead of looking for coordinates within 100 characters of the coordinate key (in the defaut "reading order"), look for coordinates to the right and/or immediately below the coordinate key. This should be a little more robust / less dependent on they way (in which order) text is defined in the document.
Instead of concatenating the text of all pages at the beginning of the coordinate extraction, we iterate over the pages and return on the first page where a valid coordinate was found. (This also will allow us to know on which page the coordinate was found, which is necessary for a future visualisation.)
Improve logging a little bit.

The extraction of the coordinates values themselves still happens from plain text, not from the TextLine objects. We could improve on this in a future implementation, to use the TextLine objects everywhere. The main advantage of this would be, that we would also know where on the PDF page the coordinates were found, and we could use this information for the visualisation.

There are no changes in the KPIs for both the Zurich dataset and the Geoquat validation dataset.

github-actions · 2024-06-05T09:18:29Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	188	188	0%	3–482
get_files.py	21	21	0%	3–48
line_detection.py	26	26	0%	3–76
main.py	94	94	0%	3–237
src/stratigraphy/util
coordinate_extraction.py	121	19	84%	29, 49, 53, 57, 61–69, 90, 186, 284, 288, 300–303
dataclasses.py	32	3	91%	37–39
depthcolumn.py	208	67	68%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 313, 323, 352, 373, 376–387, 402–403, 448–490
depthcolumnentry.py	20	4	80%	12, 15, 27, 34
description_block_splitter.py	70	2	97%	24, 139
draw.py	74	74	0%	3–226
duplicate_detection.py	51	51	0%	3–146
extract_text.py	27	2	93%	38–39
find_depth_columns.py	89	6	93%	41–42, 70, 82, 175–176
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	86	2	98%	82, 132
interval.py	107	52	51%	25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
language_detection.py	18	18	0%	3–45
layer_identifier_column.py	91	91	0%	3–227
line.py	49	4	92%	25, 42, 51, 98
linesquadtree.py	46	1	98%	76
plot_utils.py	43	43	0%	3–120
predictions.py	186	186	0%	3–386
textblock.py	74	8	89%	27, 51, 63, 75, 98, 119, 127, 155
util.py	40	18	55%	22, 40–47, 61–63, 87–88, 100–105
TOTAL	1832	1009	45%

Tests	Skipped	Failures	Errors	Time
58	0 💤	0 ❌	0 🔥	0.710s ⏱️

src/stratigraphy/util/coordinate_extraction.py

redur · 2024-06-06T08:54:59Z

This comes in handy with the changes I am thinking about for label studio with regards to coordinate extraction.

redur

Approved with some open questions.

redur · 2024-06-06T08:51:10Z

src/stratigraphy/util/extract_text.py

I am not sure whether it makes sense to create a new python file for this. Could we thematically group it to line.py as this is the code that creates the objects defined in line.py?

In general I'm more in favor of more files with fewer lines of code per file, and maybe group files into subdirectories/packages/modules when we start having too many files in one place.
But if you feel strongly about it, we could also put these in one file, they are certainly closely related.

Follow up ticket regarding this is created on Jira.

redur · 2024-06-06T08:52:23Z

src/stratigraphy/extract.py

I am fine with that change. I am just thinking if there is any use case where we'd need the page / doc object inside process page. But should that be the case we can always add the page object again.

Good point, but at this point in time, I don't see a need for it.

Also, it feels more consistent now with the geometric_lines, which are derived from the page object in a similar way, but were already passed as a separate parameter to the process_page method.

stijnvermeeren-swisstopo added 3 commits June 4, 2024 16:28

use lines instead of raw text for finding coordinate keys

43338f5

improved logging

c490723

fix unit tests for coordinate extraction

b264b34

stijnvermeeren-swisstopo requested a review from redur June 5, 2024 09:23

stijnvermeeren-swisstopo added 3 commits June 5, 2024 11:55

bugfix (indentation)

4982205

allow coordinate key at end of line + add unit test + spelling fix

6dce022

cleanup

46724d0

redur reviewed Jun 6, 2024

View reviewed changes

src/stratigraphy/util/coordinate_extraction.py Show resolved Hide resolved

redur approved these changes Jun 6, 2024

View reviewed changes

update docstring

43c2ea6

redur assigned redur and stijnvermeeren-swisstopo and unassigned redur Jun 6, 2024

stijnvermeeren-swisstopo merged commit 3e912ee into main Jun 7, 2024
3 checks passed

stijnvermeeren-swisstopo deleted the coordinates-with-position branch June 7, 2024 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use TextLine objects instead of the full plain text in coordinate extraction #58

Use TextLine objects instead of the full plain text in coordinate extraction #58

stijnvermeeren-swisstopo commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024 •

edited

Loading

redur commented Jun 6, 2024

redur left a comment

redur Jun 6, 2024

stijnvermeeren-swisstopo Jun 6, 2024

redur Jun 6, 2024

redur Jun 6, 2024

stijnvermeeren-swisstopo Jun 6, 2024

Use TextLine objects instead of the full plain text in coordinate extraction #58

Use TextLine objects instead of the full plain text in coordinate extraction #58

Conversation

stijnvermeeren-swisstopo commented Jun 5, 2024 • edited Loading

github-actions bot commented Jun 5, 2024 • edited Loading

redur commented Jun 6, 2024

redur left a comment

Choose a reason for hiding this comment

redur Jun 6, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo Jun 6, 2024

Choose a reason for hiding this comment

redur Jun 6, 2024

Choose a reason for hiding this comment

redur Jun 6, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo Jun 6, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024 •

edited

Loading