Feat/improve coordinate extraction #48

redur · 2024-05-24T11:34:01Z

Added additional metrics for coordinate extraction evaluation.
Adjust extraction logic to ensure X & Y values are close together.
Changed evaluation to allow small deviations due to approximate conversions between LV03 and LV95 coordinate systems.
Added additional check for validity of extracted coordinates.

Outcome:

(Almost) all cornercases are covered with the extraction improvement. Coordinate accuracy as well as coordinate precision increased. (i.e. listed cornercases in the jira ticket, and similar such cases are covered)
There does not seem any need to use the pdf-coordinate (i.e. location of the text in the pdf) to determine the right coordinates.

…for evaluation.

github-actions · 2024-05-24T11:35:35Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	211	211	0%	3–507
get_files.py	21	21	0%	3–48
line_detection.py	26	26	0%	3–76
main.py	91	91	0%	3–232
src/stratigraphy/util
coordinate_extraction.py	116	20	83%	25, 45, 49, 53, 57–65, 86, 171, 191, 280, 283–284, 288, 300
dataclasses.py	32	3	91%	37–39
depthcolumn.py	206	67	67%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 310, 314, 343, 364, 367–378, 393–394, 439–481
depthcolumnentry.py	20	4	80%	12, 15, 27, 34
description_block_splitter.py	70	2	97%	24, 139
draw.py	73	73	0%	3–225
duplicate_detection.py	32	32	0%	3–81
find_depth_columns.py	89	6	93%	39–40, 68, 80, 173–174
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	87	2	98%	83, 133
interval.py	107	52	51%	25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
language_detection.py	18	18	0%	3–43
layer_identifier_column.py	91	91	0%	3–227
line.py	49	26	47%	25, 42, 51, 65–95, 98
linesquadtree.py	46	1	98%	76
plot_utils.py	44	44	0%	3–121
predictions.py	186	186	0%	3–386
textblock.py	74	8	89%	27, 51, 63, 75, 98, 119, 127, 155
util.py	40	22	45%	15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL	1800	1035	42%

Tests	Skipped	Failures	Errors	Time
57	0 💤	0 ❌	0 🔥	0.672s ⏱️

Adjust extraction logic to ensure X & Y coordinate are close in the text.

redur · 2024-05-24T12:11:12Z

Remaining mistakes on the geoquad dataset:
3425.pdf. —> see below
A1150.pdf. —> OCR Issue
A1163.pdf —> typo on pdf
A1254.pdf. —> OCR issue
A185.pdf. —> strange (unexpected) coordinate format
A8425.pdf. —> OCR issue
A8432.pdf —> OCR issue

GroundTruth Wrong
A1141.pdf. —> typo in ground truth
A361.pdf —> deviation of 4
A494.pdf —> typo in ground truth
A7068.pdf —> deviation of 28 in ground truth
A7069.pdf —> deviation of 23
B542.pdf —> deviation of 62

3425.pdf --> The coordinates are written on a underlined dotted line. The dots from the line are recognized as "." in between the numbers. Weakening recognition statement leads to additional false positives (even though there is "X" and "Y" in the matched string)

I suggest we correct the ground truth. But do not allow larger deviations from the correct value.

Precision is 84% on geoquad (and considering the wrong ground truth even higher)
Precision is 91% on Zurich dataset (I did not check all profiles for correct ground truth here).

stijnvermeeren-swisstopo · 2024-05-24T15:16:11Z

This looks like a nice accuracy improvement!

The sentence "Adjust extraction logic to ensure X & Y values are close together." is not really accurate, if I understand correctly, as we are not checking that the are close together, but rather that they each fall within a specific range.

I'm not yet fully sure about the decision not to consider the location of the text in the PDF at all. You are right that with the additional constraints implemented here, consdering at the PDF coordinates does not really seem necessary to achieve a satisfactory accuracy on our data. On the other hand, I feel like this would still be a "cheap" way to give us some extra certainty about the extracted coordinates. Also, we might want to support other countries / coordinate systems in the future, where the coordinate values might be less constrained, and considering the PDF coordinates might become essential.
I'm not fully decided yet on whether I would like to implement this now, or keep this as an idea for the future. I'll review the results more after the weekend, and then make a decision.

stijnvermeeren-swisstopo · 2024-05-24T15:26:35Z

src/stratigraphy/util/predictions.py

@@ -316,8 +317,8 @@ def evaluate_metadata(self, metadata_ground_truth: dict):
                ground_truth_east = int(metadata_ground_truth["coordinates"]["E"])
                ground_truth_west = int(metadata_ground_truth["coordinates"]["N"])


north, nicht west

redur · 2024-05-27T06:18:14Z

This looks like a nice accuracy improvement!

The sentence "Adjust extraction logic to ensure X & Y values are close together." is not really accurate, if I understand correctly, as we are not checking that the are close together, but rather that they each fall within a specific range.

I'm not yet fully sure about the decision not to consider the location of the text in the PDF at all. You are right that with the additional constraints implemented here, consdering at the PDF coordinates does not really seem necessary to achieve a satisfactory accuracy on our data. On the other hand, I feel like this would still be a "cheap" way to give us some extra certainty about the extracted coordinates. Also, we might want to support other countries / coordinate systems in the future, where the coordinate values might be less constrained, and considering the PDF coordinates might become essential.
I'm not fully decided yet on whether I would like to implement this now, or keep this as an idea for the future. I'll review the results more after the weekend, and then make a decision.

Thanks for the review!
With "X and Y" close together I mean that we don't allow too much text (characters) in between X and Y values. There is an implicit "closeness" condition as the OCR engine will put text that is close in the pdf also close in the "textspace". Of course this is not fully bullet proof; but it works very well on all the different formats that we have in our data set.

Regarding "other" countries. Here I see the problem, that we have specific classes for our Coordinates with custom logic. E.g. adding the 1 and 2, switching the coordinates etc. I fear that another country would have to create their own coordinate class. Or we manage to parametrize the logic in the yaml file (which I believe would be hard).

Either way, probably it would be best to:

Create a file that only contains coordinate classes
Add documentation on how to add your own coordinate class
A means to use the coordinate class, without the need to adjust the source code (maybe we leave this for the future and open an issue).

Ideally, these coordinate classes could also contain some detection logic (inherited from the base class). This way, another country could easily adjust it to their needs and liking by overwriding the base class in their custom coordinate implementation.

The only thing I am not sure about is the evaluation. It is quite specific to our case allowing for deviations due to swiss coordinate conversion.

redur · 2024-05-27T07:05:48Z

Improve comment about "X & Y are close"

redur · 2024-05-27T07:10:30Z

Collect ideas regarding refactoring in a ticket.

…ly from regex

…ion-review Feat/improve coordinate extraction review

redur added 3 commits May 23, 2024 14:30

Do not consider cases for which there is no ground truth coordinates …

54f619a

…for evaluation.

Add precision, recall and f1 score to coordinate extraction.

68825ce

Change tolerance for which we accept coordinates as correct.

2b4299f

redur self-assigned this May 24, 2024

Check for validation if coordinates are valid.

51b1466

Adjust extraction logic to ensure X & Y coordinate are close in the text.

redur force-pushed the feat/improve_coordinate_extraction branch from 99d60ac to 51b1466 Compare May 24, 2024 11:36

redur requested a review from stijnvermeeren-swisstopo May 24, 2024 11:38

redur added the enhancement New feature or request label May 24, 2024

Update keywords for coordinates.

4bcd0dc

stijnvermeeren-swisstopo requested changes May 24, 2024

View reviewed changes

redur and others added 8 commits May 27, 2024 10:36

Correct false name of variable.

bc04ae0

Update docstring.

69c9b94

coordinate extraction: allow more punctuation + extract values direct…

69dc265

…ly from regex

coordinate extraction: bugfix

a75e231

coordinate extraction: bugfix

1a9bcd4

rm redundante Coordinate.create_from_string method & fix tests

67f881a

also consider all matched coordinate pairs with explicit X and Y labels

d28e372

Merge pull request #50 from swisstopo/feat/improve_coordinate_extract…

3e17388

…ion-review Feat/improve coordinate extraction review

stijnvermeeren-swisstopo approved these changes May 28, 2024

View reviewed changes

redur merged commit 8eed0bb into main May 28, 2024
3 checks passed

redur deleted the feat/improve_coordinate_extraction branch May 28, 2024 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/improve coordinate extraction #48

Feat/improve coordinate extraction #48

redur commented May 24, 2024

github-actions bot commented May 24, 2024 •

edited

Loading

redur commented May 24, 2024 •

edited

Loading

stijnvermeeren-swisstopo commented May 24, 2024

stijnvermeeren-swisstopo May 24, 2024

redur commented May 27, 2024 •

edited

Loading

redur commented May 27, 2024

redur commented May 27, 2024

		@@ -316,8 +317,8 @@ def evaluate_metadata(self, metadata_ground_truth: dict):
		ground_truth_east = int(metadata_ground_truth["coordinates"]["E"])
		ground_truth_west = int(metadata_ground_truth["coordinates"]["N"])

Feat/improve coordinate extraction #48

Feat/improve coordinate extraction #48

Conversation

redur commented May 24, 2024

github-actions bot commented May 24, 2024 • edited Loading

redur commented May 24, 2024 • edited Loading

stijnvermeeren-swisstopo commented May 24, 2024

stijnvermeeren-swisstopo May 24, 2024

Choose a reason for hiding this comment

redur commented May 27, 2024 • edited Loading

redur commented May 27, 2024

redur commented May 27, 2024

github-actions bot commented May 24, 2024 •

edited

Loading

redur commented May 24, 2024 •

edited

Loading

redur commented May 27, 2024 •

edited

Loading