Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/improve coordinate extraction #48

Merged
merged 13 commits into from
May 28, 2024
Merged

Conversation

redur
Copy link
Contributor

@redur redur commented May 24, 2024

  • Added additional metrics for coordinate extraction evaluation.
  • Adjust extraction logic to ensure X & Y values are close together.
  • Changed evaluation to allow small deviations due to approximate conversions between LV03 and LV95 coordinate systems.
  • Added additional check for validity of extracted coordinates.

Outcome:

  • (Almost) all cornercases are covered with the extraction improvement. Coordinate accuracy as well as coordinate precision increased. (i.e. listed cornercases in the jira ticket, and similar such cases are covered)
  • There does not seem any need to use the pdf-coordinate (i.e. location of the text in the pdf) to determine the right coordinates.

@redur redur self-assigned this May 24, 2024
Copy link

github-actions bot commented May 24, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py2112110%3–507
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py91910%3–232
src/stratigraphy/util
   coordinate_extraction.py1162083%25, 45, 49, 53, 57–65, 86, 171, 191, 280, 283–284, 288, 300
   dataclasses.py32391%37–39
   depthcolumn.py2066767%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 310, 314, 343, 364, 367–378, 393–394, 439–481
   depthcolumnentry.py20480%12, 15, 27, 34
   description_block_splitter.py70297%24, 139
   draw.py73730%3–225
   duplicate_detection.py32320%3–81
   find_depth_columns.py89693%39–40, 68, 80, 173–174
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py87298%83, 133
   interval.py1075251%25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
   language_detection.py18180%3–43
   layer_identifier_column.py91910%3–227
   line.py492647%25, 42, 51, 65–95, 98
   linesquadtree.py46198%76
   plot_utils.py44440%3–121
   predictions.py1861860%3–386
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py402245%15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL1800103542% 

Tests Skipped Failures Errors Time
57 0 💤 0 ❌ 0 🔥 0.672s ⏱️

Adjust extraction logic to ensure X & Y coordinate are close in the text.
@redur redur force-pushed the feat/improve_coordinate_extraction branch from 99d60ac to 51b1466 Compare May 24, 2024 11:36
@redur redur added the enhancement New feature or request label May 24, 2024
@redur
Copy link
Contributor Author

redur commented May 24, 2024

Remaining mistakes on the geoquad dataset:
3425.pdf. —> see below
A1150.pdf. —> OCR Issue
A1163.pdf —> typo on pdf
A1254.pdf. —> OCR issue
A185.pdf. —> strange (unexpected) coordinate format
A8425.pdf. —> OCR issue
A8432.pdf —> OCR issue

GroundTruth Wrong
A1141.pdf. —> typo in ground truth
A361.pdf —> deviation of 4
A494.pdf —> typo in ground truth
A7068.pdf —> deviation of 28 in ground truth
A7069.pdf —> deviation of 23
B542.pdf —> deviation of 62

3425.pdf --> The coordinates are written on a underlined dotted line. The dots from the line are recognized as "." in between the numbers. Weakening recognition statement leads to additional false positives (even though there is "X" and "Y" in the matched string)

I suggest we correct the ground truth. But do not allow larger deviations from the correct value.

Precision is 84% on geoquad (and considering the wrong ground truth even higher)
Precision is 91% on Zurich dataset (I did not check all profiles for correct ground truth here).

@stijnvermeeren-swisstopo
Copy link
Contributor

This looks like a nice accuracy improvement!

The sentence "Adjust extraction logic to ensure X & Y values are close together." is not really accurate, if I understand correctly, as we are not checking that the are close together, but rather that they each fall within a specific range.

I'm not yet fully sure about the decision not to consider the location of the text in the PDF at all. You are right that with the additional constraints implemented here, consdering at the PDF coordinates does not really seem necessary to achieve a satisfactory accuracy on our data. On the other hand, I feel like this would still be a "cheap" way to give us some extra certainty about the extracted coordinates. Also, we might want to support other countries / coordinate systems in the future, where the coordinate values might be less constrained, and considering the PDF coordinates might become essential.
I'm not fully decided yet on whether I would like to implement this now, or keep this as an idea for the future. I'll review the results more after the weekend, and then make a decision.

@@ -316,8 +317,8 @@ def evaluate_metadata(self, metadata_ground_truth: dict):
ground_truth_east = int(metadata_ground_truth["coordinates"]["E"])
ground_truth_west = int(metadata_ground_truth["coordinates"]["N"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

north, nicht west

@redur
Copy link
Contributor Author

redur commented May 27, 2024

This looks like a nice accuracy improvement!

The sentence "Adjust extraction logic to ensure X & Y values are close together." is not really accurate, if I understand correctly, as we are not checking that the are close together, but rather that they each fall within a specific range.

I'm not yet fully sure about the decision not to consider the location of the text in the PDF at all. You are right that with the additional constraints implemented here, consdering at the PDF coordinates does not really seem necessary to achieve a satisfactory accuracy on our data. On the other hand, I feel like this would still be a "cheap" way to give us some extra certainty about the extracted coordinates. Also, we might want to support other countries / coordinate systems in the future, where the coordinate values might be less constrained, and considering the PDF coordinates might become essential.
I'm not fully decided yet on whether I would like to implement this now, or keep this as an idea for the future. I'll review the results more after the weekend, and then make a decision.

Thanks for the review!
With "X and Y" close together I mean that we don't allow too much text (characters) in between X and Y values. There is an implicit "closeness" condition as the OCR engine will put text that is close in the pdf also close in the "textspace". Of course this is not fully bullet proof; but it works very well on all the different formats that we have in our data set.

Regarding "other" countries. Here I see the problem, that we have specific classes for our Coordinates with custom logic. E.g. adding the 1 and 2, switching the coordinates etc. I fear that another country would have to create their own coordinate class. Or we manage to parametrize the logic in the yaml file (which I believe would be hard).

Either way, probably it would be best to:

  • Create a file that only contains coordinate classes
  • Add documentation on how to add your own coordinate class
  • A means to use the coordinate class, without the need to adjust the source code (maybe we leave this for the future and open an issue).

Ideally, these coordinate classes could also contain some detection logic (inherited from the base class). This way, another country could easily adjust it to their needs and liking by overwriding the base class in their custom coordinate implementation.

The only thing I am not sure about is the evaluation. It is quite specific to our case allowing for deviations due to swiss coordinate conversion.

@redur
Copy link
Contributor Author

redur commented May 27, 2024

Improve comment about "X & Y are close"

@redur
Copy link
Contributor Author

redur commented May 27, 2024

Collect ideas regarding refactoring in a ticket.

@redur redur merged commit 8eed0bb into main May 28, 2024
3 checks passed
@redur redur deleted the feat/improve_coordinate_extraction branch May 28, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants