-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/improve coordinate extraction #48
Conversation
Adjust extraction logic to ensure X & Y coordinate are close in the text.
99d60ac
to
51b1466
Compare
Remaining mistakes on the geoquad dataset: GroundTruth Wrong 3425.pdf --> The coordinates are written on a underlined dotted line. The dots from the line are recognized as "." in between the numbers. Weakening recognition statement leads to additional false positives (even though there is "X" and "Y" in the matched string) I suggest we correct the ground truth. But do not allow larger deviations from the correct value. Precision is 84% on geoquad (and considering the wrong ground truth even higher) |
This looks like a nice accuracy improvement! The sentence "Adjust extraction logic to ensure X & Y values are close together." is not really accurate, if I understand correctly, as we are not checking that the are close together, but rather that they each fall within a specific range. I'm not yet fully sure about the decision not to consider the location of the text in the PDF at all. You are right that with the additional constraints implemented here, consdering at the PDF coordinates does not really seem necessary to achieve a satisfactory accuracy on our data. On the other hand, I feel like this would still be a "cheap" way to give us some extra certainty about the extracted coordinates. Also, we might want to support other countries / coordinate systems in the future, where the coordinate values might be less constrained, and considering the PDF coordinates might become essential. |
src/stratigraphy/util/predictions.py
Outdated
@@ -316,8 +317,8 @@ def evaluate_metadata(self, metadata_ground_truth: dict): | |||
ground_truth_east = int(metadata_ground_truth["coordinates"]["E"]) | |||
ground_truth_west = int(metadata_ground_truth["coordinates"]["N"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
north
, nicht west
Thanks for the review! Regarding "other" countries. Here I see the problem, that we have specific classes for our Coordinates with custom logic. E.g. adding the 1 and 2, switching the coordinates etc. I fear that another country would have to create their own coordinate class. Or we manage to parametrize the logic in the yaml file (which I believe would be hard). Either way, probably it would be best to:
Ideally, these coordinate classes could also contain some detection logic (inherited from the base class). This way, another country could easily adjust it to their needs and liking by overwriding the base class in their custom coordinate implementation. The only thing I am not sure about is the evaluation. It is quite specific to our case allowing for deviations due to swiss coordinate conversion. |
Improve comment about "X & Y are close" |
Collect ideas regarding refactoring in a ticket. |
…ion-review Feat/improve coordinate extraction review
Outcome: