-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve duplicate detection to use depth information #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour for A525 page 1-2 is somewhat suboptimal. We have a situation like
Page 1 duplicate detected Page 2
a
b
c ------------------ C
d ------------------ D
e E
f F
G
H
Layers "e" and "f" are included in the output for both page 1 as well as page 2, even though we could infer from the previously detected duplicates higher up ("c" and "d"), that these layers must be duplicates as well.
Something for a follow-up ticket probably, as I don't currently see an easy fix for this.
and current_depth_interval["end"].get("value") == previous_depth_interval["end"].get("value") | ||
): | ||
duplicate_condition = True | ||
print("Duplicate condition met") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger instead of print
.
By the way, not related to this PR, but I was thinking that it would be useful if the logger prints a timestamp for each log statement as well. Would that be possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it is now?
2024-05-28 15:32:48 INFO Processing file: data/data_v2/validation/2537.pdf
2024-05-28 15:32:48 INFO Swapping coordinates.
2024-05-28 15:32:48 INFO Processing page 1
There is a mistake indeed. I am not sure whether the current approach works super well for these long profiles. We're somewhat sensitive to mistakes in the depth column. I would suggest we accept this error for now. |
|
Regarding 2: Another solution would be do define logging in the init.py file and import stratigraphy at the beginning. Then we factor that out of the main code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Follow-up ticket was created: https://jira.swisstopo.ch/browse/LGVISIUM-44
The score remains unaffected (less than 0.1% improvement), the duplicate detection is now improved.
Check borehole profile A525.pdf to see the changes.
Overall logic:
If there is depth information for a given layer --> use that information to detect duplicates.
If there is no depth information for a given layer --> use template matching.
I argue to keep template matching in the logic, as there are borehole profiles that do not have depth columns, and many layers may not have any depth-information assigned. Especially older borehole profiles, sometimes come with a visual representation of depth information and I believe template matching is still the best way to go in that case.