Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve duplicate detection to use depth information #49

Merged
merged 5 commits into from
May 29, 2024

Conversation

redur
Copy link
Contributor

@redur redur commented May 27, 2024

The score remains unaffected (less than 0.1% improvement), the duplicate detection is now improved.
Check borehole profile A525.pdf to see the changes.

Overall logic:
If there is depth information for a given layer --> use that information to detect duplicates.
If there is no depth information for a given layer --> use template matching.

I argue to keep template matching in the logic, as there are borehole profiles that do not have depth columns, and many layers may not have any depth-information assigned. Especially older borehole profiles, sometimes come with a visual representation of depth information and I believe template matching is still the best way to go in that case.

@redur redur self-assigned this May 27, 2024
Copy link

github-actions bot commented May 27, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py2102100%3–506
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py91910%3–232
src/stratigraphy/util
   coordinate_extraction.py1162083%25, 45, 49, 53, 57–65, 86, 171, 191, 280, 283–284, 288, 300
   dataclasses.py32391%37–39
   depthcolumn.py2066767%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 310, 314, 343, 364, 367–378, 393–394, 439–481
   depthcolumnentry.py20480%12, 15, 27, 34
   description_block_splitter.py70297%24, 139
   draw.py73730%3–225
   duplicate_detection.py51510%3–146
   find_depth_columns.py89693%39–40, 68, 80, 173–174
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1075251%25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
   language_detection.py18180%3–43
   layer_identifier_column.py91910%3–227
   line.py492647%25, 42, 51, 65–95, 98
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1861860%3–386
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py402245%15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL1816105242% 

Tests Skipped Failures Errors Time
57 0 💤 0 ❌ 0 🔥 0.605s ⏱️

@redur redur added the enhancement New feature or request label May 27, 2024
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour for A525 page 1-2 is somewhat suboptimal. We have a situation like

Page 1  duplicate detected   Page 2
 a
 b
 c      ------------------    C 
 d      ------------------    D
 e                            E
 f                            F
                              G
                              H

Layers "e" and "f" are included in the output for both page 1 as well as page 2, even though we could infer from the previously detected duplicates higher up ("c" and "d"), that these layers must be duplicates as well.

Something for a follow-up ticket probably, as I don't currently see an easy fix for this.

and current_depth_interval["end"].get("value") == previous_depth_interval["end"].get("value")
):
duplicate_condition = True
print("Duplicate condition met")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logger instead of print.

By the way, not related to this PR, but I was thinking that it would be useful if the logger prints a timestamp for each log statement as well. Would that be possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it is now?

2024-05-28 15:32:48 INFO     Processing file: data/data_v2/validation/2537.pdf
2024-05-28 15:32:48 INFO     Swapping coordinates.
2024-05-28 15:32:48 INFO     Processing page 1

src/stratigraphy/util/duplicate_detection.py Outdated Show resolved Hide resolved
@redur
Copy link
Contributor Author

redur commented May 28, 2024

The behaviour for A525 page 1-2 is somewhat suboptimal. We have a situation like

Page 1  duplicate detected   Page 2
 a
 b
 c      ------------------    C 
 d      ------------------    D
 e                            E
 f                            F
                              G
                              H

Layers "e" and "f" are included in the output for both page 1 as well as page 2, even though we could infer from the previously detected duplicates higher up ("c" and "d"), that these layers must be duplicates as well.

Something for a follow-up ticket probably, as I don't currently see an easy fix for this.

There is a mistake indeed. I am not sure whether the current approach works super well for these long profiles. We're somewhat sensitive to mistakes in the depth column.

I would suggest we accept this error for now.

@stijnvermeeren-swisstopo
Copy link
Contributor

  1. Could you create a follow-up ticket for the A525.pdf case?
  2. Is it necessary to repeat the definition of the log format in every file, or is it possible to define that in a single place?

@redur
Copy link
Contributor Author

redur commented May 29, 2024

  1. Could you create a follow-up ticket for the A525.pdf case?
  2. Is it necessary to repeat the definition of the log format in every file, or is it possible to define that in a single place?

Regarding 2:
It is possible to remove the config statements. We just need to make sure, that the config is defined before any logging statements are done. In our case, we probably are always going to execute main.py and it's sufficient to define logging therein. I adjusted it this way.

Another solution would be do define logging in the init.py file and import stratigraphy at the beginning. Then we factor that out of the main code.

Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Follow-up ticket was created: https://jira.swisstopo.ch/browse/LGVISIUM-44

@redur redur merged commit f575b6d into main May 29, 2024
3 checks passed
@redur redur deleted the feat/improve_duplicate_detectin branch May 29, 2024 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants