Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve depth entry recognition for cases such as '.80'. #66

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

redur
Copy link
Contributor

@redur redur commented Jun 24, 2024

And here the PR to include '.80' and the like as well.

Copy link

github-actions bot commented Jun 24, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1881880%3–482
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py94940%3–237
src/stratigraphy/util
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 109–127, 139–148
   coordinate_extraction.py127794%31, 62, 75–76, 80, 205, 328
   dataclasses.py32391%37–39
   depthcolumn.py1946467%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 248–256, 267, 272, 279, 310, 315–322, 337–338, 381–423
   depthcolumnentry.py26773%12, 15, 29–30, 33, 45, 52
   description_block_splitter.py70297%24, 139
   draw.py80800%3–244
   duplicate_detection.py51510%3–146
   extract_text.py27293%38–39
   find_depth_columns.py91693%42–43, 71, 83, 176–177
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1065548%24–27, 31–34, 39, 44, 47, 57–59, 99–145, 166, 171–187
   language_detection.py18180%3–45
   layer_identifier_column.py91910%3–227
   line.py49492%25, 42, 51, 98
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1871870%3–387
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py401855%22, 40–47, 61–63, 87–88, 100–105
TOTAL1879102745% 

Tests Skipped Failures Errors Time
61 0 💤 0 ❌ 0 🔥 1.101s ⏱️

@stijnvermeeren-swisstopo
Copy link
Contributor

stijnvermeeren-swisstopo commented Jun 24, 2024

I made two changes:

  • Apply individual corrections to a single depth value, instead of trying all possible corrections at once. For example, from a value of "40", we would like to try the possibilities "0.40" as well as "10", whereas with the original implementation, we would only try "0.10". This currently does not make a difference for our datasets, but I think it makes more sense to apply corrections individually, especially as we will likely add more possible corrections in the future.
  • Allow for more than one depth value to be corrected in a depth column. This improves the results for 680244005-bp.pdf (where two corrections of the 4->1 type are needed), while for 16005.pdf the multiple corrections lead to a column of 3 different values to be incorrectly recognized as a depth column (though it does not make a difference for the F1 score, as this document scores 0% anyway). Maybe allowing a certain number of corrections that is proportional to the total number of entries in the depth column, would be most robust? However, with so few documents where this makes a difference at all, I don't really think that it's worthwhile implementing such a rule with a additional parameter that needs to be tuned. @redur what do you think, should we stick to allowing one correction per column for now, or shall we drop that limit?

@redur redur closed this Jun 25, 2024
@redur redur reopened this Jun 25, 2024
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation was added as discussed.

@redur redur merged commit f686692 into main Jun 25, 2024
3 checks passed
@redur redur deleted the feat/correct_.NN branch June 25, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants