Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close #LGVISIUM-83: Extract coordinates with non integer values #88

Merged

Conversation

dcleres
Copy link
Contributor

@dcleres dcleres commented Oct 1, 2024

Made it possible to extract digits from the coordinates by adapting the regex code. Adapted the tests to this new case and uploaded a new test file to test these cases.

@dcleres dcleres self-assigned this Oct 1, 2024
@@ -138,10 +138,10 @@ def draw_metadata(
"""
# TODO associate correctness with the extracted coordinates in a better way
coordinate_color = "green" if is_coordinate_correct else "red"
coordinate_rect = fitz.Rect([5, 5, 200, 25])
coordinate_rect = fitz.Rect([5, 5, 250, 30])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to increase the size of the boxes; otherwise, the coordinates would not be printed because the box was too small.

Copy link

github-actions bot commented Oct 1, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1861860%3–483
   get_files.py19190%3–47
   main.py1191190%3–310
src/stratigraphy/data_extractor
   data_extractor.py50394%32, 62, 98
src/stratigraphy/depthcolumn
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 110–128, 140–149
   depthcolumn.py1946467%25, 29, 50, 56, 59–60, 84, 87, 94, 101, 109–110, 120, 137–153, 191, 228, 247–255, 266, 271, 278, 309, 314–321, 336–337, 380–422
   depthcolumnentry.py28679%17, 21, 36, 39, 56, 65
   find_depth_columns.py1061982%42–43, 73, 86, 180–181, 225–245
src/stratigraphy/layer
   layer_identifier_column.py745230%16–17, 20, 28, 43, 47, 51, 59–63, 66, 74, 91–96, 99, 112, 125–126, 148–158, 172–199
src/stratigraphy/lines
   geometric_line_utilities.py86298%81, 131
   line.py51492%25, 50, 60, 110
   linesquadtree.py46198%75
src/stratigraphy/metadata
   coordinate_extraction.py108595%30, 64, 94–95, 107
src/stratigraphy/text
   description_block_splitter.py70297%24, 139
   extract_text.py29390%19, 53–54
   find_description.py642856%27–35, 50–63, 79–95, 172–175
   textblock.py80989%28, 56, 64, 89, 101, 124, 145, 154, 183
src/stratigraphy/util
   dataclasses.py32391%37–39
   interval.py1045547%29–32, 37–40, 46, 52, 56, 66–68, 107–153, 174, 180–196
   predictions.py1071070%3–282
   util.py391756%41, 69–76, 90–92, 116–117, 129–133
TOTAL164172556% 

Tests Skipped Failures Errors Time
82 0 💤 0 ❌ 0 🔥 6.302s ⏱️

Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked at it a little more, I think that I do have a solution that keeps the performance status-quo without requiring the additional preprocessing hack. I would use the following regex:

r"(?:([12])[\.\s'‘’]{0,2})?(\d{3})[\.\s'‘’]{0,2}(\d{3})(?:\.(\d{1,}))?"

The difference is that IF we extract decimal digits, THEN the decimal point is required. The current regex has the decimal point always as optional, regardless of whether we have some decimal digits or not.

src/stratigraphy/metadata/coordinate_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/metadata/coordinate_extraction.py Outdated Show resolved Hide resolved
tests/test_coordinate_extraction.py Outdated Show resolved Hide resolved
tests/test_coordinate_extraction.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dcleres dcleres merged commit 2771506 into main Oct 7, 2024
3 checks passed
@dcleres dcleres deleted the LGVISIUM-83-Extract-coordinates-with-non-integer-values branch October 8, 2024 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants