Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close LGVISIUM-63: Extraction of the groundwater logo using computer vision #83

Conversation

dcleres
Copy link
Contributor

@dcleres dcleres commented Sep 20, 2024

Addition of template matching to the code. This should make it possible to find groundwater information without the respective keywords.

In this PR, I also tried to remove the keywords that were not used and to remove them if needed. Furthermore, I added a list of FP keywords. This list contains keywords that were leading to false positives in the detections.

Copy link

github-actions bot commented Sep 20, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1861860%3–483
   get_files.py19190%3–47
   main.py1171170%3–308
src/stratigraphy/data_extractor
   data_extractor.py57395%33, 66, 103
src/stratigraphy/depthcolumn
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 110–128, 140–149
   depthcolumn.py1946467%25, 29, 50, 56, 59–60, 84, 87, 94, 101, 109–110, 120, 137–153, 191, 228, 247–255, 266, 271, 278, 309, 314–321, 336–337, 380–422
   depthcolumnentry.py28679%17, 21, 36, 39, 56, 65
   find_depth_columns.py1061982%42–43, 73, 86, 180–181, 225–245
src/stratigraphy/layer
   layer_identifier_column.py745230%16–17, 20, 28, 43, 47, 51, 59–63, 66, 74, 91–96, 99, 112, 125–126, 148–158, 172–199
src/stratigraphy/lines
   geometric_line_utilities.py86298%81, 131
   line.py51492%25, 50, 60, 110
   linesquadtree.py46198%75
src/stratigraphy/metadata
   coordinate_extraction.py108595%30, 64, 94–95, 107
src/stratigraphy/text
   description_block_splitter.py70297%24, 139
   extract_text.py29390%19, 53–54
   find_description.py642856%27–35, 50–63, 79–95, 172–175
   textblock.py80989%28, 56, 64, 89, 101, 124, 145, 154, 183
src/stratigraphy/util
   dataclasses.py32391%37–39
   interval.py1045547%29–32, 37–40, 46, 52, 56, 66–68, 107–153, 174, 180–196
   predictions.py1071070%3–282
   util.py391756%41, 69–76, 90–92, 116–117, 129–133
TOTAL165272356% 

Tests Skipped Failures Errors Time
82 0 💤 0 ❌ 0 🔥 6.265s ⏱️

@dcleres dcleres changed the title Lgvisium 63 extraction of the groundwater logo using computer vision Close LGVISIUM-63: Extraction of the groundwater logo using computer vision Sep 20, 2024
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not enable this functionality by default at the moment, as it is just too slow and does not lead to an improvement that is significant enough to justify this slow-down. So maybe we should add a config parameter that can be used to decide whether to apply template matching or not. Then we can keep the code for now, and create a follow-up ticket to look into ways to optimize it.

One potential idea would be to look into using the OpenCV2 implementation of template matching instead of the scikit-learn one. Some people seem to claim that the former is more performant (e.g. https://www.reddit.com/r/opencv/comments/g8kdcs/question_the_speed_of_matchtemplate/).

src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/utility.py Outdated Show resolved Hide resolved
Comment on lines 144 to 145
search_left_factor: float = 3 # NOTE: check files 267125334-bp.pdf, 267125338-bp.pdf, and 267125339-bp.pdf if this
# value is too high, as it might lead to false positives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this, in combination with the new search_above_factor, indeed seems to lead to too many false positives (see e.g. 267125029-bp.pdf). But maybe the ongoing work in https://jira.swisstopo.ch/browse/LGVISIUM-77 will already make this more robust again?

Why was it necessary exactly to increase this value? I don't really understand what the files 267125334-bp.pdf, 267125338-bp.pdf, and 267125339-bp.pdf have to do with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I was facing with the 267125334-bp.pdf, 267125338-bp.pdf, and 267125339-bp.pdf bore profiles was that False Positives were generated if the left search factor was too large. In these profiles, the algorithm would find the depth column and extract data.

I think the best option performance-wise would be to use the default values from the main branch.

@dcleres dcleres closed this in ef84413 Oct 7, 2024
@dcleres dcleres reopened this Oct 8, 2024
@dcleres
Copy link
Contributor Author

dcleres commented Oct 14, 2024

I addressed the comments you raised @stijnvermeeren-swisstopo . Thank you very much for the review. I added the possibility of running the template matching on demand by editing the environment variable IS_SEARCHING_GROUNDWATER_ILLUSTRATION.

Even when not running the template matching, the metrics were improved:

Screenshot 2024-10-14 at 14 28 38

main branch was run: vaunted-mink-342

@dcleres dcleres marked this pull request as ready for review October 14, 2024 12:31
pyproject.toml Outdated Show resolved Hide resolved
src/stratigraphy/data_extractor/data_extractor.py Outdated Show resolved Hide resolved
src/stratigraphy/data_extractor/data_extractor.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/groundwater/groundwater_extraction.py Outdated Show resolved Hide resolved
lines, page_number, terrain_elevation
)
if found_groundwater:
logger.info("Confidence list: %s", confidence_list)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this helpful logging according to you, @stijnvermeeren-swisstopo? In a previous iteration, we removed the logging in case no groundwater was found. We can also remove the logging when groundwater is found, as I mostly use it for debugging purposes.

@dcleres
Copy link
Contributor Author

dcleres commented Oct 14, 2024

@stijnvermeeren-swisstopo I do believe I implemented all the changes we discussed today. Main change, the template matching is now in an independent and separate file.

pyproject.toml Outdated Show resolved Hide resolved
@dcleres
Copy link
Contributor Author

dcleres commented Oct 17, 2024

@stijnvermeeren-swisstopo thank you for your review. I add the groundwater_illustration_matching to the toml installation script.

pyproject.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dcleres dcleres merged commit 204a50d into main Oct 17, 2024
3 checks passed
@dcleres dcleres deleted the LGVISIUM-63-Extraction-of-the-Groundwater-logo-using-Computer-Vision branch October 21, 2024 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants