Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/improve is valid #52

Merged
merged 3 commits into from
May 31, 2024
Merged

Feat/improve is valid #52

merged 3 commits into from
May 31, 2024

Conversation

redur
Copy link
Contributor

@redur redur commented May 31, 2024

Improve is_valid criterion.

The noise check within the is_valid criterion is now adjusted to check
short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries).
This is achieved by applying a quadratic behavior onto the number of entries.

F1 improvement on Zurich dataset by 0.6%.
Similar F1 on geoquat dataset.

@@ -25,7 +25,7 @@ def depth_column_entries(all_words: list[TextWord], include_splits: bool) -> lis
for word in sorted(all_words, key=lambda word: word.rect.y0):
try:
input_string = word.text.strip().replace(",", ".")
regex = re.compile(r"^-?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")
regex = re.compile(r"^-?\.?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now support numbers such as .40 that sometimes occur.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input .40 is now extracted with value 40, not as 0.40. Is that really what we want?

I would also suggest adding this as a test case in test_find_depth_columns.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked again, and actually it's rather that sometimes a '-' is recognized as a '.' in older borehole profiles that have this "handwritten style". Then the behavior is totally desired.

But I also found an occurrence of '.80'. See here: A531.pdf

For our dataset, it is for now better to use the current behaviour.

Copy link

github-actions bot commented May 31, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py2102100%3–506
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py91910%3–232
src/stratigraphy/util
   coordinate_extraction.py1162083%25, 45, 49, 53, 57–65, 86, 171, 191, 280, 283–284, 288, 300
   dataclasses.py32391%37–39
   depthcolumn.py2086768%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 313, 323, 352, 373, 376–387, 402–403, 448–490
   depthcolumnentry.py20480%12, 15, 27, 34
   description_block_splitter.py70297%24, 139
   draw.py73730%3–225
   duplicate_detection.py51510%3–146
   find_depth_columns.py89693%41–42, 70, 82, 175–176
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1075251%25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
   language_detection.py18180%3–43
   layer_identifier_column.py91910%3–227
   line.py492647%25, 42, 51, 65–95, 98
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1861860%3–386
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py402245%15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL1818105242% 

Tests Skipped Failures Errors Time
58 0 💤 0 ❌ 0 🔥 0.611s ⏱️

@@ -36,8 +36,8 @@ def detect_language_of_document(doc: fitz.Document) -> str:
try:
language = detect(text)
except LangDetectException:
language = "de"
language = "de" # TODO: default language should be read from config
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bothering me for the moment. Right now you need to adjust the code to extend the extraction to other languages. This should be doable from the config files. I believe there are other places where language is hard-coded in form of keywords. (e.g. coordinate extraction)

I will open an issue for it.

The noise check within the is_valid criterion is now adjusted to check
short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries).
This is achieved by applying a quadratic behavior onto the number of entries.
@redur redur force-pushed the feat/improve_is_valid branch from dc1798b to c9804e9 Compare May 31, 2024 06:52
@@ -25,7 +25,7 @@ def depth_column_entries(all_words: list[TextWord], include_splits: bool) -> lis
for word in sorted(all_words, key=lambda word: word.rect.y0):
try:
input_string = word.text.strip().replace(",", ".")
regex = re.compile(r"^-?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")
regex = re.compile(r"^-?\.?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input .40 is now extracted with value 40, not as 0.40. Is that really what we want?

I would also suggest adding this as a test case in test_find_depth_columns.py.

@redur redur merged commit 71bfee2 into main May 31, 2024
3 checks passed
@redur redur deleted the feat/improve_is_valid branch May 31, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants