Feat/improve is valid #52

redur · 2024-05-31T06:41:17Z

Improve is_valid criterion.

The noise check within the is_valid criterion is now adjusted to check
short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries).
This is achieved by applying a quadratic behavior onto the number of entries.

F1 improvement on Zurich dataset by 0.6%.
Similar F1 on geoquat dataset.

redur · 2024-05-31T06:42:40Z

src/stratigraphy/util/find_depth_columns.py

@@ -25,7 +25,7 @@ def depth_column_entries(all_words: list[TextWord], include_splits: bool) -> lis
    for word in sorted(all_words, key=lambda word: word.rect.y0):
        try:
            input_string = word.text.strip().replace(",", ".")
-            regex = re.compile(r"^-?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")
+            regex = re.compile(r"^-?\.?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")


We now support numbers such as .40 that sometimes occur.

The input .40 is now extracted with value 40, not as 0.40. Is that really what we want?

I would also suggest adding this as a test case in test_find_depth_columns.py.

I checked again, and actually it's rather that sometimes a '-' is recognized as a '.' in older borehole profiles that have this "handwritten style". Then the behavior is totally desired.

But I also found an occurrence of '.80'. See here: A531.pdf

For our dataset, it is for now better to use the current behaviour.

github-actions · 2024-05-31T06:42:42Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	210	210	0%	3–506
get_files.py	21	21	0%	3–48
line_detection.py	26	26	0%	3–76
main.py	91	91	0%	3–232
src/stratigraphy/util
coordinate_extraction.py	116	20	83%	25, 45, 49, 53, 57–65, 86, 171, 191, 280, 283–284, 288, 300
dataclasses.py	32	3	91%	37–39
depthcolumn.py	208	67	68%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 313, 323, 352, 373, 376–387, 402–403, 448–490
depthcolumnentry.py	20	4	80%	12, 15, 27, 34
description_block_splitter.py	70	2	97%	24, 139
draw.py	73	73	0%	3–225
duplicate_detection.py	51	51	0%	3–146
find_depth_columns.py	89	6	93%	41–42, 70, 82, 175–176
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	86	2	98%	82, 132
interval.py	107	52	51%	25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
language_detection.py	18	18	0%	3–43
layer_identifier_column.py	91	91	0%	3–227
line.py	49	26	47%	25, 42, 51, 65–95, 98
linesquadtree.py	46	1	98%	76
plot_utils.py	43	43	0%	3–120
predictions.py	186	186	0%	3–386
textblock.py	74	8	89%	27, 51, 63, 75, 98, 119, 127, 155
util.py	40	22	45%	15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL	1818	1052	42%

Tests	Skipped	Failures	Errors	Time
58	0 💤	0 ❌	0 🔥	0.611s ⏱️

redur · 2024-05-31T06:44:10Z

src/stratigraphy/util/language_detection.py

@@ -36,8 +36,8 @@ def detect_language_of_document(doc: fitz.Document) -> str:
    try:
        language = detect(text)
    except LangDetectException:
-        language = "de"
+        language = "de"  # TODO: default language should be read from config


This is bothering me for the moment. Right now you need to adjust the code to extend the extraction to other languages. This should be doable from the config files. I believe there are other places where language is hard-coded in form of keywords. (e.g. coordinate extraction)

I will open an issue for it.

The noise check within the is_valid criterion is now adjusted to check short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries). This is achieved by applying a quadratic behavior onto the number of entries.

stijnvermeeren-swisstopo · 2024-05-31T11:37:05Z

src/stratigraphy/util/find_depth_columns.py

@@ -25,7 +25,7 @@ def depth_column_entries(all_words: list[TextWord], include_splits: bool) -> lis
    for word in sorted(all_words, key=lambda word: word.rect.y0):
        try:
            input_string = word.text.strip().replace(",", ".")
-            regex = re.compile(r"^-?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")
+            regex = re.compile(r"^-?\.?([0-9]+(\.[0-9]+)?)[müMN\\.]*$")


The input .40 is now extracted with value 40, not as 0.40. Is that really what we want?

I would also suggest adding this as a test case in test_find_depth_columns.py.

Add TODO comments regarding language configuration.

9f25f7d

redur commented May 31, 2024

View reviewed changes

Improve is_valid criterion.

c9804e9

The noise check within the is_valid criterion is now adjusted to check short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries). This is achieved by applying a quadratic behavior onto the number of entries.

redur force-pushed the feat/improve_is_valid branch from dc1798b to c9804e9 Compare May 31, 2024 06:52

redur requested a review from stijnvermeeren-swisstopo May 31, 2024 08:42

stijnvermeeren-swisstopo requested changes May 31, 2024

View reviewed changes

Add documentation and tests for find_depth_column.

0f2fc22

redur requested a review from stijnvermeeren-swisstopo May 31, 2024 13:30

stijnvermeeren-swisstopo approved these changes May 31, 2024

View reviewed changes

redur merged commit 71bfee2 into main May 31, 2024
3 checks passed

redur deleted the feat/improve_is_valid branch May 31, 2024 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/improve is valid #52

Feat/improve is valid #52

redur commented May 31, 2024 •

edited

Loading

redur May 31, 2024

stijnvermeeren-swisstopo May 31, 2024

redur May 31, 2024

github-actions bot commented May 31, 2024 •

edited

Loading

redur May 31, 2024

stijnvermeeren-swisstopo May 31, 2024

Feat/improve is valid #52

Feat/improve is valid #52

Conversation

redur commented May 31, 2024 • edited Loading

redur May 31, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo May 31, 2024

Choose a reason for hiding this comment

redur May 31, 2024

Choose a reason for hiding this comment

github-actions bot commented May 31, 2024 • edited Loading

redur May 31, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo May 31, 2024

Choose a reason for hiding this comment

redur commented May 31, 2024 •

edited

Loading

github-actions bot commented May 31, 2024 •

edited

Loading