Feat/improve depth entry recognition #65

redur · 2024-06-20T07:02:36Z

Tune some parameters to improve depth entry recognition.
Introduce new checks:

check if depths are strictly increasing.
check if correcting a common OCR mistake makes a depth column valid.

Overall, more depth entries are now recognized. For the zurich dataset, even the layer detection is improved. For the geoquad dataset, the layer detection rate is approximately the same.

redur · 2024-06-20T07:03:09Z

src/stratigraphy/benchmark/score.py

@@ -147,7 +147,7 @@ def evaluate_borehole_extraction(predictions: dict, number_of_truth_values: dict
    coordinate_metrics, coordinate_document_level_metrics = evaluate_metadata(predictions)
    metrics = {**layer_metrics, **coordinate_metrics}
    document_level_metrics = pd.merge(
-        layer_document_level_metrics, coordinate_document_level_metrics, on="document_name"
+        layer_document_level_metrics, coordinate_document_level_metrics, on="document_name", how="outer"


This is just fixing a bug where we only kept rows with coordinates in the document_level_metrics

redur · 2024-06-20T07:03:18Z

src/stratigraphy/util/depthcolumn.py

@@ -262,14 +262,14 @@ def depth_intervals(self) -> list[BoundaryInterval]:
        return depth_intervals

    def significant_arithmetic_progression(self) -> bool:
-        if len(self.entries) < 6:
+        if len(self.entries) < 7:


This simply improved the scores.

Keeping the segment length at 6 and simplifying the threshold to abs(scale_pearson_correlation_coef) >= 0.9999 regardless of segment length seems to work even better. (Implemented in my commit.)

github-actions · 2024-06-20T07:03:22Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	188	188	0%	3–482
get_files.py	21	21	0%	3–48
line_detection.py	26	26	0%	3–76
main.py	94	94	0%	3–237
src/stratigraphy/util
boundarydepthcolumnvalidator.py	37	16	57%	47, 57, 60, 81–84, 102–118, 130–132
coordinate_extraction.py	127	7	94%	31, 62, 75–76, 80, 205, 328
dataclasses.py	32	3	91%	37–39
depthcolumn.py	192	64	67%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 245–253, 264, 269, 276, 307, 312–319, 334–335, 378–420
depthcolumnentry.py	26	7	73%	12, 15, 29–30, 33, 45, 52
description_block_splitter.py	70	2	97%	24, 139
draw.py	80	80	0%	3–244
duplicate_detection.py	51	51	0%	3–146
extract_text.py	27	2	93%	38–39
find_depth_columns.py	91	6	93%	42–43, 71, 83, 176–177
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	86	2	98%	82, 132
interval.py	106	55	48%	24–27, 31–34, 39, 44, 47, 57–59, 99–145, 166, 171–187
language_detection.py	18	18	0%	3–45
layer_identifier_column.py	91	91	0%	3–227
line.py	49	4	92%	25, 42, 51, 98
linesquadtree.py	46	1	98%	76
plot_utils.py	43	43	0%	3–120
predictions.py	187	187	0%	3–387
textblock.py	74	8	89%	27, 51, 63, 75, 98, 119, 127, 155
util.py	40	18	55%	22, 40–47, 61–63, 87–88, 100–105
TOTAL	1873	1023	45%

Tests	Skipped	Failures	Errors	Time
61	0 💤	0 ❌	0 🔥	0.952s ⏱️

redur · 2024-06-20T07:03:57Z

src/stratigraphy/util/depthcolumn.py

-        return corr_coef and corr_coef > 0.99
+
+        return (
+            corr_coef and corr_coef > np.min([1.0382 - len(self.entries) * 0.01, 0.9985]) and corr_coef > 0.95


Admittably, this is probably overfitted. Will be interesting to see the real scores ones we have a test set.

Comes with improved depth interval recognition. This commit accounts for: - Tuned parameter for the depthcolumn.py file. - Check whether depth columns have strictly increasing depth entries. - Check if correcting a common OCR mistake makes a depth column valid. Also fixed a bug in document_level_metrics.csv file; previously only borehole profile including coordinates where present in that file.

…umnValidator class

stijnvermeeren-swisstopo

Thanks for the work!
The code for correcting OCR mistakes does not make a big difference yet, but offers more potential for further improvements in the future.

With the additional changes that I've implemented (which I believe make the logic slightly simpler as well), I now get the following numbers (overall F1 score):

Zurich dataset:

main branch: 87.16%
Renato's version: 87.91%
Stijn's version: 87.90%

Geoquat validation dataset:

main branch: 49.60%
Renato's version: 49.43%
Stijn's version: 49.53%

While I believe that some further optimization of the parameters involved might be possible, the improvements will be marginal, and probably not worth the effort. Therefore, I would propose to merge the PR like this.

redur commented Jun 20, 2024

View reviewed changes

redur force-pushed the feat/improve_depth_entry_recognition branch from f45c853 to ee459a5 Compare June 20, 2024 07:07

redur requested a review from stijnvermeeren-swisstopo June 20, 2024 07:07

stijnvermeeren-swisstopo added 3 commits June 21, 2024 11:52

simplify significant_arithmetic_progression + create BoundaryDepthCol…

d37c03b

…umnValidator class

add unit test for is_arithmetic_progression

85bfa35

simplify correlation coefficient threshold

6b765e0

stijnvermeeren-swisstopo approved these changes Jun 21, 2024

View reviewed changes

redur merged commit b96efa8 into main Jun 24, 2024
3 checks passed

redur deleted the feat/improve_depth_entry_recognition branch June 24, 2024 06:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/improve depth entry recognition #65

Feat/improve depth entry recognition #65

redur commented Jun 20, 2024

redur Jun 20, 2024

redur Jun 20, 2024

stijnvermeeren-swisstopo Jun 21, 2024

github-actions bot commented Jun 20, 2024 •

edited

Loading

redur Jun 20, 2024

stijnvermeeren-swisstopo left a comment

Feat/improve depth entry recognition #65

Feat/improve depth entry recognition #65

Conversation

redur commented Jun 20, 2024

redur Jun 20, 2024

Choose a reason for hiding this comment

redur Jun 20, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo Jun 21, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 20, 2024 • edited Loading

redur Jun 20, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 20, 2024 •

edited

Loading