Improve duplicate detection to use depth information #49

redur · 2024-05-27T07:59:44Z

The score remains unaffected (less than 0.1% improvement), the duplicate detection is now improved.
Check borehole profile A525.pdf to see the changes.

Overall logic:
If there is depth information for a given layer --> use that information to detect duplicates.
If there is no depth information for a given layer --> use template matching.

I argue to keep template matching in the logic, as there are borehole profiles that do not have depth columns, and many layers may not have any depth-information assigned. Especially older borehole profiles, sometimes come with a visual representation of depth information and I believe template matching is still the best way to go in that case.

github-actions · 2024-05-27T08:01:09Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	210	210	0%	3–506
get_files.py	21	21	0%	3–48
line_detection.py	26	26	0%	3–76
main.py	91	91	0%	3–232
src/stratigraphy/util
coordinate_extraction.py	116	20	83%	25, 45, 49, 53, 57–65, 86, 171, 191, 280, 283–284, 288, 300
dataclasses.py	32	3	91%	37–39
depthcolumn.py	206	67	67%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 310, 314, 343, 364, 367–378, 393–394, 439–481
depthcolumnentry.py	20	4	80%	12, 15, 27, 34
description_block_splitter.py	70	2	97%	24, 139
draw.py	73	73	0%	3–225
duplicate_detection.py	51	51	0%	3–146
find_depth_columns.py	89	6	93%	39–40, 68, 80, 173–174
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	86	2	98%	82, 132
interval.py	107	52	51%	25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
language_detection.py	18	18	0%	3–43
layer_identifier_column.py	91	91	0%	3–227
line.py	49	26	47%	25, 42, 51, 65–95, 98
linesquadtree.py	46	1	98%	76
plot_utils.py	43	43	0%	3–120
predictions.py	186	186	0%	3–386
textblock.py	74	8	89%	27, 51, 63, 75, 98, 119, 127, 155
util.py	40	22	45%	15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL	1816	1052	42%

Tests	Skipped	Failures	Errors	Time
57	0 💤	0 ❌	0 🔥	0.605s ⏱️

stijnvermeeren-swisstopo

The behaviour for A525 page 1-2 is somewhat suboptimal. We have a situation like

Page 1  duplicate detected   Page 2
 a
 b
 c      ------------------    C 
 d      ------------------    D
 e                            E
 f                            F
                              G
                              H

Layers "e" and "f" are included in the output for both page 1 as well as page 2, even though we could infer from the previously detected duplicates higher up ("c" and "d"), that these layers must be duplicates as well.

Something for a follow-up ticket probably, as I don't currently see an easy fix for this.

stijnvermeeren-swisstopo · 2024-05-28T09:58:31Z

src/stratigraphy/util/duplicate_detection.py

+                    and current_depth_interval["end"].get("value") == previous_depth_interval["end"].get("value")
+                ):
+                    duplicate_condition = True
+                    print("Duplicate condition met")


Use logger instead of print.

By the way, not related to this PR, but I was thinking that it would be useful if the logger prints a timestamp for each log statement as well. Would that be possible?

Like it is now?

2024-05-28 15:32:48 INFO Processing file: data/data_v2/validation/2537.pdf 2024-05-28 15:32:48 INFO Swapping coordinates. 2024-05-28 15:32:48 INFO Processing page 1

src/stratigraphy/util/duplicate_detection.py

redur · 2024-05-28T13:48:31Z

The behaviour for A525 page 1-2 is somewhat suboptimal. We have a situation like
Page 1  duplicate detected   Page 2
 a
 b
 c      ------------------    C 
 d      ------------------    D
 e                            E
 f                            F
                              G
                              H
Layers "e" and "f" are included in the output for both page 1 as well as page 2, even though we could infer from the previously detected duplicates higher up ("c" and "d"), that these layers must be duplicates as well.

Something for a follow-up ticket probably, as I don't currently see an easy fix for this.

There is a mistake indeed. I am not sure whether the current approach works super well for these long profiles. We're somewhat sensitive to mistakes in the depth column.

I would suggest we accept this error for now.

stijnvermeeren-swisstopo · 2024-05-28T14:24:27Z

Could you create a follow-up ticket for the A525.pdf case?
Is it necessary to repeat the definition of the log format in every file, or is it possible to define that in a single place?

redur · 2024-05-29T06:44:55Z

Could you create a follow-up ticket for the A525.pdf case?

Is it necessary to repeat the definition of the log format in every file, or is it possible to define that in a single place?

Regarding 2:
It is possible to remove the config statements. We just need to make sure, that the config is defined before any logging statements are done. In our case, we probably are always going to execute main.py and it's sufficient to define logging therein. I adjusted it this way.

Another solution would be do define logging in the init.py file and import stratigraphy at the beginning. Then we factor that out of the main code.

stijnvermeeren-swisstopo

LGTM.

Follow-up ticket was created: https://jira.swisstopo.ch/browse/LGVISIUM-44

Improve duplicate detection to use depth information

459c5d9

redur self-assigned this May 27, 2024

Minor updates; improve docstrings.

ce5d453

redur requested a review from stijnvermeeren-swisstopo May 27, 2024 08:13

redur added the enhancement New feature or request label May 27, 2024

stijnvermeeren-swisstopo requested changes May 28, 2024

View reviewed changes

redur added 2 commits May 28, 2024 15:31

Update logging behavior.

5788e4d

Correct type hint

b38ab1b

Remove logging config from files except main.

3477db9

redur requested a review from stijnvermeeren-swisstopo May 29, 2024 06:53

stijnvermeeren-swisstopo approved these changes May 29, 2024

View reviewed changes

redur merged commit f575b6d into main May 29, 2024
3 checks passed

redur deleted the feat/improve_duplicate_detectin branch May 29, 2024 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve duplicate detection to use depth information #49

Improve duplicate detection to use depth information #49

redur commented May 27, 2024 •

edited

Loading

github-actions bot commented May 27, 2024 •

edited

Loading

stijnvermeeren-swisstopo left a comment

stijnvermeeren-swisstopo May 28, 2024

redur May 28, 2024

redur commented May 28, 2024

stijnvermeeren-swisstopo commented May 28, 2024

redur commented May 29, 2024

stijnvermeeren-swisstopo left a comment

Improve duplicate detection to use depth information #49

Improve duplicate detection to use depth information #49

Conversation

redur commented May 27, 2024 • edited Loading

github-actions bot commented May 27, 2024 • edited Loading

stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

stijnvermeeren-swisstopo May 28, 2024

Choose a reason for hiding this comment

redur May 28, 2024

Choose a reason for hiding this comment

redur commented May 28, 2024

stijnvermeeren-swisstopo commented May 28, 2024

redur commented May 29, 2024

stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

redur commented May 27, 2024 •

edited

Loading

github-actions bot commented May 27, 2024 •

edited

Loading