Improve depth entry recognition for cases such as '.80'. #66

redur · 2024-06-24T08:57:53Z

And here the PR to include '.80' and the like as well.

github-actions · 2024-06-24T08:58:41Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	188	188	0%	3–482
get_files.py	21	21	0%	3–48
line_detection.py	26	26	0%	3–76
main.py	94	94	0%	3–237
src/stratigraphy/util
boundarydepthcolumnvalidator.py	41	20	51%	47, 57, 60, 81–84, 109–127, 139–148
coordinate_extraction.py	127	7	94%	31, 62, 75–76, 80, 205, 328
dataclasses.py	32	3	91%	37–39
depthcolumn.py	194	64	67%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 248–256, 267, 272, 279, 310, 315–322, 337–338, 381–423
depthcolumnentry.py	26	7	73%	12, 15, 29–30, 33, 45, 52
description_block_splitter.py	70	2	97%	24, 139
draw.py	80	80	0%	3–244
duplicate_detection.py	51	51	0%	3–146
extract_text.py	27	2	93%	38–39
find_depth_columns.py	91	6	93%	42–43, 71, 83, 176–177
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	86	2	98%	82, 132
interval.py	106	55	48%	24–27, 31–34, 39, 44, 47, 57–59, 99–145, 166, 171–187
language_detection.py	18	18	0%	3–45
layer_identifier_column.py	91	91	0%	3–227
line.py	49	4	92%	25, 42, 51, 98
linesquadtree.py	46	1	98%	76
plot_utils.py	43	43	0%	3–120
predictions.py	187	187	0%	3–387
textblock.py	74	8	89%	27, 51, 63, 75, 98, 119, 127, 155
util.py	40	18	55%	22, 40–47, 61–63, 87–88, 100–105
TOTAL	1879	1027	45%

Tests	Skipped	Failures	Errors	Time
61	0 💤	0 ❌	0 🔥	1.101s ⏱️

stijnvermeeren-swisstopo · 2024-06-24T11:51:08Z

I made two changes:

Apply individual corrections to a single depth value, instead of trying all possible corrections at once. For example, from a value of "40", we would like to try the possibilities "0.40" as well as "10", whereas with the original implementation, we would only try "0.10". This currently does not make a difference for our datasets, but I think it makes more sense to apply corrections individually, especially as we will likely add more possible corrections in the future.
Allow for more than one depth value to be corrected in a depth column. This improves the results for 680244005-bp.pdf (where two corrections of the 4->1 type are needed), while for 16005.pdf the multiple corrections lead to a column of 3 different values to be incorrectly recognized as a depth column (though it does not make a difference for the F1 score, as this document scores 0% anyway). Maybe allowing a certain number of corrections that is proportional to the total number of entries in the depth column, would be most robust? However, with so few documents where this makes a difference at all, I don't really think that it's worthwhile implementing such a rule with a additional parameter that needs to be tuned. @redur what do you think, should we stick to allowing one correction per column for now, or shall we drop that limit?

stijnvermeeren-swisstopo

Documentation was added as discussed.

Improve depth entry recognition for cases such as '.80'.

8132e75

potentially consider multiple alternative values per depth entry

13f3077

redur closed this Jun 25, 2024

redur reopened this Jun 25, 2024

documentation for correct_OCR_mistakes

5407194

stijnvermeeren-swisstopo approved these changes Jun 25, 2024

View reviewed changes

redur merged commit f686692 into main Jun 25, 2024
3 checks passed

redur deleted the feat/correct_.NN branch June 25, 2024 15:25

Provide feedback