LGVISIUM-66: refactor computation of metrics #76

stijnvermeeren-swisstopo · 2024-09-04T13:36:15Z

My first attempt at making everything a bit more coherent.

Through the refactoring, it became apparent that metrics are not computed very consistently.

For layers and depth intervals, we track the macro average of the metrics. I.e. we first compute precision, recall and F1 score for each individual document, and then we take the average of those values.
- Moreover, the overall precision is computed based on the macro-averaged precision and macro-averages recall. This can be different from the macro-averaged F1 score (i.e. the average of the F1 scores of the individual documents)! I don't think that this is a very usual metric, and we probably should compute a proper macro-averaged F1 score instead.
  For metadata and groundwater, we use the micro average, i.e. we count the TP, FP and FN across all documents, and compute the overall F1, recall and precision metrics based on those counts.
Considering that we are mostly interested in the question "how much work does an expert have to fix any mistaking during quality control", and this effort tends to be larger for documents with a lot of data, I would say that the micro average might be a better metric for us.

However, for now, I've left all the metrics exactly as they were computed previously. The code in the evaluate method in score.py now at least gives a clear overview of how each metric is computed.

github-actions · 2024-09-04T13:37:07Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	188	188	0%	3–491
get_files.py	19	19	0%	3–47
line_detection.py	26	26	0%	3–76
main.py	108	108	0%	3–250
src/stratigraphy/coordinates
coordinate_extraction.py	108	5	95%	30, 64, 83–84, 96
src/stratigraphy/data_extractor
data_extractor.py	50	3	94%	32, 62, 98
src/stratigraphy/util
boundarydepthcolumnvalidator.py	41	20	51%	47, 57, 60, 81–84, 110–128, 140–149
dataclasses.py	32	3	91%	37–39
depthcolumn.py	194	64	67%	26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 248–256, 267, 272, 279, 310, 315–322, 337–338, 381–423
depthcolumnentry.py	28	6	79%	17, 21, 36, 39, 56, 65
description_block_splitter.py	70	2	97%	25, 140
draw.py	118	118	0%	3–350
duplicate_detection.py	51	51	0%	3–146
extract_text.py	29	3	90%	20, 54–55
find_depth_columns.py	91	6	93%	42–43, 73, 86, 180–181
find_description.py	63	28	56%	27–35, 50–63, 79–95, 172–175
geometric_line_utilities.py	86	2	98%	82, 132
interval.py	104	55	47%	25–28, 33–36, 42, 48, 52, 62–64, 101–147, 168, 174–190
language_detection.py	18	18	0%	3–45
layer_identifier_column.py	91	91	0%	3–234
line.py	51	4	92%	26, 51, 61, 111
linesquadtree.py	46	1	98%	76
plot_utils.py	43	43	0%	3–120
predictions.py	155	155	0%	3–363
textblock.py	80	9	89%	29, 57, 65, 90, 102, 125, 146, 155, 184
util.py	39	17	56%	22, 40–47, 61–63, 87–88, 100–104
TOTAL	1937	1046	46%

Tests	Skipped	Failures	Errors	Time
79	0 💤	0 ❌	0 🔥	5.454s ⏱️

dcleres

I think the computations of the metrics is much clearer and I am happy to see that the code became much lighter when it comes to compute the metrics. I added a few minor comments and questions that we can try to address together.
I have this think a bit about which metrics (micro or macro) are the most relevant for us but I do tend to agree with what you wrote on JIRA at the moment.

src/stratigraphy/benchmark/score.py

# Conflicts: # src/stratigraphy/benchmark/score.py

dcleres

There are a few minor comments that would be nice to address. Thank you very much for incorporating my previous comments.

dcleres · 2024-09-13T09:37:41Z