Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGVISIUM-66: refactor computation of metrics #76

Merged
merged 6 commits into from
Sep 13, 2024

Conversation

stijnvermeeren-swisstopo
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo commented Sep 4, 2024

My first attempt at making everything a bit more coherent.

Through the refactoring, it became apparent that metrics are not computed very consistently.

  • For layers and depth intervals, we track the macro average of the metrics. I.e. we first compute precision, recall and F1 score for each individual document, and then we take the average of those values.
    • Moreover, the overall precision is computed based on the macro-averaged precision and macro-averages recall. This can be different from the macro-averaged F1 score (i.e. the average of the F1 scores of the individual documents)! I don't think that this is a very usual metric, and we probably should compute a proper macro-averaged F1 score instead.
      For metadata and groundwater, we use the micro average, i.e. we count the TP, FP and FN across all documents, and compute the overall F1, recall and precision metrics based on those counts.
  • Considering that we are mostly interested in the question "how much work does an expert have to fix any mistaking during quality control", and this effort tends to be larger for documents with a lot of data, I would say that the micro average might be a better metric for us.

However, for now, I've left all the metrics exactly as they were computed previously. The code in the evaluate method in score.py now at least gives a clear overview of how each metric is computed.

Copy link

github-actions bot commented Sep 4, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1881880%3–491
   get_files.py19190%3–47
   line_detection.py26260%3–76
   main.py1081080%3–250
src/stratigraphy/coordinates
   coordinate_extraction.py108595%30, 64, 83–84, 96
src/stratigraphy/data_extractor
   data_extractor.py50394%32, 62, 98
src/stratigraphy/util
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 110–128, 140–149
   dataclasses.py32391%37–39
   depthcolumn.py1946467%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 248–256, 267, 272, 279, 310, 315–322, 337–338, 381–423
   depthcolumnentry.py28679%17, 21, 36, 39, 56, 65
   description_block_splitter.py70297%25, 140
   draw.py1181180%3–350
   duplicate_detection.py51510%3–146
   extract_text.py29390%20, 54–55
   find_depth_columns.py91693%42–43, 73, 86, 180–181
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1045547%25–28, 33–36, 42, 48, 52, 62–64, 101–147, 168, 174–190
   language_detection.py18180%3–45
   layer_identifier_column.py91910%3–234
   line.py51492%26, 51, 61, 111
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1551550%3–363
   textblock.py80989%29, 57, 65, 90, 102, 125, 146, 155, 184
   util.py391756%22, 40–47, 61–63, 87–88, 100–104
TOTAL1937104646% 

Tests Skipped Failures Errors Time
79 0 💤 0 ❌ 0 🔥 5.454s ⏱️

Copy link
Contributor

@dcleres dcleres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the computations of the metrics is much clearer and I am happy to see that the code became much lighter when it comes to compute the metrics. I added a few minor comments and questions that we can try to address together.
I have this think a bit about which metrics (micro or macro) are the most relevant for us but I do tend to agree with what you wrote on JIRA at the moment.

src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
@stijnvermeeren-swisstopo stijnvermeeren-swisstopo marked this pull request as ready for review September 13, 2024 09:34
Copy link
Contributor

@dcleres dcleres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few minor comments that would be nice to address. Thank you very much for incorporating my previous comments.

else:
return 0

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this could be a one lineer

return self.tp / (self.tp + self.fn) if (self.tp + self.fn) > 0 else 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but in my humble opinion, this would actually make the code less easily readable/understandable, but that's of course just a subjective opinion.
If you don't feel strongly about this, I would keep the current code style.

else:
return 0

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be a one liner


def macro_recall(self) -> float:
"""Compute the macro recall score."""
if self.metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be a one liner


def macro_precision(self) -> float:
"""Compute the macro precision score."""
if self.metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be a one liner


def macro_f1(self) -> float:
"""Compute the macro F1 score."""
if self.metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be a one liner

"""Keeps track of a particular metrics for all documents in a dataset."""

def __init__(self):
self.metrics: dict[str, Metrics] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for the metric to carry the filename as an attribute? This would make it possible here only to use a file of metrics that already know their filename and avoid creating the dict just to keep track of the filenames. I also had that issue in the refactoring of the metadata metrics and was also not sure about the best solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to keep the filename separate from the Metrics class for two reasons:

  • Currently the Metrics class is also used to compute the metrics for the entire dataset (DatasetMetrics.overall_metrics()); in this case there is no specific filename that can be associated with the Metrics object
  • In many places where a Metrics object is instantiated, we currently don't have access to the filename. I don't really want to pass additional parameters around, just for this.

That does not mean that we cannot improve the coupling between metrics and filename, but I think it should probably happen at a different level in the code. Maybe we can have a look at it in the context of your refactoring PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tend to share your opinion. Then all good for me!

@dcleres
Copy link
Contributor

dcleres commented Sep 13, 2024

Screenshot 2024-09-13 at 13 50 57

@stijnvermeeren-swisstopo while running the code in the main and the LGVISIUM-66 branches, I saw that the results on mlflow are the same but the fields like groundwater_depth_fn do not exist anymore. I am assuming that this change was made on purpose?

@stijnvermeeren-swisstopo
Copy link
Contributor Author

@dcleres On purpose, as discussed on Teams ;)

image

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo merged commit ee065a4 into main Sep 13, 2024
3 checks passed
@stijnvermeeren-swisstopo stijnvermeeren-swisstopo deleted the LGVISIUM-66/metrics-refactoring branch September 13, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants