Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80

Conversation

dcleres
Copy link
Contributor

@dcleres dcleres commented Sep 13, 2024

I tried to move the different files from the utils directory into meaningful new directories that are grouped by the extracted borehole feature. Furthermore, I refactored the BoreholeMetadata class and created a way to evaluate the metadata on its own. There will be a follow-up ticket for the layers and other parts of the pipeline. The goal I was pursuing was to aim for no dictionaries that are filled on the fly but to create data classes that have attributes for the different features.

I also tried to conceptually differentiate between classes that hold data and classes that compute and evaluate the extracted information.

Copy link

github-actions bot commented Sep 13, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1861860%3–483
   get_files.py19190%3–47
   main.py1191190%3–310
src/stratigraphy/data_extractor
   data_extractor.py50394%32, 62, 98
src/stratigraphy/depthcolumn
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 110–128, 140–149
   depthcolumn.py1946467%25, 29, 50, 56, 59–60, 84, 87, 94, 101, 109–110, 120, 137–153, 191, 228, 247–255, 266, 271, 278, 309, 314–321, 336–337, 380–422
   depthcolumnentry.py28679%17, 21, 36, 39, 56, 65
   find_depth_columns.py1061982%42–43, 73, 86, 180–181, 225–245
src/stratigraphy/layer
   layer_identifier_column.py745230%16–17, 20, 28, 43, 47, 51, 59–63, 66, 74, 91–96, 99, 112, 125–126, 148–158, 172–199
src/stratigraphy/lines
   geometric_line_utilities.py86298%81, 131
   line.py51492%25, 50, 60, 110
   linesquadtree.py46198%75
src/stratigraphy/metadata
   coordinate_extraction.py108595%30, 64, 83–84, 96
src/stratigraphy/text
   description_block_splitter.py70297%24, 139
   extract_text.py29390%19, 53–54
   find_description.py642856%27–35, 50–63, 79–95, 172–175
   textblock.py80989%28, 56, 64, 89, 101, 124, 145, 154, 183
src/stratigraphy/util
   dataclasses.py32391%37–39
   interval.py1045547%29–32, 37–40, 46, 52, 56, 66–68, 107–153, 174, 180–196
   predictions.py1071070%3–282
   util.py391756%41, 69–76, 90–92, 116–117, 129–133
TOTAL164172556% 

Tests Skipped Failures Errors Time
79 0 💤 0 ❌ 0 🔥 5.379s ⏱️

…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation
@dcleres
Copy link
Contributor Author

dcleres commented Sep 16, 2024

Screenshot 2024-09-16 at 14 43 34

Overall, the idea is that the refactored architecture follows the following pattern. The FilePrediction object we currently have as a large dict should become an object that hosts sub-objects such as the BoreholeMetadata object. This BoreholeMetadata can then be evaluated individually using a BoreholeMetadataEvaluator.

Grouping the different objects into classes make it easier to keep track of the fields in the object as they need to be declarer in advance and not on of the fly as we currently do in the dict objects.

@dcleres dcleres force-pushed the LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation branch from 3f2c095 to 0bcd51f Compare September 16, 2024 13:21
@dcleres
Copy link
Contributor Author

dcleres commented Sep 16, 2024

Screenshot 2024-09-16 at 16 01 20

Need to investigate why the metrics are different. Currently, checked the formula but did not find anything obvious.

Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the use case of the "metadata pipeline" (executing only metadata extraction) yet. Seems to be a significant amount of code to maintain, for something that we will very rarely use, unless it's significantly faster (but currently it does not seem to be?).

Apart from that, the code structure seems to become significantly cleaner; I like it.

What would be the plan for further development in this direction? Apply the same structure for groundwater and for layers as well?

src/stratigraphy/util/util.py Outdated Show resolved Hide resolved
src/stratigraphy/util/util.py Outdated Show resolved Hide resolved
src/stratigraphy/evaluation/evaluation_dataclasses.py Outdated Show resolved Hide resolved
src/stratigraphy/evaluation/evaluation_dataclasses.py Outdated Show resolved Hide resolved
src/stratigraphy/layer/layer.py Outdated Show resolved Hide resolved
src/stratigraphy/metadata/metadata.py Outdated Show resolved Hide resolved
@dcleres
Copy link
Contributor Author

dcleres commented Sep 16, 2024

@stijnvermeeren-swisstopo thank you very much for the initial review of the draft PR.

I wanted to provide some quick answers to some of your questions:

I'm not sure about the use case of the "metadata pipeline" (executing only metadata extraction) yet.

This actually comes from the issue description in Jira I do believe. The issue states:

Refactoring the extraction pipeline to extract a) layers and b) metadata separately. This refactoring should lead into three commands.
`boreholes-extract-all`,
`boreholes-extract-layers`
`boreholes-extract-metadata` 

Do you see another way to be able to launch the boreholes-extract-metadata without the addional code to maintain? If yes, I am happy to use that as I agree with you: It adds quite a bit of code.

What would be the plan for further development in this direction? Apply the same structure for groundwater and for layers as well?

Indeed, the goal of the child issue(s) from this one would be to redefine the different building blocks of the pipeline (layers, groundwater, ...) in the same way. I had a functional version two weeks ago, but it was a bit messy and tricky to review and merge with the changes we made back then. I hope that once we agree on the structure of the Metadata, we can move forward a bit faster with the other ones.

src/stratigraphy/main.py Outdated Show resolved Hide resolved
src/stratigraphy/main.py Show resolved Hide resolved
src/stratigraphy/metadata/elevation_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/evaluation/evaluation_dataclasses.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elevation and coordinates are always drawn with a red line in the visualisations, even when the values are correct.

src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
src/stratigraphy/benchmark/score.py Outdated Show resolved Hide resolved
src/stratigraphy/evaluation/evaluation_dataclasses.py Outdated Show resolved Hide resolved
src/stratigraphy/evaluation/evaluation_dataclasses.py Outdated Show resolved Hide resolved
Comment on lines 133 to 138
"""Get the document level metrics."""
# Collect all the data frames in a list
frames = [metadata.get_document_level_metrics() for metadata in self.borehole_metadata_metrics]

# Concatenate them once at the end
return pd.concat(frames, ignore_index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not return the input documents in the original order. Where does the order get lost?

Maybe the implementation approach from DatasetMetricsCatalog.document_level_metrics_df is more robust?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce the issues you are mentioning.

I tried something along these lines:

        document_level_metrics = pd.DataFrame(columns=["document_name", "elevation", "coordinate"])
        for metadata in self.borehole_metadata_metrics:
            document_level_metrics = document_level_metrics.merge(metadata.get_document_level_metrics(), how="outer")

But this actually sorted the filenames, and the order was lost. Is there a reason why the other is important here in your opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had this implementation

document_level_metrics = pd.DataFrame(columns=["document_name", "elevation", "coordinate"])
for metadata in self.borehole_metadata_metrics:
    document_level_metrics = pd.concat([document_level_metrics, metadata.get_document_level_metrics()])

but this raises a deprecation warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to me the have the files in order, so that different version of the CSV file (from different runs) can be easily compared, as well as easily comparing the document-level-metrics with the PNG visualisations (which are listed in alphabetical order in MLFlow).

I'm working on a suggestion for a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I already suggested a new fix. Maybe you can have a look at it. There is even an assertion that makes sure the files are in order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You assertion ensures that the order is unchanged (i.e. execution order), but this is not necessarily alphabetical, at least on my system it isn't.

@dcleres dcleres marked this pull request as ready for review September 18, 2024 11:33
@dcleres dcleres merged commit 3fa9f21 into main Sep 18, 2024
3 checks passed
@dcleres dcleres deleted the LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation branch September 23, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants