Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80

dcleres · 2024-09-13T06:56:46Z

I tried to move the different files from the utils directory into meaningful new directories that are grouped by the extracted borehole feature. Furthermore, I refactored the BoreholeMetadata class and created a way to evaluate the metadata on its own. There will be a follow-up ticket for the layers and other parts of the pipeline. The goal I was pursuing was to aim for no dictionaries that are filled on the fly but to create data classes that have attributes for the different features.

I also tried to conceptually differentiate between classes that hold data and classes that compute and evaluate the extracted information.

…es & refactored metadata object

github-actions · 2024-09-13T06:57:49Z

Coverage Report

File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	186	186	0%	3–483
get_files.py	19	19	0%	3–47
main.py	119	119	0%	3–310
src/stratigraphy/data_extractor
data_extractor.py	50	3	94%	32, 62, 98
src/stratigraphy/depthcolumn
boundarydepthcolumnvalidator.py	41	20	51%	47, 57, 60, 81–84, 110–128, 140–149
depthcolumn.py	194	64	67%	25, 29, 50, 56, 59–60, 84, 87, 94, 101, 109–110, 120, 137–153, 191, 228, 247–255, 266, 271, 278, 309, 314–321, 336–337, 380–422
depthcolumnentry.py	28	6	79%	17, 21, 36, 39, 56, 65
find_depth_columns.py	106	19	82%	42–43, 73, 86, 180–181, 225–245
src/stratigraphy/layer
layer_identifier_column.py	74	52	30%	16–17, 20, 28, 43, 47, 51, 59–63, 66, 74, 91–96, 99, 112, 125–126, 148–158, 172–199
src/stratigraphy/lines
geometric_line_utilities.py	86	2	98%	81, 131
line.py	51	4	92%	25, 50, 60, 110
linesquadtree.py	46	1	98%	75
src/stratigraphy/metadata
coordinate_extraction.py	108	5	95%	30, 64, 83–84, 96
src/stratigraphy/text
description_block_splitter.py	70	2	97%	24, 139
extract_text.py	29	3	90%	19, 53–54
find_description.py	64	28	56%	27–35, 50–63, 79–95, 172–175
textblock.py	80	9	89%	28, 56, 64, 89, 101, 124, 145, 154, 183
src/stratigraphy/util
dataclasses.py	32	3	91%	37–39
interval.py	104	55	47%	29–32, 37–40, 46, 52, 56, 66–68, 107–153, 174, 180–196
predictions.py	107	107	0%	3–282
util.py	39	17	56%	41, 69–76, 90–92, 116–117, 129–133
TOTAL	1641	725	56%

Tests	Skipped	Failures	Errors	Time
79	0 💤	0 ❌	0 🔥	5.379s ⏱️

…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation

dcleres · 2024-09-16T12:46:41Z

Overall, the idea is that the refactored architecture follows the following pattern. The FilePrediction object we currently have as a large dict should become an object that hosts sub-objects such as the BoreholeMetadata object. This BoreholeMetadata can then be evaluated individually using a BoreholeMetadataEvaluator.

Grouping the different objects into classes make it easier to keep track of the fields in the object as they need to be declarer in advance and not on of the fly as we currently do in the dict objects.

dcleres · 2024-09-16T14:02:22Z

Need to investigate why the metrics are different. Currently, checked the formula but did not find anything obvious.

stijnvermeeren-swisstopo

I'm not sure about the use case of the "metadata pipeline" (executing only metadata extraction) yet. Seems to be a significant amount of code to maintain, for something that we will very rarely use, unless it's significantly faster (but currently it does not seem to be?).

Apart from that, the code structure seems to become significantly cleaner; I like it.

What would be the plan for further development in this direction? Apply the same structure for groundwater and for layers as well?

src/stratigraphy/util/util.py

src/stratigraphy/evaluation/evaluation_dataclasses.py

src/stratigraphy/layer/layer.py

src/stratigraphy/metadata/metadata.py

dcleres · 2024-09-16T15:20:49Z

@stijnvermeeren-swisstopo thank you very much for the initial review of the draft PR.

I wanted to provide some quick answers to some of your questions:

I'm not sure about the use case of the "metadata pipeline" (executing only metadata extraction) yet.

This actually comes from the issue description in Jira I do believe. The issue states:

Refactoring the extraction pipeline to extract a) layers and b) metadata separately. This refactoring should lead into three commands.
`boreholes-extract-all`,
`boreholes-extract-layers`
`boreholes-extract-metadata`

Do you see another way to be able to launch the boreholes-extract-metadata without the addional code to maintain? If yes, I am happy to use that as I agree with you: It adds quite a bit of code.

What would be the plan for further development in this direction? Apply the same structure for groundwater and for layers as well?

Indeed, the goal of the child issue(s) from this one would be to redefine the different building blocks of the pipeline (layers, groundwater, ...) in the same way. I had a functional version two weeks ago, but it was a bit messy and tricky to review and merge with the changes we made back then. I hope that once we agree on the structure of the Metadata, we can move forward a bit faster with the other ones.

…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation

…file-organisation' of https://github.com/swisstopo/swissgeol-boreholes-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation

src/stratigraphy/main.py

src/stratigraphy/metadata/elevation_extraction.py

src/stratigraphy/evaluation/evaluation_dataclasses.py

stijnvermeeren-swisstopo

Elevation and coordinates are always drawn with a red line in the visualisations, even when the values are correct.

src/stratigraphy/benchmark/score.py

src/stratigraphy/evaluation/evaluation_dataclasses.py

stijnvermeeren-swisstopo · 2024-09-18T07:36:17Z

src/stratigraphy/evaluation/evaluation_dataclasses.py

+        """Get the document level metrics."""
+        # Collect all the data frames in a list
+        frames = [metadata.get_document_level_metrics() for metadata in self.borehole_metadata_metrics]
+
+        # Concatenate them once at the end
+        return pd.concat(frames, ignore_index=True)


This does not return the input documents in the original order. Where does the order get lost?

Maybe the implementation approach from DatasetMetricsCatalog.document_level_metrics_df is more robust?

I cannot reproduce the issues you are mentioning.

I tried something along these lines:

document_level_metrics = pd.DataFrame(columns=["document_name", "elevation", "coordinate"]) for metadata in self.borehole_metadata_metrics: document_level_metrics = document_level_metrics.merge(metadata.get_document_level_metrics(), how="outer")

But this actually sorted the filenames, and the order was lost. Is there a reason why the other is important here in your opinion?

I initially had this implementation

document_level_metrics = pd.DataFrame(columns=["document_name", "elevation", "coordinate"]) for metadata in self.borehole_metadata_metrics: document_level_metrics = pd.concat([document_level_metrics, metadata.get_document_level_metrics()])

but this raises a deprecation warning.

It's important to me the have the files in order, so that different version of the CSV file (from different runs) can be easily compared, as well as easily comparing the document-level-metrics with the PNG visualisations (which are listed in alphabetical order in MLFlow).

I'm working on a suggestion for a fix.

I see. I already suggested a new fix. Maybe you can have a look at it. There is even an assertion that makes sure the files are in order.

You assertion ensures that the order is unchanged (i.e. execution order), but this is not necessarily alphabetical, at least on my system it isn't.

…e full page

…oordinates and elevation correctness

dcleres added 5 commits September 10, 2024 11:45

Close #LGVISIUM-73: Moved the util files into more specific directori…

e7cbdd6

…es & refactored metadata object

Added refactored metadata class

ff36d85

BAckup for today

6002790

Added code to evaluate the metadata individually

540f82a

minor edits

a613f41

dcleres requested a review from stijnvermeeren-swisstopo September 13, 2024 06:56

dcleres self-assigned this Sep 13, 2024

Merge branch 'main' of https://github.com/swisstopo/swissgeol-borehol…

aa171af

…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation

dcleres added 3 commits September 16, 2024 14:51

Edited the pipeline command

af0fd12

Fixed typo in the eval file

750194d

Added groundtruth file

0bcd51f

dcleres force-pushed the LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation branch from 3f2c095 to 0bcd51f Compare September 16, 2024 13:21

Minor improvements

d2509c8

stijnvermeeren-swisstopo requested changes Sep 16, 2024

View reviewed changes

minor changes

a7a98a5

dcleres added 7 commits September 17, 2024 14:07

Merge branch 'main' of https://github.com/swisstopo/swissgeol-borehol…

f9411d5

…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation

Addressed the comments made during the PR

d2920e3

Addressed the comments raised during the review

4c1dc87

Merge branch 'LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-…

a92c2da

…file-organisation' of https://github.com/swisstopo/swissgeol-boreholes-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation

Edited the yaml file for the CI

bedf096

Removed duplicated code

3b4c61b

Fixed the issue with the metrics

3b0773d

stijnvermeeren-swisstopo requested changes Sep 18, 2024

View reviewed changes

src/stratigraphy/main.py Outdated Show resolved Hide resolved

src/stratigraphy/main.py Show resolved Hide resolved

src/stratigraphy/metadata/elevation_extraction.py Outdated Show resolved Hide resolved

src/stratigraphy/evaluation/evaluation_dataclasses.py Outdated Show resolved Hide resolved

stijnvermeeren-swisstopo requested changes Sep 18, 2024

View reviewed changes

Address the elevation metric difference by passing none to provide th…

da8ba85

…e full page

dcleres added 2 commits September 18, 2024 11:20

Addressed review comments and fix the issue with the drawing of the c…

ffd3fef

…oordinates and elevation correctness

Updated the document level metric computation

af75c57

dcleres marked this pull request as ready for review September 18, 2024 11:33

stijnvermeeren-swisstopo added 2 commits September 18, 2024 15:26

update comments

1cc4818

code review

6e7ef2a

stijnvermeeren-swisstopo approved these changes Sep 18, 2024

View reviewed changes

dcleres merged commit 3fa9f21 into main Sep 18, 2024
3 checks passed

dcleres deleted the LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation branch September 23, 2024 12:03

dcleres added a commit that referenced this pull request Sep 26, 2024

Close #80: Refactoring of the groundwater evaluation and classes

c2864d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80

Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80

dcleres commented Sep 13, 2024

github-actions bot commented Sep 13, 2024 •

edited

Loading

dcleres commented Sep 16, 2024

dcleres commented Sep 16, 2024

stijnvermeeren-swisstopo left a comment

dcleres commented Sep 16, 2024

stijnvermeeren-swisstopo left a comment

stijnvermeeren-swisstopo Sep 18, 2024

dcleres Sep 18, 2024

dcleres Sep 18, 2024

stijnvermeeren-swisstopo Sep 18, 2024

dcleres Sep 18, 2024

stijnvermeeren-swisstopo Sep 18, 2024

Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80

Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80

Conversation

dcleres commented Sep 13, 2024

github-actions bot commented Sep 13, 2024 • edited Loading

dcleres commented Sep 16, 2024

dcleres commented Sep 16, 2024

stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

dcleres commented Sep 16, 2024

stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

stijnvermeeren-swisstopo Sep 18, 2024

Choose a reason for hiding this comment

dcleres Sep 18, 2024

Choose a reason for hiding this comment

dcleres Sep 18, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo Sep 18, 2024

Choose a reason for hiding this comment

dcleres Sep 18, 2024

Choose a reason for hiding this comment

stijnvermeeren-swisstopo Sep 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Sep 13, 2024 •

edited

Loading