-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80
Close #LGVISIUM-73: Create a metadata object and look into the file organisation #80
Conversation
…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation
Overall, the idea is that the refactored architecture follows the following pattern. The Grouping the different objects into classes make it easier to keep track of the fields in the object as they need to be declarer in advance and not on of the fly as we currently do in the |
3f2c095
to
0bcd51f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the use case of the "metadata pipeline" (executing only metadata extraction) yet. Seems to be a significant amount of code to maintain, for something that we will very rarely use, unless it's significantly faster (but currently it does not seem to be?).
Apart from that, the code structure seems to become significantly cleaner; I like it.
What would be the plan for further development in this direction? Apply the same structure for groundwater and for layers as well?
@stijnvermeeren-swisstopo thank you very much for the initial review of the draft PR. I wanted to provide some quick answers to some of your questions:
This actually comes from the issue description in Jira I do believe. The issue states:
Do you see another way to be able to launch the
Indeed, the goal of the child issue(s) from this one would be to redefine the different building blocks of the pipeline (layers, groundwater, ...) in the same way. I had a functional version two weeks ago, but it was a bit messy and tricky to review and merge with the changes we made back then. I hope that once we agree on the structure of the Metadata, we can move forward a bit faster with the other ones. |
…es-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation
…file-organisation' of https://github.com/swisstopo/swissgeol-boreholes-dataextraction into LGVISIUM-73-Create-a-Metadata-Object-and-look-into-the-file-organisation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elevation and coordinates are always drawn with a red line in the visualisations, even when the values are correct.
"""Get the document level metrics.""" | ||
# Collect all the data frames in a list | ||
frames = [metadata.get_document_level_metrics() for metadata in self.borehole_metadata_metrics] | ||
|
||
# Concatenate them once at the end | ||
return pd.concat(frames, ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not return the input documents in the original order. Where does the order get lost?
Maybe the implementation approach from DatasetMetricsCatalog.document_level_metrics_df
is more robust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot reproduce the issues you are mentioning.
I tried something along these lines:
document_level_metrics = pd.DataFrame(columns=["document_name", "elevation", "coordinate"])
for metadata in self.borehole_metadata_metrics:
document_level_metrics = document_level_metrics.merge(metadata.get_document_level_metrics(), how="outer")
But this actually sorted the filenames, and the order was lost. Is there a reason why the other is important here in your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had this implementation
document_level_metrics = pd.DataFrame(columns=["document_name", "elevation", "coordinate"])
for metadata in self.borehole_metadata_metrics:
document_level_metrics = pd.concat([document_level_metrics, metadata.get_document_level_metrics()])
but this raises a deprecation warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to me the have the files in order, so that different version of the CSV file (from different runs) can be easily compared, as well as easily comparing the document-level-metrics with the PNG visualisations (which are listed in alphabetical order in MLFlow).
I'm working on a suggestion for a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I already suggested a new fix. Maybe you can have a look at it. There is even an assertion that makes sure the files are in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You assertion ensures that the order is unchanged (i.e. execution order), but this is not necessarily alphabetical, at least on my system it isn't.
…oordinates and elevation correctness
I tried to move the different files from the
utils
directory into meaningful new directories that are grouped by the extracted borehole feature. Furthermore, I refactored theBoreholeMetadata
class and created a way to evaluate the metadata on its own. There will be a follow-up ticket for the layers and other parts of the pipeline. The goal I was pursuing was to aim for no dictionaries that are filled on the fly but to create data classes that have attributes for the different features.I also tried to conceptually differentiate between classes that hold data and classes that compute and evaluate the extracted information.