Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add motivation to README #71

Merged
merged 3 commits into from
Aug 13, 2024
Merged

Add motivation to README #71

merged 3 commits into from
Aug 13, 2024

Conversation

stijnvermeeren-swisstopo
Copy link
Contributor

No description provided.

Copy link

github-actions bot commented Aug 12, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1881880%3–482
   get_files.py19190%3–47
   line_detection.py26260%3–76
   main.py95950%3–237
src/stratigraphy/util
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 109–127, 139–148
   coordinate_extraction.py127794%31, 62, 75–76, 80, 205, 328
   dataclasses.py32391%37–39
   depthcolumn.py1946467%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 248–256, 267, 272, 279, 310, 315–322, 337–338, 381–423
   depthcolumnentry.py26773%12, 15, 29–30, 33, 45, 52
   description_block_splitter.py70297%24, 139
   draw.py80800%3–244
   duplicate_detection.py51510%3–146
   extract_text.py27293%38–39
   find_depth_columns.py91693%42–43, 71, 83, 176–177
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1065548%24–27, 31–34, 39, 44, 47, 57–59, 99–145, 166, 171–187
   language_detection.py18180%3–45
   layer_identifier_column.py91910%3–227
   line.py49492%25, 42, 51, 98
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1301300%3–272
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py401855%22, 40–47, 61–63, 87–88, 100–105
TOTAL182196947% 

Tests Skipped Failures Errors Time
61 0 💤 0 ❌ 0 🔥 0.979s ⏱️

README.md Outdated

Note that the project is under active development and there is no release to this date, nor has the project reached a maturity such that it could be used.
The Federal Office of Topography swisstopo is Switzerland's geoinformation centre. The Swiss Geological Survey at swisstopo is the federal competence centre for the collection, analysis, storage and provision of geological data of national interest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma:

The Federal Office of Topography swisstopo is Switzerland's geoinformation centre. The Swiss Geological Survey at swisstopo is the federal competence centre for the collection, analysis, storage, and provision of geological data of national interest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a universally accepted rule :) -> https://en.wikipedia.org/wiki/Serial_comma
However, I don't feel very strongly about it, and I can add the comma.

A general note: could you try using the "add a suggestion" Github feature for suggesting minor changes like this one? That would make it easier to find and to accept the suggested changes.
image

@dcleres
Copy link
Contributor

dcleres commented Aug 12, 2024

Thank you for improving the README.md. It is much clearer now.

Do you think we should also reference the two related label studio forks?

Copy link
Contributor

@dcleres dcleres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. The content is good and very clear. I spotted a few typos that you could address before merging. Thank you for clarifying this part of the project!

README.md Outdated
Only German and French borehole profiles are supported as of now.
Therefore, the goal of this project is to automate the extraction of structured data from borehole profiles as much as possible. As far as swisstopo is concerned, the use case is to integrate the data extraction pipeline with the application boreholes.swissgeol.ch ([Github](https://github.com/swisstopo/swissgeol-boreholes-suite)), where a user interface for efficient quality control of the automatically extracted data will also be implemented.

All code and documentation is published in this Github repository as open source software. All other persons, companies or agencies who manage borehole data of their own, are welcome to use the data extraction pipeline on their own data and to contribute to the project with their own improvements/additions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github -> GitHub

* Lithology / stratigraphy
* **Depths** (upper and lower bound of each layer)
* **Material descriptions** (as plain text)
* USCS classification, color, consistency, plasticity...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not italic or bold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being actively worked on (at least not planned for the rest of the year). Same for the others.

* **Depths** (upper and lower bound of each layer)
* **Material descriptions** (as plain text)
* USCS classification, color, consistency, plasticity...
* Geological interpretations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not italic or bold?

* Geological interpretations
* Other
* _Hydrogeology (ground water levels)_
* Instrumentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not italic or bold?

* Other
* _Hydrogeology (ground water levels)_
* Instrumentation
* Casing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not italic or bold?

* _Hydrogeology (ground water levels)_
* Instrumentation
* Casing
* Borehole geometry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not italic or bold?

README.md Outdated

### Related work

Existing work related to this project is mostly focussed on the extraction and classification of specific properties from textual geological descriptions. Notable examples include [GEOBERTje](https://www.arxiv.org/abs/2407.10991) (Belgium), [geo-ner-model](https://github.com/BritishGeologicalSurvey/geo-ner-model) (UK), [GeoVec](https://www.sciencedirect.com/science/article/pii/S0098300419306533) und [dh2loop](https://github.com/Loop3D/dh2loop) (Australia). The scope of this project is considerable wider, in particular regarding the goal of understanding borehole profiles in various languages and with an unknown layout, where the document structure first needs to be understood, before the relevant text fragments can be identified and extracted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two spaces: descriptions. Notable

README.md Outdated

Existing work related to this project is mostly focussed on the extraction and classification of specific properties from textual geological descriptions. Notable examples include [GEOBERTje](https://www.arxiv.org/abs/2407.10991) (Belgium), [geo-ner-model](https://github.com/BritishGeologicalSurvey/geo-ner-model) (UK), [GeoVec](https://www.sciencedirect.com/science/article/pii/S0098300419306533) und [dh2loop](https://github.com/Loop3D/dh2loop) (Australia). The scope of this project is considerable wider, in particular regarding the goal of understanding borehole profiles in various languages and with an unknown layout, where the document structure first needs to be understood, before the relevant text fragments can be identified and extracted.

The automatic data extraction pipeline can be considered to belong to the field or [automatic/intelligent document processing](https://en.wikipedia.org/wiki/Document_processing). As such, it involves a combination of methods from multiple fields in data science and machine learning, in particular computer vision (e.g. object detection, line detection) and natural language processing (large language models, named entity recognition). Some of these have already been implemented (e.g. the [Line Segment Detector](https://docs.opencv.org/3.4/db/d73/classcv_1_1LineSegmentDetector.html) algorithm), others are planned as future work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

belong to the field of

@stijnvermeeren-swisstopo
Copy link
Contributor Author

Do you think we should also reference the two related label studio forks?

Unless we decide that we actually want to maintain those forks and put them into production after all, I would not explicitly mention them here.

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo merged commit a2ae1cb into main Aug 13, 2024
3 checks passed
@stijnvermeeren-swisstopo stijnvermeeren-swisstopo deleted the readme-additions branch August 13, 2024 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants