Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/include new profiles #32

Merged
merged 7 commits into from
Apr 11, 2024
Merged

Feat/include new profiles #32

merged 7 commits into from
Apr 11, 2024

Conversation

redur
Copy link
Contributor

@redur redur commented Apr 9, 2024

Added new set of borehole profiles, and with it, multilingual support.

  • Adjusted GroundTruth to work with new format. --> old format will not work any more. There is a functionality in the notebook that converts old ground_truth.json into the new one.
  • Added language detection.
  • Translated including and excluding expression into french.
  • Also comes with the EDA based on a different branch. Probably should be dropped here. Kept for the conversion script and the generation of test/train/eval splits. Can be dropped after the review.

@redur redur self-assigned this Apr 9, 2024
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would indeed remove the notebook and not merge it into the main code base.

What's the easiest way for us to share our train/validation/test splits? Do you recommend that I execute the notebook myself, using the exact same random seed, or is there a more convenient way?

text = text.replace("\n", " ")

# remove all numbers and special characters from text
return "".join(e for e in text if (e.isalnum() or e.isspace()) and not e.isdigit())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this keep accented letter such as öäüéàè that might be important for detecting German/French, or only ASCII?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'll have to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

öäüéàè and the like are kept.

@redur redur force-pushed the feat/include_new_profiles branch from 7665e04 to af24308 Compare April 9, 2024 14:12
@redur
Copy link
Contributor Author

redur commented Apr 9, 2024

I would indeed remove the notebook and not merge it into the main code base.

What's the easiest way for us to share our train/validation/test splits? Do you recommend that I execute the notebook myself, using the exact same random seed, or is there a more convenient way?

I would upload them to a bucket and source data from there directly.

Other from that, it's probably good to execute the notebook with the same seed.

Copy link

github-actions bot commented Apr 9, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   extract.py1891890%3–460
   line_detection.py29290%3–88
   main.py78780%3–174
src/stratigraphy/util
   dataclasses.py30390%34–36
   depthcolumn.py2046767%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 187, 210, 226–234, 244, 249, 256, 263, 268, 286, 296, 299–306, 321–322, 365–407
   depthcolumnentry.py20480%12, 15, 27, 34
   description_block_splitter.py70297%24, 139
   draw.py73730%3–199
   find_depth_columns.py82495%57–58, 149–150
   find_description.py39685%27–34, 111–114
   geometric_line_utilities.py1233770%74–88, 111–115, 214–237, 261, 311
   interval.py1015249%24–27, 31–34, 39, 44, 47, 90–136, 156, 161–177
   language_detection.py18180%3–43
   line.py492647%25, 42, 51, 65–95, 98
   plot_utils.py44440%3–111
   textblock.py68790%31, 43, 55, 78, 99, 107, 135
   util.py402245%15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL126266148% 

Tests Skipped Failures Errors Time
45 0 💤 0 ❌ 0 🔥 1.027s ⏱️

@redur
Copy link
Contributor Author

redur commented Apr 10, 2024

TBD: Remove Notebook and create scripts for the data generation.

@redur redur mentioned this pull request Apr 10, 2024
@redur redur force-pushed the feat/include_new_profiles branch from 3ffe1b3 to 32e512f Compare April 11, 2024 07:11
@redur redur force-pushed the feat/include_new_profiles branch from 2da00d6 to 5668fe9 Compare April 11, 2024 07:58
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For now, it's good to have the convert_ground_script here for reproducibility, but in future we can probable remove that script again, as we won't need to deal with the old format at all anymore?

@redur
Copy link
Contributor Author

redur commented Apr 11, 2024

LGTM.

For now, it's good to have the convert_ground_script here for reproducibility, but in future we can probable remove that script again, as we won't need to deal with the old format at all anymore?

I agree with this statement.

@redur redur merged commit 98a9034 into main Apr 11, 2024
3 checks passed
@redur redur deleted the feat/include_new_profiles branch April 11, 2024 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants