-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/include new profiles #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would indeed remove the notebook and not merge it into the main code base.
What's the easiest way for us to share our train/validation/test splits? Do you recommend that I execute the notebook myself, using the exact same random seed, or is there a more convenient way?
text = text.replace("\n", " ") | ||
|
||
# remove all numbers and special characters from text | ||
return "".join(e for e in text if (e.isalnum() or e.isspace()) and not e.isdigit()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this keep accented letter such as öäüéàè that might be important for detecting German/French, or only ASCII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I'll have to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
öäüéàè and the like are kept.
7665e04
to
af24308
Compare
I would upload them to a bucket and source data from there directly. Other from that, it's probably good to execute the notebook with the same seed. |
Coverage Report
|
TBD: Remove Notebook and create scripts for the data generation. |
3ffe1b3
to
32e512f
Compare
2da00d6
to
5668fe9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
For now, it's good to have the convert_ground_script here for reproducibility, but in future we can probable remove that script again, as we won't need to deal with the old format at all anymore?
I agree with this statement. |
Added new set of borehole profiles, and with it, multilingual support.