-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/improve is valid #52
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,8 +36,8 @@ def detect_language_of_document(doc: fitz.Document) -> str: | |
try: | ||
language = detect(text) | ||
except LangDetectException: | ||
language = "de" | ||
language = "de" # TODO: default language should be read from config | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is bothering me for the moment. Right now you need to adjust the code to extend the extraction to other languages. This should be doable from the config files. I believe there are other places where language is hard-coded in form of keywords. (e.g. coordinate extraction) I will open an issue for it. |
||
|
||
if language not in ["de", "fr"]: | ||
if language not in ["de", "fr"]: # TODO: This should be read from the config | ||
language = "de" | ||
return language |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now support numbers such as
.40
that sometimes occur.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input
.40
is now extracted with value40
, not as0.40
. Is that really what we want?I would also suggest adding this as a test case in
test_find_depth_columns.py
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked again, and actually it's rather that sometimes a '-' is recognized as a '.' in older borehole profiles that have this "handwritten style". Then the behavior is totally desired.
But I also found an occurrence of '.80'. See here: A531.pdf
For our dataset, it is for now better to use the current behaviour.