-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/improve is valid #52
Conversation
@@ -25,7 +25,7 @@ def depth_column_entries(all_words: list[TextWord], include_splits: bool) -> lis | |||
for word in sorted(all_words, key=lambda word: word.rect.y0): | |||
try: | |||
input_string = word.text.strip().replace(",", ".") | |||
regex = re.compile(r"^-?([0-9]+(\.[0-9]+)?)[müMN\\.]*$") | |||
regex = re.compile(r"^-?\.?([0-9]+(\.[0-9]+)?)[müMN\\.]*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now support numbers such as .40
that sometimes occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input .40
is now extracted with value 40
, not as 0.40
. Is that really what we want?
I would also suggest adding this as a test case in test_find_depth_columns.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked again, and actually it's rather that sometimes a '-' is recognized as a '.' in older borehole profiles that have this "handwritten style". Then the behavior is totally desired.
But I also found an occurrence of '.80'. See here: A531.pdf
For our dataset, it is for now better to use the current behaviour.
@@ -36,8 +36,8 @@ def detect_language_of_document(doc: fitz.Document) -> str: | |||
try: | |||
language = detect(text) | |||
except LangDetectException: | |||
language = "de" | |||
language = "de" # TODO: default language should be read from config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bothering me for the moment. Right now you need to adjust the code to extend the extraction to other languages. This should be doable from the config files. I believe there are other places where language is hard-coded in form of keywords. (e.g. coordinate extraction)
I will open an issue for it.
The noise check within the is_valid criterion is now adjusted to check short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries). This is achieved by applying a quadratic behavior onto the number of entries.
dc1798b
to
c9804e9
Compare
@@ -25,7 +25,7 @@ def depth_column_entries(all_words: list[TextWord], include_splits: bool) -> lis | |||
for word in sorted(all_words, key=lambda word: word.rect.y0): | |||
try: | |||
input_string = word.text.strip().replace(",", ".") | |||
regex = re.compile(r"^-?([0-9]+(\.[0-9]+)?)[müMN\\.]*$") | |||
regex = re.compile(r"^-?\.?([0-9]+(\.[0-9]+)?)[müMN\\.]*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input .40
is now extracted with value 40
, not as 0.40
. Is that really what we want?
I would also suggest adding this as a test case in test_find_depth_columns.py
.
Improve is_valid criterion.
The noise check within the is_valid criterion is now adjusted to check
short depth columns (i.e. few entries) more strictly than longer depth columns (i.e. more entries).
This is achieved by applying a quadratic behavior onto the number of entries.
F1 improvement on Zurich dataset by 0.6%.
Similar F1 on geoquat dataset.