-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/improve depth entry recognition #65
Conversation
@@ -147,7 +147,7 @@ def evaluate_borehole_extraction(predictions: dict, number_of_truth_values: dict | |||
coordinate_metrics, coordinate_document_level_metrics = evaluate_metadata(predictions) | |||
metrics = {**layer_metrics, **coordinate_metrics} | |||
document_level_metrics = pd.merge( | |||
layer_document_level_metrics, coordinate_document_level_metrics, on="document_name" | |||
layer_document_level_metrics, coordinate_document_level_metrics, on="document_name", how="outer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just fixing a bug where we only kept rows with coordinates in the document_level_metrics
src/stratigraphy/util/depthcolumn.py
Outdated
@@ -262,14 +262,14 @@ def depth_intervals(self) -> list[BoundaryInterval]: | |||
return depth_intervals | |||
|
|||
def significant_arithmetic_progression(self) -> bool: | |||
if len(self.entries) < 6: | |||
if len(self.entries) < 7: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simply improved the scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the segment length at 6 and simplifying the threshold to abs(scale_pearson_correlation_coef) >= 0.9999
regardless of segment length seems to work even better. (Implemented in my commit.)
src/stratigraphy/util/depthcolumn.py
Outdated
return corr_coef and corr_coef > 0.99 | ||
|
||
return ( | ||
corr_coef and corr_coef > np.min([1.0382 - len(self.entries) * 0.01, 0.9985]) and corr_coef > 0.95 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittably, this is probably overfitted. Will be interesting to see the real scores ones we have a test set.
Comes with improved depth interval recognition. This commit accounts for: - Tuned parameter for the depthcolumn.py file. - Check whether depth columns have strictly increasing depth entries. - Check if correcting a common OCR mistake makes a depth column valid. Also fixed a bug in document_level_metrics.csv file; previously only borehole profile including coordinates where present in that file.
f45c853
to
ee459a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
The code for correcting OCR mistakes does not make a big difference yet, but offers more potential for further improvements in the future.
With the additional changes that I've implemented (which I believe make the logic slightly simpler as well), I now get the following numbers (overall F1 score):
Zurich dataset:
- main branch: 87.16%
- Renato's version: 87.91%
- Stijn's version: 87.90%
Geoquat validation dataset:
- main branch: 49.60%
- Renato's version: 49.43%
- Stijn's version: 49.53%
While I believe that some further optimization of the parameters involved might be possible, the improvements will be marginal, and probably not worth the effort. Therefore, I would propose to merge the PR like this.
Tune some parameters to improve depth entry recognition.
Introduce new checks:
Overall, more depth entries are now recognized. For the zurich dataset, even the layer detection is improved. For the geoquad dataset, the layer detection rate is approximately the same.