Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/improve depth entry recognition #65

Merged
merged 4 commits into from
Jun 24, 2024

Conversation

redur
Copy link
Contributor

@redur redur commented Jun 20, 2024

Tune some parameters to improve depth entry recognition.
Introduce new checks:

  • check if depths are strictly increasing.
  • check if correcting a common OCR mistake makes a depth column valid.

Overall, more depth entries are now recognized. For the zurich dataset, even the layer detection is improved. For the geoquad dataset, the layer detection rate is approximately the same.

@@ -147,7 +147,7 @@ def evaluate_borehole_extraction(predictions: dict, number_of_truth_values: dict
coordinate_metrics, coordinate_document_level_metrics = evaluate_metadata(predictions)
metrics = {**layer_metrics, **coordinate_metrics}
document_level_metrics = pd.merge(
layer_document_level_metrics, coordinate_document_level_metrics, on="document_name"
layer_document_level_metrics, coordinate_document_level_metrics, on="document_name", how="outer"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just fixing a bug where we only kept rows with coordinates in the document_level_metrics

@@ -262,14 +262,14 @@ def depth_intervals(self) -> list[BoundaryInterval]:
return depth_intervals

def significant_arithmetic_progression(self) -> bool:
if len(self.entries) < 6:
if len(self.entries) < 7:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simply improved the scores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the segment length at 6 and simplifying the threshold to abs(scale_pearson_correlation_coef) >= 0.9999 regardless of segment length seems to work even better. (Implemented in my commit.)

Copy link

github-actions bot commented Jun 20, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1881880%3–482
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py94940%3–237
src/stratigraphy/util
   boundarydepthcolumnvalidator.py371657%47, 57, 60, 81–84, 102–118, 130–132
   coordinate_extraction.py127794%31, 62, 75–76, 80, 205, 328
   dataclasses.py32391%37–39
   depthcolumn.py1926467%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 245–253, 264, 269, 276, 307, 312–319, 334–335, 378–420
   depthcolumnentry.py26773%12, 15, 29–30, 33, 45, 52
   description_block_splitter.py70297%24, 139
   draw.py80800%3–244
   duplicate_detection.py51510%3–146
   extract_text.py27293%38–39
   find_depth_columns.py91693%42–43, 71, 83, 176–177
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1065548%24–27, 31–34, 39, 44, 47, 57–59, 99–145, 166, 171–187
   language_detection.py18180%3–45
   layer_identifier_column.py91910%3–227
   line.py49492%25, 42, 51, 98
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1871870%3–387
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py401855%22, 40–47, 61–63, 87–88, 100–105
TOTAL1873102345% 

Tests Skipped Failures Errors Time
61 0 💤 0 ❌ 0 🔥 0.952s ⏱️

return corr_coef and corr_coef > 0.99

return (
corr_coef and corr_coef > np.min([1.0382 - len(self.entries) * 0.01, 0.9985]) and corr_coef > 0.95
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittably, this is probably overfitted. Will be interesting to see the real scores ones we have a test set.

Comes with improved depth interval recognition. This commit accounts for:
- Tuned parameter for the depthcolumn.py file.
- Check whether depth columns have strictly increasing depth entries.
- Check if correcting a common OCR mistake makes a depth column valid.

Also fixed a bug in document_level_metrics.csv file; previously only borehole profile
including coordinates where present in that file.
@redur redur force-pushed the feat/improve_depth_entry_recognition branch from f45c853 to ee459a5 Compare June 20, 2024 07:07
Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!
The code for correcting OCR mistakes does not make a big difference yet, but offers more potential for further improvements in the future.

With the additional changes that I've implemented (which I believe make the logic slightly simpler as well), I now get the following numbers (overall F1 score):

Zurich dataset:

  • main branch: 87.16%
  • Renato's version: 87.91%
  • Stijn's version: 87.90%

Geoquat validation dataset:

  • main branch: 49.60%
  • Renato's version: 49.43%
  • Stijn's version: 49.53%

While I believe that some further optimization of the parameters involved might be possible, the improvements will be marginal, and probably not worth the effort. Therefore, I would propose to merge the PR like this.

@redur redur merged commit b96efa8 into main Jun 24, 2024
3 checks passed
@redur redur deleted the feat/improve_depth_entry_recognition branch June 24, 2024 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants