Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Material Block recognition for Geneva Layout #45

Merged
merged 4 commits into from
May 22, 2024

Conversation

redur
Copy link
Contributor

@redur redur commented May 21, 2024

Includes functionalities to recognize material description blocks for the Geneva borehole profile layout. The code detects so called layer index columns which indicate the presence of a material description rect as well as their position marks how to split the material description rect into material blocks. Note: Some visualization code is part of this commit which should be dropped in a follow up commit.

Includes functionalities to recognize material description blocks for the Geneva borehole profile layout.
The code detects so called layer index columns which indicate the presence of a material description rect as well as their position marks how to split the material description rect into material blocks.
Note: Some visualization code is part of this commit which should be dropped in a follow up commit.
Comment on lines 46 to 89
fr:
including_expressions:
- sol
- végétal
- végétal # remove accents generally; ocr might be wrong
- dallage
- terre
- bitume
- bitumineux
- grave d'infrastructure
- grave d'infrastructure # what happens if we remove this?
- sable
- limon
- gravier
- asphalte
- humus
- humus # hummus maybe?
- brun
- gris
- grise
- mou
- dur
- dure
- ferme
- racine
- revetement
- pierre
- beige
- beton
- craie
- marne
- materiau de base
- materiau
- matrice sableuse
- enrobé
- enrobé # accent --> check what happens if it's removed
- terrain
- remblais
- remblai
- molasse
- phase
- formations
- limoneuse
- argileuse
- argileux
- mousse
excluding_expressions:
- monsieur
- fin
Copy link
Contributor Author

@redur redur May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked our expressions with a native French speaker who also has experience with NLP with the French language. We looked at borehole profiles together and extended the list. Also, he suggested to remove all accents because OCR frequently makes errors there.

That's the reasons for the changes and comments here.

Copy link

github-actions bot commented May 21, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py2142140%3–522
   get_files.py21210%3–48
   line_detection.py26260%3–76
   main.py91910%3–232
src/stratigraphy/util
   coordinate_extraction.py1283176%30, 50, 54, 58–66, 143, 163, 235–241, 250–252, 268–282
   dataclasses.py32391%37–39
   depthcolumn.py2066767%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 199, 238, 254–262, 274, 279, 286, 310, 314, 343, 364, 367–378, 393–394, 439–481
   depthcolumnentry.py20480%12, 15, 27, 34
   description_block_splitter.py70297%24, 139
   draw.py73730%3–225
   duplicate_detection.py32320%3–81
   find_depth_columns.py82495%56–57, 151–152
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py87298%83, 133
   interval.py1075251%25–28, 32–35, 40, 45, 48, 100–146, 167, 172–188
   language_detection.py18180%3–43
   layer_identifier_column.py61610%3–162
   line.py492647%25, 42, 51, 65–95, 98
   linesquadtree.py46198%76
   plot_utils.py44440%3–121
   predictions.py1871870%3–385
   textblock.py74889%27, 51, 63, 75, 98, 119, 127, 155
   util.py402245%15–18, 22, 26, 40–47, 61–63, 87–88, 100–105
TOTAL1779101843% 

Tests Skipped Failures Errors Time
58 0 💤 0 ❌ 0 🔥 0.590s ⏱️

Comment on lines 126 to 136
# Visualization: To be dropped before merging to main.
for layer_index_column in layer_index_columns:
fitz.utils.draw_rect(
page, layer_index_column.rect() * page.derotation_matrix, color=fitz.utils.getColor("blue")
)
for block in blocks:
fitz.utils.draw_rect(page, block.rect * page.derotation_matrix, color=fitz.utils.getColor("red"))
fitz.utils.draw_rect(
page, material_description_rect * page.derotation_matrix, color=fitz.utils.getColor("blue")
)
page.parent.save(DATAPATH / "_temp" / "output.pdf", garbage=4, deflate=True, clean=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be dropped. Is useful to inspect the recognized layer index columns as well as the material description rect.

)
page.parent.save(DATAPATH / "_temp" / "output.pdf", garbage=4, deflate=True, clean=True)

return predictions, json_filtered_pairs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I just exit here if there's a layer index column.

Some thoughts:
I believe it should be possible to consider a layer index column a special case of a depth column, and continue the "normal" way. I believe this will be "cleaner" to move on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

from stratigraphy.util.line import TextLine


class LayerIndexColumn:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already now, this is quite similar to a depth column.

return blocks


def matching_blocks(all_lines: list[TextLine], line_index: int, next_layer_index: TextLine | None) -> list[TextBlock]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could become a method of LayerIndexColumn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If next_layer_index is not just an index but of type TextLine | None , then probably next_layer is a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not create a class object LayerIndexEntry similar to the entries of depthcolumns. The entries of the LayerIndex are simply TextLine objects.

What I mean here is the next layer index "entry". Therefore I called it next_layer_index. Does that make sense?

Moving forward and considering LayerIndexColumn as a special case of a DepthColum we might want to create the "LayerIndexEntry" object as well, and then this will be clearer. What do you think?

Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is already a pretty decent first implementation, though obviously there is room for some simplification of the code (sharing more code between the "index column" and "depth column" implementations.

I would propose to merge this (after fixing the minor comments) and then to improve the code in a next iteration.

src/stratigraphy/util/layer_index_column.py Outdated Show resolved Hide resolved
return blocks


def matching_blocks(all_lines: list[TextLine], line_index: int, next_layer_index: TextLine | None) -> list[TextBlock]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If next_layer_index is not just an index but of type TextLine | None , then probably next_layer is a better name?

src/stratigraphy/util/find_description.py Outdated Show resolved Hide resolved
src/stratigraphy/util/layer_index_column.py Outdated Show resolved Hide resolved
src/stratigraphy/util/layer_index_column.py Outdated Show resolved Hide resolved
)
page.parent.save(DATAPATH / "_temp" / "output.pdf", garbage=4, deflate=True, clean=True)

return predictions, json_filtered_pairs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@redur redur self-assigned this May 22, 2024
@redur redur merged commit e156378 into main May 22, 2024
3 checks passed
@redur redur deleted the feat/improve_geneva_layout branch May 22, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants