-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Material Block recognition for Geneva Layout #45
Conversation
Includes functionalities to recognize material description blocks for the Geneva borehole profile layout. The code detects so called layer index columns which indicate the presence of a material description rect as well as their position marks how to split the material description rect into material blocks. Note: Some visualization code is part of this commit which should be dropped in a follow up commit.
fr: | ||
including_expressions: | ||
- sol | ||
- végétal | ||
- végétal # remove accents generally; ocr might be wrong | ||
- dallage | ||
- terre | ||
- bitume | ||
- bitumineux | ||
- grave d'infrastructure | ||
- grave d'infrastructure # what happens if we remove this? | ||
- sable | ||
- limon | ||
- gravier | ||
- asphalte | ||
- humus | ||
- humus # hummus maybe? | ||
- brun | ||
- gris | ||
- grise | ||
- mou | ||
- dur | ||
- dure | ||
- ferme | ||
- racine | ||
- revetement | ||
- pierre | ||
- beige | ||
- beton | ||
- craie | ||
- marne | ||
- materiau de base | ||
- materiau | ||
- matrice sableuse | ||
- enrobé | ||
- enrobé # accent --> check what happens if it's removed | ||
- terrain | ||
- remblais | ||
- remblai | ||
- molasse | ||
- phase | ||
- formations | ||
- limoneuse | ||
- argileuse | ||
- argileux | ||
- mousse | ||
excluding_expressions: | ||
- monsieur | ||
- fin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked our expressions with a native French speaker who also has experience with NLP with the French language. We looked at borehole profiles together and extended the list. Also, he suggested to remove all accents because OCR frequently makes errors there.
That's the reasons for the changes and comments here.
src/stratigraphy/extract.py
Outdated
# Visualization: To be dropped before merging to main. | ||
for layer_index_column in layer_index_columns: | ||
fitz.utils.draw_rect( | ||
page, layer_index_column.rect() * page.derotation_matrix, color=fitz.utils.getColor("blue") | ||
) | ||
for block in blocks: | ||
fitz.utils.draw_rect(page, block.rect * page.derotation_matrix, color=fitz.utils.getColor("red")) | ||
fitz.utils.draw_rect( | ||
page, material_description_rect * page.derotation_matrix, color=fitz.utils.getColor("blue") | ||
) | ||
page.parent.save(DATAPATH / "_temp" / "output.pdf", garbage=4, deflate=True, clean=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be dropped. Is useful to inspect the recognized layer index columns as well as the material description rect.
) | ||
page.parent.save(DATAPATH / "_temp" / "output.pdf", garbage=4, deflate=True, clean=True) | ||
|
||
return predictions, json_filtered_pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I just exit here if there's a layer index column.
Some thoughts:
I believe it should be possible to consider a layer index column a special case of a depth column, and continue the "normal" way. I believe this will be "cleaner" to move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
from stratigraphy.util.line import TextLine | ||
|
||
|
||
class LayerIndexColumn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already now, this is quite similar to a depth column.
return blocks | ||
|
||
|
||
def matching_blocks(all_lines: list[TextLine], line_index: int, next_layer_index: TextLine | None) -> list[TextBlock]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could become a method of LayerIndexColumn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If next_layer_index
is not just an index but of type TextLine | None
, then probably next_layer
is a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not create a class object LayerIndexEntry
similar to the entries of depthcolumns. The entries of the LayerIndex are simply TextLine objects.
What I mean here is the next layer index "entry". Therefore I called it next_layer_index. Does that make sense?
Moving forward and considering LayerIndexColumn as a special case of a DepthColum we might want to create the "LayerIndexEntry" object as well, and then this will be clearer. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is already a pretty decent first implementation, though obviously there is room for some simplification of the code (sharing more code between the "index column" and "depth column" implementations.
I would propose to merge this (after fixing the minor comments) and then to improve the code in a next iteration.
return blocks | ||
|
||
|
||
def matching_blocks(all_lines: list[TextLine], line_index: int, next_layer_index: TextLine | None) -> list[TextBlock]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If next_layer_index
is not just an index but of type TextLine | None
, then probably next_layer
is a better name?
) | ||
page.parent.save(DATAPATH / "_temp" / "output.pdf", garbage=4, deflate=True, clean=True) | ||
|
||
return predictions, json_filtered_pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
Includes functionalities to recognize material description blocks for the Geneva borehole profile layout. The code detects so called layer index columns which indicate the presence of a material description rect as well as their position marks how to split the material description rect into material blocks. Note: Some visualization code is part of this commit which should be dropped in a follow up commit.