Skip to content

Commit

Permalink
Merge pull request #105 from swisstopo/LGVISIUM-102/LayerIdentifierCo…
Browse files Browse the repository at this point in the history
…lumn

LGVISIUM-102: common parent class "Sidebar" for LayerIdentifierColumn and DepthColumn
  • Loading branch information
stijnvermeeren-swisstopo authored Nov 22, 2024
2 parents 2ffcdef + 474111a commit b66b08e
Show file tree
Hide file tree
Showing 32 changed files with 1,613 additions and 1,830 deletions.
100 changes: 3 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,103 +124,9 @@ Use `boreholes-extract-all --help` to see all options for the extraction script.

4. **Check the results**

Once the script has finished running, you can check the results in the `data/output/draw` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the specified input directory.

### Output Structure
The `predictions.json` file contains the results of a data extraction process from PDF files. Each key in the JSON object is the name of a PDF file, and the value is a list of extracted items in a dictionary like object. The extracted items for now are the material descriptions in their correct order (given by their depths).

Example: predictions.json
```json
{
"685256002-bp.pdf": { # file name
"language": "de",
"metadata": {
"coordinates": null
},
"layers": [ # a layer corresponds to a material layer in the borehole profile
{
"material_description": { # all information about the complete description of the material of the layer
"text": "grauer, siltig-sandiger Kies (Auffullung)",
"rect": [
232.78799438476562,
130.18496704101562,
525.6640014648438,
153.54295349121094
],
"lines": [
{
"text": "grauer, siltig-sandiger Kies (Auffullung)",
"rect": [
232.78799438476562,
130.18496704101562,
525.6640014648438,
153.54295349121094
],
"page": 1
}
],
"page": 1
},
"depth_interval": { # information about the depth of the layer
"start": null,
"end": {
"value": 0.4,
"rect": [
125.25399780273438,
140.2349853515625,
146.10398864746094,
160.84498596191406
],
"page": 1
}
}
},
...
],
"depths_materials_column_pairs": [ # information about where on the pdf the information for material description as well as depths are taken.
{
"depth_column": {
"rect": [
119.05999755859375,
140.2349853515625,
146.8470001220703,
1014.4009399414062
],
"entries": [
{
"value": 0.4,
"rect": [
125.25399780273438,
140.2349853515625,
146.10398864746094,
160.84498596191406
],
"page": 1
},
{
"value": 0.6,
"rect": [
125.21800231933594,
153.8349609375,
146.0679931640625,
174.44496154785156
],
"page": 1
},
...
]
}
}
],
"page_dimensions": [
{
"height": 1192.0999755859375,
"width": 842.1500244140625
}
]
},
}
```
The script produces output in two different formats:
- A file `data/output/predictions.json` that contains all extracted data in a machine-readable format. The structure of this file is documented in [README.predictions-json.md](README.predictions-json.md).
- A PNG image of each processed PDF page in the `data/output/draw` directory, where the extracted data is highlighted.

# Developer Guidance
## Project Structure
Expand Down
128 changes: 128 additions & 0 deletions README.predictions-json.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# `predictions.json` output structure
The `predictions.json` file contains the results of a data extraction process in a machine-readable format. By default, the file is written to `data/output/predictions.json`.

Each key in the JSON object is the name of a PDF file. The extracted data is listed as an object with the following keys:
- `metadata`
- `elevation`: the detected elevation (if any) and the location in the PDF where they were extraction from.
- `coordinates`: the detected coordinates (if any) and the location in the PDF where they were extraction from.
- `language`: language that was detected for the document.
- `page_dimensions`: dimensions of each page in the PDF, measured in PDF points
- `layers`: a list of objects, where each object represents a layer of the borehole profile, using the following keys:
- `material_description`: the text of the material description, both as a single value as well as line-by-line, and the location in the PDF where the text resp. the lines where extracted from.
- `depth_interval`: the measured depth of the upper and lower limits of the layer, and the location in the PDF where they were extracted from.
- `bounding_boxes`: a list of objects, one for each (part of a) borehole profile in the PDF, that list some bounding boxes that can be used for visualizations. Each object has the following keys:
- `sidebar_rect`: the area of the page the contains a "sidebar" (if any), which contains depths or other data displayed to the side of material descriptions.
- `depth_column_entries`: list of locations of the entries in the depth column (if any).
- `material_description_rect`: the area of the page that contains all material descriptions.
- `page`: the number of the page of the PDF.
- `groundwater`: a list of objects, one for each groundwater measurement that was extracted from the PDF. Each object has the following keys.
- `date`: extracted date for the groundwater measurement (if any) as a string in YYYY-MM-DD format.
- `depth`: the measured depth (in m) of the groundwater measurement.
- `elevation`: the elevation (in m above sea level) of the groundwater measurement.
- `page` and `rect`: the location in the PDF where the groundwater measurement was extracted from.

All page numbers are counted starting at 1.

All bounding boxes are measured with PDF points as the unit, and with the top-left of the page as the origin.

## Example output
```yaml
{
"B366.pdf": { # file name
"metadata": {
"elevation": {
"elevation": 355.35,
"page": 1,
"rect": [27.49843978881836, 150.2817840576172, 159.42971801757812, 160.76754760742188]
},
"coordinates": {
"E": 659490.0,
"N": 257200.0,
"rect": [28.263830184936523, 179.63882446289062, 150.3379364013672, 188.7487335205078],
"page": 1
},
"language": "de",
"page_dimensions": [
{
"width": 591.956787109375,
"height": 1030.426025390625
},
{
"width": 588.009521484375,
"height": 792.114990234375
}
]
},
"layers": [
{
"material_description": {
"text": "beiger, massig-dichter, stark dolomitisierter Kalk, mit Muschelresten",
"lines": [
{
"text": "beiger, massig-dichter, stark",
"page": 1,
"rect": [258.5303039550781, 345.9997253417969, 379.9410705566406, 356.1011657714844]
},
{
"text": "dolomitisierter Kalk, mit",
"page": 1,
"rect": [258.2362060546875, 354.4559326171875, 363.0706787109375, 364.295654296875]
},
{
"text": "Muschelresten",
"page": 1,
"rect": [258.48748779296875, 363.6712341308594, 313.03204345703125, 371.3343505859375]
}
],
"page": 1,
"rect": [258.2362060546875, 345.9997253417969, 379.9410705566406, 371.3343505859375]
},
"depth_interval": {
"start": {
"value": 1.5,
"rect": [200.63790893554688, 331.3035888671875, 207.83108520507812, 338.30450439453125]
},
"end": {
"value": 6.0,
"rect": [201.62551879882812, 374.30560302734375, 210.0361328125, 380.828857421875]
}
}
},
# ... (more layers)
],
"bounding_boxes": [
{
"sidebar_rect": [198.11251831054688, 321.8956298828125, 210.75906372070312, 702.2628173828125],
"depth_column_entries": [
[200.1201171875, 321.8956298828125, 208.59901428222656, 328.6802062988281],
[200.63790893554688, 331.3035888671875, 207.83108520507812, 338.30450439453125],
[201.62551879882812, 374.30560302734375, 210.0361328125, 380.828857421875],
[199.86251831054688, 434.51556396484375, 210.10894775390625, 441.4538879394531],
[198.11251831054688, 557.5472412109375, 210.35877990722656, 563.9244995117188],
[198.28451538085938, 582.0216674804688, 209.76953125, 588.7603759765625],
[198.7814178466797, 616.177001953125, 209.50042724609375, 622.502197265625],
[198.6378173828125, 663.2830810546875, 210.75906372070312, 669.5428466796875],
[198.26901245117188, 695.974609375, 209.12693786621094, 702.2628173828125]
],
"material_description_rect": [256.777099609375, 345.9997253417969, 392.46051025390625, 728.2700805664062],
"page": 1
},
{
"sidebar_rect": null,
"depth_column_entries": [],
"material_description_rect": [192.3216094970703, 337.677978515625, 291.1827392578125, 633.6331176757812],
"page": 2
}
],
"groundwater": [
{
"date": "1979-11-29",
"depth": 19.28,
"elevation": 336.07,
"page": 1,
"rect": [61.23963928222656, 489.3185119628906, 94.0096435546875, 513.6478881835938]
}
]
}
}
```
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ dependencies = [
"boto3",
"pandas",
"levenshtein",
"pathlib",
"python-dotenv",
"setuptools",
"tqdm",
Expand Down
24 changes: 10 additions & 14 deletions src/stratigraphy/annotations/draw.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@
import pandas as pd
from dotenv import load_dotenv
from stratigraphy.data_extractor.data_extractor import FeatureOnPage
from stratigraphy.depthcolumn.depthcolumn import DepthColumn
from stratigraphy.depths_materials_column_pairs.depths_materials_column_pairs import DepthsMaterialsColumnPairs
from stratigraphy.depths_materials_column_pairs.bounding_boxes import BoundingBoxes
from stratigraphy.groundwater.groundwater_extraction import Groundwater
from stratigraphy.layer.layer import Layer
from stratigraphy.metadata.coordinate_extraction import Coordinate
Expand Down Expand Up @@ -55,7 +54,7 @@ def draw_predictions(
for file_prediction in predictions.file_predictions_list:
logger.info("Drawing predictions for file %s", file_prediction.file_name)

depths_materials_column_pairs = file_prediction.depths_materials_columns_pairs
bounding_boxes = file_prediction.bounding_boxes
coordinates = file_prediction.metadata.coordinates
elevation = file_prediction.metadata.elevation

Expand Down Expand Up @@ -98,7 +97,7 @@ def draw_predictions(
draw_depth_columns_and_material_rect(
shape,
page.derotation_matrix,
[pair for pair in depths_materials_column_pairs if pair.page == page_number],
[bboxes for bboxes in bounding_boxes if bboxes.page == page_number],
)
draw_material_descriptions(
shape,
Expand Down Expand Up @@ -245,7 +244,7 @@ def draw_material_descriptions(shape: fitz.Shape, derotation_matrix: fitz.Matrix


def draw_depth_columns_and_material_rect(
shape: fitz.Shape, derotation_matrix: fitz.Matrix, depths_materials_column_pairs: list[DepthsMaterialsColumnPairs]
shape: fitz.Shape, derotation_matrix: fitz.Matrix, bounding_boxes: list[BoundingBoxes]
):
"""Draw depth columns as well as the material rects on a pdf page.
Expand All @@ -257,25 +256,22 @@ def draw_depth_columns_and_material_rect(
Args:
shape (fitz.Shape): The shape object for drawing.
derotation_matrix (fitz.Matrix): The derotation matrix of the page.
depths_materials_column_pairs (list): List of depth column entries.
bounding_boxes (list[BoundingBoxes]): List of bounding boxes for depth column and material descriptions.
"""
for pair in depths_materials_column_pairs:
depth_column: DepthColumn = pair.depth_column
material_description_rect = pair.material_description_rect

if depth_column: # Draw rectangle for depth columns
for bboxes in bounding_boxes:
if bboxes.sidebar_bbox: # Draw rectangle for depth columns
shape.draw_rect(
fitz.Rect(depth_column.rect()) * derotation_matrix,
fitz.Rect(bboxes.sidebar_bbox.rect) * derotation_matrix,
)
shape.finish(color=fitz.utils.getColor("green"))
for depth_column_entry in depth_column.entries: # Draw rectangle for depth column entries
for depth_column_entry in bboxes.depth_column_entry_bboxes: # Draw rectangle for depth column entries
shape.draw_rect(
fitz.Rect(depth_column_entry.rect) * derotation_matrix,
)
shape.finish(color=fitz.utils.getColor("purple"))

shape.draw_rect( # Draw rectangle for material description column
fitz.Rect(material_description_rect) * derotation_matrix,
bboxes.material_description_bbox.rect * derotation_matrix,
)
shape.finish(color=fitz.utils.getColor("red"))

Expand Down
Loading

1 comment on commit b66b08e

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1671670%3–446
   get_files.py19190%3–47
   main.py1141140%3–307
src/stratigraphy/benchmark
   metrics.py594229%22–25, 29–32, 36–39, 46–49, 53–54, 58, 65–74, 78–91, 96–133
src/stratigraphy/data_extractor
   data_extractor.py74495%33, 46, 123, 168
src/stratigraphy/depthcolumn
   depthcolumnentry.py681282%21, 25, 37, 71–72, 86, 95, 107–109, 130, 143
src/stratigraphy/depths_materials_column_pairs
   bounding_boxes.py301067%23, 32, 50, 60, 72–78
   material_description_rect_with_sidebar.py18856%27–41
src/stratigraphy/evaluation
   evaluation_dataclasses.py491178%52, 71–74, 90, 104, 125–131, 147
   groundwater_evaluator.py48198%77
   layer_evaluator.py664630%29–30, 35–39, 47, 69–95, 105–113, 128–149
   metadata_evaluator.py371462%46–65, 86–93
   utility.py16756%43–52
src/stratigraphy/groundwater
   groundwater_extraction.py1569937%52, 94, 127–132, 140, 167–171, 186–206, 217–306, 322–354
   utility.py393315%10–17, 30–47, 59–73, 88–102
src/stratigraphy/layer
   layer.py371365%26, 29, 37, 52–72
src/stratigraphy/lines
   geometric_line_utilities.py86298%81, 131
   line.py51492%25, 50, 60, 110
   linesquadtree.py46198%75
src/stratigraphy/metadata
   coordinate_extraction.py106496%29, 93–94, 106
   elevation_extraction.py906033%34–39, 47, 55, 63, 79–87, 124–138, 150–153, 165–197, 212–220, 228–232
   language_detection.py181328%17–23, 37–45
   metadata.py662464%27, 83, 101–127, 146–155, 195–198, 206
src/stratigraphy/sidebar
   a_above_b_sidebar.py954058%39, 45, 64–72, 83, 88, 95, 108, 113–120, 135–136, 178–219
   a_above_b_sidebar_extractor.py29390%45–47
   a_above_b_sidebar_validator.py412051%48, 58, 61, 81–84, 109–127, 139–148
   a_to_b_sidebar.py471666%37, 40, 53–54, 71, 99–114
   layer_identifier_sidebar.py513237%23–24, 27, 59–78, 94–110, 122, 135
   layer_identifier_sidebar_extractor.py423321%30–40, 54–86
   sidebar.py40198%84
src/stratigraphy/text
   description_block_splitter.py70297%24, 139
   extract_text.py29390%19, 53–54
   find_description.py41880%26–34, 111–114
   textblock.py901188%22, 27, 39, 44, 71, 79, 104, 116, 139, 160, 189
src/stratigraphy/util
   dataclasses.py32391%37–39
   interval.py1146543%28–31, 36–39, 45, 51, 55, 94–140, 157, 163–179, 196–214
   predictions.py723453%72, 95–115, 143–187
   util.py341362%41, 69–76, 90–92, 116–117
TOTAL237799358% 

Tests Skipped Failures Errors Time
99 0 💤 0 ❌ 0 🔥 8.287s ⏱️

Please sign in to comment.