Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close LGVISIUM-52: Remove grouping by page #70

Merged
merged 8 commits into from
Aug 20, 2024
35 changes: 35 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Run boreholes-extract-all",
"type": "debugpy",
"request": "launch",
"module": "src.stratigraphy.main",
"args": [
"-i", "data/zurich",
"-g", "data/zurich_ground_truth.json"
],
"cwd": "${workspaceFolder}",
"justMyCode": true,
"python": "./swisstopo/bin/python3",
},
{
"name": "Python: Run label studio to GT",
"type": "debugpy",
"request": "launch",
"module": "src.scripts.label_studio_annotation_to_ground_truth",
"args": [
// "-a", "/Users/david.cleres/Downloads/project-2-at-2024-08-15-12-37-dd0f900a.json",
"-a", "/Users/david.cleres/Downloads/project-2-at-2024-08-15-13-55-e7d6ebf7.json",
"-o", "data/label-studio/zurich_ground_truth.json"
],
"cwd": "${workspaceFolder}",
"justMyCode": true,
"python": "./swisstopo/bin/python3",
}
]
}
16 changes: 16 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"cSpell.words": [
"DATAPATH",
"depthcolumn",
"depthcolumnentry",
"dotenv",
"fitz",
"mlflow",
"pixmap",
"pyproject",
"swissgeol",
"swisstopo",
"textblock",
"venv"
]
}
146 changes: 76 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Boreholes Data Extraction is a pipeline to extract structured data from borehole

The Federal Office of Topography swisstopo is Switzerland's geoinformation centre. The Swiss Geological Survey at swisstopo is the federal competence centre for the collection, analysis, storage, and provision of geological data of national interest.

Data from boreholes is an essential source for our knowledge about the subsurface. In order to manage and publish borehole data of national interest, swisstopo has developed the application boreholes.swissgeol.ch (currently for internal use only), part of the [swissgeol.ch](https://swissgeol.ch) platform. As of August 2024, over 30.000 boreholes are registered in the application database, a number that is rapidly increasing thanks to an improved data exchange with cantonal offices, other government agencies and federal corporations such as the Swiss Federal Railways SBB. In the coming years, the number of boreholes in the database is expected to keep increasing to a multiple of the current size. Data is being added from both boreholes that were recently constructued and documented, as well as from older boreholes that were until now only documented in separate databases or in analogue archives. Data from older boreholes can still be very relevant, as geology only changes very slowly, and newer data is often unavailable (and expensive to collect).
Data from boreholes is an essential source for our knowledge about the subsurface. In order to manage and publish borehole data of national interest, swisstopo has developed the application boreholes.swissgeol.ch (currently for internal use only), part of the [swissgeol.ch](https://swissgeol.ch) platform. As of August 2024, over 30.000 boreholes are registered in the application database, a number that is rapidly increasing thanks to an improved data exchange with cantonal offices, other government agencies and federal corporations such as the Swiss Federal Railways SBB. In the coming years, the number of boreholes in the database is expected to keep increasing to a multiple of the current size. Data is being added from both boreholes that were recently constructed and documented, as well as from older boreholes that were until now only documented in separate databases or in analogue archives. Data from older boreholes can still be very relevant, as geology only changes very slowly, and newer data is often unavailable (and expensive to collect).

In order to use the collected borehole data efficiently, it is critical that both metadata as well as geological information is digitally stored in a structured database. However, the relevant data for most boreholes that are received by swisstopo, is contained in PDF-files that lack a standardized structure. Older data is often only available in the form of a scanned image, obtained from a printed document or from a microfiche. Manually entering all the relevant data from these various sources into a structured database is not feasible, given the large amount of boreholes and the continuous influx of new data.

Expand Down Expand Up @@ -110,85 +110,91 @@ Example: predictions.json
{
"685256002-bp.pdf": { # file name
"language": "de",
"metadata": {"coordinates": {"E": 117146, "N": 100388}},
"page_1": {
"layers": [ # a layer corresponds to a material layer in the borehole profile
{
"material_description": { # all information about the complete description of the material of the layer
"text": "grauer, siltig-sandiger Kies (Auffullung)",
"metadata": {
"coordinates": null
},
"layers": [ # a layer corresponds to a material layer in the borehole profile
{
"material_description": { # all information about the complete description of the material of the layer
"text": "grauer, siltig-sandiger Kies (Auffullung)",
"rect": [
232.78799438476562,
130.18496704101562,
525.6640014648438,
153.54295349121094
],
"lines": [
{
"text": "grauer, siltig-sandiger Kies (Auffullung)",
"rect": [
232.78799438476562,
130.18496704101562,
525.6640014648438,
153.54295349121094
],
"page": 1
}
],
"page": 1
},
"depth_interval": { # information about the depth of the layer
"start": null,
"end": {
"value": 0.4,
"rect": [
232.78799438476562,
130.18496704101562,
525.6640014648438,
153.54295349121094
125.25399780273438,
140.2349853515625,
146.10398864746094,
160.84498596191406
],
"lines": [
{
"text": "grauer, siltig-sandiger Kies (Auffullung)",
"rect": [
232.78799438476562,
130.18496704101562,
525.6640014648438,
153.54295349121094
]
}
]
},
"depth_interval": { # information about the depth of the layer
"start": null,
"end": {
"page": 1
}
}
},
...
],
"depths_materials_column_pairs": [ # information about where on the pdf the information for material description as well as depths are taken.
{
"depth_column": {
"rect": [
119.05999755859375,
140.2349853515625,
146.8470001220703,
1014.4009399414062
],
"entries": [
{
"value": 0.4,
"rect": [
125.25399780273438,
140.2349853515625,
146.10398864746094,
160.84498596191406
]
}
}
},
...
],
"depths_materials_column_pairs": [ # information about where on the pdf the information for material description as well as depths are taken.
{
"depth_column": {
"rect": [
119.05999755859375,
140.2349853515625,
146.8470001220703,
1014.4009399414062
],
"entries": [
{
"value": 0.4,
"rect": [
125.25399780273438,
140.2349853515625,
146.10398864746094,
160.84498596191406
]
},
{
"value": 0.6,
"rect": [
125.21800231933594,
153.8349609375,
146.0679931640625,
174.44496154785156
]
}
]
},
"material_description_rect": [
231.22500610351562,
130.18496704101562,
540.6109619140625,
897.7429809570312
],
"page": 1
},
{
"value": 0.6,
"rect": [
125.21800231933594,
153.8349609375,
146.0679931640625,
174.44496154785156
],
"page": 1
},
...
]
}
]
}
}
}
],
"page_dimensions": [
{
"height": 1192.0999755859375,
"width": 842.1500244140625
}
]
},
}
```

Expand Down
50 changes: 34 additions & 16 deletions src/scripts/label_studio_annotation_to_ground_truth.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
import logging
from collections import defaultdict
from pathlib import Path
from typing import Any

import click
import fitz
from stratigraphy.util.coordinate_extraction import Coordinate
from stratigraphy.util.interval import AnnotatedInterval
from stratigraphy.util.predictions import BoreholeMetaData, FilePredictions, LayerPrediction, PagePredictions
from stratigraphy.util.predictions import BoreholeMetaData, FilePredictions, LayerPrediction
from stratigraphy.util.textblock import MaterialDescription

logger = logging.getLogger(__name__)
Expand All @@ -35,11 +36,15 @@ def convert_annotations_to_ground_truth(annotation_file_path: Path, output_path:
for prediction in file_predictions:
ground_truth = {**ground_truth, **prediction.convert_to_ground_truth()}

# check if the output path exists
if not output_path.parent.exists():
output_path.parent.mkdir(parents=True)

with open(output_path, "w") as f:
json.dump(ground_truth, f, indent=4)


def create_from_label_studio(annotation_results: dict):
def create_from_label_studio(annotation_results: dict) -> list[FilePredictions]:
"""Create predictions class for a file given the annotation results from Label Studio.

This method is meant to import annotations from label studio. The primary use case is to
Expand All @@ -57,11 +62,11 @@ def create_from_label_studio(annotation_results: dict):
list[FilePredictions]: A list of FilePredictions objects, one for each file present in the
annotation_results.
"""
file_pages = defaultdict(list)
file_predictions = defaultdict(list)
metadata = {}
for annotation in annotation_results:
# get page level information
file_name, page_index = _get_file_name_and_page_index(annotation)
file_name, _ = _get_file_name_and_page_index(annotation)
page_width = annotation["annotations"][0]["result"][0]["original_width"]
page_height = annotation["annotations"][0]["result"][0]["original_height"]

Expand Down Expand Up @@ -141,22 +146,27 @@ def create_from_label_studio(annotation_results: dict):
coordinate_text = coordinates.popitem()[1]["text"]
# TODO: we could extract the rectangle as well. For conversion to ground truth this does not matter.
metadata[file_name] = BoreholeMetaData(coordinates=_get_coordinates_from_text(coordinate_text))
file_pages[file_name].append(
PagePredictions(layers=layers, page_number=page_index, page_width=page_width, page_height=page_height)
)

file_predictions = []
for file_name, page_predictions in file_pages.items():
file_predictions.append(
FilePredictions(

# create the page prediction object
if file_name in file_predictions:
# append the page predictions to the existing file predictions
file_predictions[file_name].layers.extend(layers)
file_predictions[file_name].page_sizes.append({"width": page_width, "height": page_height})
else:
# create a new file prediction object if it does not exist yet
file_predictions[file_name] = FilePredictions(
layers=layers,
file_name=f"{file_name}.pdf",
pages=page_predictions,
language="unknown",
metadata=metadata.get(file_name),
page_sizes=[{"width": page_width, "height": page_height}],
)
) # TODO: language should not be required here.

return file_predictions
file_predictions_list = []
for _, file_prediction in file_predictions.items():
file_predictions_list.append(file_prediction) # TODO: language should not be required here.

return file_predictions_list


def _get_coordinates_from_text(text: str) -> Coordinate | None:
Expand Down Expand Up @@ -186,7 +196,15 @@ def _get_start_end_from_text(text: str) -> tuple[float, float]:
return float(start), float(end)


def _get_file_name_and_page_index(annotation):
def _get_file_name_and_page_index(annotation: dict[str, Any]) -> tuple[str, int]:
"""Extract the file name and page index from the annotation.

Args:
annotation (dict): The annotation dictionary. Exported from Label Studio.

Returns:
tuple[str, int]: The file name and the page index (zero-based).
"""
file_name = annotation["data"]["ocr"].split("/")[-1]
file_name = file_name.split(".")[0]
return file_name.split("_")
Expand Down
3 changes: 2 additions & 1 deletion src/stratigraphy/benchmark/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,8 @@ def evaluate_layer_extraction(predictions: dict, number_of_truth_values: dict) -


def create_predictions_objects(
predictions: dict, ground_truth_path: Path | None
predictions: dict,
ground_truth_path: Path | None,
) -> tuple[dict[FilePredictions], dict]:
"""Create predictions objects from the predictions and evaluate them against the ground truth.
Expand Down
Loading
Loading