Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/configure pipeline with arguments #21

Merged
merged 20 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,18 @@ To execute the data extraction pipeline, follow these steps:

`conda activate boreholes-dev`

2. **Run the main script**
2. **Run the extraction script**

The main script for the extraction pipeline is located at `src/stratigraphy/main.py`. Run this script to start the extraction process.
The main script for the extraction pipeline is located at `src/stratigraphy/main.py`. A cli command is created to run this script.

This script will source all PDFs from the `data/Benchmark` directory and create PNG files in the `data/Benchmark/extract` directory.
Run `boreholes-extract-layers` to run the main extraction script. With the default options, the command will source all PDFs from the `data/Benchmark` directory and create PNG files in the `data/Benchmark/extract` directory.

Use `boreholes-extract-layers --help` to see all options for the extraction script.

3. **Check the results**

Once the script has finished running, you can check the results in the `data/Benchmark/extract` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the `data/Benchmark` directory.

Please note that for now the pipeline assumes that all PDF files to be analyzed are placed in the `data/Benchmark` directory. If you want to analyze different files, please place them in this directory.

### Output Structure
The `predictions.json` file contains the results of a data extraction process from PDF files. Each key in the JSON object is the name of a PDF file, and the value is a list of extracted items in a dictionary like object. The extracted items for now are the material descriptions in their correct order (given by their depths).

Expand Down Expand Up @@ -149,7 +149,7 @@ The project structure and the most important files are as follows:
- `src/` : The source code of the project.
- `stratigraphy/` : The main package of the project.
- `main.py` : The main script of the project. This script runs the data extraction pipeline.
- `line_detection.py`: This script runs the line detection on provided sample pdfs. Will be deprecated in the future.
- `line_detection.py`: Contains functionalities for line detection on pdf pages.
- `util/` : Utility scripts and modules.
- `benchmark/` : Scripts to evaluate the data extraction.
- `data/` : The data used by the project.
Expand All @@ -164,7 +164,6 @@ The project structure and the most important files are as follows:

- `main.py` : This is the main script of the project. It runs the data extraction pipeline, which analyzes the PDF files in the `data/Benchmark` directory and saves the results in the `predictions.json` file.

- `line_detection.py` : Runs the line detection algorithm on pdfs using `lsd` from opencv. It is meant to find all lines that potentially separate two material descriptions. It is incorporated in the script `main.py` and will be deprecated as a standalone script in the future.

## Experiment Tracking
We perform experiment tracking using MLFlow. Each developer has his own local MLFlow instance.
Expand Down
3 changes: 2 additions & 1 deletion environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,15 @@ dependencies:
- pathlib==1.0.1
- opencv==4.9.0
- python-dotenv==1.0.1
- pytest==8.1.1
- click==8.1.7
- pip
# dev dependencies
- matplotlib==3.8.0
- isort==5.13.2
- jupyterlab==4.1.3
- black==24.2.0
- pre-commit==3.6.2
- pytest==8.1.1
- pip:
# prod pip dependencies; needs to be a strict copy of environment-prod.yml
- amazon-textract-textractor
Expand Down
1 change: 1 addition & 0 deletions environment-prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies:
- pathlib==1.0.1
- opencv==4.9.0
- python-dotenv==1.0.1
- click==8.1.7
- pip
- pip:
- amazon-textract-textractor
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ requires-python = ">=3.10"
dependencies = [
]

[project.scripts]
boreholes-extract-layers = "stratigraphy.main:start_pipeline"

[tool.ruff.lint]
select = [
# pydocstyle
Expand Down
28 changes: 8 additions & 20 deletions src/stratigraphy/benchmark/score.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""Evaluate the predictions against the ground truth."""

import json
import logging
import os
from pathlib import Path
Expand All @@ -9,7 +8,6 @@
from dotenv import load_dotenv
from stratigraphy import DATAPATH
from stratigraphy.benchmark.ground_truth import GroundTruth
from stratigraphy.util.draw import draw_predictions
from stratigraphy.util.util import parse_text

load_dotenv()
Expand Down Expand Up @@ -56,32 +54,20 @@ def f1(precision: float, recall: float) -> float:
return 0


def evaluate_matching(
predictions_path: Path, ground_truth_path: Path, directory: Path, out_directory: Path
) -> tuple[dict, pd.DataFrame]:
def evaluate_matching(predictions: dict, number_of_truth_values: dict) -> tuple[dict, pd.DataFrame]:
"""Calculate F1, precision and recall for the predictions.

Calculate F1, precision and recall for the individual documents as well as overall.
The individual document metrics are returned as a DataFrame.

Args:
predictions_path (Path): Path to the predictions.json file.
ground_truth_path (Path): Path to the ground truth annotated data.
directory (Path): Path to the directory containing the pdf files.
out_directory (Path): Path to the directory where the evaluation images should be saved.
predictions (dict): The predictions.
number_of_truth_values (dict): The number of ground truth values per file.

Returns:
tuple[dict, pd.DataFrame]: A tuple containing the overall F1, precision and recall as a dictionary and the
individual document metrics as a DataFrame.
"""
ground_truth = GroundTruth(ground_truth_path)
with open(predictions_path) as in_file:
predictions = json.load(in_file)

predictions, number_of_truth_values = _add_ground_truth_to_predictions(predictions, ground_truth)

draw_predictions(predictions, directory, out_directory)

document_level_metrics = {
"document_name": [],
"F1": [],
Expand Down Expand Up @@ -135,16 +121,18 @@ def evaluate_matching(
}, pd.DataFrame(document_level_metrics)


def _add_ground_truth_to_predictions(predictions: dict, ground_truth: GroundTruth) -> (dict, dict):
def add_ground_truth_to_predictions(predictions: dict, ground_truth_path: Path) -> tuple[dict, dict]:
"""Add the ground truth to the predictions.

Args:
predictions (dict): The predictions.
ground_truth (GroundTruth): The ground truth.
ground_truth_path (Path): The path to the ground truth file.

Returns:
(dict, dict): The predictions with the ground truth added, and the number of ground truth values per file.
tuple[dict, dict]: The predictions with the ground truth added, and the number of ground truth values per file.
"""
ground_truth = GroundTruth(ground_truth_path)

number_of_truth_values = {}
for file, file_predictions in predictions.items():
ground_truth_for_file = ground_truth.for_file(file)
Expand Down
Loading
Loading