Skip to content

Commit

Permalink
Remove line_detection script
Browse files Browse the repository at this point in the history
  • Loading branch information
redur committed Apr 3, 2024
1 parent 3778b88 commit f8d6db8
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 39 deletions.
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,6 @@ To execute the data extraction pipeline, follow these steps:

Once the script has finished running, you can check the results in the `data/Benchmark/extract` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the `data/Benchmark` directory.

Please note that for now the pipeline assumes that all PDF files to be analyzed are placed in the `data/Benchmark` directory. If you want to analyze different files, please place them in this directory.

### Output Structure
The `predictions.json` file contains the results of a data extraction process from PDF files. Each key in the JSON object is the name of a PDF file, and the value is a list of extracted items in a dictionary like object. The extracted items for now are the material descriptions in their correct order (given by their depths).

Expand Down Expand Up @@ -151,7 +149,7 @@ The project structure and the most important files are as follows:
- `src/` : The source code of the project.
- `stratigraphy/` : The main package of the project.
- `main.py` : The main script of the project. This script runs the data extraction pipeline.
- `line_detection.py`: This script runs the line detection on provided sample pdfs. Will be deprecated in the future.
- `line_detection.py`: Contains functionalities for line detection on pdf pages.
- `util/` : Utility scripts and modules.
- `benchmark/` : Scripts to evaluate the data extraction.
- `data/` : The data used by the project.
Expand All @@ -166,7 +164,6 @@ The project structure and the most important files are as follows:

- `main.py` : This is the main script of the project. It runs the data extraction pipeline, which analyzes the PDF files in the `data/Benchmark` directory and saves the results in the `predictions.json` file.

- `line_detection.py` : Runs the line detection algorithm on pdfs using `lsd` from opencv. It is meant to find all lines that potentially separate two material descriptions. It is incorporated in the script `main.py` and will be deprecated as a standalone script in the future.

## Experiment Tracking
We perform experiment tracking using MLFlow. Each developer has his own local MLFlow instance.
Expand Down
36 changes: 1 addition & 35 deletions src/stratigraphy/line_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,14 @@
from dotenv import load_dotenv
from numpy.typing import ArrayLike

from stratigraphy import DATAPATH
from stratigraphy.util.dataclasses import Line
from stratigraphy.util.geometric_line_utilities import (
drop_vertical_lines,
merge_parallel_lines_approximately,
merge_parallel_lines_efficiently,
)
from stratigraphy.util.plot_utils import plot_lines
from stratigraphy.util.util import flatten, line_from_array, read_params
from stratigraphy.util.util import line_from_array, read_params

load_dotenv()

Expand Down Expand Up @@ -111,36 +110,3 @@ def draw_lines_on_pdfs(input_directory: Path, line_detection_params: dict):
import mlflow

mlflow.log_image(img, f"pages/{filename}_page_{page_index}_lines.png")


if __name__ == "__main__":
# Some test pdfs
selected_pdfs = [
"270124083-bp.pdf",
"268124307-bp.pdf",
"268125268-bp.pdf",
"267125378-bp.pdf",
"268124435-bp.pdf",
"267123060-bp.pdf",
"268124635-bp.pdf",
"675230002-bp.pdf",
"268125592-bp.pdf",
"267124070-bp.pdf",
"699248001-bp.pdf",
]

if mlflow_tracking:
import mlflow

mlflow.set_experiment("LineDetection")
mlflow.start_run()
mlflow.log_params(flatten(line_detection_params))
lines = {}
for pdf in selected_pdfs:
doc = fitz.open(DATAPATH / "Benchmark" / pdf)

for page in doc:
lines[pdf] = extract_lines(page, line_detection_params)
img = plot_lines(page, lines[pdf], scale_factor=line_detection_params["pdf_scale_factor"])
if mlflow_tracking:
mlflow.log_image(img, f"lines_{pdf}.png")

0 comments on commit f8d6db8

Please sign in to comment.