Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close LGVISIUM-52: Remove grouping by page #70

Merged
merged 8 commits into from
Aug 20, 2024

Conversation

dcleres
Copy link
Contributor

@dcleres dcleres commented Jul 31, 2024

Remove the page based logic in the code. The prediction.json has now the following shape:

{
    "685256002-bp.pdf": {
        "language": "de",
        "metadata": {
            "coordinates": null
        },
        "layers": [
            {
                "material_description": {
                    "text": "grauersiltigsandigerkiesauffullung",
                    "rect": [
                        232.78799438476562,
                        130.18496704101562,
                        525.6640014648438,
                        153.54295349121094
                    ],
                    "lines": [
                        {
                            "text": "grauer, siltig-sandiger Kies (Auffullung)",
                            "rect": [
                                232.78799438476562,
                                130.18496704101562,
                                525.6640014648438,
                                153.54295349121094
                            ],
                            "page": 1
                        }
                    ],
                    "page": 1
                },
                "depth_interval": {
                    "start": null,
                    "end": {
                        "value": 0.4,
                        "rect": [
                            125.25399780273438,
                            140.2349853515625,
                            146.10398864746094,
                            160.84498596191406
                        ],
                        "page": 1
                    }
                }
            },
...
            {
                "material_description": {
                    "text": "dunkelgrauermagererztsandigerlehm",
                    "rect": [
                        246.42300415039062,
                        708.9439697265625,
                        438.72283935546875,
                        722.6839599609375
                    ],
                    "lines": [
                        {
                            "text": "dunkelgrauer, magerer, z.T. sandiger Lehm",
                            "rect": [
                                246.42300415039062,
                                708.9439697265625,
                                438.72283935546875,
                                722.6839599609375
                            ],
                            "page": 1
                        }
                    ],
                    "page": 1
                },
                "depth_interval": {
                    "start": {
                        "value": 9.9,
                        "rect": [
                            163.56900024414062,
                            684.2750244140625,
                            177.468994140625,
                            698.0150146484375
                        ],
                        "page": 1
                    },
                    "end": {
                        "value": 12.0,
                        "rect": [
                            160.0050048828125,
                            745.1210327148438,
                            177.5189971923828,
                            757.4869995117188
                        ],
                        "page": 1
                    }
                }
            }
        ],
        "depths_materials_column_pairs": [
            {
                "depth_column": {
                    "rect": [
                        160.0050048828125,
                        410.78302001953125,
                        178.76199340820312,
                        757.4869995117188
                    ],
                    "entries": [
                        {
                            "value": 0.2,
                            "rect": [
                                163.0070037841797,
                                410.78302001953125,
                                176.90699768066406,
                                424.52301025390625
                            ],
                            "page": 1
                        },
                        {
                            "value": 2.9,
                            "rect": [
                                162.36900329589844,
                                485.98602294921875,
                                176.2689971923828,
                                499.72601318359375
                            ],
                            "page": 1
                        },
                        {
                            "value": 3.1,
                            "rect": [
                                163.23500061035156,
                                495.28399658203125,
                                177.13499450683594,
                                509.02398681640625
                            ],
                            "page": 1
                        },
                        {
                            "value": 3.7,
                            "rect": [
                                163.31199645996094,
                                509.58197021484375,
                                177.2119903564453,
                                523.3219604492188
                            ],
                            "page": 1
                        },
                        {
                            "value": 4.0,
                            "rect": [
                                163.27000427246094,
                                517.2839965820312,
                                177.1699981689453,
                                531.0239868164062
                            ],
                            "page": 1
                        },
                        {
                            "value": 4.3,
                            "rect": [
                                163.1320037841797,
                                526.6389770507812,
                                177.03199768066406,
                                540.3789672851562
                            ],
                            "page": 1
                        },
                        {
                            "value": 5.1,
                            "rect": [
                                163.4720001220703,
                                548.4299926757812,
                                178.76199340820312,
                                563.5440063476562
                            ],
                            "page": 1
                        },
                        {
                            "value": 5.7,
                            "rect": [
                                163.5570068359375,
                                567.0780029296875,
                                177.45700073242188,
                                580.8179931640625
                            ],
                            "page": 1
                        },
                        {
                            "value": 7.0,
                            "rect": [
                                163.42300415039062,
                                602.1010131835938,
                                177.322998046875,
                                615.8410034179688
                            ],
                            "page": 1
                        },
                        {
                            "value": 7.7,
                            "rect": [
                                163.625,
                                622.0120239257812,
                                177.52499389648438,
                                635.7520141601562
                            ],
                            "page": 1
                        },
                        {
                            "value": 8.6,
                            "rect": [
                                163.5500030517578,
                                647.9769897460938,
                                177.4499969482422,
                                661.7169799804688
                            ],
                            "page": 1
                        },
                        {
                            "value": 9.9,
                            "rect": [
                                163.56900024414062,
                                684.2750244140625,
                                177.468994140625,
                                698.0150146484375
                            ],
                            "page": 1
                        },
                        {
                            "value": 12.0,
                            "rect": [
                                160.0050048828125,
                                745.1210327148438,
                                177.5189971923828,
                                757.4869995117188
                            ],
                            "page": 1
                        }
                    ]
                },
                "material_description_rect": [
                    244.97000122070312,
                    414.0959777832031,
                    594.2720947265625,
                    722.6839599609375
                ]
            }
        ],
        "page_height": [
            1192.0
        ],
        "page_width": [
            842.0
        ]
    }
}

Behind the scenes, in the code, the page-based logic was removed, and a File-based logic was introduced.

Additional features:

  • Added the launch.json file that makes it possible to run the borehole extraction in debug mode.
  • Added docstrings to some public functions

@dcleres dcleres added the enhancement New feature or request label Jul 31, 2024
@dcleres dcleres self-assigned this Jul 31, 2024
@dcleres dcleres force-pushed the LGVISIUM-52-remove-grouping-by-page branch from e6059c1 to 29fa7b9 Compare July 31, 2024 11:49
Copy link

github-actions bot commented Jul 31, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1881880%3–491
   get_files.py19190%3–47
   line_detection.py26260%3–76
   main.py1041040%3–247
src/stratigraphy/util
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 110–128, 140–149
   coordinate_extraction.py127794%31, 62, 75–76, 80, 205, 328
   dataclasses.py32391%37–39
   depthcolumn.py1946467%26, 30, 51, 57, 60–61, 85, 88, 95, 102, 110–111, 121, 138–154, 192, 229, 248–256, 267, 272, 279, 310, 315–322, 337–338, 381–423
   depthcolumnentry.py28679%17, 21, 36, 39, 56, 65
   description_block_splitter.py70297%25, 140
   draw.py84840%3–253
   duplicate_detection.py51510%3–146
   extract_text.py27293%39–40
   find_depth_columns.py91693%42–43, 73, 86, 180–181
   find_description.py632856%27–35, 50–63, 79–95, 172–175
   geometric_line_utilities.py86298%82, 132
   interval.py1045547%25–28, 33–36, 42, 48, 52, 62–64, 101–147, 168, 174–190
   language_detection.py18180%3–45
   layer_identifier_column.py91910%3–234
   line.py51492%26, 51, 61, 111
   linesquadtree.py46198%76
   plot_utils.py43430%3–120
   predictions.py1171170%3–269
   textblock.py80989%29, 57, 65, 90, 102, 125, 146, 155, 184
   util.py391756%22, 40–47, 61–63, 87–88, 100–104
TOTAL182896847% 

Tests Skipped Failures Errors Time
61 0 💤 0 ❌ 0 🔥 0.920s ⏱️

Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed two commits:

  • one fixing 2 TODOs in predictions.py (many thanks for highlighting those!)
  • one addressing the last point in the JIRA ticket:
    • While we are at it, can we remove the postprocessing (lowercase, removing whitespace) from the "text" field of the material description (e.g. "text": "kiesmitvielsandwenigsiltundtonbraungrau")? It would make more sense to me if this postprocessing is only done for the purposes of evaluation / comparing with ground truth, but the predictions.json file should actually contain the original text from the document without postprocessing (as is already the case for the individual lines).

Remaining points from the ticket:

  • Importing the extracted data in Label Studio still works
  • README documentation is updated
    • The README currently contains a JSON example with the old structure

I'm not a big fan of the fact that we had to add page_number=1 in so many unit tests that are not really concerned with page number logic at all. However, I don't really see an immediate solution to this. Unless you see an easy way to avoid this, I'm happy to accept it like this.

The other change requests that I've added are all pretty minor, I think.

src/stratigraphy/util/coordinate_extraction.py Outdated Show resolved Hide resolved
src/stratigraphy/main.py Outdated Show resolved Hide resolved
src/stratigraphy/util/find_depth_columns.py Outdated Show resolved Hide resolved
src/stratigraphy/util/layer_identifier_column.py Outdated Show resolved Hide resolved
src/stratigraphy/util/extract_text.py Show resolved Hide resolved
src/stratigraphy/util/draw.py Outdated Show resolved Hide resolved
src/stratigraphy/util/predictions.py Outdated Show resolved Hide resolved
@dcleres
Copy link
Contributor Author

dcleres commented Aug 15, 2024

I found during the review that the file: src/scripts/label_studio_annotation_to_ground_truth.py also needs to be adapted to match the new format. I will start working on this.

@dcleres
Copy link
Contributor Author

dcleres commented Aug 15, 2024

Thank you for your comment regarding the PR. I resolved all the issues you highlighted which are relevant to this part of the code. The changes required to the label studio repository will be merged into the respective repository.

I'm not a big fan of the fact that we had to add page_number=1 in so many unit tests that are not really concerned with page number logic at all. However, I don't see an immediate solution to this. Unless you see an easy way to avoid this, I'm happy to accept it as is.

Regarding this comment, I fully agree with you, but I also have not found a fundamentally better way to solve this issue. I edited the tests and added a constant called PAGE_NUMBER that mitigates this issue to some extent.

While doing this, I also realized that we are not performing any tests on files with more than one page. @stijnvermeeren-swisstopo, should a test be added for this as well?

@dcleres dcleres force-pushed the LGVISIUM-52-remove-grouping-by-page branch from 129a8e2 to e8221f6 Compare August 15, 2024 11:48
@stijnvermeeren-swisstopo
Copy link
Contributor

While doing this, I also realized that we are not performing any tests on files with more than one page. @stijnvermeeren-swisstopo, should a test be added for this as well?

That would certainly be useful as well, but it's not a top priority. If you want to work on that, I would certainly do it in a separate PR, so that we can close this one already now.

Copy link
Contributor

@stijnvermeeren-swisstopo stijnvermeeren-swisstopo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a minor simplification of the page dimensions code (4404eff).

Everything looks good to me now.

@dcleres dcleres merged commit 98b17ef into main Aug 20, 2024
3 checks passed
@dcleres dcleres deleted the LGVISIUM-52-remove-grouping-by-page branch August 20, 2024 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants