Skip to content

Latest commit

 

History

History
178 lines (147 loc) · 8.32 KB

README.md

File metadata and controls

178 lines (147 loc) · 8.32 KB

S2-VLUE

Overview

The S2-VLUE, Semantic Scholar Visual Layout-enhanced Scientific Text Understanding Evaluation (S2-VLUE) Benchmark Suite, is created to evaluate the scientific document understanding and parsing with visual layout information.

It consists of three datasets, i.e., GROTOAP2, DocBank, and, S2-VL. We modify the existing dataset GROTOAP2[1] and DocBank[2], adding visual layout information and converting them to a format that is compatible with HuggingFace Datasets. The S2-VL dataset is a newly curated dataset that addresses three major drawbacks in existing work: 1) annotation quality, 2) VILA creation, and 3) domain coverage. It contains human annotations for papers from 19 scientific disciplines. We provide scripts for downloading the source PDF files as well as converting them to a similar HuggingFace Datasets format.

Download & Usage

Download the exported JSON (for training language models)

cd <vila-root>/datasets
bash ./download.sh <dataset-name> #grotoap2, docbank, s2-vl or all

Download the source PDFs or screenshots

Datasets Details

The S2-VL dataset

During the data release process, we unfortunately found that a small portion of PDFs in our dataset (22 out of 87) had additional copyright constraints of which we had been unaware. This meant that we could not directly release the data corresponding to these papers. As such, in the downloaded version, it contains only paper data created from the 65 papers.

If you are interested in the version of the dataset used for training and evaluation in our paper, please fill out this Google Form to request access (if you haven't hear from us within 2 weeks, please feel free to contact Shannon)

Recreating the dataset from PDFs and annotations

We also provide the full code to help you recreate the dataset from PDFs and annotation files to the JSON files for training models. Please check the instructions in s2-vl-utils/README.md.

Dataset Curation Details

Please find a detailed description of the labeling schemas and categories in the following documents:

*The algorithm category is removed due to its small number of instances.

The VILA-enhanced DocBank Dataset

Dataset Details

Statistics of the Datasets

GROTOAP2 DocBank S2-VL-ver1
Train Test Split 83k/18k/18k 398k/50k/50k *
Annotation Method Automatic Automatic Human Annotation
Paper Domain Life Science Math/Physics/CS 19 Disciplines
VILA Structure PDF parsing Vision model Gold Label / Detection methods
# of Categories 22 12 15
GROTOAP2 DocBank S2-VL-ver1*
Tokens per Page
Average 1203 838 790
Std 591 503 453
95th Percentile 2307 1553 1591
Text Lines per Page
Average 90 60 64
Std 51 34 54
95th Percentile 171 125 154
Text Blocks per Page
Average 12 15 22
Std 16 8 36
95th Percentile 37 30 68
Tokens per Text Line
Average 17 16 14
Std 12 43 10
95th Percentile 38 38 30
Tokens per Text Block
Average 90 57 48
Std 184 138 121
95th Percentile 431 210 249
  • This is calculated based on the S2-VL-ver1 with all 87 papers.

File Structures

  1. The organization of the dataset files :
    grotoap2 # Docbank is similar 
    ├─ labels.json       
    ├─ train-token.json
    ├─ dev-token.json           
    ├─ test-token.json           
    └─ train-test-split.json
  2. What's in each file?
    1. labels.json
      {"0": "Title",
       "1": "Author",
       ...
      }
    2. train-test-split.json
      {
          "train": [
              "pdf-file-name", ...
          ],
          "test": ["pdf-file-name", ...]
      }
    3. train-token.json, dev-token.json or test-token.json Please see detailed schema explanation in the schema-token.json file.
  3. Special notes on the folder structure for S2-VL: since the dataset size is small, we use 5-fold cross validation in the paper. The released version has a similar structure:
    s2-vl-ver1
    ├─ 0  # 5-fold Cross validation                           
    │  ├─ labels.json               
    │  ├─ test-token.json           
    │  ├─ train-test-split.json     
    │  └─ train-token.json          
    ├─ 1  # fold-1, have the same files as other folds                         
    │  ├─ labels.json               
    │  ├─ test-token.json           
    │  ├─ train-test-split.json     
    │  └─ train-token.json          
    ├─ 2                            
    ├─ 3                            
    └─ 4

Reference

  1. The GROTOAP2 Dataset:

  2. The Original DocBank Dataset:

Citation

@article{Shen2021IncorporatingVL,
  title={Incorporating Visual Layout Structures for Scientific Text Classification},
  author={Zejiang Shen and Kyle Lo and Lucy Lu Wang and Bailey Kuehl and Daniel S. Weld and Doug Downey},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.00676},
  url={https://arxiv.org/abs/2106.00676}
}