The S2-VLUE, Semantic Scholar Visual Layout-enhanced Scientific Text Understanding Evaluation (S2-VLUE) Benchmark Suite, is created to evaluate the scientific document understanding and parsing with visual layout information.
It consists of three datasets, i.e., GROTOAP2, DocBank, and, S2-VL. We modify the existing dataset GROTOAP2[1] and DocBank[2], adding visual layout information and converting them to a format that is compatible with HuggingFace Datasets. The S2-VL dataset is a newly curated dataset that addresses three major drawbacks in existing work: 1) annotation quality, 2) VILA creation, and 3) domain coverage. It contains human annotations for papers from 19 scientific disciplines. We provide scripts for downloading the source PDF files as well as converting them to a similar HuggingFace Datasets format.
cd <vila-root>/datasets
bash ./download.sh <dataset-name> #grotoap2, docbank, s2-vl or all
- GROTOAP2 (downloading paper PDFs)
- Please follow the instructions from the GROTOAP2 Project README.
- DocBank (downloading paper page screenshots)
- Please follow the instructions from the home page of the DocBank Project.
- S2-VL (downloading paper PDFs)
- Please check the instructions in s2-vl-utils/README.md.
During the data release process, we unfortunately found that a small portion of PDFs in our dataset (22 out of 87) had additional copyright constraints of which we had been unaware. This meant that we could not directly release the data corresponding to these papers. As such, in the downloaded version, it contains only paper data created from the 65 papers.
If you are interested in the version of the dataset used for training and evaluation in our paper, please fill out this Google Form to request access (if you haven't hear from us within 2 weeks, please feel free to contact Shannon)
We also provide the full code to help you recreate the dataset from PDFs and annotation files to the JSON files for training models. Please check the instructions in s2-vl-utils/README.md.
Please find a detailed description of the labeling schemas and categories in the following documents:
- Labeling Instruction
- S2-VL Category Definition
- We labeled both layout and semantic categories in S2-VL (see the document above), but only the 15* layout categories will be used in this evaluation benchmark.
- The 19 Scientific Disciplines
*The algorithm
category is removed due to its small number of instances.
GROTOAP2 | DocBank | S2-VL-ver1 | |
---|---|---|---|
Train Test Split | 83k/18k/18k | 398k/50k/50k | * |
Annotation Method | Automatic | Automatic | Human Annotation |
Paper Domain | Life Science | Math/Physics/CS | 19 Disciplines |
VILA Structure | PDF parsing | Vision model | Gold Label / Detection methods |
# of Categories | 22 | 12 | 15 |
GROTOAP2 | DocBank | S2-VL-ver1* | |
---|---|---|---|
Tokens per Page | |||
Average | 1203 | 838 | 790 |
Std | 591 | 503 | 453 |
95th Percentile | 2307 | 1553 | 1591 |
Text Lines per Page | |||
Average | 90 | 60 | 64 |
Std | 51 | 34 | 54 |
95th Percentile | 171 | 125 | 154 |
Text Blocks per Page | |||
Average | 12 | 15 | 22 |
Std | 16 | 8 | 36 |
95th Percentile | 37 | 30 | 68 |
Tokens per Text Line | |||
Average | 17 | 16 | 14 |
Std | 12 | 43 | 10 |
95th Percentile | 38 | 38 | 30 |
Tokens per Text Block | |||
Average | 90 | 57 | 48 |
Std | 184 | 138 | 121 |
95th Percentile | 431 | 210 | 249 |
- This is calculated based on the S2-VL-ver1 with all 87 papers.
- The organization of the dataset files :
grotoap2 # Docbank is similar ├─ labels.json ├─ train-token.json ├─ dev-token.json ├─ test-token.json └─ train-test-split.json
- What's in each file?
labels.json
{"0": "Title", "1": "Author", ... }
train-test-split.json
{ "train": [ "pdf-file-name", ... ], "test": ["pdf-file-name", ...] }
train-token.json
,dev-token.json
ortest-token.json
Please see detailed schema explanation in the schema-token.json file.
- Special notes on the folder structure for S2-VL: since the dataset size is small, we use 5-fold cross validation in the paper. The released version has a similar structure:
s2-vl-ver1 ├─ 0 # 5-fold Cross validation │ ├─ labels.json │ ├─ test-token.json │ ├─ train-test-split.json │ └─ train-token.json ├─ 1 # fold-1, have the same files as other folds │ ├─ labels.json │ ├─ test-token.json │ ├─ train-test-split.json │ └─ train-token.json ├─ 2 ├─ 3 └─ 4
-
The GROTOAP2 Dataset:
- Paper: https://www.dlib.org/dlib/november14/tkaczyk/11tkaczyk.html
- Original download link: http://cermine.ceon.pl/grotoap2/
- Licence: Open Access license
-
The Original DocBank Dataset:
- Paper: https://arxiv.org/pdf/2006.01038.pdf
- Original download link: https://github.com/doc-analysis/DocBank
- Licence: Apache-2.0
@article{Shen2021IncorporatingVL,
title={Incorporating Visual Layout Structures for Scientific Text Classification},
author={Zejiang Shen and Kyle Lo and Lucy Lu Wang and Bailey Kuehl and Daniel S. Weld and Doug Downey},
journal={ArXiv},
year={2021},
volume={abs/2106.00676},
url={https://arxiv.org/abs/2106.00676}
}