After running get_data.sh
as described in the README
, a data
folder will populate, containing the following files:
claims.jsonl
: Claims in SciFact-Open, annotated with evidence.claims_metadata.jsonl
: Metadata associated with each claim.corpus.jsonl
: The full SciFact-Open corpus of 500K research abstracts from S2ORC.corpus_candidates.jsonl
: Subset of documents fromcorpus.jsonl
that were retrieved for at least one claim.
Detailed descriptions of each file are included below.
A .jsonl
file with one claim per line, annotated with evidence. The schema is as follows:
{
"id": number, # An integer claim ID.
"claim": string, # The text of the claim.
"evidence": { # The evidence for the claim.
[doc_id]: # The rationales for a single document, keyed by S2ORC ID.
{
"provenance": enum("citation" | "pooling") # Evidence source. See below.
"label": enum("SUPPORT" | "CONTRADICT"), # Veracity label.
"sentences": number[], # Evidence "highlights".
"model_ranks": {[model_name]: number} | None # Ranks for pooled evidence.
}
}
}
Note that this format is altered slightly from the format for SciFact-Orig; see below for details.
A few of these fields warrant explanation.
- provenance: As described in the paper, evidence in SciFact-Open comes from two sources (see Sec. 3 of the paper for more details):
- Evidence from SciFact-Orig, obtained via citation links.
- Evidence added in SciFact-Open via pooling.
The
provenance
field indicates the source for each piece of evidence in the final SciFact-Open dataset.
- sentences: Evidence "highlights" justifying each labeling decision. The source of these highlights is different for
citation
evidence vs.pooling
evidence:- For
citation
evidence, highlights were hand-annotated by experts and are high-quality. - For
pooling
evidence, highlights were not hand-annotated. Instead, we include the predicted highglights from the model(s) that identified the evidence. These predictions may be incorrect, but are included for convenience and to facilitate future research.
- For
- model_ranks: For pooled data collection, predicted evidence is ranked by model confidence, and the
k
most-confidence predictions for each model are annotated; we usedk=250
.- For
pooling
evidence, this field provides the rank assigned to this piece of evidence by each model. - For
citation
evidence, this field is set toNone
.
- For
Below, we provide an example of a claim with two pieces of evidence: one from SciFact-Orig, and one identified via pooling.
{
"id": 170,
"claim": "Bariatric surgery reduces resolution of diabetes.",
"evidence": {
"5824985": {
"provenance": "citation",
"label": "CONTRADICT",
"sentences": [
10,
15
],
"model_ranks": None
},
"7854739": {
"provenance": "pooling",
"label": "CONTRADICT",
"sentences": [
2,
5,
6
],
"model_ranks": {
"multivers_10": 4556,
"multivers_20": 3881,
"paragraph_joint": 947,
"vert5erini": 120
}
}
}
}
The format here is slightly different from the format for the original SciFact dataset. In that work, a single document could be associated with multiple evidence sets, each of which provides sufficient justification to support or refute the claim. For instance:
{
# SciFact-Orig
"id": 263,
"claim": "Citrullinated proteins externalized in neutrophil extracellular traps act indirectly to disrupt the inflammatory cycle.",
"evidence": {
"30041340": [ # Multiple evidence sets associated with a single document.
{ "sentences": [0, 1],
"label": "CONTRADICT" },
{ "sentences": [11],
"label": "CONTRADICT" }
]
}
}
However, we didn't end up using this evidence set structure in our modeling or evaluation. Therefore, in this work, we simplify things a bit by "flattening" evidence sets, as follows:
{
# SciFact-Open
"id": 263,
"claim": "Citrullinated proteins externalized in neutrophil extracellular traps act indirectly to disrupt the inflammatory cycle.",
"evidence": {
"30041340": { # Flattened evidence, including all highlights for this document.
"sentences": [0, 1, 11],
"label": "CONTRADICT"
},
}
}
Each line includes metadata for a single claim, ordered as in claims.jsonl
. This information isn't needed for any modeling task related to SciFact-Open, but may be of interest for those interested in designing and analyzing datasets. As background, claims in SciFact were created by re-writing citation sentences found in documents in the S2ORC corpus. We refer to the document used as the source of the claim as the "source document".
{
"id": number # Same ID's as in `claims.jsonl`.
"source_doc_id": number # The S2ORC ID of the document
"source_metadata": { # Metadata on claim source document.
"paper_id": string
"title": string
"year": number
"arxiv_id": string | None
"pmc_id": string | None
"pubmed_id": string | None
"venue": string | None
"journal": string | None
},
"cited_doc_ids": [number] # Documents mentioned in source citation
}
A single line represents a document from the SciFact-Open corpus.
{
"doc_id": number, # The document's S2ORC ID.
"title": string, # The title.
"abstract": string[], # The abstract, written as a list of sentences.
"metadata": {
"paper_id": string, # Redundant with `doc_id` above.
"title": string,
"year": number,
"venue": string,
"s2orc_version: string
},
"scifact_orig": bool # True if this document was part of the corpus from SciFact-Orig
}
This file contains the subset of ~12K documents from corpus.jsonl
that either contain evidence indicated via citation links (from SciFact-Orig), or were identified via pooling as potential evidence candidates (from SciFact-Open).