Skip to content

Latest commit

 

History

History
145 lines (111 loc) · 6.57 KB

File metadata and controls

145 lines (111 loc) · 6.57 KB

Wikia/Wikipedia-NER-and-EL-Dataset-Creator

  • You can create datasets from Wikia/Wikipedia that can be used for both of entity recognition and Entity Linking.

  • Sample Dataset is available here. See also preprocessed data examples.

Sample ja-wiki dataset .

Create en-wiki dataset.

  • Ongoing under branch feature/FixEnParseBug.

Environment Setup for Preprocessing.

$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
$ (install wikiextractor==3.0.5 from source https://github.com/attardi/wikiextractor for activate --json option.)

Dataset Preparation

For Wikia

  • Download [worldname]_pages_current.xml from wikia statistics page to ./dataset/.

    • For example, if you are interested in Virtual Youtuber, download virtualyoutuber_pages_current.xml dump from here.

For Wikipedia

Sample Script for Creating EL Dataset.

$ sh ./scripts/vtuber.sh

Parameters for Creating Dataset

  • -augmentation_with_title_set_string_match (Default:True)

    • When this parameter is True, first we construct title set from entire pages in one wikia .xml. Then, when string matches in this title set, we treat these mentions as annotated ones.
  • -in_document_augmentation_with_its_title (Default:True)

    • When this parameter is True, we add another annotation to dataset with distant supervision from title, where the mention appears.

    • For example, the page of Anakin Skywalker mentions him without anchor link, as Anakin or Skywalker.

    • With this parameter on, we treat these mentions as annotated ones.

  • -spacy_model (Default: en_core_web_md)

    • Specify spaCy model for sentence boundary detection.

    • Note: SBD with spaCy is conducted only when -multiprocessing is False.

  • -language (Default: en)

    • Specify language of document.

    • When en is selected and -multiprocessing is False, spaCy is used for SBD.

    • When en is selected and -multiprocessing is True, pysbd is used for SBD.

    • When ja is selected, konoha is used for SBD.

  • -multiprocessing (Default: False)

    • If True, documents after preprocessing with wikiextractor are multiprocessed.

License

  • Dataset was constructed using Wikias (from FANDOM) and Wikipedia. This dataset is licensed under the Creative Commons Attribution-Share Alike License (CC-BY-SA).

Preprocessed data example from Wikia.

annotation.json

key its_content
document_title Page title where the annotation exists.
anchor_sent Anchored sentence with <a> and </a>. This anchor can be used for Entity Linking.
annotation_doc_entity_title Which entity to be linked if the mention is disambiguated. Redirects are also considered.
mention Surface form as it is in sentence where the mention appeared.
original_sentence Sentence without anchors.
original_sentence_mention_start Mention span start position in original sentence.
original_sentence_mention_end Mention span end position in original sentence.
[
    {
        "document_title": "Melissa Kinrenka",
        "anchor_sent": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of <a> Nijisanji </a>.",
        "annotation_doc_entity_title": "Nijisanji",
        "mention": "Nijisanji",
        "original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
        "original_sentence_mention_start": 75,
        "original_sentence_mention_end": 84
    },
    {
        "document_title": "Melissa Kinrenka",
        "anchor_sent": "<a> Melissa Kinrenka </a> (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
        "annotation_doc_entity_title": "Melissa Kinrenka",
        "mention": "Melissa Kinrenka",
        "original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
        "original_sentence_mention_start": 0,
        "original_sentence_mention_end": 16
    },
    ...
]
...

doc_title2sents.json

  • Redirect-resolved title and its descriptions after sentence split are available.
{
    "Furen E Lustario": [
        "Furen E Lustario (フレン・E・ルスタリオ) is a female Japanese Virtual YouTuber and member of Nijisanji.",
        "A female knight of the Corvus Empire.",
        "Introduction Video.",
        "Furen's introduction.",
        "Personality.",
        "Furen lacks a surprising amount of common sense.",
        "It has been displayed in at least two streams that she cannot tell from left to right.",
        ...
    ],
    "Ibrahim": [
        "Ibrahim (イブラヒム) is a male Japanese Virtual YouTuber and a member of Nijisanji.",
        "A former oil tycoon from the Corvus Empire.",
        "Since the value of oil has fallen, he now makes a living from a hot spring that he accidentally dug up.",
        "History.",
        "Background.",
        "Ibrahim made his YouTube debut on 1 February 2020.",
        ...
    ],
    ...
}

WIP

  • Add Entity Type to doc_title2sents.json for each entity.

Contact

  • izuna385(_atmark)gmail.com
  • PR and issues are welocome!