Skip to content

Latest commit

 

History

History
79 lines (69 loc) · 4.58 KB

README.md

File metadata and controls

79 lines (69 loc) · 4.58 KB

KINNEWS-and-KIRNEWS

Data, Embeddings, Stopword lists, code, and baselines for COLING 2020 paper titled "KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi" by Rubungo Andre Niyongabo, Hong Qu, Julia Kreutzer, and Li Huang.

This paper introduces Kinyarwanda and Kirundi news classification datasets (KINNEWS and KIRNEWS,respectively), which were both collected from Rwanda and Burundi news websites and newspapers, for low-resource monolingual and cross-lingual multiclass classification tasks. Along with the datasets, we provide statistics, guidelines for preprocessing, pretrained word embeddings, and monolingual and cross-lingual baseline models.

Note: Please, when using any of the resources provided here, remember to cite our paper.

Data

Download the datasets

  • The raw and cleaned versions of KINNEWS can be downloaded from here (21,268 articles, 14 classes, 45.07MB(raw) and 38.85MB(cleaned))
  • The raw and cleaned versions of KIRNEWS can be downloaded from here (4,612 articles, 12 classes, 9.31MB(raw) and 7.77MB(cleaned))

Datasets description

Each dataset is in camma-separated-value (csv) format, with columns that are described bellow (Note that in the cleaned versions we only remain with 'label','title', and 'content' columns):

Field Description
label Numerical labels that range from 1 to 14
en_label English labels
kin_label Kinyarwanda labels
kir_label Kirundi labels
url The link to the news source
title The title of the news article
content The full content of the news article

Word embeddings

Download pre-trained word embeddings

  • The Kinyarwanda embeddings can be downloaded form here (59.88MB for 100d and 29.94MB for 50d)
  • The Kirundi embeddings can be downloaded from here (17.98MB for 100d and 8.96MB for 50d)

Training your own embeddings

To train you own word vectors, check out code/embeddings/word2vec_training.py file or refer to this gensim documentation.

Stopwords

To use our stopwords you may just copy the whole stopset_kin for Kinyarwanda and stopset_kir for Kirundi into your code or import them directly from KKLTK package, which is more recommended.

Leaderboard (baselines)

Monolingual

KINNEWS

Model Accuracy(%)
BiGRU(W2V-Kin-50*) 88.65
SVM(TF-IDF) 88.53
BiGRU(W2V-Kin-100) 88.29
CNN(W2V-Kin-50) 87.55
CNN(W2V-Kin-100) 87.54
LR(TF-IDF) 87.14
MNB(TF-IDF) 82.70
Char-CNN 71.70

KIRNEWS

Model Accuracy(%)
SVM(TF-IDF) 90.14
CNN(W2V-Kin-100) 88.01
BiGRU(W2V-Kin-100) 86.61
LR(TF-IDF) 86.13
BiGRU(W2V-Kin-50) 85.86
CNN(W2V-Kin-50) 85.75
MNB(TF-IDF) 82.67
Char-CNN 69.23

Cross-lingual

Model Train set Test set Accuracy(%)
MNB(TF-IDF) KINNEWS KIRNEWS 73.46
SVM(TF-IDF) KINNEWS KIRNEWS 72.70
LR(TF-IDF) KINNEWS KIRNEWS 68.26
BiGRU(W2V-Kin-50) KINNEWS KIRNEWS 67.54
BiGRU(W2V-Kin-100*) KINNEWS KIRNEWS 65.06
CNN(W2V-Kin-100) KINNEWS KIRNEWS 61.72
CNN(W2V-Kin-50) KINNEWS KIRNEWS 60.64
Char-CNN KINNEWS KIRNEWS 49.60
Model Train set Test set Accuracy(%)
CNN(W2V-Kin-100) KIRNEWS KIRNEWS 88.01
BiGRU(W2V-Kin-100) KIRNEWS KIRNEWS 86.61
CNN(W2V-Kin-50) KIRNEWS KIRNEWS 85.75
BiGRU(W2V-Kin-50) KIRNEWS KIRNEWS 83.38