Skip to content

mt-upc/OccGen_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OccGen_dataset

OccGen dataset and its metadata, we have two usecases for our toolkit. The translations_HR_data.csv which represents the parallel data of EN, ES, AR, RU. The translations_LR_data.csv which represents the parallel data of EN, SW.

All the previous data is annotated and is ready to be used for evaluation.

The OccGen toolkit is available in this github link: https://github.com/mt-upc/OccGen_toolkit/

Training data is released under the following links:

En-Ar 122k: https://drive.google.com/file/d/18SP18I-9wb0H_ibrdP9eBc3hVfgk4q9X/view?usp=sharing

En-Es 1M: https://drive.google.com/file/d/1wahfqTiEgD89wHqnZuq2X4syDk15EgTz/view?usp=sharing

En-Ru 433k:https://drive.google.com/file/d/16KnpUWtbCkoXOEIwUJKFuYqK8woQPZDg/view?usp=sharing

Gender distribution in our training data can be seen in the below heatmap:

heatmap_genders

Gender definitions extracted from Wikipedia.

About

OCCGEN dataset and its metadata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published