OccGen_dataset

OccGen dataset and its metadata, we have two usecases for our toolkit. The translations_HR_data.csv which represents the parallel data of EN, ES, AR, RU. The translations_LR_data.csv which represents the parallel data of EN, SW.

All the previous data is annotated and is ready to be used for evaluation.

The OccGen toolkit is available in this github link: https://github.com/mt-upc/OccGen_toolkit/

Training data is released under the following links:

En-Ar 122k: https://drive.google.com/file/d/18SP18I-9wb0H_ibrdP9eBc3hVfgk4q9X/view?usp=sharing

En-Es 1M: https://drive.google.com/file/d/1wahfqTiEgD89wHqnZuq2X4syDk15EgTz/view?usp=sharing

En-Ru 433k:https://drive.google.com/file/d/16KnpUWtbCkoXOEIwUJKFuYqK8woQPZDg/view?usp=sharing

Gender distribution in our training data can be seen in the below heatmap:

Gender definitions extracted from Wikipedia.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
static		static
.DS_Store		.DS_Store
README.md		README.md
translations_HR_data.csv		translations_HR_data.csv
translations_LR_data.csv		translations_LR_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OccGen_dataset

About

Releases

Packages

Contributors 2

mt-upc/OccGen_dataset

Folders and files

Latest commit

History

Repository files navigation

OccGen_dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages