Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create test sets for all languages. #158

Open
cdleong opened this issue Jun 24, 2021 · 8 comments
Open

Create test sets for all languages. #158

cdleong opened this issue Jun 24, 2021 · 8 comments

Comments

@cdleong
Copy link
Contributor

cdleong commented Jun 24, 2021

Following #157, check what languages are not covered in https://github.com/juliakreutzer/masakhane/tree/master/jw300_utils/test, and create custom test sets for those. @juliakreutzer I think I can give this a go, but do I need to do a pull request to... your forked version of masakhane-mt?

Alternate language code list, looks the same: https://opus.nlpl.eu/opusapi/?languages=True&corpus=JW300

@juliakreutzer
Copy link
Collaborator

Yes, or we incorporate the code here completely.

@cdleong
Copy link
Contributor Author

cdleong commented Jun 24, 2021

Ah, well we'd have to update the notebooks as well, as they point directly to the forked version

@cdleong
Copy link
Contributor Author

cdleong commented Jun 24, 2021

Some languages, e.g. ady, lack alignment files for English: https://opus.nlpl.eu/JW300.php

@cdleong
Copy link
Contributor Author

cdleong commented Jun 24, 2021

test_letter_a_new.zip
Did every language code which starts with the letter "a". Here's the ones that weren't already in there.

@cdleong
Copy link
Contributor Author

cdleong commented Jun 24, 2021

Got to bfi before I started actually practicing "quality at a glance" and looking at the data. Turns out bfi is just... English data?

@cdleong
Copy link
Contributor Author

cdleong commented Jun 24, 2021

Oh, it's "British Sign Language". What the heck? https://en.wikipedia.org/wiki/British_Sign_Language

@cdleong
Copy link
Contributor Author

cdleong commented Jun 24, 2021

test_ba_thru_btg_new.zip
ba thru btg codes, not already in the global test set

@juliakreutzer
Copy link
Collaborator

Oh yeah, maybe we should do a blacklist for all the language codes that have issues according to the tables in the appendix of https://arxiv.org/abs/2103.12028.
Btw the sources of the test set were selected based on in how many African languages they were translated into, so there is a bias towards frequent/general sentences. This is important to keep in mind as we extend the test set to more languages, since this initial selection of languages played a role in the selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants