Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create back-translation.md #463

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

liashahnazaryan
Copy link
Contributor

Description

Fixes # 81

Type of PR

  • Creates the article [Back-translation]

Checklist:

Copy link
Collaborator

@cefoo cefoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your hard work, @liashahnazaryan!

I hope my comments make sense. Let me know what you think. :)

title: Back-translation
description:
description: Back-translating or back-copying target language sentences to augment parallel data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this definition is not clear, should we say something like "Translation of the machine translation output back into its input language"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option. would be something like "Creating parallel data by translating monolingual data from the target language to the source language"

---

**Back-translation** is the process of translating the monolingual data in the target language into the source language and then back into the target language.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simplify "the monolingual data in the target language" to make it less wordy.

What do you think of something like this?

Back-translation is the process of using machine translation to translate again a machine translation output to generate synthetic data.

Steps

1- First, the machine translation system translates a text from one language to another language.
2- Then, the system uses the machine translation output as input, and translates it back into the original language.
3- Finally, the system translates the resulting synthetic input text again into the output language.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think creating an image would make it easier to understand?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this description, because it makes it sound like then it is translated from source to target. But the training data is only translated from target to source.

Translation from source to target only happens at inference time, but that's a given.

---

**Back-translation** is the process of translating the monolingual data in the target language into the source language and then back into the target language.
The goal is to generate synthetic [parallel data](/customisation/parallel-data.md) that can be used to train machine translation systems.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great place to introduce the "augment" term!
Perhaps we can say that synthetic data is necessary to augment parallel data so that there is more data to train machine translation systems and improve quality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"necessary" is not always true, we can be less opinionated.

The goal is to generate synthetic [parallel data](/customisation/parallel-data.md) that can be used to train machine translation systems.

**Back-copying** is a similar technique to back-translation.
The process involves using an existing translation to create a new parallel sentence pair in the opposite direction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's clear how it differs with back-translation. Is the "existing" translation human-generated?

@bittlingmayer
Copy link
Collaborator

I don't want to make articles longer, but I see a few things missing:

  • the motivation: there is a lot more monolingual data out there than parallel data
  • the effect: if overdone, helps fluency more than accuracy, since it's kind of like a target-side language model
  • adoption: Google, Microsoft, DeepL etc use this technique heavily
  • references: can we link some papers or articles? there was one year at WMT where this really become known

@liashahnazaryan
Copy link
Contributor Author

I'll make this a draft to work on the changes and add the suggested parts.

@liashahnazaryan liashahnazaryan marked this pull request as draft April 7, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants