-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create back-translation.md #463
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,18 @@ | ||
--- | ||
parent: Customisation | ||
layout: coming_soon | ||
title: Back-translation | ||
description: | ||
description: Back-translating or back-copying target language sentences to augment parallel data | ||
--- | ||
|
||
**Back-translation** is the process of translating the monolingual data in the target language into the source language and then back into the target language. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd simplify "the monolingual data in the target language" to make it less wordy. What do you think of something like this? Back-translation is the process of using machine translation to translate again a machine translation output to generate synthetic data. Steps1- First, the machine translation system translates a text from one language to another language. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think creating an image would make it easier to understand? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree with this description, because it makes it sound like then it is translated from source to target. But the training data is only translated from target to source. Translation from source to target only happens at inference time, but that's a given. |
||
The goal is to generate synthetic [parallel data](/customisation/parallel-data.md) that can be used to train machine translation systems. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great place to introduce the "augment" term! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "necessary" is not always true, we can be less opinionated. |
||
|
||
**Back-copying** is a similar technique to back-translation. | ||
The process involves using an existing translation to create a new parallel sentence pair in the opposite direction. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's clear how it differs with |
||
|
||
### Challenges | ||
|
||
Back-translation is challenging for language pairs with significantly different syntactic and semantic features, resulting in low-quality parallel data. | ||
|
||
Back-copying generates parallel data that is identical in the source and target languages. | ||
The lack of diversity can result in a machine translation system that is overly reliant on the [training data](/customisation/training-data.md) and performs poorly on unseen data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this definition is not clear, should we say something like "Translation of the machine translation output back into its input language"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option. would be something like "Creating parallel data by translating monolingual data from the target language to the source language"