Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create back-translation.md #463

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions customisation/back-translation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
---
parent: Customisation
layout: coming_soon
title: Back-translation
description:
description: Back-translating or back-copying target language sentences to augment parallel data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this definition is not clear, should we say something like "Translation of the machine translation output back into its input language"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option. would be something like "Creating parallel data by translating monolingual data from the target language to the source language"

---

**Back-translation** is the process of translating the monolingual data in the target language into the source language and then back into the target language.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simplify "the monolingual data in the target language" to make it less wordy.

What do you think of something like this?

Back-translation is the process of using machine translation to translate again a machine translation output to generate synthetic data.

Steps

1- First, the machine translation system translates a text from one language to another language.
2- Then, the system uses the machine translation output as input, and translates it back into the original language.
3- Finally, the system translates the resulting synthetic input text again into the output language.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think creating an image would make it easier to understand?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this description, because it makes it sound like then it is translated from source to target. But the training data is only translated from target to source.

Translation from source to target only happens at inference time, but that's a given.

The goal is to generate synthetic [parallel data](/customisation/parallel-data.md) that can be used to train machine translation systems.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great place to introduce the "augment" term!
Perhaps we can say that synthetic data is necessary to augment parallel data so that there is more data to train machine translation systems and improve quality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"necessary" is not always true, we can be less opinionated.


**Back-copying** is a similar technique to back-translation.
The process involves using an existing translation to create a new parallel sentence pair in the opposite direction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's clear how it differs with back-translation. Is the "existing" translation human-generated?


### Challenges

Back-translation is challenging for language pairs with significantly different syntactic and semantic features, resulting in low-quality parallel data.

Back-copying generates parallel data that is identical in the source and target languages.
The lack of diversity can result in a machine translation system that is overly reliant on the [training data](/customisation/training-data.md) and performs poorly on unseen data.