This repository contains the code for finetuning a pretrained multilingual model with an image-grounded Emergent Communication task. This is the official repository for the ICLR EmeComm Workshop Paper Emergent Communication Fine-tuning (EC-FT) for Pretrained Language Models and the July 2022 pre-print "Learning to Translate by Learning to Communicate".
The image-to-image results (i2i-ec) from the July 2022 preprint should be replicable on the tag preprint_jul22_i2i-ec. The text-to-image results (t2i-ec) should be replicable on preprint_jul22_t2i-ec.
The code was loosely based on the work of the following paper
Yaoyiran Li, Edoardo Maria Ponti, Ivan Vulić, and Anna Korhonen. 2020. Emergent Communication Pretraining for Few-Shot Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). LINK
and is under development by the University of Washington CLMBR Lab, under Shane Steinert-Threlkeld.
The source code is built aroud PyTorch, and has the following main dependencies:
- Python 3.9
- transformers 4.0.1
For more extensive dependencies, see requirements.txt
.
conda create -n unmt python=3.9
pip install -r requirements.txt
pip install torch==1.12.1+cu102 --extra-index-url https://download.pytorch.org/whl/cu102
Important Note: We develop this project with torch==1.12.1+cu102
, please make sure this is package is used for reproducibility.
To obtain the latest results of this project, go to the communication-translation
folder and
run the relevant script from RunScripts.
This project uses structured configs implemented by hydra
, located in Configs
directory. We will briefly explain how configs are organized in this project and we refer user to original hydra
documentation for understanding what "structure configs" means.
Our pipeline mainly consists of two parts: backtranslation(Configs/backtranslate
, "BT") and emergent communication(Configs/ec
, "EC"). Backtranslation follows the iterative backtranslation process in mBART paper but applied to different language pairs.
Backtranslation(BT) is the main bulk part of the experiments and BT in all experiments run for the same number of steps. One could view EC training as a super light-weight training (about 30min of EC and 12hr of BT) inserted into the backtranslation process. Given a language pair and image embedding source, different experiments mainly vary across two dimensions: 1. Where EC is inserted in the process of BT 2. Which type of EC is inserted (T2I or I2I).
|===== BT =====||===== Optional: EC ======||=========== BT ============|
With that in mind, we organize configs as follows. We will use BT as an example:
Configs/backtranslate/
├── bt_baseline.yaml # baseline, that only do BT
├── data # pointing to different language data files
│ ├── en-de.yaml
│ ├── en-ne.yaml
│ ├── en-si.yaml
│ └── en-zh.yaml
└── train_eval # different training configs for backtranslation
│ ├── baseline.yaml
│ ├── initial.yaml
│ └── secondary.yaml
│ # different configs for different experiments,
│ # each configs essentially combine sub-configs in `data` and `train_eval`
├── i2i_bt_initial.yaml
├── i2i_bt_no_initBT.yaml
├── i2i_bt_secondary.yaml
├── t2i_bt_initial.yaml
├── t2i_bt_secondary.yaml
└── t2i_bt.yaml
Since we ran our experiments on server managed by condor, we included many *.cmd
files at the repository root.
Source code can be largely automatically formatted using yapf. Make sure you have yapf installed (it is included in requirements.txt).
pip install yapf
The repository style can be changed as we need, but for now the configuration
can be found in setup.cfg
. To automatically format source code in place, use
the following command:
yapf -ir src_file_or_directory
We recommend running this command after you add any code, and before you commit.
Please also follow other style best practices that yapf does not enforce:
- Always break up lines over 80 characters (in most text editors you can display a ruler to check)
- Name variables with full, descriptive words, space permitting
- Include one blank line at the end of every file
- Organize imports into the following three groups, alphabetizing within each
group (and within group, put Python library imports before external package
ones)
import a
import a as b
from a import b
- Comment any code you add
- "Imperative" style is preferred, e.g.
# save variable to cache
- "Imperative" style is preferred, e.g.
COCO image features are obtained from Translagent.
We publicize our data and model at huggingface hub.
The data is under Data/
; we additionally use a language model to regularize training, the used model is essentially a finetuned mBART decoder (under Output/mbart_lm_lr6e-6
) on different languages of CC-100.
Part of the code is based on Translagent.
The datasets for our experiments include MS COCO for Emergent Communication pretraining,