README.rmd

---
title: Reproducible Extraction of Cross-lingual Topics using R.
output: github_document
---

# rectr <img src="man/figures/rectr_logo.png" align="right" height="200" />

Please cite this package as:

*Chan, C.H., Zeng, J., Wessler, H., Jungblut, M., Welbers, K., Bajjalieh, J., van Atteveldt, W., & Althaus, S. (2020) Reproducible Extraction of Cross-lingual Topics. Communication Methods & Measures. DOI: [10.1080/19312458.2020.1812555](https://doi.org/10.1080/19312458.2020.1812555)*

The rectr package contains an example dataset "wiki" with English and German articles from Wikipedia about programming languages and locations in Germany. This package uses the corpus data structure from the `quanteda` package.

```{r}
require(rectr)
require(tibble)
require(dplyr)
wiki
```

Currently, this package supports [aligned fastText](https://github.com/facebookresearch/fastText/tree/master/alignment) from Facebook Research and [Multilingual BERT (MBERT)](https://github.com/google-research/bert/blob/master/multilingual.md) from Google Research. For easier integration, the PyTorch version of MBERT from [Transformers](https://github.com/huggingface/transformers) is used.

# Multilingual BERT

## Step 1: Setup your conda environment

```{r, eval = FALSE}
## setup a conda environment, default name: rectr_condaenv
mbert_env_setup()
```

## Step 1: Download MBERT model

```{r, eval = FALSE}
## default to your current directory
download_mbert(noise = TRUE)
```

## Step 2: Create corpus

Create a multilingual corpus

```{r}
wiki_corpus <- create_corpus(wiki$content, wiki$lang)
```

## Step 2: Create bag-of-embeddings dfm

Create a multilingual dfm

```{r, eval = FALSE}
## default
wiki_dfm <- transform_dfm_boe(wiki_corpus, noise = TRUE)
wiki_dfm
```

```{r, includes = FALSE}
wiki_dfm <- readRDS("man/figures/wiki_dfm.RDS")
```

## Step 3: Filter dfm

Filter the dfm for language differences

```{r}
wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2)
wiki_dfm_filtered
```

## Step 4: Estimate GMM

Estimate a Guassian Mixture Model

```{r}
wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm
```

The document-topic matrix is available in `wiki_gmm$theta`.

Rank the articles according to the theta1.

```{r}
wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)
```

# Aligned fastText

## Step 1: Download word embeddings

Download and preprocess fastText word embeddings from Facebook. Make sure you have at least 5G of disk space and a reasonably amount of RAM. It took around 20 minutes on my machine.

```{r, eval = FALSE}
get_ft("en")
get_ft("de")
```

## Step 1: Read the downloaded word embeddings

```{r}
emb <- read_ft(c("en", "de"))
```

## Step 2: Create corpus

Create a multilingual corpus

```{r}
wiki_corpus <- create_corpus(wiki$content, wiki$lang)
```

## Step 2: Create bag-of-embeddings dfm

Create a multilingual dfm

```{r}
require(future)
plan(multisession)
wiki_dfm <- transform_dfm_boe(wiki_corpus, emb, .progress = TRUE)
wiki_dfm
```

## Step 3: Filter dfm

Filter the dfm for language differences

```{r}
wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2)
wiki_dfm_filtered
```

## Step 4: Estimate GMM

Estimate a Guassian Mixture Model

```{r}
wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm
```

The document-topic matrix is available in `wiki_gmm$theta`.

Rank the articles according to the theta1.

```{r}
wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)
```


SessionInfo

```{r}
sessionInfo()
```