-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.rmd
155 lines (103 loc) · 3.53 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: Reproducible Extraction of Cross-lingual Topics using R.
output: github_document
---
# rectr <img src="man/figures/rectr_logo.png" align="right" height="200" />
Please cite this package as:
*Chan, C.H., Zeng, J., Wessler, H., Jungblut, M., Welbers, K., Bajjalieh, J., van Atteveldt, W., & Althaus, S. (2020) Reproducible Extraction of Cross-lingual Topics. Communication Methods & Measures. DOI: [10.1080/19312458.2020.1812555](https://doi.org/10.1080/19312458.2020.1812555)*
The rectr package contains an example dataset "wiki" with English and German articles from Wikipedia about programming languages and locations in Germany. This package uses the corpus data structure from the `quanteda` package.
```{r}
require(rectr)
require(tibble)
require(dplyr)
wiki
```
Currently, this package supports [aligned fastText](https://github.com/facebookresearch/fastText/tree/master/alignment) from Facebook Research and [Multilingual BERT (MBERT)](https://github.com/google-research/bert/blob/master/multilingual.md) from Google Research. For easier integration, the PyTorch version of MBERT from [Transformers](https://github.com/huggingface/transformers) is used.
# Multilingual BERT
## Step 1: Setup your conda environment
```{r, eval = FALSE}
## setup a conda environment, default name: rectr_condaenv
mbert_env_setup()
```
## Step 1: Download MBERT model
```{r, eval = FALSE}
## default to your current directory
download_mbert(noise = TRUE)
```
## Step 2: Create corpus
Create a multilingual corpus
```{r}
wiki_corpus <- create_corpus(wiki$content, wiki$lang)
```
## Step 2: Create bag-of-embeddings dfm
Create a multilingual dfm
```{r, eval = FALSE}
## default
wiki_dfm <- transform_dfm_boe(wiki_corpus, noise = TRUE)
wiki_dfm
```
```{r, includes = FALSE}
wiki_dfm <- readRDS("man/figures/wiki_dfm.RDS")
```
## Step 3: Filter dfm
Filter the dfm for language differences
```{r}
wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2)
wiki_dfm_filtered
```
## Step 4: Estimate GMM
Estimate a Guassian Mixture Model
```{r}
wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm
```
The document-topic matrix is available in `wiki_gmm$theta`.
Rank the articles according to the theta1.
```{r}
wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)
```
# Aligned fastText
## Step 1: Download word embeddings
Download and preprocess fastText word embeddings from Facebook. Make sure you have at least 5G of disk space and a reasonably amount of RAM. It took around 20 minutes on my machine.
```{r, eval = FALSE}
get_ft("en")
get_ft("de")
```
## Step 1: Read the downloaded word embeddings
```{r}
emb <- read_ft(c("en", "de"))
```
## Step 2: Create corpus
Create a multilingual corpus
```{r}
wiki_corpus <- create_corpus(wiki$content, wiki$lang)
```
## Step 2: Create bag-of-embeddings dfm
Create a multilingual dfm
```{r}
require(future)
plan(multisession)
wiki_dfm <- transform_dfm_boe(wiki_corpus, emb, .progress = TRUE)
wiki_dfm
```
## Step 3: Filter dfm
Filter the dfm for language differences
```{r}
wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2)
wiki_dfm_filtered
```
## Step 4: Estimate GMM
Estimate a Guassian Mixture Model
```{r}
wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm
```
The document-topic matrix is available in `wiki_gmm$theta`.
Rank the articles according to the theta1.
```{r}
wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)
```
SessionInfo
```{r}
sessionInfo()
```