-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path03_Rmd_smartpca.Rmd
208 lines (154 loc) · 10.9 KB
/
03_Rmd_smartpca.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Principal Components Analysis (PCA)"
output: html_document
editor_options:
chunk_output_type: console
---
```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE)
```
```{r}
library(magrittr)
```
Principal components analysis (PCA) is one of the most useful techniques to visualise genetic diversity in a dataset. The methodology is not restricted to genetic data, but in general allows breaking down high-dimensional datasets to two or more dimensions for visualisation in a two-dimensional space.
## Genotype Data
This lesson is also our first contact with the genotype data used in this and most of the following lessons. The dataset that we will work with contains 1,340 individuals, each represented by 593,124 single nucleotide polymorphisms (SNPs). Those SNPs have exactly two different alleles, and each individual has one of four possible values at each genotype: homozygous reference, heterozygous, homozygous alternative, or missing. Those four values are encoded 2, 1, 0 and 9 respectively.
The data is laid out as a matrix, with columns indicating individuals, and rows indicating SNPs. The data itself comes in the so-called \"EIGENSTRAT\" format, which is defined in the [Eigensoft package](https://github.com/DReichLab/EIG) used by many tools used in this workshop. In this format, a genotype dataset consists of three files, usually with the following file endings:
* `*.snp`: The file containing the SNP positions. It consists of six columns: SNP-name, chromosome, genetic positions, physical position, reference allele, alternative allele.
* `*.ind`: The file containing the names of the individuals. It consists of three columns: Individual Name, Sex (encoded as M(ale), F(emale), or U(nknown)), and population name.
* `*.geno`: The file containing the genotype matrix, with individuals laid out from left to right, and SNP positions laid out from top to bottom.
In the following, we will explore the files using R in this Rmarkdown document.
The data that we want to analyse is stored at `data/popgen_course`. Let's list the contents of that directory:
```{r}
list.files("data/popgen_course/")
```
Let's explore those files a bit. Here are the first 20 individuals:
```{r}
individuals <- readr::read_delim(
"data/popgen_course/genotypes_small.ind",
delim = " ",
trim_ws = T,
col_names = c(
"name",
"sex",
"population"
)
)
individuals %>% head(20)
```
And here the first 20 SNP rows:
```{r}
snps <- readr::read_delim(
"data/popgen_course/genotypes_small.snp",
delim = " ",
trim_ws = T,
col_names = c(
"SNP_name",
"chromosome",
"genetic_position",
"physical_position",
"reference_allele",
"alternative_allele"
)
)
```
And here are the first 20 genotypes of the first 50 individuals:
```{r}
geno <- readr::read_lines(
"data/popgen_course/genotypes_small.geno",
n_max = 20
)
geno %>% substr(1, 50)
```
Counting how many individuals and SNPs there are:
```{r}
nrow(individuals)
nrow(snps)
```
And now we check that the first row of the `*.geno` file indeed contains the same number of columns:
```{r}
nchar(geno[1])
```
Now counting the number of rows in the `*.geno`-file (this takes a few seconds, as the file is several hundred MB large):
```{r}
R.utils::countLines("data/popgen_course/genotypes_small.geno") %>% as.integer()
```
Great, the number of rows and columns agrees with the numbers indicated in the `*.ind` and `*.snp` file! Now we're counting how many different populations there are. Let's first see the first 10 populations in the sorted list, alongside the number of individuals in each group:
```{r}
individuals %>%
dplyr::group_by(population) %>%
dplyr::count()
```
## How PCA works
To understand how PCA works, consider a single individual and its representation by its 593,124 markers. Formally, each individual is a point in a 593,124-dimensional space, where each dimension can take only the three possible genotypes indicated above, or have missing data. To visualise this high-dimensional dataset, we would like to project it down to two dimensions. But as there are many ways to project the shadow of a three-dimensional object on a two dimensional plane, there are many (and even more) ways to project a 593,124-dimensional cloud of points to two dimensions. What PCA does is figuring out the \"best\" way to do this project in order to visualise the major components of variance in the data.
For actually running the analysis, we use a software called `smartPCA` from the [Eigensoft package](https://github.com/DReichLab/EIG). As many other tools from this and related packages, `smartPCA` reads in a parameter file which specifies its input and output files and options. In our case, we want the parameter file to have the following content:
```
genotypename: data/popgen_course/genotypes_small.geno
snpname: data/popgen_course/genotypes_small.snp
indivname: data/popgen_course/genotypes_small.ind
evecoutname: pca.WestEurasia.evec
evaloutname: pca.WestEurasia.eval
poplistname: data/popgen_course/WestEurasia.poplist.txt
lsqproject: YES
numoutevec: 4
numthreads: 1
```
Here, the first three parameters specify the input genotype files. The next two rows specify two output file names, typically with ending `*.evec` and `*.eval`. The parameter line beginning with `poplistname` contains a file with a list of populations used for calculating the principal components (see below). The option `lsqproject` is important for applications including ancient DNA with lots of missing data, which I will not elaborate on. For the purpose of this workshop, you should use `lsqproject: YES`. The next option `numoutevec` specifies the number of principal components that we compute, the last option `numthreads` the number of CPUs to use for this run. We use just one since we're working together on the same computer, so cannot afford everyone running on lots of CPUs.
## Population lists vs. Projection
The parameter named `poplistname` is a very crucial one. It specifies the populations whose individuals are used to calculate the principal components. Why not just all of them you ask? For two reasons: First, there are simply too many of them and we don't want to use all of them, since the computation would take too long. More importantly, however, we generally try to avoid using ancient samples to compute principal components, to avoid specific ancient-DNA related artefacts affecting the computation. Finally, the list of populations to use for PCA should be informed by your question. If you're investigating African population structure, in makes no sense to put Asian or European individuals in your population list, since then the main axes of genetic differentiation would not be inside of Africa, but between Africans and Non-Africans.
So what happens to individuals that are not in populations listed in the population list? Well, fortunately, they are not just ignored, but \"projected\". This means that after the principal components have been computed, *all* individuals (not just the one in the list) are projected onto these principal components. That way, we can visualise ancient populations in the context of modern genetic variation. While that may sound a bit problematic at first (Some variation in ancient populations is not represented well by modern populations), but it turns out to be nevertheless one of the most useful tools for this purpose. The advantage of avoiding ancient-DNA artefacts and batch effects to affect the visualisation outweighs the disadvantage of missing some private genetic variation components in the ancient populations themselves. Of course, that argument breaks down once the analysed populations become too ancient and detached from modern genetic variation. But for our purposes it will work just fine.
For this workshop, I prepared two population lists::
```
data/popgen_course/WestEurasia.poplist.txt
data/popgen_course/AllEurasia.poplist.txt
```
As you can tell from the names of the files, they specify two sets of modern populations representing West Eurasia or all of Europe and Asia, respectively.
I recommend to look through both of the population lists and google some population names that you don't recognise to get a feeling for the ethnic groups represented here.
## Running `smartPCA`
Now go ahead and open a new text file using your Jupyter Browser, you can name it anything you like. For the sake of a concrete name, let's call it `pca.WestEurasia.params.txt`. Text files in Jupyter are opened in a text editor, so you can then simply copy-paste the above lines into the new file.
```{r}
readr::write_lines(c(
"genotypename: data/popgen_course/genotypes_small.geno",
"snpname: data/popgen_course/genotypes_small.snp",
"indivname: data/popgen_course/genotypes_small.ind",
"evecoutname: pca.WestEurasia.evec",
"evaloutname: pca.WestEurasia.eval",
"poplistname: data/popgen_course/WestEurasia.poplist.txt",
"lsqproject: YES",
"numoutevec: 4",
"numthreads: 1"
),
path = "pca.WestEurasia.params.txt"
)
```
Let's see whether it worked, by printing out the contents of that file into your notebook:
```{r}
readr::read_lines(
"pca.WestEurasia.params.txt"
)
```
Great, so that's our parameter file for running `smartPCA`.
**Note:** that we specified two output files in our parameter file, here called `pca.WestEurasia.evec` and `pca.WestEurasia.eval`. You can actually put any names you want in there. But beware of relative vs. absolute paths. File names starting with `/` are considered \"absolute\", that is, taken to go from the root of the file system. In contrast, filenames not starting with `/` are considered \"relative\" to the current working directory. If you forgot which directory you're in, run `pwd`.
**Note:** The option `poplistname` is a crucial one. Here you need to specify which populations are used to compute the eigenvectors of the principal components analysis. In our case, I have prepared two population list files: `data/popgen_course/WestEurasia.poplist.txt` and `data/popgen_course/AllEurasia.poplist.txt`. Pick one of the two to carry on.
Good, now we can run `smartPCA`. To do that, it's more convenient to use the terminal than a Rmarkdown file. So open a terminal and run
```
smartpca -p pca.WestEurasia.params.txt
```
This will typically run for about 30 minutes and output lots of logging output to the screen.
In a similar manner we can prepare a parameter file for the AllEurasia population list. This is how it should look:
```
genotypename: data/popgen_course/genotypes_small.geno
snpname: data/popgen_course/genotypes_small.snp
indivname: data/popgen_course/genotypes_small.ind
evecoutname: pca.AllEurasia.evec
evaloutname: pca.AllEurasia.eval
poplistname: data/popgen_course/AllEurasia.poplist.txt
lsqproject: YES
numoutevec: 4
numthreads: 1
```
And similar to the command above, we can run pca on the AllEurasia population list via:
```
smartpca -p pca.AllEurasia.params.txt
```
which will run slightly longer than the first one because there are more populations.