-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathHOWTOLD.Rmd
executable file
·268 lines (204 loc) · 9.53 KB
/
HOWTOLD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
title: "Linkage disequilibrium of SBCC barleys"
output:
html_document:
fig_caption: yes
highlight: zenburn
theme: cerulean
toc: yes
toc_depth: 4
pdf_document:
fig_caption: yes
highlight: zenburn
toc: yes
toc_depth: 4
word_document: default
date: '`r format(Sys.Date(),"%d/%b/%Y")`'
---
```{r knitr setup, include=FALSE, eval=TRUE, echo=FALSE, warning=FALSE}
library(knitr)
knitr::opts_chunk$set(eval=TRUE, cache=FALSE, message=FALSE, warning=FALSE,
comment = "", results="markup")
```
This document describes how linkage disequilibrium (LD) was computed from SNPs of 135 landraces of the Spanish Barley Core Collection (SBCC) by E Igartua and B Contreras.
r2 is calculated in four ways: the standard and three new measures of r2, correcting for population structure, with strategies similar to the Q, K and Q+K models in GWAS, r2s, r2v, and r2vs, respectively. These measures were first described by Mangin et al in
[https://www.nature.com/articles/hdy201173](https://www.nature.com/articles/hdy201173)
and are implemented in R package [LDcorSV](https://cran.r-project.org/web/packages/LDcorSV/index.html).
<!--Mangin, B., Siberchicot, A., Nicolas, S., Doligez, A., This, P., & Cierco-Ayrolles, C. (2012). Novel measures of linkage disequilibrium that correct the bias due to population structure and relatedness. Heredity, 108(3), 285.-->
## Formatting SNPs
The initial set of 9,920 Infinium and GBS markers in file [*9920_SNPs_SBCC_50K.tsv*](raw/9920_SNPs_SBCC_50K.tsv)
was converted as follows, allowing up to 10% missing data per position and accepting only biallelic loci (n=8455, after removing two htzg markers), just as done for the STRUCTURE analysis (file [*SBCC_9K_SNPs.structure.tsv*](SBCC_9K_SNPs.structure.tsv) was produced). The final list of SNPs is in file [*SBCC_LD_order.tsv*](LD/SBCC_LD_order.tsv).
Genotype files were prepared for each chromosome, including an artificial chromosome for markers of uncertain location,
with genotypes coded as 0, 2 & NA.
These CSV files are called [*Geno1H*](LD/Geno1H.csv), [*Geno2H*](LD/Geno2H.csv), [*Geno3H*](LD/Geno3H.csv), [*Geno4H*](LD/Geno4H.csv), [*Geno5H*](LD/Geno5H.csv), [*Geno6H*](LD/Geno6H.csv), [*Geno7H*](LD/Geno7H.csv) and
[*Genounk*](LD/Genounk_random.csv).
## Imputing missing data
Before moving on we can take these files, one per chromosome, and impute missing genotype calls
with R package [linkim](https://cran.r-project.org/package=linkim), a linkage-based method for
imputing missing diploid genotypes published with [doi:10.4172/2329-9533.1000115](https://doi.org/10.4172/2329-9533.1000115).
This method has some stochastic component as independent runs provide different results.
For this reason the block code below has eval=FALSE so that it is compatible with LFMM results in
protocol [HOWTOLFMM](./HOWTOsnpsLFMM.html).
```{r impute, eval=FALSE}
library(linkim)
# set barley chr names
chrnames = c("1H","2H","3H","4H","5H","6H","7H")
# import data for all chromosomes (1H to 7H)
Geno = c()
Geno[[1]] = read.csv('LD/Geno1H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
Geno[[2]] = read.csv('LD/Geno2H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
Geno[[3]] = read.csv('LD/Geno3H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
Geno[[4]] = read.csv('LD/Geno4H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
Geno[[5]] = read.csv('LD/Geno5H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
Geno[[6]] = read.csv('LD/Geno6H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
Geno[[7]] = read.csv('LD/Geno7H.csv', header=TRUE, check.names = F, sep=";", row.names = 1)
# read corresponding map info, with markers in the same order as in CSV files
mapmarkers = read.table(file="LD/SBCC_LD_order.tsv",header=T,sep="\t",check.names = F)
# filename for imputed outfile
out_imp_TSV = "LD/SBCC_9K_SNPs.imputed.tsv"
unlink(out_imp_TSV)
for (c in 1:7){
Genotypes = Geno[[c]]
# convert allele calls 2 to 1
Genotypes[Genotypes==2] <- 1
# get markers from column names and sort them
chrmarkers = mapmarkers[ mapmarkers$chr==c, ]
chrmarkers = chrmarkers[ complete.cases(chrmarkers), ]
#chrmarkers = as.numeric(chrmarkers[ , c("cM") ]) # does not work with repeated cM positions
chrmarkers = as.numeric(chrmarkers[ , c("bp") ])
chrmarkers = chrmarkers / 1000000
# impute
full.Genotypes = link.im(data=Genotypes,r=chrmarkers) #,trace = T)
# transpose and print data
full.Genotypes = t(full.Genotypes)
write.table(full.Genotypes,file=out_imp_TSV,sep="\t",row.names=T,col.names=F,quote=F,append=T)
}
```
A header is then added to this file:
```{r, engine='bash'}
perl -F";" -lane 'next if(/^;/); $h.="\t$F[0]"; END{print "marker$h"}' LD/Geno1H.csv > LD/header.tsv
cat LD/header.tsv LD/SBCC_9K_SNPs.imputed.tsv > SBCC_9K_SNPs.imputed.tsv
```
These operations produce file [*SBCC_9K_SNPs.imputed.tsv*](SBCC_9K_SNPs.imputed.tsv) (n=7477).
## Kinship matrix and population structure
A kinship matrix was modified from a square matrix of distances produced by TASSEL and saved as [*WAIS*](LD/WAIS.csv).
In addition, a file with probabilities of membership to K=4 subpopulations was prepared from STRUCTURE output and saved as [*POP*](LD/POP.csv). This is the Q file with 3 columns.
## Computing LD
```{r LD, eval=FALSE}
library(LDcorSV)
# set window length
Wl = 9
# import data for unmapped markers (chr8) as a control, chromosomes 1H to 7H were already loaded
Geno[[8]] = read.csv('LD/Genounk_random.csv', header=TRUE, row.names=1, sep=";", dec=".", na.strings="NA")
# read kinship/covariance matrix and population memberships
KIN<-read.csv('LD/WAIS.csv', header=TRUE, row.names=1, sep=";", dec=".", na.strings="NA")
POP<-read.csv('LD/POP.csv', header=TRUE, row.names=1, sep=";", dec=".", na.strings="NA")
for (c in 1:8){ # 7 real + 1 fake chr for unmapped markers
# set names of output files
outTSVr2s = paste("LD/r2s_",c,"H.tsv", sep="")
outTSVr2 = paste("LD/r2_",c,"H.tsv", sep="")
unlink(c(outTSVr2,outTSVr2s))
# set appropriate chr and last element considering window length
Genotypes = Geno[[c]]
lasti = ncol(Genotypes)-(Wl-1)
lastv = (Wl * (Wl-1))/2 # see https://rdrr.io/cran/LDcorSV/man/LD.Measures.html
for (i in 1:lasti){
Geno.Wl = Genotypes[,i:(i+Wl-1)]
LD = LD.Measures(Geno.Wl,V=KIN,S=POP,data ="G",supinfo=TRUE,na.presence=TRUE)
outLD = LD[,3:6]
mean_values = apply(outLD[1:lastv,],2,mean)
# kinship-corrected LD
r2s = mean_values[3]
write(r2s, file=outTSVr2s, append=T)
# raw LD
r2 = mean_values[1]
write(r2, file=outTSVr2, append=T)
}
}
```
## Validating and exporting LD estimates
```{r LDeval, eval=FALSE}
library(qqman)
# output file names
outTSVr2 = "LD/SBCC_r2.tsv"
outTSVr2s = "LD/SBCC_r2s.tsv"
## r2
# prepare vars required to export LD values to a single file
r2 = c()
idx = 0
lasti = (Wl-1)/2
# prepare plots
par(mfrow=c(2,4))
for (c in 1:8){
# markers anchored to chromosomes vs unsorted markers (chr=8)
LDdata = read.table(file=paste("LD/r2_",c,"H.tsv", sep=""))
boxplot(LDdata,ylim=c(0,1),ylab="r2",main=paste(c,"H (n=",nrow(LDdata),")",sep=""))
# add head void values where LD window was not computed
for (i in 1:lasti){
idx = idx+1
r2[idx] = NA
}
# add bulk LD markers
for (i in 1:nrow(LDdata)){
idx = idx+1
r2[idx] = LDdata[i,]
}
# add tail void values where LD window was not computed
for (i in 1:lasti){
idx = idx+1
r2[idx] = NA
}
}
# actually save to file, leaving out unmapped markers
allLDtable = cbind(r2,mapmarkers)
allLDtable_map = subset(allLDtable, (!is.na(allLDtable[,3])))
write.table(allLDtable_map,file=outTSVr2,sep="\t",row.names=F,col.names=T,quote=F)
# make manhattan plot
mht.data = allLDtable_map[ , c("chr","cM","r2","SNPidentifier")]
mht.data = subset(mht.data, (!is.na(mht.data[,3])))
png("LD/SBCC_r2.png",width=1200)
manhattan(mht.data,chr="chr",bp="cM",p="r2",snp="SNPidentifier",logp=F,
col = c("gray10", "gray60"), chrlabs=chrnames,
suggestiveline=F,genomewideline=F,ylab="r2")
dev.off()
## r2s (structure-corrected)
# prepare vars required to export LD values to a single file
r2s = c()
idx = 0
# prepare plots
par(mfrow=c(2,4))
for (c in 1:8){
# markers anchored to chromosomes vs unsorted markers (chr=8)
LDdata = read.table(file=paste("LD/r2s_",c,"H.tsv", sep=""))
boxplot(LDdata,ylim=c(0,1),ylab="r2s",main=paste(c,"H (n=",nrow(LDdata),")",sep=""))
# add head void values where LD window was not computed
for (i in 1:lasti){
idx = idx+1
r2s[idx] = NA
}
# add bulk LD markers
for (i in 1:nrow(LDdata)){
idx = idx+1
r2s[idx] = LDdata[i,]
}
# add tail void values where LD window was not computed
for (i in 1:lasti){
idx = idx+1
r2s[idx] = NA
}
}
# actually save to file, leaving out unmapped markers
allLDstable = cbind(r2s,mapmarkers)
allLDstable_map = subset(allLDstable, (!is.na(allLDstable[,3])))
write.table(allLDstable_map,file=outTSVr2s,sep="\t",row.names=F,col.names=T,quote=F)
# make manhattan plot
mht.data = allLDstable_map[ , c("chr","cM","r2s","SNPidentifier")]
mht.data = subset(mht.data, (!is.na(mht.data[,3])))
png("LD/SBCC_r2s.png",width=1200)
manhattan(mht.data,chr="chr",bp="cM",p="r2s",snp="SNPidentifier",logp=F,
col = c("gray10", "gray60"), chrlabs=chrnames,
suggestiveline=F,genomewideline=F,ylab="r2s")
dev.off()
```
![Manhattan plots of SBCC linkage disequilibrium (windows of 9 markers)](LD/SBCC_r2.png)
![Manhattan plots of SBCC linkage disequilibrium (windows of 9 markers, structure-corrected)](LD/SBCC_r2s.png)