-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02.Rmd
449 lines (376 loc) · 19.5 KB
/
02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
```
```{r, echo = F, eval = F}
# (PART) THE BASICS: MODELS, PROBABILITY, BAYES' RULE, AND R {-}
# In this part of the project, we mainly reproduce figures. But we will fit models with increasing regularity as the chapters progress.
```
# Introduction: Credibility, Models, and Parameters
> The goal of this chapter is to introduce the conceptual framework of Bayesian data analysis. Bayesian data analysis has two foundational ideas. The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities, over which we allocate credibility, are parameter values in meaningful mathematical models. [@kruschkeDoingBayesianData2015, p. 15]
## Bayesian inference is reallocation of credibility across possibilities
The first step toward making Figure 2.1 is putting together a data object. To help with that, we'll open up the **tidyverse**.
```{r, message = F, warning = F}
library(tidyverse)
d <-
crossing(iteration = 1:3,
stage = factor(c("Prior", "Posterior"),
levels = c("Prior", "Posterior"))) %>%
expand(nesting(iteration, stage),
Possibilities = LETTERS[1:4]) %>%
mutate(Credibility = c(rep(.25, times = 4),
0, rep(1/3, times = 3),
0, rep(1/3, times = 3),
rep(c(0, .5), each = 2),
rep(c(0, .5), each = 2),
rep(0, times = 3), 1))
```
When making data with many repetitions in the rows, it's good to have the `tidyr::expand()` function up your sleeve. Go [here](https://tidyr.tidyverse.org/reference/expand.html) to learn more.
We can take a look at the top few rows of the data with the `head()` function.
```{r}
head(d)
```
Before we attempt Figure 2.1, we'll need two supplemental data frames. The first one, `text`, will supply the coordinates for the annotation in the plot. The second, `arrow`, will supply the coordinates for the arrows.
```{r}
text <-
tibble(Possibilities = "B",
Credibility = .75,
label = str_c(LETTERS[1:3], " is\nimpossible"),
iteration = 1:3,
stage = factor("Posterior",
levels = c("Prior", "Posterior")))
arrow <-
tibble(Possibilities = LETTERS[1:3],
iteration = 1:3) %>%
expand(nesting(Possibilities, iteration),
Credibility = c(0.6, 0.01)) %>%
mutate(stage = factor("Posterior", levels = c("Prior", "Posterior")))
```
Now we're ready to code our version of Figure 2.1.
```{r, fig.width = 6, fig.height = 4}
d %>%
ggplot(aes(x = Possibilities, y = Credibility)) +
geom_col(color = "grey30", fill = "grey30") +
# annotation in the bottom row
geom_text(data = text,
aes(label = label)) +
# arrows in the bottom row
geom_line(data = arrow,
arrow = arrow(length = unit(0.30, "cm"),
ends = "first", type = "closed")) +
facet_grid(stage ~ iteration) +
theme(axis.ticks.x = element_blank(),
panel.grid = element_blank(),
strip.text.x = element_blank())
```
We will take a similar approach to make our version of Figure 2.2. But this time, we'll define our supplemental data sets directly in `geom_text()` and `geom_line()`. It's good to have both methods up your sleeve. Also notice how we simply fed our primary data set directly into `ggplot()` without saving it, either.
```{r, fig.width = 2.25, fig.height = 4}
# primary data
crossing(stage = factor(c("Prior", "Posterior"),
levels = c("Prior", "Posterior")),
Possibilities = LETTERS[1:4]) %>%
mutate(Credibility = c(rep(0.25, times = 4),
rep(0, times = 3),
1)) %>%
# plot!
ggplot(aes(x = Possibilities, y = Credibility)) +
geom_col(color = "grey30", fill = "grey30") +
# annotation in the bottom panel
geom_text(data = tibble(
Possibilities = "B",
Credibility = .8,
label = "D is\nresponsible",
stage = factor("Posterior", levels = c("Prior", "Posterior"))
), aes(label = label)
) +
# the arrow
geom_line(data = tibble(
Possibilities = LETTERS[c(4, 4)],
Credibility = c(.25, .99),
stage = factor("Posterior", levels = c("Prior", "Posterior"))
),
arrow = arrow(length = unit(0.30, "cm"), ends = "last", type = "closed"),
color = "grey92") +
facet_wrap(~ stage, ncol = 1) +
theme(axis.ticks.x = element_blank(),
panel.grid = element_blank())
```
### Data are noisy and inferences are probabilistic.
Now on to Figure 2.3. I'm pretty sure the curves in the plot are Gaussian, which we'll make with the `dnorm()` function. After a little trial and error, their standard deviations look to be 1.2. However, it's tricky placing those curves in along with the probabilities, because the probabilities for the four discrete sizes (i.e., 1 through 4) are in a different metric than the Gaussian density curves. Since the probability metric for the four discrete sizes are the primary metric of the plot, we need to rescale the curves using a little algebra, which we do in the data code below. After that, the code for the plot is relatively simple.
```{r, fig.width = 2.25, fig.height = 2.5}
# data
tibble(mu = 1:4,
p = .25) %>%
expand(nesting(mu, p),
x = seq(from = -2, to = 6, by = .1)) %>%
mutate(density = dnorm(x, mean = mu, sd = 1.2)) %>%
mutate(d_max = max(density)) %>%
mutate(rescale = p / d_max) %>%
mutate(density = density * rescale) %>%
# plot!
ggplot(aes(x = x)) +
geom_col(data = . %>% distinct(mu, p),
aes(x = mu, y = p),
fill = "grey67", width = 1/3) +
geom_line(aes(y = density, group = mu)) +
scale_x_continuous(breaks = 1:4) +
scale_y_continuous(breaks = 0:5 / 5) +
coord_cartesian(xlim = c(0, 5),
ylim = c(0, 1)) +
labs(title = "Prior",
x = "Possibilities",
y = "Credibility") +
theme(axis.ticks.x = element_blank(),
panel.grid = element_blank())
```
We can use the same basic method to make the bottom panel of Figure 2.3. Kruschke gave the relative probability values on page 21. Notice that they sum perfectly to 1. The only other notable change from the previous plot is our addition of a `geom_point()` section, the data in which we defined on the fly.
```{r, fig.width = 2.25, fig.height = 2.5}
tibble(mu = 1:4,
p = c(.11, .56, .31, .02)) %>%
expand(nesting(mu, p),
x = seq(from = -2, to = 6, by = .1)) %>%
mutate(density = dnorm(x, mean = mu, sd = 1.2)) %>%
mutate(d_max = max(density)) %>%
mutate(rescale = p / d_max) %>%
mutate(density = density * rescale) %>%
# plot!
ggplot() +
geom_col(data = . %>% distinct(mu, p),
aes(x = mu, y = p),
fill = "grey67", width = 1/3) +
geom_line(aes(x = x, y = density, group = mu)) +
geom_point(data = tibble(x = c(1.75, 2.25, 2.75), y = 0),
aes(x = x, y = y),
size = 3, color = "grey33", alpha = 3/4) +
scale_x_continuous(breaks = 1:4) +
scale_y_continuous(breaks = 0:5 / 5) +
coord_cartesian(xlim = c(0, 5),
ylim = c(0, 1)) +
labs(title = "Posterior",
x = "Possibilities",
y = "Credibility") +
theme(axis.ticks.x = element_blank(),
panel.grid = element_blank())
```
> In summary, the essence of Bayesian inference is reallocation of credibility across possibilities. The distribution of credibility initially reflects prior knowledge about the possibilities, which can be quite vague. Then new data are observed, and the credibility is re-allocated. Possibilities that are consistent with the data garner more credibility, while possibilities that are not consistent with the data lose credibility. Bayesian analysis is the mathematics of re-allocating credibility in a logically coherent and precise way. (p. 22)
## Possibilities are parameter values in descriptive models
"A key step in Bayesian analysis is defining the set of possibilities over which credibility is allocated. *This is not a trivial step*, because there might always be possibilities beyond the ones we include in the initial set" (p. 22, *emphasis* added).
In the last section, we used the `dnorm()` function to make curves following the normal distribution (a.k.a. the Gaussian distribution). Here we'll do that again, but also use the `rnorm()` function to simulate actual data from that same normal distribution. Behold Figure 2.4.a.
```{r, fig.width = 3, fig.height = 2.5}
# set the seed to make the simulation reproducible
set.seed(2)
# simulate the data with `rnorm()`
d <- tibble(x = rnorm(2000, mean = 10, sd = 5))
# plot!
ggplot(data = d, aes(x = x)) +
geom_histogram(aes(y = ..density..),
binwidth = 1, fill = "grey67",
color = "grey92", size = 1/10) +
geom_line(data = tibble(x = seq(from = -6, to = 26, by = .01)),
aes(x = x, y = dnorm(x, mean = 10, sd = 5)),
color = "grey33") +
coord_cartesian(xlim = c(-5, 25)) +
labs(subtitle = "The candidate normal distribution\nhas a mean of 10 and SD of 5.",
x = "Data Values",
y = "Data Probability") +
theme(panel.grid = element_blank())
```
Did you notice how we made the data for the density curve within `geom_line()`? That's one way to do it, and in our next plot we'll take a more elegant approach with the `stat_function()` function. Here's our Figure 2.4.b.
```{r, fig.width = 3, fig.height = 2.5}
ggplot(data = d, aes(x = x)) +
geom_histogram(aes(y = ..density..),
binwidth = 1, fill = "grey67",
color = "grey92", size = 1/8) +
stat_function(fun = dnorm, n = 101, args = list(mean = 8, sd = 6),
color = "grey33", linetype = 2) +
coord_cartesian(xlim = c(-5, 25)) +
labs(subtitle = "The candidate normal distribution\nhas a mean of 8 and SD of 6.",
x = "Data Values",
y = "Data Probability") +
theme(panel.grid = element_blank())
```
## The steps of Bayesian data analysis
> In general, Bayesian analysis of data follows these steps:
>
> 1. Identify the data relevant to the research questions. What are the measurement scales of the data? Which data variables are to be predicted, and which data variables are supposed to act as predictors?
> 2. Define a descriptive model for the relevant data. The mathematical form and its parameters should be meaningful and appropriate to the theoretical purposes of the analysis.
> 3. Specify a prior distribution on the parameters. The prior must pass muster with the audience of the analysis, such as skeptical scientists.
> 4. Use Bayesian inference to re-allocate credibility across parameter values. Interpret the posterior distribution with respect to theoretically meaningful issues (assuming that the model is a reasonable description of the data; see next step).
> 5. Check that the posterior predictions mimic the data with reasonable accuracy (i.e., conduct a "posterior predictive check"). If not, then consider a different descriptive
model.
>
> Perhaps the best way to explain these steps is with a realistic example of Bayesian data analysis. The discussion that follows is abbreviated for purposes of this introductory chapter, with many technical details suppressed. (p. 25)
I will show you a few more details than Kruschke did in the text. But just has he did, we'll cover this workflow in much more detail in the chapters to come.
In order to recreate Figure 2.5, we need to generate the data and fit a model to those data. In his `HtWtDataDenerator.R` script, Kruschke provided the code for a function that will generate height/weight data of the kind in his text. Here is the code in full:
```{r HtWtDataGenerator}
HtWtDataGenerator <- function(nSubj, rndsd = NULL, maleProb = 0.50) {
# Random height, weight generator for males and females. Uses parameters from
# Brainard, J. & Burmaster, D. E. (1992). Bivariate distributions for height and
# weight of men and women in the United States. Risk Analysis, 12(2), 267-275.
# Kruschke, J. K. (2011). Doing Bayesian data analysis:
# A Tutorial with R and BUGS. Academic Press / Elsevier.
# Kruschke, J. K. (2014). Doing Bayesian data analysis, 2nd Edition:
# A Tutorial with R, JAGS and Stan. Academic Press / Elsevier.
# require(MASS)
# Specify parameters of multivariate normal (MVN) distributions.
# Men:
HtMmu <- 69.18
HtMsd <- 2.87
lnWtMmu <- 5.14
lnWtMsd <- 0.17
Mrho <- 0.42
Mmean <- c(HtMmu, lnWtMmu)
Msigma <- matrix(c(HtMsd^2, Mrho * HtMsd * lnWtMsd,
Mrho * HtMsd * lnWtMsd, lnWtMsd^2), nrow = 2)
# Women cluster 1:
HtFmu1 <- 63.11
HtFsd1 <- 2.76
lnWtFmu1 <- 5.06
lnWtFsd1 <- 0.24
Frho1 <- 0.41
prop1 <- 0.46
Fmean1 <- c(HtFmu1, lnWtFmu1)
Fsigma1 <- matrix(c(HtFsd1^2, Frho1 * HtFsd1 * lnWtFsd1,
Frho1 * HtFsd1 * lnWtFsd1, lnWtFsd1^2), nrow = 2)
# Women cluster 2:
HtFmu2 <- 64.36
HtFsd2 <- 2.49
lnWtFmu2 <- 4.86
lnWtFsd2 <- 0.14
Frho2 <- 0.44
prop2 <- 1 - prop1
Fmean2 <- c(HtFmu2, lnWtFmu2)
Fsigma2 <- matrix(c(HtFsd2^2, Frho2 * HtFsd2 * lnWtFsd2,
Frho2 * HtFsd2 * lnWtFsd2, lnWtFsd2^2), nrow = 2)
# Randomly generate data values from those MVN distributions.
if (!is.null(rndsd)) {set.seed(rndsd)}
datamatrix <- matrix(0, nrow = nSubj, ncol = 3)
colnames(datamatrix) <- c("male", "height", "weight")
maleval <- 1; femaleval <- 0 # arbitrary coding values
for (i in 1:nSubj) {
# Flip coin to decide sex
sex <- sample(c(maleval, femaleval), size = 1, replace = TRUE,
prob = c(maleProb, 1 - maleProb))
if (sex == maleval) {datum = MASS::mvrnorm(n = 1, mu = Mmean, Sigma = Msigma)}
if (sex == femaleval) {
Fclust = sample(c(1, 2), size = 1, replace = TRUE, prob = c(prop1, prop2))
if (Fclust == 1) {datum = MASS::mvrnorm(n = 1, mu = Fmean1, Sigma = Fsigma1)}
if (Fclust == 2) {datum = MASS::mvrnorm(n = 1, mu = Fmean2, Sigma = Fsigma2)}
}
datamatrix[i, ] = c(sex, round(c(datum[1], exp(datum[2])), 1))
}
return(datamatrix)
} # end function
```
Now we have the `HtWtDataGenerator()` function, all we need to do is determine how many values to generate and how probable we want the values to be based on those from men. These are controlled by the `nSubj` and `maleProb` parameters.
```{r, message = F, warning = F}
# set your seed to make the data generation reproducible
set.seed(2)
d <-
HtWtDataGenerator(nSubj = 57, maleProb = .5) %>%
as_tibble()
d %>%
head()
```
We're about ready for the model, which we will fit it with the Hamiltonian Monte Carlo (HMC) method via the [**brms** package](https://CRAN.R-project.org/package=brms). We'll introduce **brms** more fully in [Chapter 8][~~JAGS~~ brms] and you can go [here](https://github.com/paul-buerkner/brms) for an introduction, too. In the meantime, let's just load the package and see what happens.
```{r, message = F, warning = F}
library(brms)
```
The traditional use of [diffuse and noninformative priors is discouraged with HMC, as is the uniform distribution for sigma](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). Instead, we'll use weakly-regularizing priors for the intercept and slope and a half Cauchy with a fairly large scale parameter for $\sigma$.
```{r fit2.1}
fit2.1 <-
brm(data = d,
family = gaussian,
weight ~ 1 + height,
prior = c(prior(normal(0, 100), class = Intercept),
prior(normal(0, 100), class = b),
prior(cauchy(0, 10), class = sigma)),
chains = 4, cores = 4, iter = 2000, warmup = 1000,
seed = 2,
file = "fits/fit02.01")
```
If you wanted a quick model summary, you could execute `print(fit1)`. Again, we'll walk through that and other diagnostics in greater detail starting in [Chapter 8][~~JAGS~~ brms]. For now, here's how we might make Figure 2.5.a.
```{r, fig.width = 3.25, fig.height = 3}
# extract the posterior draws
draws <- as_draws_df(fit2.1)
# this will subset the output
n_lines <- 150
# plot!
draws %>%
slice(1:n_lines) %>%
ggplot() +
geom_abline(aes(intercept = b_Intercept, slope = b_height, group = .draw),
color = "grey50", size = 1/4, alpha = .3) +
geom_point(data = d,
aes(x = height, y = weight),
shape = 1) +
# the `eval(substitute(paste()))` trick came from: https://www.r-bloggers.com/value-of-an-r-object-in-an-expression/
labs(subtitle = eval(substitute(paste("Data with", n_lines, "credible regression lines"))),
x = "Height in inches",
y = "Weight in pounds") +
coord_cartesian(xlim = c(55, 80),
ylim = c(50, 250)) +
theme(panel.grid = element_blank())
```
For Figure 2.5.b., we'll mark off the mode and 95% highest density interval (HDI) with help from the handy `stat_histinterval()` function, provided by the [**tidybayes** package](http://mjskay.github.io/tidybayes) [@R-tidybayes].
```{r, warning = F, message = F, fig.width = 3, fig.height = 3}
library(tidybayes)
draws %>%
ggplot(aes(x = b_height, y = 0)) +
stat_histinterval(point_interval = mode_hdi, .width = .95,
fill = "grey67", slab_color = "grey92",
breaks = 40, slab_size = .2, outline_bars = T) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 8)) +
labs(title = "The posterior distribution",
subtitle = "The mode and 95% HPD intervals are\nthe dot and horizontal line at the bottom.",
x = expression(beta[1]~(slope))) +
theme(panel.grid = element_blank())
```
To make Figure 2.6, we use the `brms::predict()` function, which we'll cover more fully in the pages to come.
```{r, fig.width = 4, fig.height = 4}
nd <- tibble(height = seq(from = 53, to = 81, length.out = 20))
predict(fit2.1, newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = height)) +
geom_pointrange(aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
color = "grey67", shape = 20) +
geom_point(data = d,
aes(y = weight),
alpha = 2/3) +
labs(subtitle = "Data with the percentile-based 95% intervals and\nthe means of the posterior predictions",
x = "Height in inches",
y = "Weight in inches") +
theme(panel.grid = element_blank())
```
The posterior predictions might be easier to depict with a ribbon and line, instead.
```{r, fig.width = 4, fig.height = 4}
nd <- tibble(height = seq(from = 53, to = 81, length.out = 30))
predict(fit2.1, newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = height)) +
geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5), fill = "grey75") +
geom_line(aes(y = Estimate),
color = "grey92") +
geom_point(data = d,
aes(y = weight),
alpha = 2/3) +
labs(subtitle = "Data with the percentile-based 95% intervals and\nthe means of the posterior predictions",
x = "Height in inches",
y = "Weight in inches") +
theme(panel.grid = element_blank())
```
"We have seen the five steps of Bayesian analysis in a fairly realistic example. This book explains how to do this sort of analysis for many different applications and types of descriptive models" (p. 30). Are you intrigued and excited? You should be. Welcome to Bayes, friends!
## Session info {-}
```{r}
sessionInfo()
```
```{r, echo = F}
# remove our objects
rm(d, text, arrow, HtWtDataGenerator, fit2.1, draws, n_lines, nd)
```
```{r, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```