-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07.Rmd
1655 lines (1244 loc) · 79.4 KB
/
07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# Markov Chain Monte Carlo
> This chapter introduces the methods we will use for producing accurate approximations to Bayesian posterior distributions for realistic applications. The class of methods is called Markov chain Monte Carlo (MCMC), for reasons that will be explained later in the chapter. It is MCMC algorithms and software, along with fast computer hardware, that allow us to do Bayesian data analysis for realistic applications that would have been effectively impossible $30$ years ago. [@kruschkeDoingBayesianData2015, p. 144]
Statistician David Draper covered some of the history of MCMC in his lecture, [*Bayesian Statistical Reasoning*](https://www.youtube.com/watch?v=072Q18nX91I&frags=pl%2Cwn).
## Approximating a distribution with a large sample
> The concept of representing a distribution by a large representative sample is foundational for the approach we take to Bayesian analysis of complex models. The idea is applied intuitively and routinely in everyday life and in science. For example, polls and surveys are founded on this concept: By randomly sampling a subset of people from a population, we estimate the underlying tendencies in the entire population. The larger the sample, the better the estimation. What is new in the present application is that the population from which we are sampling is a mathematically defined distribution, such as a posterior probability distribution. (p. 145)
Like in Chapters 4 and 6, we need to define the `hdi_of_icdf()` function.
```{r}
hdi_of_icdf <- function(name, width = .95, tol = 1e-8, ... ) {
incredible_mass <- 1.0 - width
interval_width <- function(low_tail_prob, name, width, ...) {
name(width + low_tail_prob, ...) - name(low_tail_prob, ...)
}
opt_info <- optimize(interval_width, c(0, incredible_mass),
name = name, width = width,
tol = tol, ...)
hdi_lower_tail_prob <- opt_info$minimum
return(c(name(hdi_lower_tail_prob, ...),
name(width + hdi_lower_tail_prob, ...)))
}
```
Our `hdi_of_icdf()` function will compute the analytic 95% highest density intervals (HDIs) for the distribution under consideration in Figure 7.1, $\operatorname{Beta}(\theta | 15, 7)$.
```{r}
h <-
hdi_of_icdf(name = qbeta,
shape1 = 15,
shape2 = 7)
h
```
Using an equation from [Chapter 6][Specifying a beta prior.], $\omega = (a − 1) / (a + b − 2)$, we can compute the corresponding mode.
```{r}
(omega <- (15 - 1) / (15 + 7 - 2))
```
To get the density in the upper left panel of Figure 7.1, we'll make use of the `dbeta()` function and of our `h[1:2]` and `omega` values.
```{r, fig.width = 4, fig.height = 2.5, warning = F, message = F}
library(tidyverse)
library(cowplot)
tibble(theta = seq(from = 0, to = 1, length.out = 100)) %>%
mutate(density = dbeta(theta, shape1 = 15, shape2 = 7)) %>%
ggplot() +
geom_area(aes(x = theta, y = density),
fill = "steelblue") +
geom_segment(aes(x = h[1], xend = h[2], y = 0, yend = 0),
size = .75) +
geom_point(aes(x = omega, y = 0),
size = 1.5, shape = 19) +
annotate(geom = "text", x = .675, y = .4,
label = "95% HDI", color = "white") +
scale_x_continuous(expression(theta),
breaks = c(0, h, omega, 1),
labels = c("0", h %>% round(2), omega, "1")) +
ggtitle("Exact distribution") +
ylab(expression(p(theta))) +
theme_cowplot()
```
Note how we're continuing to use `theme_cowplot()`, which we introduced in the last chapter. The remaining panels in Figure 7.1 require we simulate the data.
```{r}
set.seed(7)
d <-
tibble(n = c(500, 5000, 50000)) %>%
mutate(theta = map(n, ~rbeta(., shape1 = 15, shape2 = 7))) %>%
unnest(theta) %>%
mutate(key = str_c("Sample N = ", n))
head(d)
```
With the data in hand, we're ready to plot the remaining panels for Figure 7.1. This time, we'll use the handy `stat_pointinterval()` function from the [**tidybayes** package](https://github.com/mjskay/tidybayes) to mark off the mode and 95% HDIs.
```{r, fig.width = 10, fig.height = 2.75, warning = F, message = F}
library(tidybayes)
d %>%
ggplot(aes(x = theta, y = 0)) +
stat_histinterval(point_interval = mode_hdi, .width = .95,
breaks = 30, fill = "steelblue") +
scale_x_continuous(expression(theta), limits = c(0, 1)) +
scale_y_continuous(NULL, breaks = NULL) +
theme_cowplot() +
facet_wrap(~ key, ncol = 3, scales = "free")
```
If we want the exact values for the mode and 95% HDIs, we can use the `tidybayes::mode_hdi()` function.
```{r}
d %>%
group_by(key) %>%
mode_hdi(theta)
```
If you wanted a better sense of the phenomena, you could do a simulation. We'll make a custom simulation function to compute the modes from many random draws from our $\operatorname{Beta}(\theta | 15, 7)$ distribution, with varying $N$ values.
```{r}
my_mode_simulation <- function(seed) {
set.seed(seed)
tibble(n = c(500, 5000, 50000)) %>%
mutate(theta = map(n, ~rbeta(., shape1 = 15, shape2 = 7))) %>%
unnest(theta) %>%
mutate(key = str_c("Sample N = ", n)) %>%
group_by(key) %>%
mode_hdi(theta)
}
```
Here we put our `my_mode_simulation()` function to work.
```{r sim, fig.width = 8, fig.height = 2.75, warning = F, message = F, cache = T}
# we need an index of the values we set our seed with in our `my_mode_simulation()` function
sim <-
tibble(seed = 1:1e3) %>%
group_by(seed) %>%
# inserting our subsamples
mutate(modes = map(seed, my_mode_simulation)) %>%
# unnesting allows us to access our model results
unnest(modes)
sim %>%
ggplot(aes(x = theta, y = key)) +
geom_vline(xintercept = .7, color = "white") +
stat_histinterval(.width = c(.5, .95), breaks = 20, fill = "steelblue") +
labs(title = expression("Variability of the mode for simulations of "*beta(theta*'|'*15*', '*7)*", the true mode of which is .7"),
subtitle = "For each sample size, the dot is the median, the inner thick line is the percentile-based 50% interval,\nand the outer thin line the percentile-based 95% interval. Although the central tendency\napproximates the true value for all three conditions, the variability of the mode estimate is inversely\nrelated to the sample size.",
x = "mode",
y = NULL) +
coord_cartesian(xlim = c(.6, .8),
ylim = c(1.25, 3.5)) +
theme_cowplot(font_size = 11.5) +
theme(axis.text.y = element_text(hjust = 0),
axis.ticks.y = element_blank())
```
## A simple case of the Metropolis algorithm
> Our goal in Bayesian inference is to get an accurate representation of the posterior distribution. One way to do that is to sample a large number of representative points from the posterior. The question then becomes this: How can we sample a large number of representative values from a distribution? (p. 146).
The answer, my friends, is MCMC.
### A politician stumbles upon the Metropolis algorithm.
I’m not going to walk out Kruschke’s politician example in any detail, here. But if we denote $P_\text{proposed}$ as the population of the proposed island and $P_\text{current}$ as the population of the current island, then
$$p_\text{move} = \frac{P_\text{proposed}}{P_\text{current}}.$$
"What's amazing about this heuristic is that it works: In the long run, the probability that the politician is on any one of the islands exactly matches the relative population of the island" (p. 147)!
### A random walk.
The code below will allow us to reproduce Kruschke's random walk. To give credit where it's due, this is a mild amendment to the code from Chapter 8 of McElreath's [-@mcelreathStatisticalRethinkingBayesian2015] text, [*Statistical rethinking: A Bayesian course with examples in R and Stan*](https://xcelab.net/rm/statistical-rethinking/).
```{r random_walk, cache = T}
set.seed(7)
num_days <- 5e4
positions <- rep(0, num_days)
current <- 4
for (i in 1:num_days) {
# record current position
positions[i] <- current
# flip coin to generate proposal
proposal <- current + sample(c(-1, 1), size = 1)
# now make sure he loops around from 7 back to 1
if (proposal < 1) proposal <- 7
if (proposal > 7) proposal <- 1
# move?
prob_accept_the_proposal <- proposal/current
current <- ifelse(runif(1) < prob_accept_the_proposal, proposal, current)
}
```
If you missed it, `positions` is the main product of our simulation. Here we'll put `positions` in a tibble and reproduce the top portion of Figure 7.2.
```{r, fig.width = 6, fig.height = 2}
tibble(theta = positions) %>%
ggplot(aes(x = theta)) +
geom_bar(fill = "steelblue") +
scale_x_continuous(expression(theta), breaks = 1:7) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
theme_cowplot()
```
Did you notice that `scale_y_continuous()` line in the code? Claus Wilke, the author of the **cowplot** package, has a lot of thoughts on data visualization. He even wrote a [-@wilkeFundamentalsDataVisualization2019] book on it: [*Fundamentals of data visualization*](https://clauswilke.com/dataviz/). In his [-@Wilke2020Themes] [*Themes*](https://wilkelab.org/cowplot/articles/themes.html) vignette, Wilke recommended against allowing for space between the bottoms of the bars in a bar plot and the $x$-axis line. The **ggplot2** default is to allow for such a space. Here we followed Wilke and suppressed that space with `expand = expansion(mult = c(0, 0.05))`. You can learn more about the `ggplot2::expansion()` function [here](https://ggplot2.tidyverse.org/reference/expansion.html).
Here's the middle portion of Figure 7.2.
```{r, fig.width = 6, fig.height = 2.5}
tibble(t = 1:5e4,
theta = positions) %>%
slice(1:500) %>%
ggplot(aes(x = theta, y = t)) +
geom_path(size = 1/4, color = "steelblue") +
geom_point(size = 1/2, alpha = 1/2, color = "steelblue") +
scale_x_continuous(expression(theta), breaks = 1:7) +
scale_y_log10("Time Step", breaks = c(1, 2, 5, 20, 100, 500)) +
theme_cowplot()
```
And now we make the bottom.
```{r, fig.width = 6, fig.height = 2}
tibble(x = 1:7,
y = 1:7) %>%
ggplot(aes(x = x, y = y)) +
geom_col(width = .2, fill = "steelblue") +
scale_x_continuous(expression(theta), breaks = 1:7) +
scale_y_continuous(expression(p(theta)), expand = expansion(mult = c(0, 0.05))) +
theme_cowplot()
```
> Notice that the sampled relative frequencies closely mimic the actual relative populations in the bottom panel! In fact, a sequence generated this way will converge, as the sequence gets longer, to an arbitrarily close approximation of the actual relative probabilities. (p. 149)
### General properties of a random walk.
> The tajectory shown in Figure 7.2 is just one possible sequence of positions when the movement heuristic is applied. At each time step, the direction of the proposed move is random, and if the relative probability of the proposed position is less than that of the current position, then acceptance of the proposed move is also random. Because of the randomness, if the process were started over again, then the specific trajectory would almost certainly be different. Regardless of the specific trajectory, in the long run the relative frequency of visits mimics the target distribution.
>
> Figure 7.3 shows the probability of being in each position as a function of time. (p. 149)
I was initially stumped on how to reproduce the simulation depicted in Figure 7.3. However, fellow enthusiast [Cardy Moten III](https://github.com/cmoten) kindly [shared a solution](https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/issues/14) which was itself based on Kruschke's blog post from 2012, [*Metropolis algorithm: Discrete position probabilities*](https://doingbayesiandataanalysis.blogspot.com/2012/08/metropolis-algorithm-discrete-position.html). Here's a mild reworking of their solutions. First, we simulate.
```{r}
nslots <- 7
p_target <- 1:7
p_target <- p_target / sum(p_target)
# construct the transition matrix
proposal_matrix <- matrix(0, nrow = nslots, ncol = nslots)
for(from_idx in 1:nslots) {
for(to_idx in 1:nslots) {
if(to_idx == from_idx - 1) {proposal_matrix[from_idx, to_idx] <- 0.5}
if(to_idx == from_idx + 1) {proposal_matrix[from_idx, to_idx] <- 0.5}
}
}
# construct the acceptance matrix
acceptance_matrix <- matrix(0, nrow = nslots, ncol = nslots)
for(from_idx in 1:nslots) {
for(to_idx in 1:nslots) {
acceptance_matrix[from_idx, to_idx] <- min(p_target[to_idx] / p_target[from_idx], 1)
}
}
# compute the matrix of move probabilities
move_matrix <- proposal_matrix * acceptance_matrix
# compute the transition matrix, including the probability of staying in place
transition_matrix <- move_matrix
for (diag_idx in 1:nslots) {
transition_matrix[diag_idx, diag_idx] = 1.0 - sum(move_matrix[diag_idx, ])
}
# specify starting position vector:
position_vec <- rep(0, nslots)
position_vec[round(nslots / 2)] <- 1.0
p <- list()
data <-
tibble(position = 1:nslots,
prob = position_vec)
# loop through the requisite time indexes
# update the data and transition vector
for(time_idx in 1:99) {
p[[time_idx]] <- data
# update the position vec
position_vec <- position_vec %*% transition_matrix
# update the data
data <- NULL
data <-
tibble(position = 1:nslots,
prob = t(position_vec))
}
```
Now we wrangle and plot.
```{r, fig.width = 10, fig.height = 8}
p %>%
as_tibble_col() %>%
mutate(facet = str_c("italic(t)==", 1:99)) %>%
slice(c(1:14, 99)) %>%
unnest(value) %>%
bind_rows(
tibble(position = 1:nslots,
prob = p_target,
facet = "target")
) %>%
mutate(facet = factor(facet, levels = c(str_c("italic(t)==", c(1:14, 99)), "target"))) %>%
# plot!
ggplot(aes(x = position, y = prob, fill = facet == "target")) +
geom_col(width = .2) +
scale_fill_manual(values = c("steelblue", "goldenrod2"), breaks = NULL) +
scale_x_continuous(expression(theta), breaks = 1:7) +
scale_y_continuous(expression(italic(p)(theta)), expand = expansion(mult = c(0, 0.05))) +
theme_cowplot() +
facet_wrap(~ facet, scales = "free_y", labeller = label_parsed)
```
### Why we care.
Through the simple magic of the random walk procedure,
> we are able to do *indirectly* something we could not necessarily do directly: We can generate random samples from the target distribution. Moreover, we can generate those random samples from the target distribution even when the target distribution is not normalized.
>
> This technique is profoundly useful when the target distribution $P(\theta)$ is a posterior proportional to $p(D | \theta) p(\theta)$. Merely by evaluating $p(D | \theta) p(\theta)$, without normalizing it by $p(D)$, we can generate random representative values from the posterior distribution. This result is wonderful because the method obviates direct computation of the evidence $p(D)$, which, as you'll recall, is one of the most difficult aspects of Bayesian inference. By using MCMC techniques, we can do Bayesian inference in rich and complex models. It has only been with the development of MCMC algorithms and software that Bayesian inference is applicable to complex data analysis, and it has only been with the production of fast and cheap computer hardware that Bayesian inference is accessible to a wide audience. (p. 152, *emphasis* in the original)
## The Metropolis algorithm more generally
"The procedure described in the previous section was just a special case of a more general procedure known as the Metropolis algorithm, named after the first author of a famous article [@metropolisEquationStateCalculations1953]" (p. 156).
Here's how to generate a proposed jump from a zero-mean normal distribution with a standard deviation of 0.2.
```{r}
rnorm(1, mean = 0, sd = 0.2)
```
To get a sense of what draws from `rnorm()` looks like in the long run, we might plot.
```{r, fig.width = 5, fig.height = 3}
mu <- 0
sigma <- 0.2
# how many proposals would you like?
n <- 500
set.seed(7)
tibble(proposed_jump = rnorm(n, mean = mu, sd = sigma)) %>%
ggplot(aes(x = proposed_jump, y = 0)) +
geom_jitter(width = 0, height = .1,
size = 1/2, alpha = 1/2, color = "steelblue") +
# this is the idealized distribution
stat_function(fun = dnorm, args = list(mean = mu, sd = sigma),
color = "steelblue") +
scale_x_continuous(breaks = seq(from = -0.6, to = 0.6, length.out = 7)) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Jump proposals",
subtitle = "The blue line shows the data generating distribution.") +
theme_cowplot()
```
Anyway,
> having generated a proposed new position, the algorithm then decides whether or not to accept the proposal. The decision rule is exactly what was already specified in Equation 7.1. In detail, this is accomplished by computing the ratio $p_\text{move} = P(\theta_\text{proposed}) / P(\theta_\text{current})$. Then a random number from the uniform interval $[0, 1]$ is generated; in R, this can be accomplished with the command `runif(1)`. If the random number is between $0$ and pmove, then the move is accepted. (p. 157)
We'll see what that might look like in the next section. In the meantime, here's how to use `runif()`.
```{r}
runif(1)
```
Just for kicks, here's what that looks like in bulk.
```{r, fig.width = 5, fig.height = 2}
# how many proposals would you like?
n <- 500
set.seed(7)
tibble(draw = runif(n)) %>%
ggplot(aes(x = draw, y = 0)) +
geom_jitter(width = 0, height = 1/4,
size = 1/2, alpha = 1/2, color = "steelblue") +
stat_function(fun = dunif,
color = "steelblue") +
scale_y_continuous(NULL, breaks = NULL, limits = c(-1/3, 5/3)) +
labs(title = "Uniform draws",
subtitle = "The blue line shows the data generating distribution.") +
theme_cowplot()
```
We do not see a concentration towards the mean, this time. The draws are uniformly distributed across the parameter space.
### Metropolis algorithm applied to Bernoulli likelihood and beta prior.
You can find Kruschke's code in the `BernMetrop.R` file. I'm going to break it up a little.
```{r}
# specify the data, to be used in the likelihood function.
my_data <- c(rep(0, 6), rep(1, 14))
# define the Bernoulli likelihood function, p(D|theta).
# the argument theta could be a vector, not just a scalar
likelihood <- function(theta, data) {
z <- sum(data)
n <- length(data)
p_data_given_theta <- theta^z * (1 - theta)^(n - z)
# the theta values passed into this function are generated at random,
# and therefore might be inadvertently greater than 1 or less than 0.
# the likelihood for theta > 1 or for theta < 0 is zero
p_data_given_theta[theta > 1 | theta < 0] <- 0
return(p_data_given_theta)
}
# define the prior density function.
prior_d <- function(theta) {
p_theta <- dbeta(theta, 1, 1)
# the theta values passed into this function are generated at random,
# and therefore might be inadvertently greater than 1 or less than 0.
# the prior for theta > 1 or for theta < 0 is zero
p_theta[theta > 1 | theta < 0] = 0
return(p_theta)
}
# define the relative probability of the target distribution,
# as a function of vector theta. for our application, this
# target distribution is the unnormalized posterior distribution
target_rel_prob <- function(theta, data) {
target_rel_prob <- likelihood(theta, data) * prior_d(theta)
return(target_rel_prob)
}
# specify the length of the trajectory, i.e., the number of jumps to try:
traj_length <- 50000 # this is just an arbitrary large number
# initialize the vector that will store the results
trajectory <- rep(0, traj_length)
# specify where to start the trajectory:
trajectory[1] <- 0.01 # another arbitrary value
# specify the burn-in period
burn_in <- ceiling(0.0 * traj_length) # arbitrary number, less than `traj_length`
# initialize accepted, rejected counters, just to monitor performance:
n_accepted <- 0
n_rejected <- 0
```
That first part follows what Kruschke put in his script. I'm going to bundel the next large potion in a fucntion, `my_metropolis()` which will make it easier to plug the code into the `purrr::map()` function.
```{r}
my_metropolis <- function(proposal_sd) {
# now generate the random walk. the 't' index is time or trial in the walk.
# specify seed to reproduce same random walk
set.seed(47405)
## I'm taking this section out and will replace it
# # specify standard deviation of proposal distribution
# proposal_sd <- c(0.02, 0.2, 2.0)[2]
## end of the section I took out
for (t in 1:(traj_length - 1)) {
current_position <- trajectory[t]
# use the proposal distribution to generate a proposed jump
proposed_jump <- rnorm(1, mean = 0, sd = proposal_sd)
# compute the probability of accepting the proposed jump
prob_accept <- min(1,
target_rel_prob(current_position + proposed_jump, my_data)
/ target_rel_prob(current_position, my_data))
# generate a random uniform value from the interval [0, 1] to
# decide whether or not to accept the proposed jump
if (runif(1) < prob_accept) {
# accept the proposed jump
trajectory[t + 1] <- current_position + proposed_jump
# increment the accepted counter, just to monitor performance
if (t > burn_in) {n_accepted <- n_accepted + 1}
} else {
# reject the proposed jump, stay at current position
trajectory[t + 1] <- current_position
# increment the rejected counter, just to monitor performance
if (t > burn_in) {n_rejected <- n_rejected + 1}
}
}
# extract the post-burn_in portion of the trajectory
accepted_traj <- trajectory[(burn_in + 1) : length(trajectory)]
tibble(accepted_traj = accepted_traj,
n_accepted = n_accepted,
n_rejected = n_rejected)
# end of Metropolis algorithm
}
```
Now we have `my_metropolis()`, we can run the analysis based on the three `proposal_sd` values, nesting the results in a tibble.
```{r metropolis_sim, cache = T}
d <-
tibble(proposal_sd = c(0.02, 0.2, 2.0)) %>%
mutate(accepted_traj = map(proposal_sd, my_metropolis)) %>%
unnest(accepted_traj)
glimpse(d)
```
Now we have `d` in hand, here's the top portion of Figure 7.4.
```{r, fig.width = 10, fig.height = 2.75}
d <-
d %>%
mutate(proposal_sd = str_c("Proposal SD = ", proposal_sd),
iter = rep(1:50000, times = 3))
d %>%
ggplot(aes(x = accepted_traj, y = 0)) +
stat_histinterval(point_interval = mode_hdi, .width = .95,
fill = "steelblue", slab_color = "white", outline_bars = T,
breaks = 40, normalize = "panels") +
scale_x_continuous(expression(theta), breaks = 0:5 * 0.2) +
scale_y_continuous(NULL, breaks = NULL) +
theme_cowplot() +
panel_border() +
facet_wrap(~ proposal_sd, ncol = 3)
```
The modes are the points and the lines depict the 95% HDIs. Also, did you notice our use of the `cowplot::panel_border()` function? The settings from `theme_cowplot()` can make it difficult to differentiate among subplots when faceting. By throwing in a call to `panel_border()` after `theme_cowplot()`, we added in lightweight panel borders.
Here's the middle of Figure 7.4.
```{r, fig.width = 10, fig.height = 2.75, warning = F}
d %>%
ggplot(aes(x = accepted_traj, y = iter)) +
geom_path(size = 1/4, color = "steelblue") +
geom_point(size = 1/2, alpha = 1/2, color = "steelblue") +
scale_x_continuous(expression(theta), breaks = 0:5 * 0.2, limits = c(0, 1)) +
scale_y_continuous("Step in Chain", limits = c(49900, 50000)) +
ggtitle("End of Chain") +
theme_cowplot() +
panel_border() +
facet_wrap(~ proposal_sd, ncol = 3)
```
The bottom:
```{r, fig.width = 10, fig.height = 2.75, warning = F}
d %>%
ggplot(aes(x = accepted_traj, y = iter)) +
geom_path(size = 1/4, color = "steelblue") +
geom_point(size = 1/2, alpha = 1/2, color = "steelblue") +
scale_x_continuous(expression(theta), breaks = 0:5 * 0.2, limits = c(0, 1)) +
scale_y_continuous("Step in Chain", limits = c(1, 100)) +
ggtitle("End of Chain") +
theme_cowplot() +
panel_border() +
facet_wrap(~ proposal_sd, ncol = 3)
```
> Regardless of the which proposal distribution in Figure 7.4 is used, the Metropolis algorithm will eventually produce an accurate representation of the posterior distribution, as is suggested by the histograms in the upper row of Figure 7.4. What differs is the efficiency of achieving a good approximation. (p. 160)
Before we move on, you may have noticed the top row of Kruschke's Figure 7.4 contains the estimates for the effective sample size (Eff.Sz.) for each chain, which Kruschke briefly mentioned on page 160. We'll walk out effective sample sizes, later. But for now we can compute them with some of the helper functions from the **posterior** package [@R-posterior]. As it turns out, there are several kinds of effective sample size. With regards to the helper functions from **posterior**, I suspect the one closest to what Kruschke used in the text will be `ess_mean()`. Here's how you might use `ess_mean()` to compute the Eff.Sz. estimate for each of our three chains.
```{r}
d %>%
select(proposal_sd, accepted_traj) %>%
group_by(proposal_sd) %>%
summarise(Eff.Sz. = posterior::ess_mean(accepted_traj))
```
### Summary of Metropolis algorithm.
> The motivation for methods like the Metropolis algorithm is that they provide a high-resolution picture of the posterior distribution, even though in complex models we cannot explicitly solve the mathematical integral in Bayes' rule. The idea is that we get a handle on the posterior distribution by generating a large sample of representative values. The larger the sample, the more accurate is our approximation. As emphasized previously, this is a sample of representative credible parameter values from the posterior distribution; it is not a resampling of data (there is a fixed data set).
>
> The cleverness of the method is that representative parameter values can be randomly sampled from complicated posterior distributions without solving the integral in Bayes' rule, and by using only simple proposal distributions for which efficient random number generators already exist. (p. 161)
## Toward Gibbs sampling: Estimating two coin biases
"The Metropolis method is very useful, but it can be inefficient. Other methods can be more efficient in some situations" (p. 162).
### Prior, likelihood and posterior for two biases.
> We are considering situations in which there are two underlying biases, namely $\theta_1$ and $\theta_2$, for the two coins. We are trying to determine what we should believe about these biases after we have observed some data from the two coins. Recall that [Kruschke used] the term "bias" as the name of the parameter $\theta$, and not to indicate that the value of $\theta$ deviates from $0.5$....
>
> What we have to do next is specify a particular mathematical form for the prior distribution. We will work through the mathematics of a particular case for two reasons: First, it will allow us to explore graphical displays of two-dimensional parameter spaces, which will inform our intuitions about Bayes' rule and sampling from the posterior distribution. Second, the mathematics will set the stage for a specific example of Gibbs sampling. Later in the book when we do applied Bayesian analysis, we will *not* be doing any of this sort of mathematics. We are doing the math now, for simple cases, to understand how the methods work so we can properly interpret their outputs in realistically complex cases. (pp. 163--165, *emphasis* in the original)
### The posterior via exact formal analysis.
The plots in the left column of Figure 7.5 are outside of my skill set. I believe they are referred to as wireframe plots and it's my understanding that **ggplot2** does not support wireframe plots at this time. However, I can reproduce versions of the right hand column. For our initial attempt for the upper right corner, we'll simulate.
```{r betas, cache = T, fig.height = 3}
set.seed(7)
betas <-
tibble(theta_1 = rbeta(1e5, shape1 = 2, shape2 = 2),
theta_2 = rbeta(1e5, shape1 = 2, shape2 = 2))
betas %>%
ggplot(aes(x = theta_1, y = theta_2)) +
stat_density_2d() +
labs(x = expression(theta[1]),
y = expression(theta[2])) +
coord_equal() +
theme_cowplot()
```
Instead of the contour lines, one might use color to depict the density variable.
```{r, fig.height = 3}
betas %>%
ggplot(aes(x = theta_1, y = theta_2, fill = stat(density))) +
stat_density_2d(geom = "raster", contour = F) +
scale_fill_viridis_c(option = "A") +
labs(x = expression(theta[1]),
y = expression(theta[2])) +
coord_equal() +
theme_cowplot()
```
Remember how we talked about suppressing the unsightly white space between the bottom of bar-plot bars and the $x$-axis? Well, look at all that unsightly white space between the axes and the boundaries of the parameter space in our bivariate Beta plot. We can further flex our `expansion()` skills to get rid of those in the next plot. Speaking of which, we might make a more precise version of that plot with the `dbeta()` function. This approach is also more in line with the title of this subsection: *The posterior via exact formal analysis*.
```{r, fig.height = 3}
theta_sequence <- seq(from = 0, to = 1, by = .01)
tibble(theta_1 = theta_sequence,
theta_2 = theta_sequence) %>%
mutate(prior_1 = dbeta(x = theta_1, shape1 = 2, shape2 = 2),
prior_2 = dbeta(x = theta_2, shape1 = 2, shape2 = 2)) %>%
expand(nesting(theta_1, prior_1), nesting(theta_2, prior_2)) %>%
ggplot(aes(x = theta_1, y = theta_2, fill = prior_1 * prior_2)) +
geom_tile() +
scale_fill_viridis_c("joint prior density", option = "A") +
scale_x_continuous(expression(theta[1]), expand = expansion(mult = 0)) +
scale_y_continuous(expression(theta[2]), expand = expansion(mult = 0)) +
coord_equal() +
theme_cowplot()
```
Look at that--no more unsightly white space! We'll need the `bernoulli_likelihood()` function from back in Chapter 6 for the middle right of Figure 7.5.
```{r}
bernoulli_likelihood <- function(theta, data) {
# theta = success probability parameter ranging from 0 to 1
# data = the vector of data (i.e., a series of 0's and 1's)
n <- length(data)
z <- sum(data)
return(theta^z * (1 - theta)^(n - sum(data)))
}
```
With our trusty `bernoulli_likelihood()` function in hand, we're almost ready to compute and plot the likelihood. We just need to define our data.
```{r}
# set the parameters
# coin 1
n1 <- 8
z1 <- 6
# coin 2
n2 <- 7
z2 <- 2
# use the parameters to make the data
theta_1_data <- rep(0:1, times = c(n1 - z1, z1))
theta_2_data <- rep(0:1, times = c(n2 - z2, z2))
# take a look
theta_1_data
theta_2_data
```
Note how these data sequences are of different sample sizes $(N_1 = 8; N_2 = 7 )$. Though it doesn't matter much for the formal analysis approach, this will become very important when we fit the model with **brms**. But for right now, we're finally ready to make a version of the middle right panel of Figure 7.5.
```{r, fig.width = 5.5, fig.height = 3}
tibble(theta_1 = theta_sequence,
theta_2 = theta_sequence) %>%
mutate(likelihood_1 = bernoulli_likelihood(theta = theta_sequence,
data = theta_1_data),
likelihood_2 = bernoulli_likelihood(theta = theta_sequence,
data = theta_2_data)) %>%
expand(nesting(theta_1, likelihood_1), nesting(theta_2, likelihood_2)) %>%
ggplot(aes(x = theta_1, y = theta_2, fill = likelihood_1 * likelihood_2)) +
geom_tile() +
scale_fill_viridis_c("joint likelihood", option = "A") +
scale_x_continuous(expression(theta[1]), expand = expansion(mult = 0)) +
scale_y_continuous(expression(theta[2]), expand = expansion(mult = 0)) +
coord_equal() +
theme_cowplot()
```
Here's how to make the two-dimensional posterior in the lower right panel of Figure 7.5.
```{r, fig.width = 4.5, fig.height = 3}
# this is a redo from two plots up, but saved as `d_prior`
d_prior <-
tibble(theta_1 = theta_sequence,
theta_2 = theta_sequence) %>%
mutate(prior_1 = dbeta(x = theta_1, shape1 = 2, shape2 = 2),
prior_2 = dbeta(x = theta_2, shape1 = 2, shape2 = 2)) %>%
expand(nesting(theta_1, prior_1), nesting(theta_2, prior_2))
# this is a redo from one plot up, but saved as `d_likelihood`
d_likelihood <-
tibble(theta_1 = theta_sequence,
theta_2 = theta_sequence) %>%
mutate(likelihood_1 = bernoulli_likelihood(theta = theta_sequence,
data = theta_1_data),
likelihood_2 = bernoulli_likelihood(theta = theta_sequence,
data = theta_2_data)) %>%
expand(nesting(theta_1, likelihood_1), nesting(theta_2, likelihood_2))
# here we combine `d_prior` and `d_likelihood`
d_prior %>%
left_join(d_likelihood, by = c("theta_1", "theta_2")) %>%
# we need the marginal likelihood, the denominator in Bayes' rule
mutate(marginal_likelihood = sum(prior_1 * prior_2 * likelihood_1 * likelihood_2)) %>%
# finally, the two-dimensional posterior
mutate(posterior = (prior_1 * prior_2 * likelihood_1 * likelihood_2) / marginal_likelihood) %>%
# plot!
ggplot(aes(x = theta_1, y = theta_2, fill = posterior)) +
geom_tile() +
scale_fill_viridis_c(expression(italic(p)(theta[1]*', '*theta[2]*'|'*D)), option = "A") +
scale_x_continuous(expression(theta[1]), expand = expansion(mult = 0)) +
scale_y_continuous(expression(theta[2]), expand = expansion(mult = 0)) +
coord_equal() +
theme_cowplot()
```
That last plot, my friends, is a depiction of
$$p(\theta_1, \theta_2 | D) = \frac{p(D | \theta_1, \theta_2) p(\theta_1, \theta_2)}{p(D)}.$$
### The posterior via the Metropolis algorithm.
I initially skipped over this section because the purpose of this book is to explore Kruschke's material with **brms**, which does not use the Metropolis algorithm (which really is primarily of historic interest, at this point). However, fellow enthusiast [Omid Ghasemi](https://github.com/OmidGhasemi21) worked it through and kindly shared his solution. The workflow, below, is based heavily on his, with a few small adjustments.
To start off, we'll refresh our two data sources and define a few custom functions.
```{r}
# we've already defined these, but here they are again
theta_1_data <- rep(0:1, times = c(n1 - z1, z1))
theta_2_data <- rep(0:1, times = c(n2 - z2, z2))
# define the bivariate Bernoulli likelihood
bivariate_bernoulli_likelihood <- function(theta1, data1, theta2, data2) {
z1 <- sum(data1)
n1 <- length(data1)
z2 <- sum(data2)
n2 <- length(data2)
p_data_given_theta <- (theta1^z1 * (1 - theta1)^(n1 - z1)) * (theta2^z2 * (1 - theta2)^(n2 - z2))
p_data_given_theta[theta1 > 1 | theta1 < 0] <- 0
p_data_given_theta[theta2 > 1 | theta2 < 0] <- 0
return(p_data_given_theta)
}
# we need to update the prior density function from above
prior_d <- function(theta1, theta2) {
p_theta <- dbeta(theta1, 1, 1) * dbeta(theta2, 1, 1)
p_theta[theta1 > 1 | theta1 < 0] = 0
p_theta[theta2 > 1 | theta2 < 0] = 0
return(p_theta)
}
# we also need to update how we define the relative probability of the target distribution
target_rel_prob <- function(theta1, data1, theta2, data2) {
l <- bivariate_bernoulli_likelihood(theta1, data1, theta2, data2)
p <- prior_d(theta1, theta2)
target_rel_prob <- l * p
return(target_rel_prob)
}
```
The next bit defines how we'll apply the Metropolis algorithm to our bivariate Bernoulli data. Although the guts contain a lot of moving parts, there are only two parameters at the top level. The `traj_length` argument is set to 50,000, which will be our default number of MCMC draws. Of greater interest is the `proposal_sd` argument. From the text, we read:
> Recall that the Metropolis algorithm is a random walk through the parameter space that starts at some arbitrary point. We propose a jump to a new point in parameter space, with the proposed jump randomly generated from a proposal distribution from which it is easy to generate values. For our present purposes, the proposal distribution is a *bivariate* normal. (p. 168, *emphasis* in the original)
For this exercise, the bivariate normal proposal distribution is centered at zero with an adjustable standard deviation. In the text, Kruschke compared the results for $\operatorname{Normal}(0, 0.02)$ and $\operatorname{Normal}(0, 0.2)$. For our `my_bivariate_metropolis()` function, the `proposal_sd` argument controls that $\sigma$ parameter.
```{r}
my_bivariate_metropolis <- function(proposal_sd = 0.02,
# specify the length of the trajectory (i.e., the number of jumps to try)
traj_length = 50000) {
# initialize the vector that will store the results
trajectory1 <- rep(0, traj_length)
trajectory2 <- rep(0, traj_length)
# specify where to start the trajectory:
trajectory1[1] <- 0.5 # another arbitrary value
trajectory2[1] <- 0.5 # another arbitrary value
# specify the burn-in period
burn_in <- ceiling(0.0 * traj_length) # arbitrary number, less than `traj_length`
# initialize accepted, rejected counters, just to monitor performance:
n_accepted <- 0
n_rejected <- 0
for (t in 1:(traj_length - 1)) {
current_position1 <- trajectory1[t]
current_position2 <- trajectory2[t]
# use the proposal distribution to generate a proposed jump
proposed_jump1 <- rnorm(1, mean = 0, sd = proposal_sd)
proposed_jump2 <- rnorm(1, mean = 0, sd = proposal_sd)
# compute the probability of accepting the proposed jump
prob_accept <- min(1,
target_rel_prob(current_position1 + proposed_jump1, theta_1_data,
current_position2 + proposed_jump2, theta_2_data)
/ target_rel_prob(current_position1, theta_1_data, current_position2, theta_2_data))
# generate a random uniform value from the interval [0, 1] to
# decide whether or not to accept the proposed jump
if (runif(1) < prob_accept) {
# accept the proposed jump
trajectory1[t + 1] <- current_position1 + proposed_jump1
trajectory2[t + 1] <- current_position2 + proposed_jump2
# increment the accepted counter, just to monitor performance
if (t > burn_in) {n_accepted <- n_accepted + 1}
} else {
# reject the proposed jump, stay at current position
trajectory1[t + 1] <- current_position1
trajectory2[t + 1] <- current_position2
# increment the rejected counter, just to monitor performance
if (t > burn_in) {n_rejected <- n_rejected + 1}
}
}
# extract the post-burn_in portion of the trajectory
accepted_traj1 <- trajectory1[(burn_in + 1) : length(trajectory1)]
accepted_traj2 <- trajectory2[(burn_in + 1) : length(trajectory2)]
# collect the results
metrop_2d_data <-
tibble(iter = rep(1:traj_length),
accepted_traj1 = accepted_traj1,
accepted_traj2 = accepted_traj2,
n_accepted = n_accepted,
n_rejected = n_rejected)
return(metrop_2d_data)
}
```
Now we've defined `my_bivariate_metropolis()` let's apply it to our data with `proposal_sd == 0.02` and `proposal_sd == 0.2`. We'll save the results as `mh`.
```{r}
mh <-
tibble(proposal_sd = c(0.02, 0.2)) %>%
mutate(mh = map(proposal_sd, my_bivariate_metropolis)) %>%
unnest(mh)
mh
```
If you look at the top of Figure 7.6, you'll see Kruschke summarized his results with the acceptance rate, $N_\text{acc} / N_\text{pro}$. Here are ours.
```{r}
mh %>%
group_by(proposal_sd) %>%
slice(1) %>%
summarise(acceptance_rate = n_accepted / (n_accepted + n_rejected))
```
We can compute our effective sample sizes using the `effectiveSize()` function from the [**coda** package](https://CRAN.R-project.org/package=coda) [@R-coda; @plummerCODA2006].
```{r, warning = F, message = F}
library(coda)
mh %>%
group_by(proposal_sd) %>%
summarise(ess_theta_1 = effectiveSize(accepted_traj1),
ess_theta_2 = effectiveSize(accepted_traj2))
```
We really won't use the **coda** package in this ebook beyond this chapter and the next. But do note it has a lot to offer and Kruschke used it a bit in his code.
Recall we might also compute the effective sample size estimates with functions from **posterior**, like at the end of [Section 7.3.1][Metropolis algorithm applied to Bernoulli likelihood and beta prior.]. Here are the results using `ess_mean()` and `ess_bulk()`.
```{r}
# posterior::ess_mean()
mh %>%
group_by(proposal_sd) %>%
summarise(ess_theta_1 = posterior::ess_mean(accepted_traj1),
ess_theta_2 = posterior::ess_mean(accepted_traj2))
# posterior::ess_bulk()
mh %>%
group_by(proposal_sd) %>%
summarise(ess_theta_1 = posterior::ess_bulk(accepted_traj1),
ess_theta_2 = posterior::ess_bulk(accepted_traj2))
```
Different functions from different packages yield different results. Exciting, eh? Welcome to applied statistics, friends. Anyway, now we make our version of Figure 7.6.
```{r, fig.height = 5, warning = F}
mh %>%
filter(iter < 1000) %>%
ggplot(aes(x = accepted_traj1, y = accepted_traj2)) +
geom_path(size = 1/8, alpha = 1/2, color = "steelblue") +
geom_point(alpha = 1/4, color = "steelblue") +
scale_x_continuous(expression(theta[1]), breaks = 0:5 / 5, expand = c(0, 0), limits = c(0, 1)) +
scale_y_continuous(expression(theta[2]), breaks = 0:5 / 5, expand = c(0, 0), limits = c(0, 1)) +
coord_equal() +
theme_cowplot() +
panel_border() +
theme(panel.spacing.x = unit(0.75, "cm")) +
facet_wrap(~ proposal_sd, labeller = label_both)
```
> In the limit of infinite random walks, the Metropolis algorithm yields arbitrarily accurate representations of the underlying posterior distribution. The left and right panels of Figure 7.6 would eventually converge to an identical and highly accurate approximation to the posterior distribution. But in the real world of finite random walks, we care about how efficiently the algorithm generates an accurate representative sample. We prefer to use the proposal distribution from the right panel of Figure 7.6 because it will, typically, produce a more accurate approximation of the posterior than the proposal distribution from left panel, for the same number of proposed jumps. (p. 170)
### ~~Gibbs~~ Hamiltonian Monte Carlo sampling.
Figure 7.7 is still out of my skill set. But let's fit the model with our primary package, **brms**. First we need to load **brms**.
```{r, warning = F, message = F}
library(brms)
```
These, recall, are the data.
```{r}
theta_1_data
theta_2_data
```
Kruschke said he was starting us out simply. From a regression perspective, we are getting ready to fit an intercepts-only multivariate Bernoulli model, which isn't the simplest of things to code into **brms**. Plus, this particular pair of data sets presents a complication we won't usually have to contend with in this book: The data vectors are different lengths. Remember how we pointed that out in [Section 7.4.2][The posterior via exact formal analysis.]? The issue is that whereas **brms** has extensive multivariate capacities [@Bürkner2022Multivariate], they're usually designed for data with equal sample sizes (i.e., when the rows in the two columns of a data frame are of the same number). Since these are Bernoulli data, we have two options at our disposal:
* employ the `resp_subset()` helper function or
* fit an aggregated binomial[^2] model.
Since each has its strengths and weaknesses, we'll split this section up and fit the model both ways.
#### Uneven multivariate Benoulli via the `resp_subset()` approach.
Though **brms** can receive data from a few different formats, our approach throughout this ebook will usually be with data frames or tibbles. Here's how we might combine our two data vectors, `theta_1_data` and `theta_2_data`, into a single tibble called `d`.
```{r}
d <-
tibble(y1 = theta_1_data,
y2 = c(theta_2_data, NA))
# what is this?
d
```
Because the second data vector was one unit shorter than the first, we had to compensate by adding an eighth cell, which we coded as `NA`, the universal indicator for missing values within the **R** ecosystem.
We'll still need one more data column, though. Using the `if_else()` function, we will make a `subset` column with will be coded `TRUE` for all non-`NA` values in the `y2` columns, and `FALSE` whenever `is.na(y2)`.
```{r}
d <-
d %>%
mutate(subset = if_else(is.na(y2), FALSE, TRUE))
# what is this?
d
```