-
Notifications
You must be signed in to change notification settings - Fork 43
/
Copy path05.Rmd
323 lines (236 loc) · 16.7 KB
/
05.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
```
# Bayes' Rule
"Bayes' rule is merely the mathematical relation between the prior allocation of credibility and the posterior reallocation of credibility conditional on data" [@kruschkeDoingBayesianData2015, pp. 99--100].
## Bayes' rule
> Thomas Bayes (1702–1761) was a mathematician and Presbyterian minister in England. His famous theorem was published posthumously in 1763, thanks to the extensive editorial efforts of his friend, Richard Price [@bayesLIIEssaySolving1763]. The simple rule has vast ramifications for statistical inference, and therefore as long as his name is attached to the rule, we'll continue to see his name in textbooks. But Bayes himself probably was not fully aware of these ramifications, and many historians argue that it is Bayes' successor, Pierre-Simon Laplace (1749–1827), whose name should really label this type of analysis, because it was Laplace who independently rediscovered and extensively developed the methods [e.g., @daleHistoryInverseProbability2012; @mcgrayneTheoryThatWould2011]. (p. 100)
I do recommend checking out McGrayne's book It's an easy and entertaining read. For a sneak preview, why not [listen to her](https://www.youtube.com/watch?v=8oD6eBkjF9o) discuss the main themes she covered in the book?
### Derived from definitions of conditional probability.
With Equations 5.5 and 5.6, Kruschke gave us Bayes' rule in terms of $c$ and $r$. Equation 5.5 was
$$p(c|r) = \frac{p(r|c) \; p(c)}{p(r)}.$$
Since $p(r) = \sum_{c^*}p(r|c^*)p(c^*)$, we can re-express that as Equation 5.6:
$$p(c|r) = \frac{p(r|c) \; p(c)}{\sum_{c^*}p(r|c^*) \; p(c^*)},$$
where $c^*$ "in the denominator is a variable that takes on all possible values" of $c$ (p. 101).
## Applied to parameters and data
Here we get those equations re-expressed in the terms data analysts tend to think with, parameters (i.e., $\theta$) and data (i.e., $D$):
\begin{align*}
p(\theta|D) & = \frac{p(D|\theta) \; p(\theta)}{p(D)} \;\; \text{and since} \\
p(D) & = \sum\limits_{\theta^*}p(D|\theta^*) \; p(\theta^*), \;\; \text{it's also the case that} \\
p(\theta|D) & = \frac{p(D|\theta) \; p(\theta)}{\sum\limits_{\theta^*}p(D|\theta^*) \; p(\theta^*)}.
\end{align*}
As in the previous section where we spoke in terms of $r$ and $c$, our updated $\theta^*$ notation is meant to indicate all possible values of $\theta$. For practice, it's worth repeating how Kruschke broke this down with Equation 5.7,
$$
\underbrace{p(\theta|D)}_\text{posterior} \; = \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; / \; \underbrace{p(D)}_\text{evidence}.
$$
> The "prior," $p(\theta)$, is the credibility of the $\theta$ values without the data $D$. The "posterior," $p(\theta|D)$, is the credibility of $\theta$ values with the data $D$ taken into account. The "likelihood," $p(D|\theta)$, is the probability that the data could be generated by the model with parameter value $\theta$. The "evidence" for the model, $p(D)$, is the overall probability of the data according to the model, determined by averaging across all possible parameter values weighted by the strength of belief in those parameter values. (pp. 106--107)
And don't forget, "evidence" is short for "marginal likelihood," which is the term we'll use in some of our code, below.
## Complete examples: Estimating bias in a coin
As we begin to work with Kruschke's coin example, we should clarify that
> when [Kruschke refered] to the "bias" in a coin, [he] sometimes [referred] to its underlying probability of coming up heads. Thus, *when a coin is fair, it has a "bias" of* $\textit{0.5}$. Other times, [Kruschke used] the term "bias" in its colloquial sense of a *departure from fairness*, as in "head biased" or "tail biased." Although [Kruschke tried] to be clear about which meaning is intended, there will be times that you will have to rely on context to determine whether "bias" means the probability of heads or departure from fairness. (p. 108, *emphasis* in the original)
In this ebook, I will generally avoid Kruschke's idiosyncratic use of the term "bias." Though be warned: it may pop up from time to time.
As the the coin example at hand, here's a way to expres Kruschke's initial prior distribution in a data frame and then make Figure 5.1.a.
```{r, message = F, warning = F, fig.width = 4, fig.height = 2}
library(tidyverse)
# make the data frame for the prior
d <-
tibble(theta = seq(from = 0, to = 1, by = .1),
prior = c(0:5, 4:0) * 0.04)
# plot
d %>%
ggplot(aes(x = theta, y = prior)) +
geom_col(width = .025, color = "grey50", fill = "grey50") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
labs(title = "Prior",
y = expression(p(theta))) +
theme(panel.grid = element_blank())
```
If you were curious, it is indeed the case that those `prior` values sum to 1.
```{r}
d %>%
summarise(s = sum(prior))
```
On page 109, Kruschke proposed the Bernoulli distribution for coin-tossing data, which he defined in Equation 5.10 as
$$p(y | \theta) = \theta^y (1 - \theta)^{(1 - y)}.$$
We can express it as a function in **R** like this.
```{r}
bernoulli <- function(theta, y) {
return(theta^y * (1 - theta)^(1 - y))
}
```
To get a sense of how it works, consider a single coin flip which comes up. In this example we wanted heads, which means the trial was a success. In the `bernoulli()` function, we'll refer to that single successful trial by setting `y = 1`. We can then compute the likelihood of different values of $\theta$ by inserting those values one at a time or in bulk into the `theta` argument. Here we'll look at a sequene of 11 candidate $\theta$ values, which we'll save in a vector called `theta_sequence`.
```{r}
theta_sequence <- seq(from = 0, to = 1, by = .1)
bernoulli(theta = theta_sequence, y = 1)
```
Those values are the probability of the data (`y = 1`) given different values of theta (`theta = theta_sequence`). In terms of Bayes' rule from the previous section, we call that the likelihood $p(D|\theta)$. Here's how we might compute the likelihood within the context of our `d` data frame.
```{r}
d <-
d %>%
mutate(likelihood = bernoulli(theta = theta, y = 1))
d
```
Now our `d` data contains information about the likelihood, we can use those results to make the middle panel of Figure 5.1.
```{r, fig.width = 4, fig.height = 2}
d %>%
ggplot(aes(x = theta, y = likelihood)) +
geom_col(width = .025, color = "grey50", fill = "grey50") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
labs(title = "Likelihood",
y = expression(p(D*'|'*theta))) +
theme(panel.grid = element_blank())
```
In order to compute $p(D)$ (i.e., the *evidence* or the *marginal likelihood*), we'll need to multiply our respective prior and likelihood values for each point in our theta sequence and then sum all that up. That sum will be our *marginal likelihood*. Then we can compute the posterior $p(\theta | D)$ by dividing the product of the prior and the likelihood by the marginal likelihood and make Figure 5.1.c.
```{r, message = F, warning = F, fig.width = 4, fig.height = 2}
# compute
d <-
d %>%
mutate(marginal_likelihood = sum(prior * likelihood)) %>%
mutate(posterior = (prior * likelihood) / marginal_likelihood)
# plot
d %>%
ggplot(aes(x = theta, y = posterior)) +
geom_col(width = .025, color = "grey50", fill = "grey50") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
labs(title = "Posterior",
y = expression(p(theta*'|'*D))) +
theme(panel.grid = element_blank())
```
> The posterior is a compromise between the prior distribution and the likelihood function. Sometimes this is loosely stated as a compromise between the prior and the data. The compromise favors the prior to the extent that the prior distribution is sharply peaked and the data are few. The compromise favors the likelihood function (i.e., the data) to the extent that the prior distribution is flat and the data are many. (p. 112)
### Influence of sample size on the posterior.
Our warmup in the last section was limited in that we computed the posterior based on a single data point `y = 1`. In order to follow along with this section, we're going to have to update our Bernoulli likelihood function so it can accommodate more than a single trial. In anticipation of [Chapter 6][Inferring a Binomial Probability via Exact Mathematical Analysis], we'll call our more general function the `bernoulli_likelihood()`.
```{r}
bernoulli_likelihood <- function(theta, data) {
# `theta` = success probability parameter ranging from 0 to 1
# `data` = the vector of data (i.e., a series of 0s and 1s)
n <- length(data)
return(theta^sum(data) * (1 - theta)^(n - sum(data)))
}
```
Now we can compute the likelihood for a range of theta values given a vector of coin-flip data. To practice, we'll make a vector of $N = 4$ coin flips (three tails and one head) called `small_data`, and compute the likelihood for those data given our `theta_sequence` from above.
```{r}
# define the data
small_data <- rep(0:1, times = c(3, 1))
# compute the likelihood over a range of theta values
bernoulli_likelihood(theta = theta_sequence, data = small_data)
```
To make Figure 5.2, we'll move away from the coarse 11-point `theta_sequence` to a denser 1,001-point sequence of $\theta$ values. Here's the work required to make our version of the left portion of Figure 5.2.
```{r, fig.width = 4, fig.height = 4}
# make the primary data set
d <-
tibble(theta = seq(from = 0, to = 1, by = .001),
Prior = c(seq(from = 0, to = 1, length.out = 501),
seq(from = 0.998, to = 0, length.out = 500))) %>%
mutate(Prior = Prior / sum(Prior),
Likelihood = bernoulli_likelihood(theta = theta, data = small_data)) %>%
mutate(marginal_likelihood = sum(Prior * Likelihood)) %>%
mutate(Posterior = (Prior * Likelihood) / marginal_likelihood)
# wrangle
d %>%
select(theta, Prior, Likelihood, Posterior) %>%
pivot_longer(-theta) %>%
mutate(name = factor(name, levels = c("Prior", "Likelihood", "Posterior"))) %>%
# plot
ggplot(aes(x = theta, y = value)) +
geom_area(fill = "grey67") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
ylab("probability density") +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free_y", ncol = 1)
```
As Kruschke remarked on page 112, the mode of the posterior is at $\theta = .4$.
```{r}
d %>%
filter(Posterior == max(Posterior)) %>%
select(theta, Posterior)
```
We just follow the same overall procedure to make the right portion of Figure 5.2. The only difference is how we switch from `small_data` to the $N = 40$ `large_data`.
```{r, fig.width = 4, fig.height = 4}
large_data <- rep(0:1, times = c(30, 10))
# compute
d <-
d %>%
mutate(Likelihood = bernoulli_likelihood(theta = theta, data = large_data)) %>%
mutate(marginal_likelihood = sum(Prior * Likelihood)) %>%
mutate(Posterior = (Prior * Likelihood) / marginal_likelihood)
# wrangle
d %>%
select(theta, Prior, Likelihood, Posterior) %>%
pivot_longer(-theta) %>%
mutate(name = factor(name, levels = c("Prior", "Likelihood", "Posterior"))) %>%
# plot
ggplot(aes(x = theta, y = value)) +
geom_area(fill = "grey67") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
ylab("probability density") +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free_y", ncol = 1)
```
Now the mode of the posterior is lower at $\theta = .268$.
```{r}
d %>%
filter(Posterior == max(Posterior)) %>%
select(theta, Posterior)
```
With just an $N = 40$, the likelihood already dominated the posterior. But this is also a function of our fairly gentle prior. "In general, the more data we have, the more precise is the estimate of the parameter(s) in the model. Larger sample sizes yield greater precision or certainty of estimation" (p. 113).
### Influence of prior on the posterior.
It's not immediately obvious how Kruschke made his prior distributions for Figure 5.3. However, hidden away in his `BernGridExample.R` file he indicated that to get the distribution for the left side of Figure 5.3, you simply raise the prior from the left of Figure 5.2 to the 0.1 power.
```{r, fig.width = 4, fig.height = 4}
small_data <- rep(0:1, times = c(3, 1))
tibble(theta = seq(from = 0, to = 1, by = .001),
Prior = c(seq(from = 0, to = 1, length.out = 501),
seq(from = 0.998, to = 0, length.out = 500))) %>%
# here's the important line of code
mutate(Prior = Prior^0.1 / sum(Prior^0.1),
Likelihood = bernoulli_likelihood(theta = theta, data = small_data)) %>%
mutate(marginal_likelihood = sum(Prior * Likelihood)) %>%
mutate(Posterior = (Prior * Likelihood) / marginal_likelihood) %>%
select(theta, Prior, Likelihood, Posterior) %>%
pivot_longer(-theta) %>%
mutate(name = factor(name, levels = c("Prior", "Likelihood", "Posterior"))) %>%
ggplot(aes(x = theta, y = value)) +
geom_area(fill = "grey67") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
ylab("probability density") +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free_y", ncol = 1)
```
With a vague flat prior and a small data set, the posterior well tend to look a lot like the likelihood. With the right side of Figure 5.3, we consider a larger data set and a stronger prior. Also, note our tricky `Prior` code.
```{r, fig.width = 4.5, fig.height = 4}
large_data <- rep(0:1, times = c(30, 10))
tibble(theta = seq(from = 0, to = 1, by = .001),
Prior = c(seq(from = 0, to = 1, length.out = 501),
seq(from = 0.998, to = 0, length.out = 500))) %>%
# here's the important line of code
mutate(Prior = (Prior / sum(Prior))^10,
Likelihood = bernoulli_likelihood(theta = theta, data = large_data)) %>%
mutate(marginal_likelihood = sum(Prior * Likelihood)) %>%
mutate(Posterior = (Prior * Likelihood) / marginal_likelihood) %>%
select(theta, Prior, Likelihood, Posterior) %>%
pivot_longer(-theta) %>%
mutate(name = factor(name, levels = c("Prior", "Likelihood", "Posterior"))) %>%
ggplot(aes(x = theta, y = value)) +
geom_area(fill = "grey67") +
scale_x_continuous(expression(theta), breaks = seq(from = 0, to = 1, by = .2)) +
ylab("probability density") +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free_y", ncol = 1)
```
> Bayesian inference is intuitively rational: With a strongly informed prior that uses a lot of previous data to put high credibility over a narrow range of parameter values, it takes a lot of novel contrary data to budge beliefs away from the prior. But with a weakly informed prior that spreads credibility over a wide range of parameter values, it takes relatively little data to shift the peak of the posterior distribution toward the data (although the posterior will be relatively wide and uncertain). (p. 114)
## Why Bayesian inference can be difficult
> Determining the posterior distribution directly from Bayes' rule involves computing the evidence (a.k.a. marginal likelihood) in Equations 5.8 and 5.9. In the usual case of continuous parameters, the integral in Equation 5.9 can be impossible to solve analytically. Historically, the difficulty of the integration was addressed by restricting models to relatively simple likelihood functions with corresponding formulas for prior distributions, called *conjugate* priors, that "played nice" with the likelihood function to produce a tractable integral. (p. 115, *emphasis* in the original)
However, the simple model + conjugate prior approach has its limitations. As we'll see, we often want to fit complex models without shackling ourselves with conjugate priors—which can be quite a pain to work with. Happily,
> another kind of approximation involves randomly sampling a large number of representative combinations of parameter values from the posterior distribution. In recent decades, many such algorithms have been developed, generally referred to as Markov chain Monte Carlo (MCMC) methods. What makes these methods so useful is that they can generate representative parameter-value combinations from the posterior distribution of complex models *without* computing the integral in Bayes' rule. It is the development of these MCMC methods that has allowed Bayesian statistical methods to gain practical use. (pp. 115--116, *emphasis* in the original)
## Session info {-}
```{r}
sessionInfo()
```
```{r, eval = F, echo = F}
# remove our objects
rm(d, bernoulli, theta_sequence, bernoulli_likelihood, small_data, large_data)
```
```{r, echo = F, message = F, warning = F, results = "hide"}
ggplot2::theme_set(ggplot2::theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```