-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path25.Rmd
830 lines (626 loc) · 40.5 KB
/
25.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
```
# Tools in the Trunk
"This chapter includes some important topics that apply to many different models throughout the book... The sections can be read independently of each other and at any time" [@kruschkeDoingBayesianData2015, p. 721].
## Reporting a Bayesian analysis
> Bayesian data analyses are not yet standard procedure in many fields of research, and no conventional format for reporting them has been established. Therefore, the researcher who reports a Bayesian analysis must be sensitive to the background knowledge of his or her specific audience, and must frame the description accordingly. (p. 721)
At the time of this writing (mid 2022), this is still generally the case. See @aczelDiscussionPointsBayesian2020, [*Discussion points for Bayesian inference*](https://www.researchgate.net/publication/338849264_Discussion_points_for_Bayesian_inference), for a recent discussion from several Bayesian scholars. For a take from a familiar voice, see Kruschke's [-@kruschke2021BayesianAnalysisReporting] [*Bayesian analysis reporting guidelines*](https://doi.org/10.1038/s41562-021-01177-7).
### Essential points.
> Recall the basic steps of a Bayesian analysis from [Section 2.3][The steps of Bayesian data analysis] (p. 25): Identify the data, define a descriptive model, specify a prior, compute the posterior distribution, interpret the posterior distribution, and, check that the model is a reasonable description of the data. Those steps are in logical order, with each step building on the previous step. That logical order should be preserved in the report of the analysis. (p. 722)
Kruschke then gave recommendations for motivating Bayesian inference. His [-@kruschkeBayesianNewStatistics2018] paper with Liddell, [*The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective*](https://link.springer.com/content/pdf/10.3758/s13423-016-1221-4.pdf), might be helpful in this regard. Many of the other points Kruschke made in this section (e.g., adequately reporting the data structure, the priors, evidence for convergence) can be handled by adopting open science practices.
If your data and research questions are simple and straightforward, you might find it easy to detail these and other concerns in the primary manuscript. The harsh reality is many journals place tight constraints on word and/or page limits. If your projects are not of the simple and straightforward type, supplemental materials are your friend. Regardless of a journal's policy on hosting supplemental materials on the official journal website, you can detail your data, priors, MCMC diagnostics, and all the other fine-grained details of your analysis in supplemental documents hosted in publicly-accessible repositories like the [Open Science Framework (OSF)](https://osf.io/) or [GutHub](https://github.com/). If possible, do consider making your data openly available. Regardless of the status of your data, please consider making all your **R** scripts available as supplementary material. To reiterate from Chapter 3, I strongly recommend checking out [R Notebooks](https://bookdown.org/yihui/rmarkdown/notebook.html) for that purpose. They are a type of R Markdown document with augmentations that make them more useful for working scientists. You can learn more about them [here](https://rstudio-pubs-static.s3.amazonaws.com/256225_63ebef4029dd40ef8e3679f6cf200a5a.html) and [here](https://www.r-bloggers.com/why-i-love-r-notebooks-2/). And for a more comprehensive overview, check out Xie, Allaire, and Grolemund's [-@xieMarkdownDefinitiveGuide2022] [*R markdown: The definitive guide*](https://bookdown.org/yihui/rmarkdown/).
### Optional points.
For more thoughts on robustness checks, check out a couple Gelman's blog posts, [*What's the point of a robustness check?*](https://statmodeling.stat.columbia.edu/2017/11/29/whats-point-robustness-check/) and [*Robustness checks are a joke*](https://statmodeling.stat.columbia.edu/2018/11/14/robustness-checks-joke/), along with the action in the comments section.
In addition to posterior predictive checks, which are great [see @kruschkePosteriorPredictiveChecks2013], consider prior predictive checks, too. For a great introduction to the topic, check out Gabry, Simpson, Vehtari, Betancourt, and Gelman's [-@gabry2019visualization] [*Visualization in Bayesian workflow*](https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12378).
### Helpful points.
For more ideas on open data, check out Rouder's [-@rouderWhatWhyHow2016] [*The what, why, and how of born-open data*](https://link.springer.com/article/10.3758/s13428-015-0630-z). You might also check out Klein and colleagues' [-@kleinPracticalGuideTransparency2018] [*A practical guide for transparency in psychological science*](https://lirias.kuleuven.be/1999530?limo=0) and Martone, Garcia-Castro, and VandenBos's [-@martoneDataSharingPsychology2018] [*Data sharing in psychology*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5920518/pdf/nihms935471.pdf).
As to posting your model fits, this could be done in any number of ways, including as official supplemental materials hosted by the journal, on GitHub, or on the OSF. At a base level, this means saving your fits as external files. We've already been modeling this with our `brm()` code throughout this book. With the `save` argument, we saved the model fits within the [`fits` folder on GitHub](https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/tree/master/fits). You might adopt a similar approach. But do be warned: **brms** fit objects contain a copy of the data used to create them. For example, here's how we might reload `fit24.1` from last chapter.
```{r, message = F}
fit24.1 <- readRDS("fits/fit24.01.rds")
```
By indexing the fit object with `$data`, you can see the data.
```{r, warning = F, message = F}
library(tidyverse)
library(brms)
fit24.1$data %>%
glimpse()
```
Here's a quick way to remove the data from the fit object.
```{r}
fit24.1$data <- NULL
```
Confirm it worked.
```{r}
fit24.1$data
```
Happily, the rest of the information is still there for you. E.g., here's the summary.
```{r}
summary(fit24.1)
```
## Functions for computing highest density intervals
You can find a copy of Kruschke's scripts, including `DBDA2E-utilities.R`, at [https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/tree/master/data.R](https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/tree/master/data.R).
### R code for computing HDI of a grid approximation.
> We can imagine the grid approximation of a distribution as a landscape of poles sticking up from each point on the parameter grid, with the height of each pole indicating the probability mass at that discrete point. We can imagine the highest density region by visualizing a rising tide: We gradually flood the landscape, monitoring the total mass of the poles that protrude above water, stopping the flood when 95% (say) of the mass remains protruding. The waterline at that moment defines the highest density region [e.g., @hyndmanComputingGraphingHighest1996]. (p. 725)
We can use Kruschke's `HDIofGrid()` function for such a task.
```{r}
HDIofGrid <- function(probMassVec, credMass = 0.95) {
# Arguments:
# probMassVec is a vector of probability masses at each grid point.
# credMass is the desired mass of the HDI region.
# Return value:
# A list with components:
# indices is a vector of indices that are in the HDI
# mass is the total mass of the included indices
# height is the smallest component probability mass in the HDI
# Example of use: For determining HDI of a beta(30,12) distribution
# approximated on a grid:
# > probDensityVec = dbeta( seq(0,1,length=201) , 30 , 12 )
# > probMassVec = probDensityVec / sum( probDensityVec )
# > HDIinfo = HDIofGrid( probMassVec )
# > show( HDIinfo )
sortedProbMass <- sort(probMassVec, decreasing = TRUE)
HDIheightIdx <- min(which(cumsum(sortedProbMass) >= credMass))
HDIheight <- sortedProbMass[HDIheightIdx]
HDImass <- sum(probMassVec[probMassVec >= HDIheight])
return(list(indices = which(probMassVec >= HDIheight),
mass = HDImass, height = HDIheight))
}
```
I found Kruschke's description of his `HDIofGrid()` a bit opaque. Happily, we can understand this function with a little help from an example posted at [https://rdrr.io/github/kyusque/DBDA2E-utilities/man/HDIofGrid.html](https://rdrr.io/github/kyusque/DBDA2E-utilities/man/HDIofGrid.html).
```{r}
prob_density_vec <- dbeta(seq(0, 1, length = 201), 30, 12)
prob_mass_vec <- prob_density_vec / sum(prob_density_vec)
HDI_info <- HDIofGrid(prob_mass_vec)
show(HDI_info)
```
To walk that through a bit, `prob_density_vec` is a vector of density values for $\operatorname{Beta} (30, 12)$ based on 201 evenly-spaced values spanning the parameter space for $\theta$ (i.e., from 0 to 1). In the second line, we converted those density values to the probability metric by dividing each by their sum, which we then saved as `prob_mass_vec`. In the third line we shoved those probability values into Kruschke's `HDIofGrid()` and saved the results as `HDI_info`. The output of the fourth line, `show(HDI_info)`, showed us the results (i.e., the contents of `HDI_info`).
As to those results, the values in saved as `$indices` are the row numbers for all cases in `prob_mass_vec` that were within the HDI. The value in `$mass` showed the actual width of the HDI. Because we're only working with finite samples (i.e., `length = 201`), we won't likely get a perfect 95% HDI. The value in `$height` is the density value for *the waterline that defines the highest density region*. A plot might make that less abstract.
```{r, fig.width = 7, fig.height = 2.75, warning = F, message = F}
library(ggthemes)
theme_set(
theme_base() +
theme(plot.background = element_rect(color = "transparent"))
)
# wrangle
tibble(row = 1:length(prob_density_vec),
theta = seq(0, 1, length = length(prob_density_vec)),
density = prob_mass_vec,
cred = if_else(row %in% HDI_info$indices, 1, 0)) %>%
# plot
ggplot(aes(x = theta, y = density)) +
# HDI
geom_area(aes(fill = cred == 1)) +
# density line
geom_line(color = "skyblue", size = 1) +
# waterline
geom_hline(yintercept = HDI_info$height, linetype = 3, size = 1/4) +
# fluff
annotate(geom = "text", x = .2, y = 0.0046,
label = '"waterline" that defines all points\ninside the highest density region') +
annotate(geom = "text", x = .715, y = 0.01,
label = "95.28% HDI", color = "black", size = 5) +
scale_fill_manual(values = c("transparent", "grey67")) +
xlab(expression(theta))
```
You might have noticed our `theme_set()` lines at the top of that code block. For the plots in this chapter, we'll give a nod to the source material and make them with a similar aesthetic to Kruschke's figures.
### HDI of unimodal distribution is shortest interval.
> The algorithms [in the next sections] find the HDI by searching among candidate intervals of mass $M$. The shortest one found is declared to be the HDI. It is an approximation, of course. See @chenMonteCarloEstimation1999 for more details, and Chen, He, Shao, and Xu [-@chenMonteCarloGap2003] for dealing with the unusual situation of multimodal distributions. (p. 727)
### R code for computing HDI of a MCMC sample.
In this section, Kruschke provided the code for his `HDIofMCMC()` function. We recreate it, below, with a few mild formatting changes. The `ggthemes::theme_base()` theme gets us most of the way there.
```{r}
HDIofMCMC <- function(sampleVec, credMass = .95) {
# Computes highest density interval from a sample of representative values,
# estimated as shortest credible interval.
# Arguments:
# sampleVec
# is a vector of representative values from a probability distribution.
# credMass
# is a scalar between 0 and 1, indicating the mass within the credible
# interval that is to be estimated.
# Value:
# HDIlim is a vector containing the limits of the HDI
sortedPts <- sort(sampleVec)
ciIdxInc <- ceiling(credMass * length(sortedPts))
nCIs <- length(sortedPts) - ciIdxInc
ciWidth <- rep(0, nCIs)
for (i in 1:nCIs) {
ciWidth[i] <- sortedPts[i + ciIdxInc] - sortedPts[i]
}
HDImin <- sortedPts[which.min(ciWidth)]
HDImax <- sortedPts[which.min(ciWidth) + ciIdxInc]
HDIlim <- c(HDImin, HDImax)
return(HDIlim)
}
```
Let's continue working with `fit24.1` to see how Kruschke's `HDIofMCMC()` works. First we need to extract the posterior draws.
```{r}
draws <- as_draws_df(fit24.1)
```
Here's how you might use the function to get the HDIs for the intercept parameter.
```{r}
HDIofMCMC(draws$b_Intercept)
```
Kruschke's `HDIofMCMC()` works very much the same as the summary functions from **tidybayes**. For example, here's good old `tidybayes::mode_hdi()`.
```{r, warning = F, message = F}
library(tidybayes)
mode_hdi(draws$b_Intercept)
```
If you'd like to use **tidybayes** to just pull the HDIs without the extra information, just use the `hdi()` function.
```{r}
hdi(draws$b_Intercept)
```
Just in case you're curious, Kruschke's `HDIofMCMC()` function returns the same information as `tidybayes::hdi()`. Let's confirm.
```{r}
HDIofMCMC(draws$b_Intercept) == hdi(draws$b_Intercept)
```
Identical.
### R code for computing HDI of a function.
> The function described in this section finds the HDI of a unimodal probability density function that is specified mathematically in R. For example, the function can find HDI's of normal densities or of beta densities or of gamma densities, because those densities are specified as functions in R. (p. 728)
If you recall, we've been using this function off and on since Chapter 4. Here is it, again, with mildly reformatted code and parameter names.
```{r}
hdi_of_icdf <- function(name, width = .95, tol = 1e-8, ... ) {
# Arguments:
# `name` is R's name for the inverse cumulative density function
# of the distribution.
# `width` is the desired mass of the HDI region.
# `tol` is passed to R's optimize function.
# Return value:
# Highest density iterval (HDI) limits in a vector.
# Example of use: For determining HDI of a beta(30, 12) distribution, type
# `hdi_of_icdf(qbeta, shape1 = 30, shape2 = 12)`
# Notice that the parameters of the `name` must be explicitly stated;
# e.g., `hdi_of_icdf(qbeta, 30, 12)` does not work.
# Adapted and corrected from Greg Snow's TeachingDemos package.
incredible_mass <- 1.0 - width
interval_width <- function(low_tail_prob, name, width, ...) {
name(width + low_tail_prob, ...) - name(low_tail_prob, ...)
}
opt_info <- optimize(interval_width, c(0, incredible_mass),
name = name, width = width,
tol = tol, ...)
hdi_lower_tail_prob <- opt_info$minimum
return(c(name(hdi_lower_tail_prob, ...),
name(width + hdi_lower_tail_prob, ...)))
}
```
Here's how it works for the standard normal distribution.
```{r}
hdi_of_icdf(qnorm, mean = 0, sd = 1)
```
By default, it returns 95% HDIs. Here's how it'd work if you wanted the 80% intervals for $\operatorname{Beta}(2, 2)$.
```{r}
hdi_of_icdf(qbeta, shape1 = 2, shape2 = 2, width = .8)
```
## Reparameterization
> There are situations in which one parameterization is intuitive to express a distribution, but a different parameterization is required for mathematical convenience. For example, we may think intuitively of the standard deviation of a normal distribution, but have to parameterize the distribution in terms of the precision (i.e., reciprocal of the variance). (p. 729)
The details in the rest of this section are beyond the scope of this project.
## Censored data in ~~JAGS~~ brms
"In many situations some data are censored, which means that their values are known only within a certain range" (p. 732) Happily, **brms** is capable of handling censored variables. The setup is a little different from how Kruschke described for JAGS. From the `brmsformula` section of the [**brms** reference manual](https://CRAN.R-project.org/package=brms/brms.pdf) [@brms2022RM], we read:
> With the exception of categorical, ordinal, and mixture families, left, right, and interval censoring can be modeled through `y | cens(censored) ~ predictors`. The censoring variable (named `censored` in this example) should contain the values `'left'`, `'none'`, `'right'`, and `'interval'` (or equivalently `-1`, `0`, `1`, and `2`) to indicate that the corresponding observation is left censored, not censored, right censored, or interval censored. For interval censored data, a second variable (let's call it `y2`) has to be passed to `cens`. In this case, the formula has the structure `y | cens(censored,y2) ~ predictors`. While the lower bounds are given in `y`, the upper bounds are given in `y2` for interval censored data. Intervals are assumed to be open on the left and closed on the right: `(y,y2]`.
We'll make sense of all this in just a moment. First, let's see how Kruschke described the example in the text.
> To illustrate why it is important to include censored data in the analysis, consider a case in which $N = 500$ values are generated randomly from a normal distribution with $\mu = 100$ and $\sigma = 15$. Suppose that values above $106$ are censored, as are values in the interval between $94$ and $100$. For the censored values, all we know is the interval in which they occurred, but not their exact value. (p. 732)
I'm not aware that we have access to Kruschke's censored data, so we'll just make our own based on his description. We'll start off by simulating the idealized uncensored data, `y`, based on $\operatorname{Normal}(100, 15)$.
```{r}
n <- 500
set.seed(25)
d <- tibble(y = rnorm(n, mean = 100, sd = 15))
```
To repeat, Kruschke described two kinds of censoring:
* "values above 106 are censored,"
* "as are values in the interval between 94 and 100."
`r emo::ji("warning")` Rather than examine both kinds of censoring at once, as in the text, I'm going to slow down and break this section up. First, we'll explore right censoring. Second, we'll explore interval censoring. For our grand finale, we'll combine the two, as in the text.
### Right censoring: When you're uncertain beyond a single threshold.
In the case where the "values above 106 are censored," we need to save a single threshold value. We'll call it `t1`.
```{r}
t1 <- 106
```
Now we can use our `t1` threshold value to make a right censored version of the `y` column called `y1`. We'll also make a character value listing out a row's censoring status in nominal terms.
```{r}
d <-
d %>%
mutate(y1 = if_else(y > t1, t1, y),
cen1 = if_else(y > t1, "right", "none"))
d
```
When the values in `y1` are not censored, we see `cen1 == "none"`. For all cases where the original `y` value exceeded the `t1` threshold (i.e., 106), `y1 == 106` and `cen1 == "right"`. We are using the terms `"none"` and `"right"` because they are based on the block quote from the **brms** reference manual. For plotting purposes, we'll make a new variable, `y_na`, that only has values for which `cen1 == "none"`.
```{r}
d <-
d %>%
mutate(y_na = ifelse(cen1 == "none", y, NA))
d
```
Now let's get a better sense of the data with a few histograms.
```{r, fig.width = 6, fig.height = 2.5, warning = F}
d %>%
pivot_longer(-cen1) %>%
mutate(name = factor(name, levels = c("y", "y1", "y_na"))) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
xlab(NULL) +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, ncol = 3)
```
The original un-censored data are in `y`. The censored data, for which all values above 106 were simply recoded 106, is `y1`. The `y_na` column shows what the distribution would look like if the censored values were omitted.
Here's how we might fit a model which only uses the uncensored values, those in `y_na`.
```{r fit25.1a, warning = F, message = F, results = "hide"}
# define the stanvars
mean_y <- mean(d$y_na, na.rm = T)
sd_y <- sd(d$y_na, na.rm = T)
stanvars <-
stanvar(mean_y, name = "mean_y") +
stanvar(sd_y, name = "sd_y")
# fit the model
fit25.1a <-
brm(data = d,
family = gaussian,
y_na ~ 1,
prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
file = "fits/fit25.01a")
```
Check the summary for the naïve model.
```{r}
print(fit25.1a)
```
Relative to the true data-generating process for the original variable `y`, $\operatorname{Normal}(100, 15)$, those parameters look pretty biased. Now let's practice fitting our first censored model. Here we use the `y1 | cens(cen1)` syntax in the left side of the model `formula` to indicate that the criterion variable, `y1`, has been censored according to the process as defined in the `cens()` function. Within `cens()` we inserted our nominal variable, `cens1`, which indicates which cases are censored (`"right"`) or not (`"none"`).
This model is one of the rare occasions where we'll set our initial values for the model intercept. In my first few attempts, `brm()` had great difficulty initializing the chains using the default initial values. We'll help it out by setting them at `mean_y`. Recall that when you set custom initial values in **brms**, you save them in a list with the number of lists equaling the number of HMC chains. Because we're using the default `chains = 4`, well need four lists of intercept start values, `mean_y`. You can set them to different values, if you'd like.
```{r fit25.2a, warning = F, message = F, results = "hide"}
inits <- list(Intercept = mean_y)
inits_list <- list(inits, inits, inits, inits)
fit25.2a <-
brm(data = d,
family = gaussian,
y1 | cens(cen1) ~ 1,
prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
# here we insert our start values for the intercept
inits = inits_list,
file = "fits/fit25.02a")
```
Now check the summary for the model accounting for the censoring.
```{r}
print(fit25.2a)
```
To get a better sense of what these models are doing, we might do a posterior predictive check with `predict()`.
```{r, fig.width = 6, fig.height = 4.5, warning = F}
# original
p1 <-
d %>%
pivot_longer(cols = c(y, y_na)) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
labs(subtitle = "Data",
x = NULL) +
scale_x_continuous(breaks = 3:5 * 25) +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, ncol = 1)
# posterior simulations
set.seed(25)
p2 <-
predict(fit25.1a,
summary = F,
ndraws = 4) %>%
data.frame() %>%
mutate(sim = 1:n()) %>%
pivot_longer(-sim) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue", color = "white") +
geom_vline(xintercept = t1, linetype = 2) +
labs(subtitle = "Simulations",
x = NULL) +
scale_x_continuous(breaks = 3:5 * 25) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = range(d$y)) +
facet_wrap(~ sim, ncol = 2, labeller = label_both)
# combine
library(patchwork)
p1 + p2 +
plot_layout(widths = c(1, 2)) +
plot_annotation(title = "PP check for the naïve model, fit25.1a")
```
On the left column, the have the original uncensored data `y` and the `y_na` data where the censored values were removed. The four lighter histograms on the right are individual simulations based on our naïve model, `fit25.1a`, which only used the `y_na` data. In all cases, the simulated data are biased to be too small compared with the original data, `y`. They also have the undesirable quality that they don't even match up with the `y_na` data, which drops sharply off at the 106 threshold, whereas the simulated data all contain values exceeding that point (depicted by the dashed vertical lines). Now let's see what happens when we execute a few posterior predictive checks with our first censored model, `fit25.2a`.
```{r, fig.width = 6, fig.height = 4.5, warning = F}
p1 <-
d %>%
pivot_longer(cols = c(y, y1)) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
labs(subtitle = "Data",
x = NULL) +
scale_x_continuous(breaks = 3:5 * 25) +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, ncol = 1, scales = "free_y")
set.seed(25)
p2 <-
predict(fit25.2a,
summary = F,
ndraws = 4) %>%
data.frame() %>%
mutate(sim = 1:n()) %>%
pivot_longer(-sim) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue", color = "white") +
labs(subtitle = "Simulations",
x = NULL) +
scale_x_continuous(breaks = 3:5 * 25) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = range(d$y)) +
facet_wrap(~ sim, ncol = 2, labeller = label_both)
# combine
p1 + p2 +
plot_layout(widths = c(1, 2)) +
plot_annotation(title = "PP check for the censored model, fit25.2a")
```
Now the posterior predictive draws match up nicely from the original uncensored data `y` in terms of central tendency, dispersion, and their overall shapes. This is the magic of a model that accounts for right (or left) censoring. It can take all those stacked-up 106 values and approximate the underlying distribution, had they been more accurately recorded.
### Interval censoring: When your values are somewhere in the middle.
Time to move onto interval censoring. Kruschke's example in the text included values censored in the interval between 94 and 100. This will require two more thresholds, which we'll call `t2` and `t3`.
```{r}
t2 <- 94
t3 <- 100
```
We should revisit part of the block quote from the **brms** reference manual, above:
> For interval censored data, a second variable (let's call it `y2`) has to be passed to `cens`. In this case, the formula has the structure `y | cens(censored,y2) ~ predictors`. While the lower bounds are given in `y`, the upper bounds are given in `y2` for interval censored data. Intervals are assumed to be open on the left and closed on the right: `(y,y2]`
It's a little unclear, to me, if this is how Kruschke defined his intervals, but since we're working with **brms** we'll just use this convention. Thus, we will define "values in the interval between 94 and 100" as `y >= t2 & y < t3`. Here we'll follow the **brms** convention and define a new variable `y2` for our lower bound and `y3` for our upper bound. Additionally, the new `cen2` variable will tell us whether a case is interval censored (`"interval"`) or not (`"none"`). We have also redefined `y_na` in terms of our `cen2` variable.
```{r}
d <-
d %>%
mutate(y2 = if_else(y >= t2 & y < t3, t2, y),
y3 = if_else(y >= t2 & y < t3, t3, y),
cen2 = if_else(y >= t2 & y < t3, "interval", "none")) %>%
mutate(y_na = ifelse(cen2 == "none", y, NA))
d
```
Now let's get a better sense of the new data with a few histograms.
```{r, fig.width = 4.25, fig.height = 4.5, warning = F}
d %>%
pivot_longer(cols = c(y, y2, y3, y_na)) %>%
mutate(name = factor(name, levels = c("y", "y2", "y3", "y_na"))) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
labs(subtitle = "Our data have been updated",
x = NULL) +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, ncol = 2)
```
Now we fit two models. The first, `fit25.1b`, will only used the values not interval censored, `y_na`. The second model, `fit25.2b`, will use the `cens()` function to fit account for the interval censored data with the `y2`, `y3`, and `cen2` columns. As before, we'll want to use `inits` to help the censored model run smoothly.
```{r fit25.1b, warning = F, message = F, results = "hide"}
# naïve
fit25.1b <-
brm(data = d,
family = gaussian,
y_na ~ 1,
prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
file = "fits/fit25.01b")
# censored
fit25.2b <-
brm(data = d,
family = gaussian,
y2 | cens(cen2, y3) ~ 1,
prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
# here we insert our start values for the intercept
inits = inits_list,
file = "fits/fit25.02b")
```
Review the model summaries.
```{r}
print(fit25.1b)
print(fit25.2b)
```
In this case, both models did a reasonable job approximating the original data-generating parameters, $\operatorname{Normal}(100, 15)$. However, the naïve model `fit25.1b` did so at a considerable loss of precision, as indicated by the posterior standard deviations. To give a better sense, here are the model parameters in an interval plot.
```{r, fig.width = 7, fig.height = 1.75, warning = F}
# wrangle
bind_rows(as_draws_df(fit25.1b),
as_draws_df(fit25.2b)) %>%
mutate(fit = rep(c("fit25.1b", "fit25.2b"), each = n() / 2)) %>%
pivot_longer(b_Intercept:sigma) %>%
mutate(parameter = if_else(name == "b_Intercept", "mu", "sigma")) %>%
# plot
ggplot(aes(x = value, y = fit)) +
stat_interval(.width = c(.5, .95)) +
scale_color_manual("ETI",
values = c("skyblue2", "skyblue4"),
labels = c("95%", "50%")) +
labs(x = "marginal posterior",
y = NULL) +
theme(axis.ticks.y = element_blank()) +
facet_wrap(~ parameter, scales = "free_x", labeller = label_parsed)
```
One could also argue the parameters for the censored model `fit25.2b` were a little less biased. However, I'd be leery of making that as a general claim based on this example, alone.
### Complex censoring: When the going gets tough...
Okay, let's put what we've learned together and practice with data that are both right AND interval censored. This is going to require some tricky coding on our part. Since interval censoring is in play, we'll need two `y` variables, again. When the data are indeed interval censored, their lower and upper bounds will be depicted by `y4` and `y5`, respectively. When the data are right censored, `y4` will contain the threshold 106 and `y5` will contain the original value from `y`. All this tricky information will be indexed in our nominal variable `cen3`. Once again, we update our `y_na` variable, too.
```{r}
d <-
d %>%
mutate(y4 = if_else(y >= t2 & y < t3, t2,
if_else(y > t1, t1, y)),
y5 = if_else(y >= t2 & y < t3, t3, y),
cen3 = if_else(y >= t2 & y < t3, "interval",
if_else(y > t1, "right", "none"))) %>%
mutate(y_na = ifelse(cen3 == "none", y, NA))
d
```
Explore the new data with histograms.
```{r, fig.width = 4.25, fig.height = 4.5, warning = F}
d %>%
pivot_longer(cols = c(y, y4, y5, y_na)) %>%
mutate(name = factor(name, levels = c("y", "y4", "y5", "y_na"))) %>%
ggplot(aes(x = value)) +
geom_histogram(size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
labs(subtitle = "Our data have been updated, again",
x = NULL) +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, ncol = 2)
```
In the text, Kruschke reported he had 255 uncensored values (p. 732). Here's the breakdown of our data.
```{r}
d %>%
count(cen3)
```
We got really close! Now we fit our two models: `fit25.1c`, which ignores the censoring problem and just drops those cases, and `fit25.2c`, which makes slick use of the `cens()` function.
```{r fit25.1c, warning = F, message = F, results = "hide"}
# naïve
fit25.1c <-
brm(data = d,
family = gaussian,
y_na ~ 1,
prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
file = "fits/fit25.01c")
# censored
fit25.2c <-
brm(data = d,
family = gaussian,
y4 | cens(cen3, y5) ~ 1,
prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
# here we insert our start values for the intercept
inits = inits_list,
file = "fits/fit25.02c")
```
Compare the model summaries.
```{r}
print(fit25.1c)
print(fit25.2c)
```
The results for the naïve model are a mess and those from the censored model look great! Also note how the latter used all avaliable cases, whereas the former dropped nearly half of them.
Before we can make our version of Figure 25.4, we'll need to extract the posterior draws. We'll start with `fit25.1c`.
```{r}
draws <-
as_draws_df(fit25.1c) %>%
mutate(mu = b_Intercept,
`(mu-100)/sigma` = (b_Intercept - 100) / sigma)
head(draws)
```
These subplots look a lot like those from back in [Section 16.2][Outliers and robust estimation: The $t$ distribution]. Since this is the last plot from the book, it seems like we should go all out. To reduce some of the code redundancy with the six subplots of the marginal posteriors, we'll make a custom geom, `geom_hist()`.
```{r}
geom_hist <- function(xintercept = xintercept, binwidth = binwidth, ...) {
list(
geom_histogram(fill = "skyblue", color = "white", size = .2, binwidth = binwidth, boundary = 106),
geom_vline(xintercept = xintercept, color = "skyblue3", size = 1/2, linetype = 3),
stat_pointinterval(aes(y = 0), point_interval = mode_hdi, .width = .95),
scale_y_continuous(NULL, breaks = NULL)
)
}
```
Now we have our `geom_hist()`, here are the first three histograms for the marginal posteriors from `fit25.1c`.
```{r}
p1 <-
draws %>%
ggplot(aes(x = mu)) +
geom_hist(xintercept = 100, binwidth = 0.1) +
xlab(expression(mu))
p3 <-
draws %>%
ggplot(aes(x = sigma)) +
geom_hist(xintercept = 15, binwidth = 0.08) +
xlab(expression(sigma))
p4 <-
draws %>%
ggplot(aes(x = `(mu-100)/sigma`)) +
geom_hist(xintercept = 0, binwidth = 0.015) +
xlab(expression((mu-100)/sigma))
```
The histogram of the censored data with the posterior predictive density curves superimposed will take a little more work.
```{r, fig.width = 3, fig.height = 2}
n_lines <- 50
p2 <-
draws %>%
slice(1:n_lines) %>%
expand(nesting(.draw, mu, sigma),
y_na = seq(from = 40, to = 120, by = 1)) %>%
mutate(density = dnorm(x = y_na, mean = mu, sd = sigma)) %>%
ggplot(aes(x = y_na)) +
geom_histogram(data = d,
aes(y = stat(density)),
size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
geom_line(aes(y = density, group = .draw),
size = 1/4, alpha = 1/3, color = "skyblue") +
scale_x_continuous("y", limits = c(40, 110)) +
scale_y_continuous(NULL, breaks = NULL)
```
Now extract the posterior draws from our censored model, `fit25.2`, and repeat the process.
```{r}
draws <-
as_draws_df(fit25.2c) %>%
mutate(mu = b_Intercept,
`(mu-100)/sigma` = (b_Intercept - 100) / sigma)
p5 <-
draws %>%
ggplot(aes(x = mu)) +
geom_hist(xintercept = 100, binwidth = 0.1) +
xlab(expression(mu))
p7 <-
draws %>%
ggplot(aes(x = sigma)) +
geom_hist(xintercept = 15, binwidth = 0.1) +
xlab(expression(sigma))
p8 <-
draws %>%
ggplot(aes(x = `(mu-100)/sigma`)) +
geom_hist(xintercept = 0, binwidth = 0.01) +
xlab(expression((mu-100)/sigma))
p6 <-
draws %>%
slice(1:n_lines) %>%
expand(nesting(.draw, mu, sigma),
y_na = seq(from = 40, to = 120, by = 1)) %>%
mutate(density = dnorm(x = y_na, mean = mu, sd = sigma)) %>%
ggplot(aes(x = y_na)) +
geom_histogram(data = d,
aes(y = stat(density)),
size = .25, binwidth = 2, boundary = 106,
fill = "skyblue4", color = "white") +
geom_line(aes(y = density, group = .draw),
size = 1/4, alpha = 1/3, color = "skyblue") +
scale_x_continuous("y", limits = c(40, 110)) +
scale_y_continuous(NULL, breaks = NULL)
```
Finally, combine the subplots, and annotate a bit.
```{r, fig.width = 6, fig.height = 8.5, warning = F}
((p1 | p2) / (p3 | p4) / (p5 | p6) / (p7 | p8)) +
plot_annotation(title = "This is our final plot, together.",
caption = expression(atop(italic("Upper quartet")*": Censored data omitted from analysis; parameter estimates are too small. ", italic("Lower quartet")*": Censored data imputed in known bins; parameter estimates are accurate."))) &
theme(plot.caption = element_text(hjust = 0))
```
To learn more about censored data, check out the nice lecture by Gordon Fox, [*Introduction to analysis of censored and truncated data*](https://www.youtube.com/watch?v=aPN10YYrC1M).
### Bonus: Truncation.
Though Kruschke didn't cover it, truncation is often confused with censoring. When you have actually collected all your values, but some are just less certain than you'd like, you have a censoring issue. That is, you can think of censoring as an issue with *measurement precision*. Truncation has to do with *data collection*. When there's some level in your criterion variable that you haven't collected data on, that's a truncation issue. For example, imagine you wanted to predict scores on an IQ test with socioeconomic status (SES). There might be a relation, there. But now imagine you only collected data on people with an IQ above 110. Those IQ data are severely left truncated. If you only care about the relation of SES and IQ for those with high IQ scores, your data are fine. But if you want to make general statements across the full range of IQ values, standard regression methods will likely produce biased results. Happily, **brms** allows users to accommodate truncated criterion variables with the `trunc()` function, which works in a similar way to the `cens()` function. For details on fitting truncated regression models with `trunc()`, check out the *Additional response information* subsection of the `brmsformula` section of the [**brms** reference manual](https://CRAN.R-project.org/package=brms/brms.pdf). The [end of the Fox lecture](https://youtu.be/aPN10YYrC1M?t=6153), above, briefly covers truncated regression. You can also find a [brief vignette on truncation](https://stats.idre.ucla.edu/r/dae/truncated-regression/) by the good folks at the UCLA Institute for Digital Research and Education.
## What Next?
"If you have made it this far and you are looking for more, you might peruse posts at [Kruschke's] blog, [https://doingbayesiandataanalysis.blogspot.com/](https://doingbayesiandataanalysis.blogspot.com/), and search there for topics that interest you." In addition to the other references Kruschke mentioned, you might also check out McElreath's [*Statistical rethinking*](https://xcelab.net/rm/statistical-rethinking/), both first [-@mcelreathStatisticalRethinkingBayesian2015] and second [-@mcelreathStatisticalRethinkingBayesian2020] editions. Much like this ebook, I have recoded both editions of *Statistical rethinking* in a **bookdown** form [@kurzStatisticalRethinkingBrms2020; @kurzStatisticalRethinkingSecondEd2021]. Click [here](https://bookdown.org/content/3890/) for the first and [here](https://bookdown.org/content/4857/) for the second. You can find other pedagogical material at my academic blog, [https://solomonkurz.netlify.com/post/](https://solomonkurz.netlify.com/post/). For assistance on **brms**-related issues, check out the **brms** section on the Stan forums at [https://discourse.mc-stan.org/c/interfaces/brms/36](https://discourse.mc-stan.org/c/interfaces/brms/36). Nicenboim, Schad, and Vasishth also have a nice new [-@nicenboim2021introduction] ebook called [*An introduction to Bayesian data analysis for cognitive science*](https://vasishth.github.io/bayescogsci/book/), which highlights **brms** in a way that could compliment the material we have practiced together.
Happy modeling, friends!
## Session info {-}
```{r}
sessionInfo()
```
```{r, echo = F}
# remove our objects
rm(fit24.1, HDIofGrid, prob_density_vec, prob_mass_vec, HDI_info, HDIofMCMC, draws, hdi_of_icdf, n, d, t1, mean_y, sd_y, stanvars, fit25.1a, inits, inits_list, fit25.2a, p1, p2, t2, t3, fit25.1b, fit25.2b, fit25.1c, fit25.2c, geom_hist, p3, p4, n_lines, p5, p7, p8, p6)
```
```{r, echo = F, message = F, warning = F, results = "hide"}
ggplot2::theme_set(ggplot2::theme_grey())
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```