#16 from jhrcook/more-notes

Notes on additional sections not included in the course
jhrcook · Jan 31, 2022 · 44cbc5e · 44cbc5e
2 parents 21997ea + e799c4e
commit 44cbc5e
Show file tree

Hide file tree

Showing 43 changed files with 46,935 additions and 168 deletions.
diff --git a/_site.yml b/_site.yml
@@ -39,6 +39,20 @@ navbar:
           href: notes/11_normal-approx-freq-properties_bda3-04.html
         - text: "Section 12. Extended topics"
           href: notes/12_extended-topics.html
+        - text: "13. Notes on 'Ch 14. Introduction to regression models'"
+          href: notes/13_intro-to-regression-models_bda3-14.Rmd
+        - text: "14. Notes on 'Ch 15. Hierarchical linear models'"
+          href: notes/14_hierarchical-linear-models_bda3-15.html
+        - text: "17. Notes on 'Ch 19. Parametric nonlinear models'"
+          href: notes/17_parametric-nonlinear-models_bda3-19.html
+        - text: "18. Notes on 'Ch 20. Basis function models'"
+          href: notes/18_basis-function-models_bda3-20.html
+        - text: "19. Notes on 'Ch 21. Gaussian process models'"
+          href: notes/19_gaussian-processes_bda3-21.html
+        - text: "20. Notes on 'Ch 22. Finite mixture models'"
+          href: notes/20_finite-mixture-models_bda3-22.html
+        - text: "21. Notes on 'Ch 23. Dirichlet process models'"
+          href: notes/21_Dirichlet-process-models_bda3-23.html
     - text: "Exercises"
       menu:
         - text: "Chapter 1"
@@ -47,6 +61,8 @@ navbar:
           href: book-exercises/bda-exercises-02.html
         - text: "Chapter 7"
           href: book-exercises/bda-exercises-07.html
+        - text: "Chapter 19"
+          href: book-exercises/bda-exercises-19.html
     - text: "Assignments"
       menu:
         - text: "Assignment 1"

diff --git a/book-exercises/bda-exercises-19.Rmd b/book-exercises/bda-exercises-19.Rmd
@@ -0,0 +1,329 @@
+---
+title: "Chapter 19 Exercises - Reproducing the the 'serial dilution assay'"
+date: "2021-12-10"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
+# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE
+
+rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$")
+for (rfile in rfiles) {
+  source(rfile)
+}
+
+set.seed(0)
+```
+
+> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app).
+
+In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis*[^1] by Gelman *et al.*, the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations.
+Below, I build the model with Stan and fit it using MCMC.
+Unfortunately, I was unable to find the original data in Gelman's original publication of the model[^2].
+The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data.
+
+> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html).
+
+[^1]: Gelman, Andrew, et al. Bayesian Data Analysis. CRC Press Etc., 2015.
+[^2]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/
+
+## Setup
+
+```{r setup-show, message=FALSE, warning=FALSE}
+library(rstan)
+library(tidybayes)
+library(patchwork)
+library(tidyverse)
+
+options(mc.cores = parallel::detectCores())
+rstan_options(auto_write = TRUE)
+
+theme_set(
+  theme_classic() +
+    theme(
+      panel.grid.major = element_line(),
+      strip.background = element_blank(),
+      plot.title = element_text(hjust = 0.5)
+    )
+)
+
+SNS_BLUE <- "#1F77B4"
+STAN_RED <- "#B2171D"
+```
+
+As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473.
+
+```{r}
+dilution_standards_data <- tibble::tribble(
+  ~conc, ~dilution, ~y,
+  0.64, 1, c(101.8, 121.4),
+  0.32, 1 / 2, c(105.2, 114.1),
+  0.16, 1 / 4, c(92.7, 93.3),
+  0.08, 1 / 8, c(72.4, 61.1),
+  0.04, 1 / 16, c(57.6, 50.0),
+  0.02, 1 / 32, c(38.5, 35.1),
+  0.01, 1 / 64, c(26.6, 25.0),
+  0, 0, c(14.7, 14.2),
+) %>%
+  mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+  unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+```
+
+The following plot shows the two standard dilution curves.
+They are quite similar.
+
+```{r}
+data_plot <- dilution_standards_data %>%
+  ggplot(aes(x = conc, y = y, color = rep)) +
+  geom_line(alpha = 0.5, linetype = 2) +
+  geom_point(alpha = 0.8) +
+  scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+  scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+  scale_color_brewer(type = "qual", palette = "Set1") +
+  theme(
+    legend.position = c(0.8, 0.2),
+    legend.background = element_blank()
+  ) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Serial dilution standard curve",
+    color = "replicate"
+  )
+
+data_plot
+```
+
+## Modeling
+
+### Model specification
+
+The model uses a normal likelihood to describe the posterior distribution $p(y|x)$.
+The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field:
+
+$$
+\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\
+$$
+
+
+The model is a scaled and shifted logistic curve.
+This structure results in the following interpretations for $\beta$, all of which are restricted to positive values:
+
+1. $\beta_1$: color intensity when the concentration is 0
+2. $\beta_2$: increase to saturation
+3. $\beta_3$: the inflection point of the curve
+4. $\beta_4$: rate of saturation
+
+Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data.
+
+$$
+\beta_1 \sim N(10, 2.5) \\
+\beta_2 \sim N(100, 5) \\
+\beta_3 \sim N(0, 1) \\
+\beta_4 \sim N(0, 2.5)
+$$
+
+The measurement error of the model, representing the variance in the model's likelihood is defined as follows:
+
+$$
+\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y
+$$
+
+Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values.
+$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements.
+Below are the priors for the new variables in the model.
+
+$$
+\alpha \sim \text{Beta}(1, 1) \qquad
+\sigma \sim |N(0, 2.5)|
+$$
+
+### In Stan
+
+Below is the Stan code for the model.
+It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language.
+
+The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code.
+The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated.
+I also included the ability to provide new data to make predictions over as `xnew`.
+
+```{r}
+dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+```
+
+### Sampling
+
+As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior.
+With those helping restrict the range of their values, the model fit very well.
+
+```{r}
+xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+  N = nrow(dilution_standards_data),
+  A = 30,
+  x = dilution_standards_data$conc,
+  y = dilution_standards_data$y,
+  M = length(xnew),
+  xnew = xnew
+)
+
+dilution_model <- stan(
+  dilution_model_file,
+  model_name = "serial-dilution",
+  data = model_data,
+  refresh = 1000
+)
+```
+
+### Posterior distributions
+
+The next step is to analyze the posterior draws of the model.
+We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$.
+The trace plots are shown below followed by a table of the posteriors with the diagnostic values.
+Everything looks good suggesting MCMC was successful.
+
+```{r}
+model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+  theme(legend.position = "bottom")
+```
+
+```{r}
+print(dilution_model, pars = model_pars)
+```
+
+The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$.
+
+```{r}
+rstan::stan_dens(
+  dilution_model,
+  pars = model_pars,
+  separate_chains = FALSE,
+  alpha = 0.6
+) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0, 0.02)))
+```
+
+### Posterior predictive check
+
+Below is a plot of the posterior predictive distributions of the model on the original data.
+1,000 individual simulations are plotted in blue and the mean in black.
+The simulated curves visually appear to correspond well with the observed data indicating the model has good fit.
+
+```{r}
+dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+  as.data.frame() %>%
+  as_tibble() %>%
+  set_names(seq(1, ncol(.))) %>%
+  mutate(draw = 1:n()) %>%
+  pivot_longer(-c(draw), names_to = "idx") %>%
+  left_join(
+    dilution_standards_data %>% mutate(idx = as.character(1:n())),
+    by = "idx"
+  )
+```
+
+```{r}
+plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+  group_by(conc) %>%
+  summarize(value = mean(value)) %>%
+  ungroup()
+
+dilution_post_pred %>%
+  filter(draw %in% !!plt_draws) %>%
+  mutate(grp = glue::glue("{draw}-{rep}")) %>%
+  ggplot(aes(x = conc, y = value)) +
+  geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+  geom_line(group = "a", data = ppc_mean, color = "black") +
+  geom_point(data = ppc_mean, color = "black") +
+  geom_line(
+    aes(y = y, group = rep),
+    data = dilution_standards_data,
+    color = STAN_RED
+  ) +
+  geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Posterior predictive distribution"
+  )
+```
+
+I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes.
+The mean and 89% HDI are shown in blue below along with the observed data in red.
+The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations.
+
+```{r}
+ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+  rstan::extract(dilution_model, pars = "ynew")$ynew,
+  2,
+  bayestestR::hdi,
+  ci = 0.89
+)
+ynew_ppc <- tibble(
+  conc = xnew,
+  ynew_mean = ynew_mean,
+  ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+  ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+```
+
+```{r}
+plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+  ppc_df %>%
+    ggplot(aes(x = conc, y = ynew_mean)) +
+    geom_ribbon(
+      aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+      fill = SNS_BLUE,
+      alpha = 0.5
+    ) +
+    geom_line(group = "a") +
+    geom_line(
+      aes(y = y, group = rep),
+      data = obs_df,
+      color = STAN_RED
+    ) +
+    geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+    scale_x_continuous(expand = expansion(c(0, 0))) +
+    scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Posterior predictive distribution"
+  )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+  ynew_ppc %>% filter(conc <= sub_max),
+  dilution_standards_data %>% filter(conc <= sub_max),
+  pt_size = 0.6
+) +
+  theme(axis.title = element_blank())
+
+ppc_plot +
+  inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+```
+
+---
+
+## Session info
+
+```{r}
+sessionInfo()
+```