diff --git a/_site.yml b/_site.yml index ea51d26..b948e2c 100644 --- a/_site.yml +++ b/_site.yml @@ -39,6 +39,20 @@ navbar: href: notes/11_normal-approx-freq-properties_bda3-04.html - text: "Section 12. Extended topics" href: notes/12_extended-topics.html + - text: "13. Notes on 'Ch 14. Introduction to regression models'" + href: notes/13_intro-to-regression-models_bda3-14.Rmd + - text: "14. Notes on 'Ch 15. Hierarchical linear models'" + href: notes/14_hierarchical-linear-models_bda3-15.html + - text: "17. Notes on 'Ch 19. Parametric nonlinear models'" + href: notes/17_parametric-nonlinear-models_bda3-19.html + - text: "18. Notes on 'Ch 20. Basis function models'" + href: notes/18_basis-function-models_bda3-20.html + - text: "19. Notes on 'Ch 21. Gaussian process models'" + href: notes/19_gaussian-processes_bda3-21.html + - text: "20. Notes on 'Ch 22. Finite mixture models'" + href: notes/20_finite-mixture-models_bda3-22.html + - text: "21. Notes on 'Ch 23. Dirichlet process models'" + href: notes/21_Dirichlet-process-models_bda3-23.html - text: "Exercises" menu: - text: "Chapter 1" @@ -47,6 +61,8 @@ navbar: href: book-exercises/bda-exercises-02.html - text: "Chapter 7" href: book-exercises/bda-exercises-07.html + - text: "Chapter 19" + href: book-exercises/bda-exercises-19.html - text: "Assignments" menu: - text: "Assignment 1" diff --git a/book-exercises/bda-exercises-19.Rmd b/book-exercises/bda-exercises-19.Rmd new file mode 100644 index 0000000..a54dfd3 --- /dev/null +++ b/book-exercises/bda-exercises-19.Rmd @@ -0,0 +1,329 @@ +--- +title: "Chapter 19 Exercises - Reproducing the the 'serial dilution assay'" +date: "2021-12-10" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300) +# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE + +rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$") +for (rfile in rfiles) { + source(rfile) +} + +set.seed(0) +``` + +> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app). + +In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis*[^1] by Gelman *et al.*, the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations. +Below, I build the model with Stan and fit it using MCMC. +Unfortunately, I was unable to find the original data in Gelman's original publication of the model[^2]. +The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data. + +> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html). + +[^1]: Gelman, Andrew, et al. Bayesian Data Analysis. CRC Press Etc., 2015. +[^2]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/ + +## Setup + +```{r setup-show, message=FALSE, warning=FALSE} +library(rstan) +library(tidybayes) +library(patchwork) +library(tidyverse) + +options(mc.cores = parallel::detectCores()) +rstan_options(auto_write = TRUE) + +theme_set( + theme_classic() + + theme( + panel.grid.major = element_line(), + strip.background = element_blank(), + plot.title = element_text(hjust = 0.5) + ) +) + +SNS_BLUE <- "#1F77B4" +STAN_RED <- "#B2171D" +``` + +As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473. + +```{r} +dilution_standards_data <- tibble::tribble( + ~conc, ~dilution, ~y, + 0.64, 1, c(101.8, 121.4), + 0.32, 1 / 2, c(105.2, 114.1), + 0.16, 1 / 4, c(92.7, 93.3), + 0.08, 1 / 8, c(72.4, 61.1), + 0.04, 1 / 16, c(57.6, 50.0), + 0.02, 1 / 32, c(38.5, 35.1), + 0.01, 1 / 64, c(26.6, 25.0), + 0, 0, c(14.7, 14.2), +) %>% + mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>% + unnest(c(y, rep)) + +knitr::kable(dilution_standards_data) +``` + +The following plot shows the two standard dilution curves. +They are quite similar. + +```{r} +data_plot <- dilution_standards_data %>% + ggplot(aes(x = conc, y = y, color = rep)) + + geom_line(alpha = 0.5, linetype = 2) + + geom_point(alpha = 0.8) + + scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) + + scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) + + scale_color_brewer(type = "qual", palette = "Set1") + + theme( + legend.position = c(0.8, 0.2), + legend.background = element_blank() + ) + + labs( + x = "concentration", + y = "y", + title = "Serial dilution standard curve", + color = "replicate" + ) + +data_plot +``` + +## Modeling + +### Model specification + +The model uses a normal likelihood to describe the posterior distribution $p(y|x)$. +The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field: + +$$ +\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\ +$$ + + +The model is a scaled and shifted logistic curve. +This structure results in the following interpretations for $\beta$, all of which are restricted to positive values: + +1. $\beta_1$: color intensity when the concentration is 0 +2. $\beta_2$: increase to saturation +3. $\beta_3$: the inflection point of the curve +4. $\beta_4$: rate of saturation + +Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data. + +$$ +\beta_1 \sim N(10, 2.5) \\ +\beta_2 \sim N(100, 5) \\ +\beta_3 \sim N(0, 1) \\ +\beta_4 \sim N(0, 2.5) +$$ + +The measurement error of the model, representing the variance in the model's likelihood is defined as follows: + +$$ +\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y +$$ + +Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values. +$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements. +Below are the priors for the new variables in the model. + +$$ +\alpha \sim \text{Beta}(1, 1) \qquad +\sigma \sim |N(0, 2.5)| +$$ + +### In Stan + +Below is the Stan code for the model. +It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language. + +The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code. +The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated. +I also included the ability to provide new data to make predictions over as `xnew`. + +```{r} +dilution_model_file <- here::here("models", "serial-dilution.stan") +writeLines(readLines(dilution_model_file)) +``` + +### Sampling + +As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior. +With those helping restrict the range of their values, the model fit very well. + +```{r} +xnew <- seq(0, max(dilution_standards_data$conc), 0.001) +model_data <- list( + N = nrow(dilution_standards_data), + A = 30, + x = dilution_standards_data$conc, + y = dilution_standards_data$y, + M = length(xnew), + xnew = xnew +) + +dilution_model <- stan( + dilution_model_file, + model_name = "serial-dilution", + data = model_data, + refresh = 1000 +) +``` + +### Posterior distributions + +The next step is to analyze the posterior draws of the model. +We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$. +The trace plots are shown below followed by a table of the posteriors with the diagnostic values. +Everything looks good suggesting MCMC was successful. + +```{r} +model_pars <- c("beta", "alpha", "sigma") +rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) + + theme(legend.position = "bottom") +``` + +```{r} +print(dilution_model, pars = model_pars) +``` + +The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$. + +```{r} +rstan::stan_dens( + dilution_model, + pars = model_pars, + separate_chains = FALSE, + alpha = 0.6 +) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0, 0.02))) +``` + +### Posterior predictive check + +Below is a plot of the posterior predictive distributions of the model on the original data. +1,000 individual simulations are plotted in blue and the mean in black. +The simulated curves visually appear to correspond well with the observed data indicating the model has good fit. + +```{r} +dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>% + as.data.frame() %>% + as_tibble() %>% + set_names(seq(1, ncol(.))) %>% + mutate(draw = 1:n()) %>% + pivot_longer(-c(draw), names_to = "idx") %>% + left_join( + dilution_standards_data %>% mutate(idx = as.character(1:n())), + by = "idx" + ) +``` + +```{r} +plt_n_draws <- 1000 +plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws) + +ppc_mean <- dilution_post_pred %>% + group_by(conc) %>% + summarize(value = mean(value)) %>% + ungroup() + +dilution_post_pred %>% + filter(draw %in% !!plt_draws) %>% + mutate(grp = glue::glue("{draw}-{rep}")) %>% + ggplot(aes(x = conc, y = value)) + + geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) + + geom_line(group = "a", data = ppc_mean, color = "black") + + geom_point(data = ppc_mean, color = "black") + + geom_line( + aes(y = y, group = rep), + data = dilution_standards_data, + color = STAN_RED + ) + + geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) + + labs( + x = "concentration", + y = "y", + title = "Posterior predictive distribution" + ) +``` + +I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes. +The mean and 89% HDI are shown in blue below along with the observed data in red. +The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations. + +```{r} +ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean) +ynew_hdi <- apply( + rstan::extract(dilution_model, pars = "ynew")$ynew, + 2, + bayestestR::hdi, + ci = 0.89 +) +ynew_ppc <- tibble( + conc = xnew, + ynew_mean = ynew_mean, + ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]), + ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]]) +) +``` + +```{r} +plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) { + ppc_df %>% + ggplot(aes(x = conc, y = ynew_mean)) + + geom_ribbon( + aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi), + fill = SNS_BLUE, + alpha = 0.5 + ) + + geom_line(group = "a") + + geom_line( + aes(y = y, group = rep), + data = obs_df, + color = STAN_RED + ) + + geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) +} + +ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) + + labs( + x = "concentration", + y = "y", + title = "Posterior predictive distribution" + ) + +sub_max <- 0.04 +sub_ppc_plot <- plot_posterior_pred( + ynew_ppc %>% filter(conc <= sub_max), + dilution_standards_data %>% filter(conc <= sub_max), + pt_size = 0.6 +) + + theme(axis.title = element_blank()) + +ppc_plot + + inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5) +``` + +--- + +## Session info + +```{r} +sessionInfo() +``` diff --git a/book-exercises/bda-exercises-19.html b/book-exercises/bda-exercises-19.html new file mode 100644 index 0000000..af4c12b --- /dev/null +++ b/book-exercises/bda-exercises-19.html @@ -0,0 +1,3123 @@ + + + + + + + + + + + + + + + + + + + + Chapter 19 Exercises - Reproducing the the 'serial dilution assay' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Chapter 19 Exercises - Reproducing the the ‘serial dilution assay’

+ + + +
+ + +
+
+

I turned this exercise into a blog post on my webiste.

+
+

In chapter 17 “Parametric nonlinear models” of Bayesian Data +Analysis1 by Gelman et al., the +authors present an example of fitting a curve to a serial dilution +standard curve and using it to estimate unknown concentrations. Below, I +build the model with Stan and fit it using MCMC. Unfortunately, I was +unable to find the original data in Gelman’s original publication of the +model2. The best I could do was copy the +data for the standard curve from a table in the book and build the model +to fit that data.

+
+

The source code for this post is in a repository +of my work for Aki Vehtari’s Bayesian Data Analysis course.

+
+

Setup

+
+
+
library(rstan)
+library(tidybayes)
+library(patchwork)
+library(tidyverse)
+
+options(mc.cores = parallel::detectCores())
+rstan_options(auto_write = TRUE)
+
+theme_set(
+  theme_classic() +
+    theme(
+      panel.grid.major = element_line(),
+      strip.background = element_blank(),
+      plot.title = element_text(hjust = 0.5)
+    )
+)
+
+SNS_BLUE <- "#1F77B4"
+STAN_RED <- "#B2171D"
+
+
+
+

As mentioned above, I couldn’t find the original data, so I copied it +from the book’s figure 19.3 on page 473.

+
+
+
dilution_standards_data <- tibble::tribble(
+  ~conc, ~dilution, ~y,
+  0.64, 1, c(101.8, 121.4),
+  0.32, 1 / 2, c(105.2, 114.1),
+  0.16, 1 / 4, c(92.7, 93.3),
+  0.08, 1 / 8, c(72.4, 61.1),
+  0.04, 1 / 16, c(57.6, 50.0),
+  0.02, 1 / 32, c(38.5, 35.1),
+  0.01, 1 / 64, c(26.6, 25.0),
+  0, 0, c(14.7, 14.2),
+) %>%
+  mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+  unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
concdilutionyrep
0.641.000000101.8a
0.641.000000121.4b
0.320.500000105.2a
0.320.500000114.1b
0.160.25000092.7a
0.160.25000093.3b
0.080.12500072.4a
0.080.12500061.1b
0.040.06250057.6a
0.040.06250050.0b
0.020.03125038.5a
0.020.03125035.1b
0.010.01562526.6a
0.010.01562525.0b
0.000.00000014.7a
0.000.00000014.2b
+
+

The following plot shows the two standard dilution curves. They are +quite similar.

+
+
+
data_plot <- dilution_standards_data %>%
+  ggplot(aes(x = conc, y = y, color = rep)) +
+  geom_line(alpha = 0.5, linetype = 2) +
+  geom_point(alpha = 0.8) +
+  scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+  scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+  scale_color_brewer(type = "qual", palette = "Set1") +
+  theme(
+    legend.position = c(0.8, 0.2),
+    legend.background = element_blank()
+  ) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Serial dilution standard curve",
+    color = "replicate"
+  )
+
+data_plot
+
+
+

+
+

Modeling

+

Model specification

+

The model uses a normal likelihood to describe the posterior +distribution \(p(y|x)\). The mean of +the likelihood is defined for a given concentration \(x\) using the standard equation used in the +field:

+

\[ +\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + +(x/\beta_3)^{-\beta_4}} \\ +\]

+

The model is a scaled and shifted logistic curve. This structure +results in the following interpretations for \(\beta\), all of which are restricted to +positive values:

+
    +
  1. \(\beta_1\): color intensity when +the concentration is 0
  2. +
  3. \(\beta_2\): increase to +saturation
  4. +
  5. \(\beta_3\): the inflection point +of the curve
  6. +
  7. \(\beta_4\): rate of +saturation
  8. +
+

Below are the prior distributions for \(\beta\). Note that they are are drastically +different scales - this is critical to help the model fit the data.

+

\[ +\beta_1 \sim N(10, 2.5) \\ +\beta_2 \sim N(100, 5) \\ +\beta_3 \sim N(0, 1) \\ +\beta_4 \sim N(0, 2.5) +\]

+

The measurement error of the model, representing the variance in the +model’s likelihood is defined as follows:

+

\[ +\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} +\rgroup^{2\alpha} \sigma^2_y +\]

+

Here, \(\alpha\), restricted to lie +between 0 and 1, allows the variance to be higher for larger measurement +values. \(A\) is a constant (set to 30 +by the authors) that allows \(\sigma_y\) to be more easily interpreted as +the variance from “typical” measurements. Below are the priors for the +new variables in the model.

+

\[ +\alpha \sim \text{Beta}(1, 1) \qquad +\sigma \sim |N(0, 2.5)| +\]

+

In Stan

+

Below is the Stan code for the model. It looks very similar to the +mathematical description of the model, a nice feature of the Stan +probabilistic programming language.

+

The centrality and variance of the likelihood are calculated +separately as g and tau so they can be used in +the model and generated quantities block +without duplicating the code. The log_lik is calculated so +that PSIS-LOO cross validation can be estimated. I also included the +ability to provide new data to make predictions over as +xnew.

+
+
+
dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+
+
+
#> data {
+#>   int<lower=0> N;           // number of data points
+#>   int<lower=0> A;           // constant used in model of measurement error
+#>   vector<lower=0>[N] x;     // concentration values
+#>   vector<lower=0>[N] y;     // observed color intensity
+#>   int<lower=0> M;           // number of new x values
+#>   vector<lower=0>[M] xnew;  // new x values
+#> }
+#>
+#> parameters {
+#>   vector<lower=0>[4] beta;
+#>   real<lower=0,upper=1> alpha;
+#>   real<lower=0> sigma;
+#> }
+#>
+#> transformed parameters {
+#>   vector<lower=0>[N] g;
+#>   vector<lower=0>[N] tau;
+#>
+#>   for (i in 1:N) {
+#>     g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4]));
+#>     tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#>   }
+#> }
+#>
+#> model {
+#>   // Priors
+#>   alpha ~ beta(1, 1);
+#>   beta[1] ~ normal(10, 2.5);
+#>   beta[2] ~ normal(100, 5);
+#>   beta[3] ~ normal(0, 1);
+#>   beta[4] ~ normal(0, 2.5);
+#>   sigma ~ normal(0, 2.5);
+#>
+#>   // Likelihood
+#>   for (i in 1:N) {
+#>     y[i] ~ normal(g[i], tau[i]);
+#>   }
+#> }
+#>
+#> generated quantities {
+#>   vector[N] ypred;
+#>   vector[N] log_lik;
+#>
+#>   vector[M] g_hat;
+#>   vector[M] tau_hat;
+#>   vector[M] ynew;
+#>
+#>   for (i in 1:N) {
+#>     ypred[i] = normal_rng(g[i], tau[i]);
+#>     log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]);
+#>   }
+#>
+#>   for (i in 1:M) {
+#>     g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4]));
+#>     tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#>     ynew[i] = normal_rng(g_hat[i], tau_hat[i]);
+#>   }
+#> }
+
+

Sampling

+

As mentioned above, specifically defining the prior distributions for +each \(\beta\) is necessary for MCMC to +accurately sample from the posterior. With those helping restrict the +range of their values, the model fit very well.

+
+
+
xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+  N = nrow(dilution_standards_data),
+  A = 30,
+  x = dilution_standards_data$conc,
+  y = dilution_standards_data$y,
+  M = length(xnew),
+  xnew = xnew
+)
+
+dilution_model <- stan(
+  dilution_model_file,
+  model_name = "serial-dilution",
+  data = model_data,
+  refresh = 1000
+)
+
+
+
+

Posterior distributions

+

The next step is to analyze the posterior draws of the model. We can +check the success of MCMC by visualizing the traces of the chains, +looking for good mixing (“fuzzy caterpillars”) and checking diagnostic +values such as \(\widehat{R}\) and +\(n_\text{eff}\). The trace plots are +shown below followed by a table of the posteriors with the diagnostic +values. Everything looks good suggesting MCMC was successful.

+
+
+
model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+  theme(legend.position = "bottom")
+
+
+

+
+
+
+
print(dilution_model, pars = model_pars)
+
+
+
#> Inference for Stan model: serial-dilution.
+#> 4 chains, each with iter=2000; warmup=1000; thin=1;
+#> post-warmup draws per chain=1000, total post-warmup draws=4000.
+#>
+#>           mean se_mean   sd  2.5%   25%    50%    75%  97.5% n_eff
+#> beta[1]  14.33    0.02 0.55 13.12 14.08  14.36  14.61  15.33  1074
+#> beta[2] 102.70    0.10 4.30 94.48 99.67 102.61 105.60 111.32  1755
+#> beta[3]   0.06    0.00 0.01  0.05  0.06   0.06   0.07   0.08  1633
+#> beta[4]   1.13    0.00 0.08  0.97  1.07   1.12   1.18   1.29  1870
+#> alpha     0.72    0.01 0.17  0.33  0.61   0.74   0.85   0.98  1074
+#> sigma     1.33    0.01 0.24  0.98  1.16   1.30   1.45   1.92   893
+#>         Rhat
+#> beta[1]    1
+#> beta[2]    1
+#> beta[3]    1
+#> beta[4]    1
+#> alpha      1
+#> sigma      1
+#>
+#> Samples were drawn using NUTS(diag_e) at Sat Jan 22 11:53:33 2022.
+#> For each parameter, n_eff is a crude measure of effective sample size,
+#> and Rhat is the potential scale reduction factor on split chains (at
+#> convergence, Rhat=1).
+
+

The following density plots show the posterior distributions of the +model parameters \(\beta\), \(\alpha\), and \(\sigma\).

+
+
+
rstan::stan_dens(
+  dilution_model,
+  pars = model_pars,
+  separate_chains = FALSE,
+  alpha = 0.6
+) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0, 0.02)))
+
+
+

+
+

Posterior predictive check

+

Below is a plot of the posterior predictive distributions of the +model on the original data. 1,000 individual simulations are plotted in +blue and the mean in black. The simulated curves visually appear to +correspond well with the observed data indicating the model has good +fit.

+
+
+
dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+  as.data.frame() %>%
+  as_tibble() %>%
+  set_names(seq(1, ncol(.))) %>%
+  mutate(draw = 1:n()) %>%
+  pivot_longer(-c(draw), names_to = "idx") %>%
+  left_join(
+    dilution_standards_data %>% mutate(idx = as.character(1:n())),
+    by = "idx"
+  )
+
+
+
+
+
+
plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+  group_by(conc) %>%
+  summarize(value = mean(value)) %>%
+  ungroup()
+
+dilution_post_pred %>%
+  filter(draw %in% !!plt_draws) %>%
+  mutate(grp = glue::glue("{draw}-{rep}")) %>%
+  ggplot(aes(x = conc, y = value)) +
+  geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+  geom_line(group = "a", data = ppc_mean, color = "black") +
+  geom_point(data = ppc_mean, color = "black") +
+  geom_line(
+    aes(y = y, group = rep),
+    data = dilution_standards_data,
+    color = STAN_RED
+  ) +
+  geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Posterior predictive distribution"
+  )
+
+
+

+
+

I also had the model make posterior predictions on concentrations +across the observed range at smaller step-sizes. The mean and 89% HDI +are shown in blue below along with the observed data in red. The inset +plot is a zoomed-in view of the posterior predictive distribution at the +lower concentrations.

+
+
+
ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+  rstan::extract(dilution_model, pars = "ynew")$ynew,
+  2,
+  bayestestR::hdi,
+  ci = 0.89
+)
+ynew_ppc <- tibble(
+  conc = xnew,
+  ynew_mean = ynew_mean,
+  ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+  ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+
+
+
+
+
+
plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+  ppc_df %>%
+    ggplot(aes(x = conc, y = ynew_mean)) +
+    geom_ribbon(
+      aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+      fill = SNS_BLUE,
+      alpha = 0.5
+    ) +
+    geom_line(group = "a") +
+    geom_line(
+      aes(y = y, group = rep),
+      data = obs_df,
+      color = STAN_RED
+    ) +
+    geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+    scale_x_continuous(expand = expansion(c(0, 0))) +
+    scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Posterior predictive distribution"
+  )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+  ynew_ppc %>% filter(conc <= sub_max),
+  dilution_standards_data %>% filter(conc <= sub_max),
+  pt_size = 0.6
+) +
+  theme(axis.title = element_blank())
+
+ppc_plot +
+  inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+
+
+

+
+
+

Session info

+
+
+
sessionInfo()
+
+
+
#> R version 4.1.2 (2021-11-01)
+#> Platform: x86_64-apple-darwin17.0 (64-bit)
+#> Running under: macOS Big Sur 10.16
+#>
+#> Matrix products: default
+#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
+#>
+#> locale:
+#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+#>
+#> attached base packages:
+#> [1] stats     graphics  grDevices datasets  utils     methods
+#> [7] base
+#>
+#> other attached packages:
+#>  [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7
+#>  [4] purrr_0.3.4          readr_2.0.1          tidyr_1.1.3
+#>  [7] tibble_3.1.3         tidyverse_1.3.1      patchwork_1.1.1
+#> [10] tidybayes_3.0.1      rstan_2.21.2         ggplot2_3.3.5
+#> [13] StanHeaders_2.21.0-7
+#>
+#> loaded via a namespace (and not attached):
+#>  [1] fs_1.5.0             matrixStats_0.61.0   lubridate_1.7.10
+#>  [4] insight_0.14.4       RColorBrewer_1.1-2   httr_1.4.2
+#>  [7] rprojroot_2.0.2      tensorA_0.36.2       tools_4.1.2
+#> [10] backports_1.2.1      bslib_0.2.5.1        utf8_1.2.2
+#> [13] R6_2.5.0             DBI_1.1.1            colorspace_2.0-2
+#> [16] ggdist_3.0.0         withr_2.4.2          tidyselect_1.1.1
+#> [19] gridExtra_2.3        prettyunits_1.1.1    processx_3.5.2
+#> [22] downlit_0.2.1        curl_4.3.2           compiler_4.1.2
+#> [25] rvest_1.0.1          cli_3.0.1            arrayhelpers_1.1-0
+#> [28] xml2_1.3.2           bayestestR_0.11.0    labeling_0.4.2
+#> [31] posterior_1.1.0      sass_0.4.0           scales_1.1.1
+#> [34] checkmate_2.0.0      callr_3.7.0          digest_0.6.27
+#> [37] rmarkdown_2.10       pkgconfig_2.0.3      htmltools_0.5.1.1
+#> [40] highr_0.9            dbplyr_2.1.1         rlang_0.4.11
+#> [43] readxl_1.3.1         rstudioapi_0.13      jquerylib_0.1.4
+#> [46] farver_2.1.0         generics_0.1.0       svUnit_1.0.6
+#> [49] jsonlite_1.7.2       distill_1.2          distributional_0.2.2
+#> [52] inline_0.3.19        magrittr_2.0.1       loo_2.4.1
+#> [55] Rcpp_1.0.7           munsell_0.5.0        fansi_0.5.0
+#> [58] abind_1.4-5          lifecycle_1.0.0      stringi_1.7.3
+#> [61] yaml_2.2.1           pkgbuild_1.2.0       grid_4.1.2
+#> [64] parallel_4.1.2       crayon_1.4.1         lattice_0.20-45
+#> [67] haven_2.4.3          hms_1.1.0            knitr_1.33
+#> [70] ps_1.6.0             pillar_1.6.2         codetools_0.2-18
+#> [73] clisymbols_1.2.0     stats4_4.1.2         reprex_2.0.1
+#> [76] glue_1.4.2           evaluate_0.14        V8_3.4.2
+#> [79] renv_0.14.0          RcppParallel_5.1.4   modelr_0.1.8
+#> [82] vctrs_0.3.8          tzdb_0.1.2           cellranger_1.1.0
+#> [85] gtable_0.3.0         datawizard_0.2.1     assertthat_0.2.1
+#> [88] xfun_0.25            broom_0.7.9          coda_0.19-4
+#> [91] ellipsis_0.3.2       here_1.0.1
+
+
+
+
+
    +
  1. Gelman, Andrew, et al. Bayesian Data +Analysis. CRC Press Etc., 2015.↩︎

  2. +
  3. Gelman A, Chew GL, Shnaidman M. +Bayesian analysis of serial dilution assays. Biometrics. 2004 +Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. +https://pubmed.ncbi.nlm.nih.gov/15180666/↩︎

  4. +
+
+ + +
+ +
+
+ + + + + + + +
+ + + + + + + diff --git a/docs/about.html b/docs/about.html index f55643a..7b34ee0 100644 --- a/docs/about.html +++ b/docs/about.html @@ -2396,6 +2396,13 @@

${suggestion.title}

Section 10. Decision analysis Section 11. Normal approximation & Frequency properties Section 12. Extended topics +13. Notes on 'Ch 14. Introduction to regression models' +14. Notes on 'Ch 15. Hierarchical linear models' +17. Notes on 'Ch 19. Parametric nonlinear models' +18. Notes on 'Ch 20. Basis function models' +19. Notes on 'Ch 21. Gaussian process models' +20. Notes on 'Ch 22. Finite mixture models' +21. Notes on 'Ch 23. Dirichlet process models'

The website

-

This site is built with ‘distill’ and deployed on GitHub pages. The main point of the website is to display my course notes in a easy to read and search format. It isn’t perfect nor optimal, but there is a good balance of functionality and simplicity.

+

This site is built with ‘distill’ and deployed on +GitHub pages. The main point of the website is to display my course +notes in a easy to read and search format. It isn’t perfect nor optimal, +but there is a good balance of functionality and simplicity.

About me

-

My name is Joshua Cook and I am (at the time of taking this course and writing this About page) a graduate student at Harvard Medical School. My research is on cancer genetics and I have a specific love of Bayesian modelling. You can peruse more of my projects and other work on my GitHub profile at jhrcook or my website.

-
+

My name is Joshua Cook and I am (at the time of taking this course +and writing this About page) a graduate student at Harvard Medical +School. My research is on cancer genetics and I have a specific love of +Bayesian modelling. You can peruse more of my projects and other work on +my GitHub profile at jhrcook or +my website.

+
diff --git a/docs/book-exercises/bda-exercises-19.Rmd b/docs/book-exercises/bda-exercises-19.Rmd new file mode 100644 index 0000000..a54dfd3 --- /dev/null +++ b/docs/book-exercises/bda-exercises-19.Rmd @@ -0,0 +1,329 @@ +--- +title: "Chapter 19 Exercises - Reproducing the the 'serial dilution assay'" +date: "2021-12-10" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300) +# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE + +rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$") +for (rfile in rfiles) { + source(rfile) +} + +set.seed(0) +``` + +> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app). + +In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis*[^1] by Gelman *et al.*, the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations. +Below, I build the model with Stan and fit it using MCMC. +Unfortunately, I was unable to find the original data in Gelman's original publication of the model[^2]. +The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data. + +> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html). + +[^1]: Gelman, Andrew, et al. Bayesian Data Analysis. CRC Press Etc., 2015. +[^2]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/ + +## Setup + +```{r setup-show, message=FALSE, warning=FALSE} +library(rstan) +library(tidybayes) +library(patchwork) +library(tidyverse) + +options(mc.cores = parallel::detectCores()) +rstan_options(auto_write = TRUE) + +theme_set( + theme_classic() + + theme( + panel.grid.major = element_line(), + strip.background = element_blank(), + plot.title = element_text(hjust = 0.5) + ) +) + +SNS_BLUE <- "#1F77B4" +STAN_RED <- "#B2171D" +``` + +As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473. + +```{r} +dilution_standards_data <- tibble::tribble( + ~conc, ~dilution, ~y, + 0.64, 1, c(101.8, 121.4), + 0.32, 1 / 2, c(105.2, 114.1), + 0.16, 1 / 4, c(92.7, 93.3), + 0.08, 1 / 8, c(72.4, 61.1), + 0.04, 1 / 16, c(57.6, 50.0), + 0.02, 1 / 32, c(38.5, 35.1), + 0.01, 1 / 64, c(26.6, 25.0), + 0, 0, c(14.7, 14.2), +) %>% + mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>% + unnest(c(y, rep)) + +knitr::kable(dilution_standards_data) +``` + +The following plot shows the two standard dilution curves. +They are quite similar. + +```{r} +data_plot <- dilution_standards_data %>% + ggplot(aes(x = conc, y = y, color = rep)) + + geom_line(alpha = 0.5, linetype = 2) + + geom_point(alpha = 0.8) + + scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) + + scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) + + scale_color_brewer(type = "qual", palette = "Set1") + + theme( + legend.position = c(0.8, 0.2), + legend.background = element_blank() + ) + + labs( + x = "concentration", + y = "y", + title = "Serial dilution standard curve", + color = "replicate" + ) + +data_plot +``` + +## Modeling + +### Model specification + +The model uses a normal likelihood to describe the posterior distribution $p(y|x)$. +The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field: + +$$ +\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\ +$$ + + +The model is a scaled and shifted logistic curve. +This structure results in the following interpretations for $\beta$, all of which are restricted to positive values: + +1. $\beta_1$: color intensity when the concentration is 0 +2. $\beta_2$: increase to saturation +3. $\beta_3$: the inflection point of the curve +4. $\beta_4$: rate of saturation + +Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data. + +$$ +\beta_1 \sim N(10, 2.5) \\ +\beta_2 \sim N(100, 5) \\ +\beta_3 \sim N(0, 1) \\ +\beta_4 \sim N(0, 2.5) +$$ + +The measurement error of the model, representing the variance in the model's likelihood is defined as follows: + +$$ +\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y +$$ + +Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values. +$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements. +Below are the priors for the new variables in the model. + +$$ +\alpha \sim \text{Beta}(1, 1) \qquad +\sigma \sim |N(0, 2.5)| +$$ + +### In Stan + +Below is the Stan code for the model. +It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language. + +The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code. +The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated. +I also included the ability to provide new data to make predictions over as `xnew`. + +```{r} +dilution_model_file <- here::here("models", "serial-dilution.stan") +writeLines(readLines(dilution_model_file)) +``` + +### Sampling + +As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior. +With those helping restrict the range of their values, the model fit very well. + +```{r} +xnew <- seq(0, max(dilution_standards_data$conc), 0.001) +model_data <- list( + N = nrow(dilution_standards_data), + A = 30, + x = dilution_standards_data$conc, + y = dilution_standards_data$y, + M = length(xnew), + xnew = xnew +) + +dilution_model <- stan( + dilution_model_file, + model_name = "serial-dilution", + data = model_data, + refresh = 1000 +) +``` + +### Posterior distributions + +The next step is to analyze the posterior draws of the model. +We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$. +The trace plots are shown below followed by a table of the posteriors with the diagnostic values. +Everything looks good suggesting MCMC was successful. + +```{r} +model_pars <- c("beta", "alpha", "sigma") +rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) + + theme(legend.position = "bottom") +``` + +```{r} +print(dilution_model, pars = model_pars) +``` + +The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$. + +```{r} +rstan::stan_dens( + dilution_model, + pars = model_pars, + separate_chains = FALSE, + alpha = 0.6 +) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0, 0.02))) +``` + +### Posterior predictive check + +Below is a plot of the posterior predictive distributions of the model on the original data. +1,000 individual simulations are plotted in blue and the mean in black. +The simulated curves visually appear to correspond well with the observed data indicating the model has good fit. + +```{r} +dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>% + as.data.frame() %>% + as_tibble() %>% + set_names(seq(1, ncol(.))) %>% + mutate(draw = 1:n()) %>% + pivot_longer(-c(draw), names_to = "idx") %>% + left_join( + dilution_standards_data %>% mutate(idx = as.character(1:n())), + by = "idx" + ) +``` + +```{r} +plt_n_draws <- 1000 +plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws) + +ppc_mean <- dilution_post_pred %>% + group_by(conc) %>% + summarize(value = mean(value)) %>% + ungroup() + +dilution_post_pred %>% + filter(draw %in% !!plt_draws) %>% + mutate(grp = glue::glue("{draw}-{rep}")) %>% + ggplot(aes(x = conc, y = value)) + + geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) + + geom_line(group = "a", data = ppc_mean, color = "black") + + geom_point(data = ppc_mean, color = "black") + + geom_line( + aes(y = y, group = rep), + data = dilution_standards_data, + color = STAN_RED + ) + + geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) + + labs( + x = "concentration", + y = "y", + title = "Posterior predictive distribution" + ) +``` + +I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes. +The mean and 89% HDI are shown in blue below along with the observed data in red. +The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations. + +```{r} +ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean) +ynew_hdi <- apply( + rstan::extract(dilution_model, pars = "ynew")$ynew, + 2, + bayestestR::hdi, + ci = 0.89 +) +ynew_ppc <- tibble( + conc = xnew, + ynew_mean = ynew_mean, + ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]), + ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]]) +) +``` + +```{r} +plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) { + ppc_df %>% + ggplot(aes(x = conc, y = ynew_mean)) + + geom_ribbon( + aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi), + fill = SNS_BLUE, + alpha = 0.5 + ) + + geom_line(group = "a") + + geom_line( + aes(y = y, group = rep), + data = obs_df, + color = STAN_RED + ) + + geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) +} + +ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) + + labs( + x = "concentration", + y = "y", + title = "Posterior predictive distribution" + ) + +sub_max <- 0.04 +sub_ppc_plot <- plot_posterior_pred( + ynew_ppc %>% filter(conc <= sub_max), + dilution_standards_data %>% filter(conc <= sub_max), + pt_size = 0.6 +) + + theme(axis.title = element_blank()) + +ppc_plot + + inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5) +``` + +--- + +## Session info + +```{r} +sessionInfo() +``` diff --git a/docs/book-exercises/bda-exercises-19.html b/docs/book-exercises/bda-exercises-19.html new file mode 100644 index 0000000..af4c12b --- /dev/null +++ b/docs/book-exercises/bda-exercises-19.html @@ -0,0 +1,3123 @@ + + + + + + + + + + + + + + + + + + + + Chapter 19 Exercises - Reproducing the the 'serial dilution assay' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Chapter 19 Exercises - Reproducing the the ‘serial dilution assay’

+ + + +
+ + +
+
+

I turned this exercise into a blog post on my webiste.

+
+

In chapter 17 “Parametric nonlinear models” of Bayesian Data +Analysis1 by Gelman et al., the +authors present an example of fitting a curve to a serial dilution +standard curve and using it to estimate unknown concentrations. Below, I +build the model with Stan and fit it using MCMC. Unfortunately, I was +unable to find the original data in Gelman’s original publication of the +model2. The best I could do was copy the +data for the standard curve from a table in the book and build the model +to fit that data.

+
+

The source code for this post is in a repository +of my work for Aki Vehtari’s Bayesian Data Analysis course.

+
+

Setup

+
+
+
library(rstan)
+library(tidybayes)
+library(patchwork)
+library(tidyverse)
+
+options(mc.cores = parallel::detectCores())
+rstan_options(auto_write = TRUE)
+
+theme_set(
+  theme_classic() +
+    theme(
+      panel.grid.major = element_line(),
+      strip.background = element_blank(),
+      plot.title = element_text(hjust = 0.5)
+    )
+)
+
+SNS_BLUE <- "#1F77B4"
+STAN_RED <- "#B2171D"
+
+
+
+

As mentioned above, I couldn’t find the original data, so I copied it +from the book’s figure 19.3 on page 473.

+
+
+
dilution_standards_data <- tibble::tribble(
+  ~conc, ~dilution, ~y,
+  0.64, 1, c(101.8, 121.4),
+  0.32, 1 / 2, c(105.2, 114.1),
+  0.16, 1 / 4, c(92.7, 93.3),
+  0.08, 1 / 8, c(72.4, 61.1),
+  0.04, 1 / 16, c(57.6, 50.0),
+  0.02, 1 / 32, c(38.5, 35.1),
+  0.01, 1 / 64, c(26.6, 25.0),
+  0, 0, c(14.7, 14.2),
+) %>%
+  mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+  unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
concdilutionyrep
0.641.000000101.8a
0.641.000000121.4b
0.320.500000105.2a
0.320.500000114.1b
0.160.25000092.7a
0.160.25000093.3b
0.080.12500072.4a
0.080.12500061.1b
0.040.06250057.6a
0.040.06250050.0b
0.020.03125038.5a
0.020.03125035.1b
0.010.01562526.6a
0.010.01562525.0b
0.000.00000014.7a
0.000.00000014.2b
+
+

The following plot shows the two standard dilution curves. They are +quite similar.

+
+
+
data_plot <- dilution_standards_data %>%
+  ggplot(aes(x = conc, y = y, color = rep)) +
+  geom_line(alpha = 0.5, linetype = 2) +
+  geom_point(alpha = 0.8) +
+  scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+  scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+  scale_color_brewer(type = "qual", palette = "Set1") +
+  theme(
+    legend.position = c(0.8, 0.2),
+    legend.background = element_blank()
+  ) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Serial dilution standard curve",
+    color = "replicate"
+  )
+
+data_plot
+
+
+

+
+

Modeling

+

Model specification

+

The model uses a normal likelihood to describe the posterior +distribution \(p(y|x)\). The mean of +the likelihood is defined for a given concentration \(x\) using the standard equation used in the +field:

+

\[ +\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + +(x/\beta_3)^{-\beta_4}} \\ +\]

+

The model is a scaled and shifted logistic curve. This structure +results in the following interpretations for \(\beta\), all of which are restricted to +positive values:

+
    +
  1. \(\beta_1\): color intensity when +the concentration is 0
  2. +
  3. \(\beta_2\): increase to +saturation
  4. +
  5. \(\beta_3\): the inflection point +of the curve
  6. +
  7. \(\beta_4\): rate of +saturation
  8. +
+

Below are the prior distributions for \(\beta\). Note that they are are drastically +different scales - this is critical to help the model fit the data.

+

\[ +\beta_1 \sim N(10, 2.5) \\ +\beta_2 \sim N(100, 5) \\ +\beta_3 \sim N(0, 1) \\ +\beta_4 \sim N(0, 2.5) +\]

+

The measurement error of the model, representing the variance in the +model’s likelihood is defined as follows:

+

\[ +\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} +\rgroup^{2\alpha} \sigma^2_y +\]

+

Here, \(\alpha\), restricted to lie +between 0 and 1, allows the variance to be higher for larger measurement +values. \(A\) is a constant (set to 30 +by the authors) that allows \(\sigma_y\) to be more easily interpreted as +the variance from “typical” measurements. Below are the priors for the +new variables in the model.

+

\[ +\alpha \sim \text{Beta}(1, 1) \qquad +\sigma \sim |N(0, 2.5)| +\]

+

In Stan

+

Below is the Stan code for the model. It looks very similar to the +mathematical description of the model, a nice feature of the Stan +probabilistic programming language.

+

The centrality and variance of the likelihood are calculated +separately as g and tau so they can be used in +the model and generated quantities block +without duplicating the code. The log_lik is calculated so +that PSIS-LOO cross validation can be estimated. I also included the +ability to provide new data to make predictions over as +xnew.

+
+
+
dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+
+
+
#> data {
+#>   int<lower=0> N;           // number of data points
+#>   int<lower=0> A;           // constant used in model of measurement error
+#>   vector<lower=0>[N] x;     // concentration values
+#>   vector<lower=0>[N] y;     // observed color intensity
+#>   int<lower=0> M;           // number of new x values
+#>   vector<lower=0>[M] xnew;  // new x values
+#> }
+#>
+#> parameters {
+#>   vector<lower=0>[4] beta;
+#>   real<lower=0,upper=1> alpha;
+#>   real<lower=0> sigma;
+#> }
+#>
+#> transformed parameters {
+#>   vector<lower=0>[N] g;
+#>   vector<lower=0>[N] tau;
+#>
+#>   for (i in 1:N) {
+#>     g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4]));
+#>     tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#>   }
+#> }
+#>
+#> model {
+#>   // Priors
+#>   alpha ~ beta(1, 1);
+#>   beta[1] ~ normal(10, 2.5);
+#>   beta[2] ~ normal(100, 5);
+#>   beta[3] ~ normal(0, 1);
+#>   beta[4] ~ normal(0, 2.5);
+#>   sigma ~ normal(0, 2.5);
+#>
+#>   // Likelihood
+#>   for (i in 1:N) {
+#>     y[i] ~ normal(g[i], tau[i]);
+#>   }
+#> }
+#>
+#> generated quantities {
+#>   vector[N] ypred;
+#>   vector[N] log_lik;
+#>
+#>   vector[M] g_hat;
+#>   vector[M] tau_hat;
+#>   vector[M] ynew;
+#>
+#>   for (i in 1:N) {
+#>     ypred[i] = normal_rng(g[i], tau[i]);
+#>     log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]);
+#>   }
+#>
+#>   for (i in 1:M) {
+#>     g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4]));
+#>     tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#>     ynew[i] = normal_rng(g_hat[i], tau_hat[i]);
+#>   }
+#> }
+
+

Sampling

+

As mentioned above, specifically defining the prior distributions for +each \(\beta\) is necessary for MCMC to +accurately sample from the posterior. With those helping restrict the +range of their values, the model fit very well.

+
+
+
xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+  N = nrow(dilution_standards_data),
+  A = 30,
+  x = dilution_standards_data$conc,
+  y = dilution_standards_data$y,
+  M = length(xnew),
+  xnew = xnew
+)
+
+dilution_model <- stan(
+  dilution_model_file,
+  model_name = "serial-dilution",
+  data = model_data,
+  refresh = 1000
+)
+
+
+
+

Posterior distributions

+

The next step is to analyze the posterior draws of the model. We can +check the success of MCMC by visualizing the traces of the chains, +looking for good mixing (“fuzzy caterpillars”) and checking diagnostic +values such as \(\widehat{R}\) and +\(n_\text{eff}\). The trace plots are +shown below followed by a table of the posteriors with the diagnostic +values. Everything looks good suggesting MCMC was successful.

+
+
+
model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+  theme(legend.position = "bottom")
+
+
+

+
+
+
+
print(dilution_model, pars = model_pars)
+
+
+
#> Inference for Stan model: serial-dilution.
+#> 4 chains, each with iter=2000; warmup=1000; thin=1;
+#> post-warmup draws per chain=1000, total post-warmup draws=4000.
+#>
+#>           mean se_mean   sd  2.5%   25%    50%    75%  97.5% n_eff
+#> beta[1]  14.33    0.02 0.55 13.12 14.08  14.36  14.61  15.33  1074
+#> beta[2] 102.70    0.10 4.30 94.48 99.67 102.61 105.60 111.32  1755
+#> beta[3]   0.06    0.00 0.01  0.05  0.06   0.06   0.07   0.08  1633
+#> beta[4]   1.13    0.00 0.08  0.97  1.07   1.12   1.18   1.29  1870
+#> alpha     0.72    0.01 0.17  0.33  0.61   0.74   0.85   0.98  1074
+#> sigma     1.33    0.01 0.24  0.98  1.16   1.30   1.45   1.92   893
+#>         Rhat
+#> beta[1]    1
+#> beta[2]    1
+#> beta[3]    1
+#> beta[4]    1
+#> alpha      1
+#> sigma      1
+#>
+#> Samples were drawn using NUTS(diag_e) at Sat Jan 22 11:53:33 2022.
+#> For each parameter, n_eff is a crude measure of effective sample size,
+#> and Rhat is the potential scale reduction factor on split chains (at
+#> convergence, Rhat=1).
+
+

The following density plots show the posterior distributions of the +model parameters \(\beta\), \(\alpha\), and \(\sigma\).

+
+
+
rstan::stan_dens(
+  dilution_model,
+  pars = model_pars,
+  separate_chains = FALSE,
+  alpha = 0.6
+) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0, 0.02)))
+
+
+

+
+

Posterior predictive check

+

Below is a plot of the posterior predictive distributions of the +model on the original data. 1,000 individual simulations are plotted in +blue and the mean in black. The simulated curves visually appear to +correspond well with the observed data indicating the model has good +fit.

+
+
+
dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+  as.data.frame() %>%
+  as_tibble() %>%
+  set_names(seq(1, ncol(.))) %>%
+  mutate(draw = 1:n()) %>%
+  pivot_longer(-c(draw), names_to = "idx") %>%
+  left_join(
+    dilution_standards_data %>% mutate(idx = as.character(1:n())),
+    by = "idx"
+  )
+
+
+
+
+
+
plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+  group_by(conc) %>%
+  summarize(value = mean(value)) %>%
+  ungroup()
+
+dilution_post_pred %>%
+  filter(draw %in% !!plt_draws) %>%
+  mutate(grp = glue::glue("{draw}-{rep}")) %>%
+  ggplot(aes(x = conc, y = value)) +
+  geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+  geom_line(group = "a", data = ppc_mean, color = "black") +
+  geom_point(data = ppc_mean, color = "black") +
+  geom_line(
+    aes(y = y, group = rep),
+    data = dilution_standards_data,
+    color = STAN_RED
+  ) +
+  geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+  scale_x_continuous(expand = expansion(c(0, 0))) +
+  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Posterior predictive distribution"
+  )
+
+
+

+
+

I also had the model make posterior predictions on concentrations +across the observed range at smaller step-sizes. The mean and 89% HDI +are shown in blue below along with the observed data in red. The inset +plot is a zoomed-in view of the posterior predictive distribution at the +lower concentrations.

+
+
+
ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+  rstan::extract(dilution_model, pars = "ynew")$ynew,
+  2,
+  bayestestR::hdi,
+  ci = 0.89
+)
+ynew_ppc <- tibble(
+  conc = xnew,
+  ynew_mean = ynew_mean,
+  ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+  ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+
+
+
+
+
+
plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+  ppc_df %>%
+    ggplot(aes(x = conc, y = ynew_mean)) +
+    geom_ribbon(
+      aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+      fill = SNS_BLUE,
+      alpha = 0.5
+    ) +
+    geom_line(group = "a") +
+    geom_line(
+      aes(y = y, group = rep),
+      data = obs_df,
+      color = STAN_RED
+    ) +
+    geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+    scale_x_continuous(expand = expansion(c(0, 0))) +
+    scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+  labs(
+    x = "concentration",
+    y = "y",
+    title = "Posterior predictive distribution"
+  )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+  ynew_ppc %>% filter(conc <= sub_max),
+  dilution_standards_data %>% filter(conc <= sub_max),
+  pt_size = 0.6
+) +
+  theme(axis.title = element_blank())
+
+ppc_plot +
+  inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+
+
+

+
+
+

Session info

+
+
+
sessionInfo()
+
+
+
#> R version 4.1.2 (2021-11-01)
+#> Platform: x86_64-apple-darwin17.0 (64-bit)
+#> Running under: macOS Big Sur 10.16
+#>
+#> Matrix products: default
+#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
+#>
+#> locale:
+#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+#>
+#> attached base packages:
+#> [1] stats     graphics  grDevices datasets  utils     methods
+#> [7] base
+#>
+#> other attached packages:
+#>  [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7
+#>  [4] purrr_0.3.4          readr_2.0.1          tidyr_1.1.3
+#>  [7] tibble_3.1.3         tidyverse_1.3.1      patchwork_1.1.1
+#> [10] tidybayes_3.0.1      rstan_2.21.2         ggplot2_3.3.5
+#> [13] StanHeaders_2.21.0-7
+#>
+#> loaded via a namespace (and not attached):
+#>  [1] fs_1.5.0             matrixStats_0.61.0   lubridate_1.7.10
+#>  [4] insight_0.14.4       RColorBrewer_1.1-2   httr_1.4.2
+#>  [7] rprojroot_2.0.2      tensorA_0.36.2       tools_4.1.2
+#> [10] backports_1.2.1      bslib_0.2.5.1        utf8_1.2.2
+#> [13] R6_2.5.0             DBI_1.1.1            colorspace_2.0-2
+#> [16] ggdist_3.0.0         withr_2.4.2          tidyselect_1.1.1
+#> [19] gridExtra_2.3        prettyunits_1.1.1    processx_3.5.2
+#> [22] downlit_0.2.1        curl_4.3.2           compiler_4.1.2
+#> [25] rvest_1.0.1          cli_3.0.1            arrayhelpers_1.1-0
+#> [28] xml2_1.3.2           bayestestR_0.11.0    labeling_0.4.2
+#> [31] posterior_1.1.0      sass_0.4.0           scales_1.1.1
+#> [34] checkmate_2.0.0      callr_3.7.0          digest_0.6.27
+#> [37] rmarkdown_2.10       pkgconfig_2.0.3      htmltools_0.5.1.1
+#> [40] highr_0.9            dbplyr_2.1.1         rlang_0.4.11
+#> [43] readxl_1.3.1         rstudioapi_0.13      jquerylib_0.1.4
+#> [46] farver_2.1.0         generics_0.1.0       svUnit_1.0.6
+#> [49] jsonlite_1.7.2       distill_1.2          distributional_0.2.2
+#> [52] inline_0.3.19        magrittr_2.0.1       loo_2.4.1
+#> [55] Rcpp_1.0.7           munsell_0.5.0        fansi_0.5.0
+#> [58] abind_1.4-5          lifecycle_1.0.0      stringi_1.7.3
+#> [61] yaml_2.2.1           pkgbuild_1.2.0       grid_4.1.2
+#> [64] parallel_4.1.2       crayon_1.4.1         lattice_0.20-45
+#> [67] haven_2.4.3          hms_1.1.0            knitr_1.33
+#> [70] ps_1.6.0             pillar_1.6.2         codetools_0.2-18
+#> [73] clisymbols_1.2.0     stats4_4.1.2         reprex_2.0.1
+#> [76] glue_1.4.2           evaluate_0.14        V8_3.4.2
+#> [79] renv_0.14.0          RcppParallel_5.1.4   modelr_0.1.8
+#> [82] vctrs_0.3.8          tzdb_0.1.2           cellranger_1.1.0
+#> [85] gtable_0.3.0         datawizard_0.2.1     assertthat_0.2.1
+#> [88] xfun_0.25            broom_0.7.9          coda_0.19-4
+#> [91] ellipsis_0.3.2       here_1.0.1
+
+
+
+
+
    +
  1. Gelman, Andrew, et al. Bayesian Data +Analysis. CRC Press Etc., 2015.↩︎

  2. +
  3. Gelman A, Chew GL, Shnaidman M. +Bayesian analysis of serial dilution assays. Biometrics. 2004 +Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. +https://pubmed.ncbi.nlm.nih.gov/15180666/↩︎

  4. +
+
+ + +
+ +
+
+ + + + + + + +
+ + + + + + + diff --git a/docs/index.html b/docs/index.html index 6b08346..1e6247e 100644 --- a/docs/index.html +++ b/docs/index.html @@ -2397,6 +2397,13 @@

${suggestion.title}

Section 10. Decision analysis Section 11. Normal approximation & Frequency properties Section 12. Extended topics +13. Notes on 'Ch 14. Introduction to regression models' +14. Notes on 'Ch 15. Hierarchical linear models' +17. Notes on 'Ch 19. Parametric nonlinear models' +18. Notes on 'Ch 20. Basis function models' +19. Notes on 'Ch 21. Gaussian process models' +20. Notes on 'Ch 22. Finite mixture models' +21. Notes on 'Ch 23. Dirichlet process models'
@@ -2458,126 +2467,282 @@

Bayesian Data Analysis course

Resources

How to study

-

The following are recommendations from the course creators on how to take the course.

+

The following are recommendations from the course creators on how to +take the course.

The recommended way to go through the material is:

    -
  1. Read the reading instructions for a chapter in the chapter notes.
  2. -
  3. Read the chapter in BDA3 and check that you find the terms listed in the reading instructions.
  4. -
  5. Watch the corresponding video lecture to get explanations for most important parts.
  6. +
  7. Read the reading instructions for a chapter in the chapter +notes.
  8. +
  9. Read the chapter in BDA3 and check that you find the terms listed in +the reading instructions.
  10. +
  11. Watch the corresponding video +lecture to get explanations for most important parts.
  12. Read corresponding additional information in the chapter notes.
  13. -
  14. Run the corresponding demos in R demos or Python demos.
  15. -
  16. Read the exercise instructions and make the corresponding assignments. Demo codes in R demos and Python demos have a lot of useful examples for handling data and plotting figures. If you have problems, visit TA sessions or ask in course slack channel.
  17. -
  18. If you want to learn more, make also self study exercises listed below.
  19. +
  20. Run the corresponding demos in R demos or Python demos.
  21. +
  22. Read the exercise instructions and make the corresponding +assignments. Demo codes in R demos and Python demos have a lot of useful +examples for handling data and plotting figures. If you have problems, +visit TA sessions or ask in course slack channel.
  23. +
  24. If you want to learn more, make also self study exercises listed +below.
-

Sections

+

Course sections

++++++ - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + + + +
SectionNotesBook exercisesAssignmentsSectionNotesBook exercisesAssignments
1. Course Introductionnotesexercisesassignment 11. Course +Introductionnotesexercisesassignment 1
2. Basics of Bayesian Inferencenotesexercisesassignment 22. Basics of Bayesian +Inferencenotesexercisesassignment 2
3. Multidimensional Posteriornotesexercisesassignment 33. Multidimensional +Posteriornotes(none)assignment 3
4. Monte Carlonotesexercisesassignment 44. Monte Carlonotes(none)assignment 4
5. Markov chain Monte Carlonotesexercisesassignment 55. Markov chain Monte +Carlonotes(none)assignment 5
6. HMC, NUTS, and Stannotesexercisesassignment 66. HMC, NUTS, and +Stannotes(none)assignment 6
7. Hierarchical models and exchangeabilitynotesexercisesassignment 77. Hierarchical models and +exchangeabilitynotesexercisesassignment 7
8. Model checking & Cross-validationnotes(none)(none)8. Model checking & +Cross-validationnotes(none)(none)
9. Model comparison and selectionnotes(none)assignment 89. Model comparison and +selectionnotes(none)assignment 8
10. Decision analysisnotes(none)assignment 910. Decision +analysisnotes(none)assignment 9
11. Normal approximation & Frequency propertiesnotes(none)(none)11. Normal approximation & +Frequency propertiesnotes(none)(none)
12. Extended topicsnotes(none)(none)12. Extended topicsnotes(none)(none)
+

Additional notes

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChapterTitleNotesAdditional work
14Introduction to regression modelsnotes
15Hierarchical linear modelsnotes
16Generalized linear models
18Models for missing data
19Parametric nonlinear modelsnotesserial dilution +model
20Basis function modelsnotes
21Gaussian process modelsnotes
22Finite mixture modelsnotes
23Dirichlet process modelsnotes

Stan models

+
  • serial dilution assay for +chapter 19 exercises
  • +
    diff --git a/docs/models/serial-dilution.rds b/docs/models/serial-dilution.rds new file mode 100644 index 0000000..43fa9ea Binary files /dev/null and b/docs/models/serial-dilution.rds differ diff --git a/docs/models/serial-dilution.stan b/docs/models/serial-dilution.stan new file mode 100644 index 0000000..1dab7e7 --- /dev/null +++ b/docs/models/serial-dilution.stan @@ -0,0 +1,59 @@ +data { + int N; // number of data points + int A; // constant used in model of measurement error + vector[N] x; // concentration values + vector[N] y; // observed color intensity + int M; // number of new x values + vector[M] xnew; // new x values +} + +parameters { + vector[4] beta; + real alpha; + real sigma; +} + +transformed parameters { + vector[N] g; + vector[N] tau; + + for (i in 1:N) { + g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4])); + tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0); + } +} + +model { + // Priors + alpha ~ beta(1, 1); + beta[1] ~ normal(10, 2.5); + beta[2] ~ normal(100, 5); + beta[3] ~ normal(0, 1); + beta[4] ~ normal(0, 2.5); + sigma ~ normal(0, 2.5); + + // Likelihood + for (i in 1:N) { + y[i] ~ normal(g[i], tau[i]); + } +} + +generated quantities { + vector[N] ypred; + vector[N] log_lik; + + vector[M] g_hat; + vector[M] tau_hat; + vector[M] ynew; + + for (i in 1:N) { + ypred[i] = normal_rng(g[i], tau[i]); + log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]); + } + + for (i in 1:M) { + g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4])); + tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0); + ynew[i] = normal_rng(g_hat[i], tau_hat[i]); + } +} diff --git a/docs/notes/13_intro-to-regression-models_bda3-14.Rmd b/docs/notes/13_intro-to-regression-models_bda3-14.Rmd new file mode 100644 index 0000000..48c861f --- /dev/null +++ b/docs/notes/13_intro-to-regression-models_bda3-14.Rmd @@ -0,0 +1,78 @@ +--- +title: "13. Notes on 'Ch 14. Introduction to regression models'" +date: "2021-12-06" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") +``` + +> These are just notes on a single chapter of *BDA3* that were not part ofthe course. + +## Chapter 14. Introduction to regression models + +### 14.1 Conditional modeling + +- question: how does one quantity $y$ vary as a function of another quantity or vector of quantities $x$? + - conditional distribution of $y$ given $x$ parameterized as $p(y|\theta,x)$ +- key statistical modeling issues: + 1. defining $y$ and $x$ so that $y$ is reasonably linear as a function of the columns of $X$ + - may need to transform $x$ + 2. set priors on the model parameters + +### 14.2 Bayesian analysis of classical regression + +- simplest case: *ordinary linear regression* + - observation errors are independent and have equal variance + +$$ +y | \beta, \sigma, X \sim \text{N}(X \beta, \sigma^2 I) +$$ + +#### Posterior predictive distribution for new data + +- posterior predictive distribution has two sources of uncertainty: + 1. the inherent variability in the model represented by $\sigma$ in $y$ + 2. posterior uncertainty in $\beta$ and $\sigma$ +- draw a random sample $\tilde{y}$ from the posterior predictive distribution: + - draw $(\beta, \sigma)$ from their posteriors + - draw $\tilde{y} \sim \text{N}(\tilde{X} \beta, \sigma^2 I)$ + +## 14.4 Goals of regression analysis + +- at least three goals: + 1. understand the behavior of $y$ given $x$ + 2. predict $y$ given $x$ + 3. causal inference; predict how $y$ would change if $x$ were changed + +## 14.5 Assembling the matrix of explanitory variables + +### Identifiability and collinearity + +- "the parameters in a classical regression cannot be uniquely estimated if there are more parameters than data points or, more generally, if the columns of the matrix $X$ of explanatory variables are not linearly independent" (pg 365) + +### Nonlinear relations + +- may need to transform variables +- can include more than one transformation in the model as separate covariates +- GLMs and non-linear models are discussed in later chapters + +### Indicator variables + +- include a categorical variable in a regression using a indicator variable + - separate effect for each category + - or model as related with a hierarchical model + +### Interactions + +- "If the response to a unit change in $x_i$ depends in what value another predictor $x_j$ has been fixed at, then it is necessary to include *interaction* terms in the model" (pg 367) + - $(x_i - \bar{x_i})(x_j - \bar{x_j})$ + +## 14.6 Regularization and dimension reduction + +- see lecture notes on regularization for more updated recommendations +- "Bayesian regularization": + - location and scale of the prior + - analytic form of the prior (e.g. normal vs. Laplacian vs. Cauchy) + - how the posterior inference is summarized diff --git a/docs/notes/13_intro-to-regression-models_bda3-14.html b/docs/notes/13_intro-to-regression-models_bda3-14.html new file mode 100644 index 0000000..a7c86c7 --- /dev/null +++ b/docs/notes/13_intro-to-regression-models_bda3-14.html @@ -0,0 +1,2607 @@ + + + + + + + + + + + + + + + + + + + + 13. Notes on 'Ch 14. Introduction to regression models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    13. Notes on ‘Ch 14. Introduction to regression models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were not part ofthe course.

    +
    +

    Chapter 14. Introduction to regression models

    +

    14.1 Conditional modeling

    +
      +
    • question: how does one quantity \(y\) vary as a function of another quantity or vector of quantities \(x\)? +
        +
      • conditional distribution of \(y\) given \(x\) parameterized as \(p(y|\theta,x)\)
      • +
    • +
    • key statistical modeling issues: +
        +
      1. defining \(y\) and \(x\) so that \(y\) is reasonably linear as a function of the columns of \(X\)
      2. +
      +
        +
      • may need to transform \(x\)
      • +
      +
        +
      1. set priors on the model parameters
      2. +
    • +
    +

    14.2 Bayesian analysis of classical regression

    +
      +
    • simplest case: ordinary linear regression +
        +
      • observation errors are independent and have equal variance
      • +
    • +
    +

    \[ +y | \beta, \sigma, X \sim \text{N}(X \beta, \sigma^2 I) +\]

    +

    Posterior predictive distribution for new data

    +
      +
    • posterior predictive distribution has two sources of uncertainty: +
        +
      1. the inherent variability in the model represented by \(\sigma\) in \(y\)
      2. +
      3. posterior uncertainty in \(\beta\) and \(\sigma\)
      4. +
    • +
    • draw a random sample \(\tilde{y}\) from the posterior predictive distribution: +
        +
      • draw \((\beta, \sigma)\) from their posteriors
      • +
      • draw \(\tilde{y} \sim \text{N}(\tilde{X} \beta, \sigma^2 I)\)
      • +
    • +
    +

    14.4 Goals of regression analysis

    +
      +
    • at least three goals: +
        +
      1. understand the behavior of \(y\) given \(x\)
      2. +
      3. predict \(y\) given \(x\)
      4. +
      5. causal inference; predict how \(y\) would change if \(x\) were changed
      6. +
    • +
    +

    14.5 Assembling the matrix of explanitory variables

    +

    Identifiability and collinearity

    +
      +
    • “the parameters in a classical regression cannot be uniquely estimated if there are more parameters than data points or, more generally, if the columns of the matrix \(X\) of explanatory variables are not linearly independent” (pg 365)
    • +
    +

    Nonlinear relations

    +
      +
    • may need to transform variables
    • +
    • can include more than one transformation in the model as separate covariates
    • +
    • GLMs and non-linear models are discussed in later chapters
    • +
    +

    Indicator variables

    +
      +
    • include a categorical variable in a regression using a indicator variable +
        +
      • separate effect for each category
      • +
      • or model as related with a hierarchical model
      • +
    • +
    +

    Interactions

    +
      +
    • “If the response to a unit change in \(x_i\) depends in what value another predictor \(x_j\) has been fixed at, then it is necessary to include interaction terms in the model” (pg 367) +
        +
      • \((x_i - \bar{x_i})(x_j - \bar{x_j})\)
      • +
    • +
    +

    14.6 Regularization and dimension reduction

    +
      +
    • see lecture notes on regularization for more updated recommendations
    • +
    • “Bayesian regularization”: +
        +
      • location and scale of the prior
      • +
      • analytic form of the prior (e.g. normal vs. Laplacian vs. Cauchy)
      • +
      • how the posterior inference is summarized
      • +
      +
    • +
    + + +
    + +
    +
    + + + + + + + +
    + + + + + + + diff --git a/docs/notes/14_hierarchical-linear-models_bda3-15.Rmd b/docs/notes/14_hierarchical-linear-models_bda3-15.Rmd new file mode 100644 index 0000000..8f15df1 --- /dev/null +++ b/docs/notes/14_hierarchical-linear-models_bda3-15.Rmd @@ -0,0 +1,18 @@ +--- +title: "14. Notes on 'Ch 15. Hierarchical linear models'" +date: "2021-12-09" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") +``` + +> These are just notes on a single chapter of *BDA3* that were not part ofthe course. +> Also, these notes are rather sparse because I have a bit of experience with these models already. + +## Chapter 15. Hierarchical linear models + +- "Hierarchical regression models are useful as soon as there are predictors at different levels of variation" (pg 381) + - or for data obtained by stratified/clustered sampling +- general recommendation to start simple and build up the model's complexity diff --git a/docs/notes/14_hierarchical-linear-models_bda3-15.html b/docs/notes/14_hierarchical-linear-models_bda3-15.html new file mode 100644 index 0000000..81ef91e --- /dev/null +++ b/docs/notes/14_hierarchical-linear-models_bda3-15.html @@ -0,0 +1,2520 @@ + + + + + + + + + + + + + + + + + + + + 14. Notes on 'Ch 15. Hierarchical linear models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    14. Notes on ‘Ch 15. Hierarchical linear models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were not part ofthe course. Also, these notes are rather sparse because I have a bit of experience with these models already.

    +
    +

    Chapter 15. Hierarchical linear models

    +
      +
    • “Hierarchical regression models are useful as soon as there are predictors at different levels of variation” (pg 381) +
        +
      • or for data obtained by stratified/clustered sampling
      • +
    • +
    • general recommendation to start simple and build up the model’s complexity
    • +
    +
    + + +
    + +
    +
    + + + + + +
    + + + + + + + diff --git a/docs/notes/17_parametric-nonlinear-models_bda3-19.Rmd b/docs/notes/17_parametric-nonlinear-models_bda3-19.Rmd new file mode 100644 index 0000000..a286d06 --- /dev/null +++ b/docs/notes/17_parametric-nonlinear-models_bda3-19.Rmd @@ -0,0 +1,94 @@ +--- +title: "17. Notes on 'Ch 19. Parametric nonlinear models'" +date: "2021-12-09" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") +``` + +> These are just notes on a single chapter of *BDA3* that were not part of the course. + +## Chapter 19. Parametric nonlinear models + +- examples: + - a ratio; $\text{E}(y = \frac{a_1 + b_1 x_1}{a_2 + b_2 x_2}$ + - sum of nonlinear functions: $\text{E}(y) = A_1 e^{-\alpha_1x} + A_2 e^{-\alpha_2x}$ +- parameters of a nonlinear model are often harder to interpret + - often requires custom visualization techniques +- **"Generally each new modeling problem must be tackled afresh."** (pg 471) + - these models are less systematic than linear modeling + +### 19.1 Example: serial dilution assay + +- estimate 10 unknown concentrations of an allergen based off of serial dilutions of a known standard + +#### The model + +- **Notation**: + - parameters of interest: concentrations of unknown samples $\theta_1, \dots, \theta_10$ + - known concentration of the standard $\theta_0$ + - dilution of measure $i$ as $x_i$ and color intensity (measurement) as $y_i$ +- **Curve of expected measurements given the concentration** + - use the following equation that is standard in the field + - parameters: + - $\beta_1$: color intensity at the limit of 0 concentration + - $\beta_2$: the increase to saturation + - $\beta_3$: concentration at which the gradient of the curve turns + - $\beta_4$: rate at which saturation occurs + +$$ +\text{E}(y | x, \beta) = + g(x, \beta) = + \beta_1 + \frac{\beta_2}{1 + (x / \beta_3)^{-\beta_4}} +$$ + +- **Measurement error** + - modeled as normally distributed with unequal variances + - parameters: + - $\alpha$: models the pattern that variances are higher for larger measurements + - restricted $[0, 1]$ + - $A$ is a arbitrary constant to scale the data so $\sigma$ can be interpreted as the deviation from "typical" values + - $\sigma$: deviation of a measure from the "typical" + +$$ +y_i \sim \text{N}(g(x_i, \beta), (\frac{g(x_i, \beta)}{A})^{2\alpha} \sigma_y^2) +$$ + +- **Dilution errors** + - two possible sources: + 1. *initial dilution*: the accuracy of the creation of the initial standard concentration + 2. *serial dilutions*: error in creation of the subsequent dilutions (low enough to ignore for this analysis) + - use a normal model on the log scale of the initial dilution error + - parameters: + - $\theta_0$: known concentration of the standard solution + - $d_0^\text{init}$: known initial dilution of the standard that is called for + - without error, the concentration of the initial solution would be $d_0^\text{init} \theta_0$ + - $x_0^\text{init}$: the *actual* (unknown) concentration of the initial dilution + +$$ +\log(x_0^\text{init}) \sim \text{N}(\log(d_0^\text{init} \cdot \theta_0), (\sigma^\text{init})^2) +$$ +- **Dilution errors** (cont) + - there is no initial dilution for the unknown samples being tested + - therefore, the unknown initial concentration for sample $j$ is $x^\text{init} = \theta_j$ for $j = 1, \dots, 10$ + - for the dilutions of the unknown samples, set $x_i = d_i \cdot x_{j(i)}^\text{init}$ + - $j(i)$ is the sample $j$ corresponding to measurement $i$ + - $d_i$ is the dilution of measurement $i$ relative to the initial concentration + +#### Prior distributions + +- priors used as described by book (are likely different than what would be recommended now): + - $\log(\beta) \sim U(-\infty, \infty)$ + - $\sigma_y \sim U(0, \infty)$ + - $\alpha \sim U(0,1)$ + - $p(\log \theta_j) \propto 1$ for each unknown $j = 1, \dots, 10$ +- cannot estimate $\sigma^\text{init}$ because we only have a single standard + - use a fixed value of 0.02 based on a previous analysis of different plates + +### 19.2 Example: population toxicokinetics + +- this is a more complex model +- uses a physiological model with parameters that cannot be solely determined using the data + - requires informative priors based on previous studies diff --git a/docs/notes/17_parametric-nonlinear-models_bda3-19.html b/docs/notes/17_parametric-nonlinear-models_bda3-19.html new file mode 100644 index 0000000..625c447 --- /dev/null +++ b/docs/notes/17_parametric-nonlinear-models_bda3-19.html @@ -0,0 +1,2629 @@ + + + + + + + + + + + + + + + + + + + + 17. Notes on 'Ch 19. Parametric nonlinear models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    17. Notes on ‘Ch 19. Parametric nonlinear models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were not part of the course.

    +
    +

    Chapter 19. Parametric nonlinear models

    +
      +
    • examples: +
        +
      • a ratio; \(\text{E}(y = \frac{a_1 + b_1 x_1}{a_2 + b_2 x_2}\)
      • +
      • sum of nonlinear functions: \(\text{E}(y) = A_1 e^{-\alpha_1x} + A_2 e^{-\alpha_2x}\)
      • +
    • +
    • parameters of a nonlinear model are often harder to interpret +
        +
      • often requires custom visualization techniques
      • +
    • +
    • “Generally each new modeling problem must be tackled afresh.” (pg 471) +
        +
      • these models are less systematic than linear modeling
      • +
    • +
    +

    19.1 Example: serial dilution assay

    +
      +
    • estimate 10 unknown concentrations of an allergen based off of serial dilutions of a known standard
    • +
    +

    The model

    +
      +
    • Notation: +
        +
      • parameters of interest: concentrations of unknown samples \(\theta_1, \dots, \theta_10\)
      • +
      • known concentration of the standard \(\theta_0\)
      • +
      • dilution of measure \(i\) as \(x_i\) and color intensity (measurement) as \(y_i\)
      • +
    • +
    • Curve of expected measurements given the concentration +
        +
      • use the following equation that is standard in the field
      • +
      • parameters: +
          +
        • \(\beta_1\): color intensity at the limit of 0 concentration
        • +
        • \(\beta_2\): the increase to saturation
        • +
        • \(\beta_3\): concentration at which the gradient of the curve turns
        • +
        • \(\beta_4\): rate at which saturation occurs
        • +
      • +
    • +
    +

    \[ +\text{E}(y | x, \beta) = + g(x, \beta) = + \beta_1 + \frac{\beta_2}{1 + (x / \beta_3)^{-\beta_4}} +\]

    +
      +
    • Measurement error +
        +
      • modeled as normally distributed with unequal variances
      • +
      • parameters: +
          +
        • \(\alpha\): models the pattern that variances are higher for larger measurements +
            +
          • restricted \([0, 1]\)
          • +
        • +
        • \(A\) is a arbitrary constant to scale the data so \(\sigma\) can be interpreted as the deviation from “typical” values
        • +
        • \(\sigma\): deviation of a measure from the “typical”
        • +
      • +
    • +
    +

    \[ +y_i \sim \text{N}(g(x_i, \beta), (\frac{g(x_i, \beta)}{A})^{2\alpha} \sigma_y^2) +\]

    +
      +
    • Dilution errors +
        +
      • two possible sources: +
          +
        1. initial dilution: the accuracy of the creation of the initial standard concentration
        2. +
        3. serial dilutions: error in creation of the subsequent dilutions (low enough to ignore for this analysis)
        4. +
      • +
      • use a normal model on the log scale of the initial dilution error
      • +
      • parameters: +
          +
        • \(\theta_0\): known concentration of the standard solution
        • +
        • \(d_0^\text{init}\): known initial dilution of the standard that is called for +
            +
          • without error, the concentration of the initial solution would be \(d_0^\text{init} \theta_0\)
          • +
        • +
        • \(x_0^\text{init}\): the actual (unknown) concentration of the initial dilution
        • +
      • +
    • +
    +

    \[ +\log(x_0^\text{init}) \sim \text{N}(\log(d_0^\text{init} \cdot \theta_0), (\sigma^\text{init})^2) +\] - Dilution errors (cont) - there is no initial dilution for the unknown samples being tested - therefore, the unknown initial concentration for sample \(j\) is \(x^\text{init} = \theta_j\) for \(j = 1, \dots, 10\) - for the dilutions of the unknown samples, set \(x_i = d_i \cdot x_{j(i)}^\text{init}\) - \(j(i)\) is the sample \(j\) corresponding to measurement \(i\) - \(d_i\) is the dilution of measurement \(i\) relative to the initial concentration

    +

    Prior distributions

    +
      +
    • priors used as described by book (are likely different than what would be recommended now): +
        +
      • \(\log(\beta) \sim U(-\infty, \infty)\)
      • +
      • \(\sigma_y \sim U(0, \infty)\)
      • +
      • \(\alpha \sim U(0,1)\)
      • +
      • \(p(\log \theta_j) \propto 1\) for each unknown \(j = 1, \dots, 10\)
      • +
    • +
    • cannot estimate \(\sigma^\text{init}\) because we only have a single standard +
        +
      • use a fixed value of 0.02 based on a previous analysis of different plates
      • +
    • +
    +

    19.2 Example: population toxicokinetics

    +
      +
    • this is a more complex model
    • +
    • uses a physiological model with parameters that cannot be solely determined using the data +
        +
      • requires informative priors based on previous studies
      • +
      +
    • +
    + + +
    + +
    +
    + + + + + + + +
    + + + + + + + diff --git a/docs/notes/18_basis-function-models_bda3-20.Rmd b/docs/notes/18_basis-function-models_bda3-20.Rmd new file mode 100644 index 0000000..181d612 --- /dev/null +++ b/docs/notes/18_basis-function-models_bda3-20.Rmd @@ -0,0 +1,64 @@ +--- +title: "18. Notes on 'Ch 20. Basis function models'" +date: "2022-01-12" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") +``` + +> These are just notes on a single chapter of *BDA3* that were not part of the course. + +## Chapter 20. Basis function models + +- chapter 19 focused on nonlinear models with $\text{E}(y|X,\beta) = \mu(X_i,\phi)$ where $\mu$ is a parametric nonlinear function of unknowns $\phi$ +- in this and following chapters, consider models where $\mu$ is also unknown + +### 20.1 Splines and weighted sums of basis functions + +- replace $X_i \beta$ with $\mu(X_i)$ where $\mu(\cdot)$ is some class of nonlinear functions + - different options for modeling $\mu$ including with basis function expansions or Gaussian processes (next chapter) +- basis function approach: $\mu(x) = \sum_{h=1}^{H} \beta_h b_h(x)$ + - $b_h$: set of basis functions + - $\beta_h$: vector of basis coefficients +- common choices for basis functions are: + - **Gaussian radial basis functions**: multiple centers of the basis functions with a width parameter controlling a set of Gaussian functions + - **B-spline**: a piecewise continuous function based on a set of knots + - knots locations control the flexibility of the basis + - knots cn be placed uniformly or non-uniformly (e.g. based on the density of the data) + - can use a "free knot approach" with a prior on the number and location of knots, but is computationally demanding + - instead can use priors on the coefficients $\beta$ to shrink values to near 0 + +## 20.2 Basis selection and shrinkage coefficients + +- common to not know which basis functions are really needed +- can use a variable selection approach to allow the model to estimate the "importance" of each basis function + - can then either select the best model from the posterior or average over all possible models by weighting each basis by its importance +- possible for some bias based on initial choice of the basis functions + - implied prior information on the smoothness and shape of the model + - can include multiple types of basis functions in the initial collection + +### Shrinkage priors + +- allowing basis function coefficients to be zero with positive probability represents a challenge for sampling from the posterior + - with many basis functions, it is computationally infeasible to visit all possible states +- may be better to use shrinkage priors instead + - there are various options discussed in the book and there are likely others recommended now + +## 20.3 Non-normal models and regression surfaces + +### Other error distributions + +- may want to model data that is not a continuous output variable $y$ or does not have Gaussian residuals + - can modify the residual densities with different prior distributions to accommodate outliers + - can use the basis function and its coefficients $\eta_i = w_i \beta$ as the linear component in a GLM + +### Multivariate regression surfaces + +- careful with *curse of dimensionality* +- one option is to assume additive of the covariates so can model as the sum of univariate regression functions + - this does not always make sense and a different approach using a tensor product is described at the end of the section +- can use informative priors to help restrict the search space + - is a form of including prior information, so still proper Bayesian results and measures of uncertainty + - e.g. if we know *a priori* that the mean response variable is non-decreasing, restrict the coefficients $\beta$ to be non-negative diff --git a/docs/notes/18_basis-function-models_bda3-20.html b/docs/notes/18_basis-function-models_bda3-20.html new file mode 100644 index 0000000..fc1f146 --- /dev/null +++ b/docs/notes/18_basis-function-models_bda3-20.html @@ -0,0 +1,2595 @@ + + + + + + + + + + + + + + + + + + + + 18. Notes on 'Ch 20. Basis function models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    18. Notes on ‘Ch 20. Basis function models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were not part of the course.

    +
    +

    Chapter 20. Basis function models

    +
      +
    • chapter 19 focused on nonlinear models with \(\text{E}(y|X,\beta) = \mu(X_i,\phi)\) where \(\mu\) is a parametric nonlinear function of unknowns \(\phi\)
    • +
    • in this and following chapters, consider models where \(\mu\) is also unknown
    • +
    +

    20.1 Splines and weighted sums of basis functions

    +
      +
    • replace \(X_i \beta\) with \(\mu(X_i)\) where \(\mu(\cdot)\) is some class of nonlinear functions +
        +
      • different options for modeling \(\mu\) including with basis function expansions or Gaussian processes (next chapter)
      • +
    • +
    • basis function approach: \(\mu(x) = \sum_{h=1}^{H} \beta_h b_h(x)\) +
        +
      • \(b_h\): set of basis functions
      • +
      • \(\beta_h\): vector of basis coefficients
      • +
    • +
    • common choices for basis functions are: +
        +
      • Gaussian radial basis functions: multiple centers of the basis functions with a width parameter controlling a set of Gaussian functions
      • +
      • B-spline: a piecewise continuous function based on a set of knots +
          +
        • knots locations control the flexibility of the basis
        • +
        • knots cn be placed uniformly or non-uniformly (e.g. based on the density of the data)
        • +
        • can use a “free knot approach” with a prior on the number and location of knots, but is computationally demanding
        • +
        • instead can use priors on the coefficients \(\beta\) to shrink values to near 0
        • +
      • +
    • +
    +

    20.2 Basis selection and shrinkage coefficients

    +
      +
    • common to not know which basis functions are really needed
    • +
    • can use a variable selection approach to allow the model to estimate the “importance” of each basis function +
        +
      • can then either select the best model from the posterior or average over all possible models by weighting each basis by its importance
      • +
    • +
    • possible for some bias based on initial choice of the basis functions +
        +
      • implied prior information on the smoothness and shape of the model
      • +
      • can include multiple types of basis functions in the initial collection
      • +
    • +
    +

    Shrinkage priors

    +
      +
    • allowing basis function coefficients to be zero with positive probability represents a challenge for sampling from the posterior +
        +
      • with many basis functions, it is computationally infeasible to visit all possible states
      • +
    • +
    • may be better to use shrinkage priors instead +
        +
      • there are various options discussed in the book and there are likely others recommended now
      • +
    • +
    +

    20.3 Non-normal models and regression surfaces

    +

    Other error distributions

    +
      +
    • may want to model data that is not a continuous output variable \(y\) or does not have Gaussian residuals +
        +
      • can modify the residual densities with different prior distributions to accommodate outliers
      • +
      • can use the basis function and its coefficients \(\eta_i = w_i \beta\) as the linear component in a GLM
      • +
    • +
    +

    Multivariate regression surfaces

    +
      +
    • careful with curse of dimensionality
    • +
    • one option is to assume additive of the covariates so can model as the sum of univariate regression functions +
        +
      • this does not always make sense and a different approach using a tensor product is described at the end of the section
      • +
    • +
    • can use informative priors to help restrict the search space +
        +
      • is a form of including prior information, so still proper Bayesian results and measures of uncertainty
      • +
      • e.g. if we know a priori that the mean response variable is non-decreasing, restrict the coefficients \(\beta\) to be non-negative
      • +
      +
    • +
    + + +
    + +
    +
    + + + + + + + +
    + + + + + + + diff --git a/docs/notes/19_gaussian-processes_bda3-21.Rmd b/docs/notes/19_gaussian-processes_bda3-21.Rmd new file mode 100644 index 0000000..6dac644 --- /dev/null +++ b/docs/notes/19_gaussian-processes_bda3-21.Rmd @@ -0,0 +1,147 @@ +--- +title: "19. Notes on 'Ch 21. Gaussian process models'" +date: "2022-01-13" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") + +library(glue) +library(ggtext) +library(tidyverse) + +theme_set( + theme_bw() + + theme(axis.ticks = element_blank(), strip.background = element_blank()) +) +``` + + +> These are just notes on a single chapter of *BDA3* that were not part of the course. + +## Chapter 21. Gaussian process models + +- *Gaussian process* (GP): "flexible class of models for which any finite-dimensional marginal distribution is Gaussian" (pg. 501) + - "can be viewed as a potentially infinite-dimensional generalization of Gaussian distribution" (pg. 501) + +### 21.1 Gaussian process regression + +- realizations from a GP correspond to random functions + - good prior for an unknown regression function $\mu(x)$ +- $\mu \sim \text{GP}(m,k)$ + - $m$: mean function + - $k$: covariance function +- $\mu$ is a random function ("stochastic process") where the values at any $n$ pooints $x_1, \dots, x_n$ are drawn from the $n-dimensional$ normal distribution + - with mean $m$ and covariance $K$: + +$$ +\mu(x_1), \dots, \mu(x_n) \sim \text{N}((m(x_1), \dots, m(x_n)), K(x_1, \dots, x_n)) +$$ + +- the GP $\mu \sim \text{GP}(m,k)$ is nonparametric with infinitely many parameters + - the mean function $m$ represents an inital guess at the regression function + - the covariance function $k$ represents the covariance between the process at any two points + - controls the smoothness of realizations from the GP and degree of shrinkage towards the mean +- below is an example of realizations from a GP with mean function 0 and the *squared exponential* (a.k.a. exponentiated quadratic, Gaussian) covariance function with different parameters + +$$ +k(x, x^\prime) = \tau^2 \exp(-\frac{|x-x^\prime|^2}{2l^2}) +$$ + +```{r} +squared_exponential_cov <- function(x, tau, l) { + n <- length(x) + k <- matrix(0, nrow = n, ncol = n) + denom <- 2 * (l^2) + for (i in 1:n) { + for (j in 1:n) { + a <- x[i] + b <- x[j] + k[i, j] <- tau^2 * exp(-(abs(a - b)^2) / (denom)) + } + } + return(k) +} + +my_gaussian_process <- function(x, tau, l, n = 3) { + m <- rep(0, length(x)) + k <- squared_exponential_cov(x = x, tau = tau, l = l) + gp_samples <- mvtnorm::rmvnorm(n = n, mean = m, sigma = k) + return(gp_samples) +} + +tidy_gp <- function(x, tau, l, n = 3) { + my_gaussian_process(x = x, tau = tau, l = l, n = n) %>% + as.data.frame() %>% + as_tibble() %>% + set_names(x) %>% + mutate(sample_idx = as.character(1:n())) %>% + pivot_longer(-sample_idx, names_to = "x", values_to = "y") %>% + mutate(x = as.numeric(x)) +} + +set.seed(0) +x <- seq(0, 10, by = 0.1) +gp_samples <- tibble(tau = c(0.25, 0.5, 0.25, 0.5), l = c(0.5, 0.5, 2, 2)) %>% + mutate(samples = purrr::map2(tau, l, ~ tidy_gp(x = x, tau = .x, l = .y, n = 3))) %>% + unnest(samples) + +gp_samples %>% + mutate(grp = glue("\u03C4 = {tau}, \u2113 = {l}")) %>% + ggplot(aes(x = x, y = y)) + + facet_wrap(vars(grp), nrow = 2) + + geom_line(aes(color = sample_idx)) + + scale_x_continuous(expand = expansion(c(0, 0))) + + scale_y_continuous(expand = expansion(c(0.02, 0.02))) + + scale_color_brewer(type = "qual", palette = "Set1") + + theme(legend.position = "none", axis.text.y = element_markdown()) + + labs(x = "x", y = "\u03BC(x)") +``` + +#### Covariance functions + +- "Different covariance functions can be used to add structural prior assumptions like smoothness, nonstationarity, periodicity, and multi-scale or hierarchical structures." (pg. 502) + - sums and products of GPs are also GPs so can combine them in the same model +- can also use "anisotropic" GPs covariance functions for multiple predictors + +#### Inference + +- computing the mean and covariance in the $n$-variate normal conditional posterior for $\tilde{\mu}$ involves a matrix inversion that requires $O(n^3)$ computation + - this needs to be repeated for each MCMC step + - limits the size of the data set and number of covariates in a model + +#### Covariance function approximations + +- there are approximations to the GP that can speed up computation + - generally work by reducing the matrix inversion burden + +### 21.3 Latent Gaussian process models + +- with non-Gaussian likelihoods, the GP prior becomes a latent function $f$ which determines the likelihood $p(y|f,\phi)$ through a link function + +### 21.4 Functional data analysis + +- *functional data analysis*: considers responses and predictors not a scalar/vector-valued random variables but as random functions with infinitely-many points + - GPs fit this need well with little modification + +### 21.5 Density estimation and regression + +- can get more flexibility by modeling the conditional observation model as a nonparametric GP + - so far have used a GP as a prior for a function controlling location or shape of a parametric observation model + - one solution is the *logistic Gaussian process* (LGP) or a Dirichlet process (covered in a later chapter) + +#### Density estimation + +- LGP generates a random surface from a GP and then transforms the surface to the space of probability densities + - with 1D, the surface is just a curve + - use the continuous logistic transformation to constrain to non-negative and integrate to 1 +- there is illustrative example in the book on page 513 + +#### Density regression + +- generalize the LPG to density regression by putting a prior on the collection of conditional densities + +#### Latent-variable regression + +- an alternative to LPG diff --git a/docs/notes/19_gaussian-processes_bda3-21.html b/docs/notes/19_gaussian-processes_bda3-21.html new file mode 100644 index 0000000..599ebf8 --- /dev/null +++ b/docs/notes/19_gaussian-processes_bda3-21.html @@ -0,0 +1,2709 @@ + + + + + + + + + + + + + + + + + + + + 19. Notes on 'Ch 21. Gaussian process models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    19. Notes on ‘Ch 21. Gaussian process models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were +not part of the course.

    +
    +

    Chapter 21. Gaussian +process models

    +
      +
    • Gaussian process (GP): “flexible class of models for which +any finite-dimensional marginal distribution is Gaussian” (pg. 501) +
        +
      • “can be viewed as a potentially infinite-dimensional generalization +of Gaussian distribution” (pg. 501)
      • +
    • +
    +

    21.1 Gaussian process +regression

    +
      +
    • realizations from a GP correspond to random functions +
        +
      • good prior for an unknown regression function \(\mu(x)\)
      • +
    • +
    • \(\mu \sim \text{GP}(m,k)\) +
        +
      • \(m\): mean function
      • +
      • \(k\): covariance function
      • +
    • +
    • \(\mu\) is a random function +(“stochastic process”) where the values at any \(n\) pooints \(x_1, \dots, x_n\) are drawn from the \(n-dimensional\) normal distribution +
        +
      • with mean \(m\) and covariance +\(K\):
      • +
    • +
    +

    \[ +\mu(x_1), \dots, \mu(x_n) \sim \text{N}((m(x_1), \dots, m(x_n)), K(x_1, +\dots, x_n)) +\]

    +
      +
    • the GP \(\mu \sim \text{GP}(m,k)\) +is nonparametric with infinitely many parameters +
        +
      • the mean function \(m\) represents +an inital guess at the regression function
      • +
      • the covariance function \(k\) +represents the covariance between the process at any two points +
          +
        • controls the smoothness of realizations from the GP and degree of +shrinkage towards the mean
        • +
      • +
    • +
    • below is an example of realizations from a GP with mean function 0 +and the squared exponential (a.k.a. exponentiated quadratic, +Gaussian) covariance function with different parameters
    • +
    +

    \[ +k(x, x^\prime) = \tau^2 \exp(-\frac{|x-x^\prime|^2}{2l^2}) +\]

    +
    +
    +
    squared_exponential_cov <- function(x, tau, l) {
    +  n <- length(x)
    +  k <- matrix(0, nrow = n, ncol = n)
    +  denom <- 2 * (l^2)
    +  for (i in 1:n) {
    +    for (j in 1:n) {
    +      a <- x[i]
    +      b <- x[j]
    +      k[i, j] <- tau^2 * exp(-(abs(a - b)^2) / (denom))
    +    }
    +  }
    +  return(k)
    +}
    +
    +my_gaussian_process <- function(x, tau, l, n = 3) {
    +  m <- rep(0, length(x))
    +  k <- squared_exponential_cov(x = x, tau = tau, l = l)
    +  gp_samples <- mvtnorm::rmvnorm(n = n, mean = m, sigma = k)
    +  return(gp_samples)
    +}
    +
    +tidy_gp <- function(x, tau, l, n = 3) {
    +  my_gaussian_process(x = x, tau = tau, l = l, n = n) %>%
    +    as.data.frame() %>%
    +    as_tibble() %>%
    +    set_names(x) %>%
    +    mutate(sample_idx = as.character(1:n())) %>%
    +    pivot_longer(-sample_idx, names_to = "x", values_to = "y") %>%
    +    mutate(x = as.numeric(x))
    +}
    +
    +set.seed(0)
    +x <- seq(0, 10, by = 0.1)
    +gp_samples <- tibble(tau = c(0.25, 0.5, 0.25, 0.5), l = c(0.5, 0.5, 2, 2)) %>%
    +  mutate(samples = purrr::map2(tau, l, ~ tidy_gp(x = x, tau = .x, l = .y, n = 3))) %>%
    +  unnest(samples)
    +
    +gp_samples %>%
    +  mutate(grp = glue("\u03C4 = {tau}, \u2113 = {l}")) %>%
    +  ggplot(aes(x = x, y = y)) +
    +  facet_wrap(vars(grp), nrow = 2) +
    +  geom_line(aes(color = sample_idx)) +
    +  scale_x_continuous(expand = expansion(c(0, 0))) +
    +  scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
    +  scale_color_brewer(type = "qual", palette = "Set1") +
    +  theme(legend.position = "none", axis.text.y = element_markdown()) +
    +  labs(x = "x", y = "\u03BC(x)")
    +
    +
    +

    +
    +

    Covariance functions

    +
      +
    • “Different covariance functions can be used to add structural prior +assumptions like smoothness, nonstationarity, periodicity, and +multi-scale or hierarchical structures.” (pg. 502) +
        +
      • sums and products of GPs are also GPs so can combine them in the +same model
      • +
    • +
    • can also use “anisotropic” GPs covariance functions for multiple +predictors
    • +
    +

    Inference

    +
      +
    • computing the mean and covariance in the \(n\)-variate normal conditional posterior +for \(\tilde{\mu}\) involves a matrix +inversion that requires \(O(n^3)\) +computation +
        +
      • this needs to be repeated for each MCMC step
      • +
      • limits the size of the data set and number of covariates in a +model
      • +
    • +
    +

    Covariance function +approximations

    +
      +
    • there are approximations to the GP that can speed up computation +
        +
      • generally work by reducing the matrix inversion burden
      • +
    • +
    +

    21.3 Latent Gaussian process +models

    +
      +
    • with non-Gaussian likelihoods, the GP prior becomes a latent +function \(f\) which determines the +likelihood \(p(y|f,\phi)\) through a +link function
    • +
    +

    21.4 Functional data analysis

    +
      +
    • functional data analysis: considers responses and +predictors not a scalar/vector-valued random variables but as random +functions with infinitely-many points +
        +
      • GPs fit this need well with little modification
      • +
    • +
    +

    21.5 Density estimation and +regression

    +
      +
    • can get more flexibility by modeling the conditional observation +model as a nonparametric GP +
        +
      • so far have used a GP as a prior for a function controlling location +or shape of a parametric observation model
      • +
      • one solution is the logistic Gaussian process (LGP) or a +Dirichlet process (covered in a later chapter)
      • +
    • +
    +

    Density estimation

    +
      +
    • LGP generates a random surface from a GP and then transforms the +surface to the space of probability densities +
        +
      • with 1D, the surface is just a curve
      • +
      • use the continuous logistic transformation to constrain to +non-negative and integrate to 1
      • +
    • +
    • there is illustrative example in the book on page 513
    • +
    +

    Density regression

    +
      +
    • generalize the LPG to density regression by putting a prior on the +collection of conditional densities
    • +
    +

    Latent-variable regression

    +
      +
    • an alternative to LPG
    • +
    +
    + + +
    + +
    +
    + + + + + + + +
    + + + + + + + diff --git a/docs/notes/20_finite-mixture-models_bda3-22.Rmd b/docs/notes/20_finite-mixture-models_bda3-22.Rmd new file mode 100644 index 0000000..abfae8b --- /dev/null +++ b/docs/notes/20_finite-mixture-models_bda3-22.Rmd @@ -0,0 +1,91 @@ +--- +title: "20. Notes on 'Ch 22. Finite mixture models'" +date: "2022-01-18" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") +``` + +> These are just notes on a single chapter of *BDA3* that were not part of the course. + +## Chapter 22. Finite mixture models + +- "when measurements of a random variable are taken under two different conditions" (pg. 519) +- where the data contains multiple subpopulations where each has a different, relatively simple model +- basic mixture modeling principle is to introduce *unobserved indicators* $z$ to specify the mixture component for an observation + - can think of a mixture indicator as missing data + +### 22.1 Setting up and interpreting mixture models + +#### Finite mixtures + +- want to model the distribution of $y = (y_1, \dots, y_n)$ or $y|x$ as a mixture of $H$ components + - for each component $h \in (1, \dots, H)$, the distribution $f_h(y_i | \theta_h)$ depends on a parameter vector $\theta_h$ + - $\lambda_h$ denotes the proportion of the population in component $h$ + - $\sum_{h=1}^{H} \lambda_h = 1$ + - common to assume all mixture components have the same parametric form + - thus, the sampling distribution of $y$ is: + +$$ +p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + \lambda_H f(y_i | \theta_H) +$$ + +- can think of the mixture distribution probabilities $\lambda$ as priors over the parameters $\theta_h$ + - or as a description of the variation in $\theta$ in a population + - akin to a hierarchical model +- introduce the indicator variables $z_{ih}$ where $z_{ih} = 1$ if the $i$th data point is drawn from component $h$ and 0 otherwise + - the $lambda$ values are used to determine $z$ + - can think of $lambda$ as a hyperprior over $z$ + - joint distribution of the observed data $y$ and the unobserved indicators $z$ conditions on the model parameters: + - only one $z_{ih}$ can be 1 for each $i$ + +$$ +\begin{aligned} +p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\ + &= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | \theta_h))^{z_{i,h}} +\end{aligned} +$$ + +#### Continuous mixtures + +- generalize the finite mixture to allow probability of an observation belongs to a class +- hierarchical models are a form a continuous mixture model + - each observed value $y_i$ is modeled as coming from a mixture of models defined by the probability of values for $\theta$ +- in the book, the focus is on finite mixtures and "minor modifications" are generally required to form a continuous distribution + +#### Identifiabilitiy of the mxixture likelihood + +- **all finite mixture models are nonidentifiable: the distribution is unchanged if the group labels are permuted** +- in many cases, purposeful, informative priors can solve the issue + +#### Priors distributions + +- the priors for a finite mixture model's parameters $\theta$ and $\lambda$ are usually the product of the two independent priors on each variable + - because the vector of mixture indicators $z_i = (z_{i,1}, \dots, z_{i,H})$ is multinomial with parameter $\lambda$, a common prior for $\lambda$ is the Dirichlet + - $\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H)$ + - $\theta = (\theta_1, \dots, \theta_H)$ is the vector of parameters for each component's sub-model + - some can be shared across components (i.e. equal variance for a group of normal distributions) + +#### Number of mixture components + +- can model $H$ as unknown but is computationally expensive +- usually can just build models with different $H$ and compare their goodness of fit + - compare the posterior predictive distributions with a "suitably chosen" test quantity + +#### Mixtures as true models or approximating distributions + +- two classes of thought[^1]: + 1. *theoretical:* a mixture model is "a realistic characterization of the true data-generating mechanism" (pg. 522) + 2. *pragmatic:* "trying to infer latent subpopulations is an intrinsically ill-defined statistical problem, but finite mixture models are nonetheless useful" (pg. 523) + +[^1]: I came up with the names of the two schools of thought as descriptive titles; they do not appear in the book. + +### 22.4 Unspecifed number of mixture components + +- can assign a Poisson distribution as a on $H$ (the number of groups/components in the mixture model) + - computationally intensive + - more common to just fit the model with different $H$ and compare with some statistic and a penalty for model complexity + - WAIC is theoretically justified, but ignores the uncertainty over $H$ + - LOO-CV may be even better diff --git a/docs/notes/20_finite-mixture-models_bda3-22.html b/docs/notes/20_finite-mixture-models_bda3-22.html new file mode 100644 index 0000000..454345e --- /dev/null +++ b/docs/notes/20_finite-mixture-models_bda3-22.html @@ -0,0 +1,2682 @@ + + + + + + + + + + + + + + + + + + + + 20. Notes on 'Ch 22. Finite mixture models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    20. Notes on ‘Ch 22. Finite mixture models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were +not part of the course.

    +
    +

    Chapter 22. Finite mixture +models

    +
      +
    • “when measurements of a random variable are taken under two +different conditions” (pg. 519)
    • +
    • where the data contains multiple subpopulations where each has a +different, relatively simple model
    • +
    • basic mixture modeling principle is to introduce unobserved +indicators \(z\) to specify the +mixture component for an observation +
        +
      • can think of a mixture indicator as missing data
      • +
    • +
    +

    22.1 Setting up and +interpreting mixture models

    +

    Finite mixtures

    +
      +
    • want to model the distribution of \(y = +(y_1, \dots, y_n)\) or \(y|x\) +as a mixture of \(H\) components +
        +
      • for each component \(h \in (1, \dots, +H)\), the distribution \(f_h(y_i | +\theta_h)\) depends on a parameter vector \(\theta_h\)
      • +
      • \(\lambda_h\) denotes the +proportion of the population in component \(h\) +
          +
        • \(\sum_{h=1}^{H} \lambda_h = +1\)
        • +
      • +
      • common to assume all mixture components have the same parametric +form
      • +
      • thus, the sampling distribution of \(y\) is:
      • +
    • +
    +

    \[ +p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + +\lambda_H f(y_i | \theta_H) +\]

    +
      +
    • can think of the mixture distribution probabilities \(\lambda\) as priors over the parameters +\(\theta_h\) +
        +
      • or as a description of the variation in \(\theta\) in a population
      • +
      • akin to a hierarchical model
      • +
    • +
    • introduce the indicator variables \(z_{ih}\) where \(z_{ih} = 1\) if the \(i\)th data point is drawn from component +\(h\) and 0 otherwise +
        +
      • the \(lambda\) values are used to +determine \(z\) +
          +
        • can think of \(lambda\) as a +hyperprior over \(z\)
        • +
      • +
      • joint distribution of the observed data \(y\) and the unobserved indicators \(z\) conditions on the model parameters: +
          +
        • only one \(z_{ih}\) can be 1 for +each \(i\)
        • +
      • +
    • +
    +

    \[ +\begin{aligned} +p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\ +&= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | +\theta_h))^{z_{i,h}} +\end{aligned} +\]

    +

    Continuous mixtures

    +
      +
    • generalize the finite mixture to allow probability of an observation +belongs to a class
    • +
    • hierarchical models are a form a continuous mixture model +
        +
      • each observed value \(y_i\) is +modeled as coming from a mixture of models defined by the probability of +values for \(\theta\)
      • +
    • +
    • in the book, the focus is on finite mixtures and “minor +modifications” are generally required to form a continuous +distribution
    • +
    +

    Identifiabilitiy of +the mxixture likelihood

    +
      +
    • all finite mixture models are nonidentifiable: the +distribution is unchanged if the group labels are permuted
    • +
    • in many cases, purposeful, informative priors can solve the +issue
    • +
    +

    Priors distributions

    +
      +
    • the priors for a finite mixture model’s parameters \(\theta\) and \(\lambda\) are usually the product of the +two independent priors on each variable +
        +
      • because the vector of mixture indicators \(z_i = (z_{i,1}, \dots, z_{i,H})\) is +multinomial with parameter \(\lambda\), +a common prior for \(\lambda\) is the +Dirichlet +
          +
        • \(\lambda \sim \text{Dirichlet}(\alpha_1, +\dots, \alpha_H)\)
        • +
      • +
      • \(\theta = (\theta_1, \dots, +\theta_H)\) is the vector of parameters for each component’s +sub-model +
          +
        • some can be shared across components (i.e. equal variance for a +group of normal distributions)
        • +
      • +
    • +
    +

    Number of mixture components

    +
      +
    • can model \(H\) as unknown but is +computationally expensive
    • +
    • usually can just build models with different \(H\) and compare their goodness of fit +
        +
      • compare the posterior predictive distributions with a “suitably +chosen” test quantity
      • +
    • +
    +

    Mixtures +as true models or approximating distributions

    +
      +
    • two classes of thought1: +
        +
      1. theoretical: a mixture model is “a realistic +characterization of the true data-generating mechanism” (pg. 522)
      2. +
      3. pragmatic: “trying to infer latent subpopulations is an +intrinsically ill-defined statistical problem, but finite mixture models +are nonetheless useful” (pg. 523)
      4. +
    • +
    +

    22.4 Unspecifed number +of mixture components

    +
      +
    • can assign a Poisson distribution as a on \(H\) (the number of groups/components in the +mixture model) +
        +
      • computationally intensive
      • +
      • more common to just fit the model with different \(H\) and compare with some statistic and a +penalty for model complexity +
          +
        • WAIC is theoretically justified, but ignores the uncertainty over +\(H\)
        • +
        • LOO-CV may be even better
        • +
      • +
    • +
    +
    +
    +
    +
      +
    1. I came up with the names of the two +schools of thought as descriptive titles; they do not appear in the +book.↩︎

    2. +
    +
    + + +
    + +
    +
    + + + + + + + +
    + + + + + + + diff --git a/docs/notes/21_Dirichlet-process-models_bda3-23.Rmd b/docs/notes/21_Dirichlet-process-models_bda3-23.Rmd new file mode 100644 index 0000000..95aa2ef --- /dev/null +++ b/docs/notes/21_Dirichlet-process-models_bda3-23.Rmd @@ -0,0 +1,350 @@ +--- +title: "21. Notes on 'Ch 23. Dirichlet process models'" +date: "2022-01-20" +output: distill::distill_article +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>") + +library(tidyverse) + +theme_set( + theme_bw() + + theme( + strip.background = element_blank(), + axis.ticks = element_blank() + ) +) +``` + +> These are just notes on a single chapter of *BDA3* that were not part of the course. + +## Chapter 23. Dirichlet process models + +- **Dirichlet process**: an infinite-dimensional generalization of the Dirichlet distribution + - used as a prior on unknown distributions + - can extend finite component mixture models to infinite mixture models + +### 23.1 Bayesian histograms + +- the histogram as a simple form of density estimation + - demonstrate a flexible parametric version that motivates the non-parametric in the following section +- prespecified knots: $\xi = (\xi_0, \dots, \xi_k)$ with $\xi_{n-1} < \xi_n$ +- probability model for the density (a histogram): + - where $\pi = (\pi_1, \dots, \pi_k)$ is an unknown probability vector + +$$ +f(y) = \sum_{h=1}^k 1_{\xi_{h-1} < y \le \xi_h} \frac{\pi_h}{(\xi_h - \xi_{h-1})} +$$ + +- prior for the probabilities $\pi$ as a Dirichlet distribution: + +$$ +p(\pi|a) = \frac{\Gamma(\sum_{h=1}^k a_h)}{\prod_{h=1}^k \Gamma(a_h)} \prod_{h=1}^k \pi _h^{a_h - 1} +$$ + +- replace the hyperparameter vector: $a = \alpha \pi_0$ where $\pi_0$ is: + +$$ +\text{E}(\pi|a) = \pi_0 = \left( \frac{a_1}{\sum_h a_h}, \dots, \frac{a_k}{\sum_h a_h} \right) +$$ + +- the posterior for $\pi$ becomes: + - where $n_i$ is the number of observations $y$ in the $i$th bin + +$$ +p(\pi | y) \propto \prod_{h=1}^k \pi_h^{a_h + n_h - 1} = \text{Dirichlet}(a_1 + n_1, \dots, a_k + n_k) +$$ + +- this histogram estimator does well but is sensitive to the specification of the knots + +### 23.2 Dirichlet process prior distributions + +#### Definition and basic properties + +- goal is to not need to prespecify the bins of the histogram +- let: + - $\Omega$: sample space + - $B_1, \dots, B_k$: measure subsets of $\Omega$ +- if $\Omega = \Re$, then $B_1, \dots, B_k$ are non-overlapping intervals that partition the real line into a finite number of bins +- $P$: unknown probability measure of $(\Omega, \mathcal{B})$ + - $\mathcal{B}$: "collection of all possible subsets of the sample space $\Omega$" + - $P$ assigns probabilities to the subsets $\mathcal{B}$ + - probability for a set of bins $B_1, \dots, B_k$ partitioning $\Omega$: + +$$ +P(B_1), \dots, P(B_k) = \left( \int_{B_1} f(y) dy, \dots, \int_{B_k} f(y) dy \right) +$$ + +- $P$ is a *random probability measure* (RPM), so the bin probabilities are random variables +- a good prior for the bin probabilities is the Dirichlet distribution \@ref(eq:dirichlet-prior + - where $P_0$ is a base probability measure providing the initial guess at $P$ + - where $\alpha$ is a prior concentration parameter + - controls shrinkage of $P$ towards $P_0$ + +$$ +P(B_1), \dots, P(B_k) \sim \text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha P_0(B_k)) +(\#eq:dirichlet-prior) +$$ + +- difference with previous Bayesian histogram: only specifies that bin $B_k$ is assigned probability $P(B_k)$ and not how probability mass is distributed across the bin $B_k$ + - thus, for a fixed set s of bins, this equation does not full specify the prior for $P$ + - need to eliminate the sensitivity to the choice of bins by assuming the prior holds for all possible partitions $B_1, \dots, B_k$ for all $k$ + - then it is a fully specified prior for $P$ +- "must exist a random probability measure $P$ such that the probabilities assigned to any measurable partition $B_1, \dots, B_k$ by $P$ is $\text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha P_0(B_k))$" + - the resulting $P$ is a Dirichlet process: + - $P \sim \text{DP}(\alpha P_0)$ + - $\alpha > 0$: a scalar precision parameter + - $P_0$: baseline probability measure also on $(\Omega, \mathcal{B})$ +- implications of DP: + - the marginal random probability assigned to any subset $B$ is a beta distribution + - $P(B) \sim \text{Beta}(\alpha PP_0(B), \alpha (1-P_0(B)))$ for all $B \in \mathcal{B}$ + - the prior for $P$ is centered on $P_0$: $E(P(B)) = P_0(B)$ + - $\alpha$ controls variance + - $\text{var}(P(B)) = \frac{P_0(B)(1 - P_0(B))}{1 + \alpha}$ + +- get posterior for $P$: + - let $y_i \stackrel{iid}{\sim} P$ fir $i = 1, \dots, n$ + - $P \sim \text{DP}(\alpha P_0)$ + - $P$ denotes the probability measure *and* its corresponding distribution + - from \@ref(eq:dirichlet-prior), for any partition $B_1, \dots, B_k$: + +$$ +p(B_1), \dots, P(B_k) | y_1, \dots, y_k \sim + \text{Dirichlet} \left( + \alpha P_0(B_1) + \sum_{i=1}^n 1_{y_i \in B_1}, \dots, \alpha P_0(B_k) + \sum_{i=1}^n 1_{y_i \in B_k} + \right) +$$ + +- this can be converted to the following: + +$$ +P | y_1, \dots, y_n \sim \text{DP} \left( \alpha P_0 \sum_i \delta_{y+i} \right) +$$ + +- finally, the posterior expectation of $P$: + +$$ +\text{E}(P(B) | y^n) = + \left(\frac{\alpha}{\alpha + n} \right) P_0(B) + + \left(\frac{n}{\alpha + n} \right) \sum_{i=1}^n \frac{1}{n} \delta_{y_i} +(\#eq:postp) +$$ + +- DP is a model similar to a random histogram but without dependence on the bins +- cons of a DP prior: + - lack of smoothness + - induces negative correlation between $P(B_1)$ and $P(B_2)$ for any two disjoint bins with no account for the distance between them + - realizations from the DP are discrete distributions + - with $P \sim \text{DP}(\alpha P_0)$, $P$ is atomic and have nonzero weights only on a set of atoms, not a continuous density on the real line + +#### Stick-breaking construction + +- more intuitive understanding of DP +- induce $P \sim \text{DP}(\alpha P_0)$ by letting: + +$$ +\begin{aligned} +P(\cdot) &= \sum_{h=1}^{\infty} \pi_h \delta_{\theta_h}(\cdot) \\ +\pi_h &= V_h \prod_{l% + mutate( + dp = purrr::map(alpha, stick_breaking_process, n = 1000), + theta = purrr::map(dp, ~ .x$theta), + pi = purrr::map(dp, ~ .x$pi) + ) %>% + select(-dp) %>% + unnest(c(theta, pi)) %>% + ggplot(aes(x = theta, y = pi)) + + facet_wrap(vars(alpha), nrow = 2, scales = "fixed") + + geom_col(width = 0.05) + + scale_x_continuous(expand = expansion(c(0.02, 0.02))) + + scale_y_continuous(expand = expansion(c(0, 0.02))) + + labs(x = "\u03B8", y = "\u03C0") +``` + +### 23.3 Dirichlet process mixtures + +#### Specification and Polya urns + +- **"the DP is more appropriately used as a prior for an unknown mixture of distributions"** (pg 549) +- in the case of density estimation, a general kernel mixture model can be specified as \@ref(eq:kernel-density) + - $\mathcal{K}(\cdot | \theta)$: a kernel + - $\theta$: location and possibly scale parameters + - $P$: "mixing measure" + +$$ +f(y|P) = \int \mathcal{K}(y|\theta) d P(\theta) +(\#eq:kernel-density) +$$ + + +- treating $P$ as discrete results in a finite mixture model +- setting a prior on $P$ creates an infinite mixture model + - prior: $P \sim \pi_\mathcal{P}$ + - $\mathcal{P}$: space of all probability measures on $(\Omega, \mathcal{B})$ + - $\pi_\mathcal{P}$: the prior over the space defined by $\mathcal{P}$ +- if set $\pi_\mathcal{P}$ as a DP prior, results in a DP mixture model + - from \@ref(eq:postp) and \@ref(eq:kernel-density), a DP prior on $P$ results in \@ref(eq:dp-kernel) + - where: + - $\pi = \sim \text{stick}(\alpha)$ denotes that the probability weights are from the stick-breaking process with parameter $\alpha$ + - $\theta_h \sim P_0$ independently for each $h=1, \dots, \infty$ + +$$ +f(y) = \sum_{h=1}^\infty \pi_h \mathcal{K}(y | \theta_h^*) +(\#eq:dp-kernel) +$$ + +- equation \@ref(eq:dp-kernel) resembles a finite mixture model except the number of mixture components in set to infinity + - does not mean there will be infinite number of components + - instead the model is just flexible to add more mixture components if necessary +- consider the following specification + - issue of how to conduct posterior computation with a DP mixture (DPM) because $P$ has infinitely many parameters + +$$ +y_i \sim \mathcal{K}(\theta_i) \text{,} \quad +\theta_i \sim P \text{,} \quad +P \sim \text{DP}(\alpha P_0) +$$ + +- can marginalize out $P$ to get an induced prior on the subject-specific parameters $\theta^n = (\theta_1, \dots, \theta_n)$ +- results in the *Polya urn* predictive rule: + +$$ +p(\theta_i | \theta_1, \dots, \theta_{i-1}) \sim \left( \frac{\alpha}{\alpha + i -1} \right) P_0(\theta_i) + +\sum_{j=1}^{i-1} \left( \frac{1}{\alpha + i - 1} \right) \delta_{\theta_j} +(\#eq:polya-urn) +$$ + +- *Chinese restaurant process* as a metaphor for the Polya urn predictive rule \@ref(eq:polya-urn): + - consider a restaurant with infinitely many tables + - the first customers sits at a table with dish $\theta_1^*$ + - the second customer sits at the first table with probability $\frac{1}{1+\alpha}$ or a new table with probability $\frac{\alpha}{1+\alpha}$ + - repeat this process for the $i$th customer: + - sit at an occupied table with probability $\frac{c_j}{1 - i + \alpha}$ where $c_j$ is the number of customers already at the table + -sit at a new table with probability $\frac{\alpha}{n-i+\alpha}$ + - interpretation: + - each table represents a cluster of subjects + - the number of clusters depends on the number of subjects + - makes sense to have the possibility of more clusters with more subjects (instead of a fixed number of clusters as in a finite mixture model) +- there is a description of the sampling process for the posterior of $\theta_i | \theta_{-i}$ + +#### Hyperprior distribution + +- $\alpha$: the DP precision parameter + - plays a role in controlling the prior on the number of clusters + - small, fixed $\alpha$ favors allocation to few clusters (relative to sample size) + - if $\alpha=1$, the prior indicates that two randomly selected subjects have a 50/50 chance of belonging to the same cluster + - alternatively can set a hyperprior on $\alpha$ and let the data inform the value + - common to use a Gamma distribution: $\Gamma(a_\alpha, b_\alpha)$ + - authors indicate that setting a hyperprior on $\alpha$ tends to work well in practice +- $P_0$: base probability for the DP + - can think of $P_0$ as setting the prior for the cluster locations + - can put hyperpriors on $P_0$ parameters + +### 23.4 Beyond density estimation + +#### Nonparametric residual distributions + +- "The real attraction of Dirichlet process mixture (DPM) models is that they can be used much more broadly for relaxing parametric assumptions in hierarchical models." (pg 557) +- consider the linear regression with a nonparametric error distribution \@ref(eq:linreg-nonparam-error) + - $X_i = (X_{i1}, \dots, X_{ip})$: vector of predictors + - $\epsilon_i$: error term with distribution $f$ + +$$ +y_i = X_i \beta + \epsilon_i \text{,} \quad \epsilon_i \sim f +(\#eq:linreg-nonparam-error) +$$ + +- can relax the assumption that $f$ has parametric form + +$$ +\epsilon_i \sim N(0, \phi_i^{-1}) \text{,} \quad \phi \sim P \text{,} \quad P \sim \text{DP}(\alpha P_0) +$$ + +#### Nonparametric models for parameters that vary by group + +- consider hierarchical linear models with varying coefficients + - can account for uncertainty about the distribution of coefficients by placing a DP or DPM priors on them + - where: + - $y_i = (y_{i1}, \dots, y_{in_i})$: repeated measurements for item $i$ + - $\mu_i$: subject-specific mean + - $\epsilon_{ij}$: observation specific residual + +$$ +y_{ij} = \mu_i + \epsilon_{ij} \text{,} \quad \mu_i \sim f \text{,} \quad \epsilon_{ij} \sim g +(\#eq:hier-linreg-dp) +$$ + +- typically for \@ref(eq:hier-linreg-dp): + - $f \equiv N(\mu, \phi^{-1})$ + - $g \equiv N(0, \sigma^2)$ +- can let more flexibility in characterizing variability among subjects: + - $\mu_i \sim P \text{,} \quad P \sim \text{DP}(\alpha P_0)$ + - the DP prior induces a *latent class* model + - the subjects are grouped into an unknown number of clusters \@ref(eq:latent-class-model) + - $S_i \in \{1, \dots, \infty\}$: latent class index + - $\pi_h$: probability of allocation to latent class $h$ + +$$ +\mu_i = \mu^*_{S_i} \text{,} \quad \Pr(S_i = h) = \pi_h \text{,} \quad h = 1, 2, \dots +(\#eq:latent-class-model) +$$ + +> There is more to this chapter, but it is currently beyond my understanding. +> I hope to return to this chapter later with a better understanding of Dirichlet processes in the future. + +--- + +```{r} +sessionInfo() +``` diff --git a/docs/notes/21_Dirichlet-process-models_bda3-23.html b/docs/notes/21_Dirichlet-process-models_bda3-23.html new file mode 100644 index 0000000..eef4c24 --- /dev/null +++ b/docs/notes/21_Dirichlet-process-models_bda3-23.html @@ -0,0 +1,3108 @@ + + + + + + + + + + + + + + + + + + + + 21. Notes on 'Ch 23. Dirichlet process models' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    21. Notes on ‘Ch 23. Dirichlet process models’

    + + + +
    + + +
    +
    +

    These are just notes on a single chapter of BDA3 that were +not part of the course.

    +
    +

    Chapter 23. Dirichlet +process models

    +
      +
    • Dirichlet process: an infinite-dimensional +generalization of the Dirichlet distribution +
        +
      • used as a prior on unknown distributions
      • +
      • can extend finite component mixture models to infinite mixture +models
      • +
    • +
    +

    23.1 Bayesian histograms

    +
      +
    • the histogram as a simple form of density estimation +
        +
      • demonstrate a flexible parametric version that motivates the +non-parametric in the following section
      • +
    • +
    • prespecified knots: \(\xi = (\xi_0, \dots, +\xi_k)\) with \(\xi_{n-1} < +\xi_n\)
    • +
    • probability model for the density (a histogram): +
        +
      • where \(\pi = (\pi_1, \dots, +\pi_k)\) is an unknown probability vector
      • +
    • +
    +

    \[ +f(y) = \sum_{h=1}^k 1_{\xi_{h-1} < y \le \xi_h} \frac{\pi_h}{(\xi_h - +\xi_{h-1})} +\]

    +
      +
    • prior for the probabilities \(\pi\) +as a Dirichlet distribution:
    • +
    +

    \[ +p(\pi|a) = \frac{\Gamma(\sum_{h=1}^k a_h)}{\prod_{h=1}^k \Gamma(a_h)} +\prod_{h=1}^k \pi _h^{a_h - 1} +\]

    +
      +
    • replace the hyperparameter vector: \(a = +\alpha \pi_0\) where \(\pi_0\) +is:
    • +
    +

    \[ +\text{E}(\pi|a) = \pi_0 = \left( \frac{a_1}{\sum_h a_h}, \dots, +\frac{a_k}{\sum_h a_h} \right) +\]

    +
      +
    • the posterior for \(\pi\) becomes: +
        +
      • where \(n_i\) is the number of +observations \(y\) in the \(i\)th bin
      • +
    • +
    +

    \[ +p(\pi | y) \propto \prod_{h=1}^k \pi_h^{a_h + n_h - 1} = +\text{Dirichlet}(a_1 + n_1, \dots, a_k + n_k) +\]

    +
      +
    • this histogram estimator does well but is sensitive to the +specification of the knots
    • +
    +

    23.2 Dirichlet process +prior distributions

    +

    Definition and basic +properties

    +
      +
    • goal is to not need to prespecify the bins of the histogram
    • +
    • let: +
        +
      • \(\Omega\): sample space
      • +
      • \(B_1, \dots, B_k\): measure +subsets of \(\Omega\)
      • +
    • +
    • if \(\Omega = \Re\), then \(B_1, \dots, B_k\) are non-overlapping +intervals that partition the real line into a finite number of bins
    • +
    • \(P\): unknown probability measure +of \((\Omega, \mathcal{B})\) +
        +
      • \(\mathcal{B}\): “collection of all +possible subsets of the sample space \(\Omega\)
      • +
      • \(P\) assigns probabilities to the +subsets \(\mathcal{B}\)
      • +
      • probability for a set of bins \(B_1, +\dots, B_k\) partitioning \(\Omega\):
      • +
    • +
    +

    \[ +P(B_1), \dots, P(B_k) = \left( \int_{B_1} f(y) dy, \dots, \int_{B_k} +f(y) dy \right) +\]

    +
      +
    • \(P\) is a random probability +measure (RPM), so the bin probabilities are random variables
    • +
    • a good prior for the bin probabilities is the Dirichlet distribution +@ref(eq:dirichlet-prior +
        +
      • where \(P_0\) is a base probability +measure providing the initial guess at \(P\)
      • +
      • where \(\alpha\) is a prior +concentration parameter +
          +
        • controls shrinkage of \(P\) towards +\(P_0\)
        • +
      • +
    • +
    +

    \[ +P(B_1), \dots, P(B_k) \sim \text{Dirichlet}(\alpha P_0(B_1), \dots, +\alpha P_0(B_k)) +\tag{1} +\]

    +
      +
    • difference with previous Bayesian histogram: only specifies that bin +\(B_k\) is assigned probability \(P(B_k)\) and not how probability mass is +distributed across the bin \(B_k\) +
        +
      • thus, for a fixed set s of bins, this equation does not full specify +the prior for \(P\)
      • +
      • need to eliminate the sensitivity to the choice of bins by assuming +the prior holds for all possible partitions \(B_1, \dots, B_k\) for all \(k\) +
          +
        • then it is a fully specified prior for \(P\)
        • +
      • +
    • +
    • “must exist a random probability measure \(P\) such that the probabilities assigned to +any measurable partition \(B_1, \dots, +B_k\) by \(P\) is \(\text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha +P_0(B_k))\)” +
        +
      • the resulting \(P\) is a Dirichlet +process: +
          +
        • \(P \sim \text{DP}(\alpha +P_0)\)
        • +
        • \(\alpha > 0\): a scalar +precision parameter
        • +
        • \(P_0\): baseline probability +measure also on \((\Omega, +\mathcal{B})\)
        • +
      • +
    • +
    • implications of DP: +
        +
      • the marginal random probability assigned to any subset \(B\) is a beta distribution +
          +
        • \(P(B) \sim \text{Beta}(\alpha PP_0(B), +\alpha (1-P_0(B)))\) for all \(B \in +\mathcal{B}\)
        • +
      • +
      • the prior for \(P\) is centered on +\(P_0\): \(E(P(B)) = P_0(B)\)
      • +
      • \(\alpha\) controls variance +
          +
        • \(\text{var}(P(B)) = \frac{P_0(B)(1 - +P_0(B))}{1 + \alpha}\)
        • +
      • +
    • +
    • get posterior for \(P\): +
        +
      • let \(y_i \stackrel{iid}{\sim} P\) +fir \(i = 1, \dots, n\)
      • +
      • \(P \sim \text{DP}(\alpha P_0)\) +
          +
        • \(P\) denotes the probability +measure and its corresponding distribution
        • +
      • +
      • from (1), for any partition \(B_1, \dots, B_k\):
      • +
    • +
    +

    \[ +p(B_1), \dots, P(B_k) | y_1, \dots, y_k \sim + \text{Dirichlet} \left( + \alpha P_0(B_1) + \sum_{i=1}^n 1_{y_i \in B_1}, \dots, \alpha +P_0(B_k) + \sum_{i=1}^n 1_{y_i \in B_k} + \right) +\]

    +
      +
    • this can be converted to the following:
    • +
    +

    \[ +P | y_1, \dots, y_n \sim \text{DP} \left( \alpha P_0 \sum_i \delta_{y+i} +\right) +\]

    +
      +
    • finally, the posterior expectation of \(P\):
    • +
    +

    \[ +\text{E}(P(B) | y^n) = + \left(\frac{\alpha}{\alpha + n} \right) P_0(B) + + \left(\frac{n}{\alpha + n} \right) \sum_{i=1}^n \frac{1}{n} +\delta_{y_i} +\tag{2} +\]

    +
      +
    • DP is a model similar to a random histogram but without dependence +on the bins
    • +
    • cons of a DP prior: +
        +
      • lack of smoothness
      • +
      • induces negative correlation between \(P(B_1)\) and \(P(B_2)\) for any two disjoint bins with no +account for the distance between them
      • +
      • realizations from the DP are discrete distributions +
          +
        • with \(P \sim \text{DP}(\alpha +P_0)\), \(P\) is atomic and have +nonzero weights only on a set of atoms, not a continuous density on the +real line
        • +
      • +
    • +
    +

    Stick-breaking construction

    +
      +
    • more intuitive understanding of DP
    • +
    • induce \(P \sim \text{DP}(\alpha +P_0)\) by letting:
    • +
    +

    \[ +\begin{aligned} +P(\cdot) &= \sum_{h=1}^{\infty} \pi_h \delta_{\theta_h}(\cdot) \\ +\pi_h &= V_h \prod_{l<h} (1 - V_i) \\ +V_h &\sim \text{Beta}(1, \alpha) \\ +\theta_h &\sim P_0 +\end{aligned} +\]

    +
      +
    • where: +
        +
      • \(P_0\): base distribution
      • +
      • \(\delta_\theta\): degenerate +distribution with all mass at \(\theta\)
      • +
      • \((\theta_h)_{h=1}^{\infty}\): the +atoms generated independently by from \(P_0\) +
          +
        • the atoms are generated by the stick-breaking process
        • +
        • this ensures the weights sum to 1
        • +
      • +
      • \(\pi_h\): probability mass at atom +\(\theta_h\)
      • +
    • +
    • the stick-breaking process: +
        +
      • start with a stick of length 1 +
          +
        • represents the total probability allocated to all the atoms
        • +
      • +
      • break off a random piece of length \(V_1\) determined by a draw from \(\text{Beta}(1, \alpha)\)
      • +
      • set \(\pi_1 = V_1\) as the +probability weight to the randomly generated first atom \(\theta_1 \sim P_0\)
      • +
      • break off another piece of the remaining stick (now length \(1-V_1\)): \(V_2 +\sim \text{Beta}(1, \alpha)\)
      • +
      • set \(\pi_2 = V_2(1-V_1)\) as the +probability weight to the next atom \(\theta_2 +\sim P_0\)
      • +
      • repeat until the stick is fully used
      • +
    • +
    • implications: +
        +
      • during the process, the stick get shorter, so lengths allocated to +later indexed atoms decrease stochastically
      • +
      • rate of decrease of stick length depends on \(\alpha\) +
          +
        • \(\alpha\) near 0 lead to high +weights early on
        • +
      • +
    • +
    • below are realizations of the stick breaking process +
        +
      • set \(P_0\) as a standard normal +distribution and vary \(\alpha\)
      • +
    • +
    +
    +
    +
    stick_breaking_process <- function(alpha, n = 1000) {
    +  theta <- rnorm(n, 0, 1) # P0 = standard normal
    +  vs <- rbeta(n, 1, alpha)
    +  pi <- rep(0, n)
    +  pi[1] <- vs[1]
    +  stick <- 1.0 - vs[1]
    +  for (h in 2:n) {
    +    pi[h] <- vs[h] * stick
    +    stick <- stick - pi[h]
    +  }
    +  return(list(theta = theta, pi = pi))
    +}
    +
    +dp_realization <- stick_breaking_process(10)
    +sum(dp_realization$pi)
    +
    +
    +
    #> [1] 1
    +
    +
    +
    +
    set.seed(549)
    +
    +tibble(alpha = c(0.5, 1, 5, 10)) %>%
    +  mutate(
    +    dp = purrr::map(alpha, stick_breaking_process, n = 1000),
    +    theta = purrr::map(dp, ~ .x$theta),
    +    pi = purrr::map(dp, ~ .x$pi)
    +  ) %>%
    +  select(-dp) %>%
    +  unnest(c(theta, pi)) %>%
    +  ggplot(aes(x = theta, y = pi)) +
    +  facet_wrap(vars(alpha), nrow = 2, scales = "fixed") +
    +  geom_col(width = 0.05) +
    +  scale_x_continuous(expand = expansion(c(0.02, 0.02))) +
    +  scale_y_continuous(expand = expansion(c(0, 0.02))) +
    +  labs(x = "\u03B8", y = "\u03C0")
    +
    +
    +

    +
    +

    23.3 Dirichlet process mixtures

    +

    Specification and Polya urns

    +
      +
    • “the DP is more appropriately used as a prior for an unknown +mixture of distributions” (pg 549)
    • +
    • in the case of density estimation, a general kernel mixture model +can be specified as (3) +
        +
      • \(\mathcal{K}(\cdot | \theta)\): a +kernel
      • +
      • \(\theta\): location and possibly +scale parameters
      • +
      • \(P\): “mixing measure”
      • +
    • +
    +

    \[ +f(y|P) = \int \mathcal{K}(y|\theta) d P(\theta) +\tag{3} +\]

    +
      +
    • treating \(P\) as discrete results +in a finite mixture model
    • +
    • setting a prior on \(P\) creates an +infinite mixture model +
        +
      • prior: \(P \sim \pi_\mathcal{P}\) +
          +
        • \(\mathcal{P}\): space of all +probability measures on \((\Omega, +\mathcal{B})\)
        • +
        • \(\pi_\mathcal{P}\): the prior over +the space defined by \(\mathcal{P}\)
        • +
      • +
    • +
    • if set \(\pi_\mathcal{P}\) as a DP +prior, results in a DP mixture model +
        +
      • from (2) and (3), a DP prior on \(P\) results in (4)
      • +
      • where: +
          +
        • \(\pi = \sim \text{stick}(\alpha)\) +denotes that the probability weights are from the stick-breaking process +with parameter \(\alpha\)
        • +
        • \(\theta_h \sim P_0\) independently +for each \(h=1, \dots, \infty\)
        • +
      • +
    • +
    +

    \[ +f(y) = \sum_{h=1}^\infty \pi_h \mathcal{K}(y | \theta_h^*) +\tag{4} +\]

    +
      +
    • equation (4) resembles a finite mixture model except +the number of mixture components in set to infinity +
        +
      • does not mean there will be infinite number of components
      • +
      • instead the model is just flexible to add more mixture components if +necessary
      • +
    • +
    • consider the following specification +
        +
      • issue of how to conduct posterior computation with a DP mixture +(DPM) because \(P\) has infinitely many +parameters
      • +
    • +
    +

    \[ +y_i \sim \mathcal{K}(\theta_i) \text{,} \quad +\theta_i \sim P \text{,} \quad +P \sim \text{DP}(\alpha P_0) +\]

    +
      +
    • can marginalize out \(P\) to get an +induced prior on the subject-specific parameters \(\theta^n = (\theta_1, \dots, +\theta_n)\)
    • +
    • results in the Polya urn predictive rule:
    • +
    +

    \[ +p(\theta_i | \theta_1, \dots, \theta_{i-1}) \sim \left( +\frac{\alpha}{\alpha + i -1} \right) P_0(\theta_i) + +\sum_{j=1}^{i-1} \left( \frac{1}{\alpha + i - 1} \right) +\delta_{\theta_j} +\tag{5} +\]

    +
      +
    • Chinese restaurant process as a metaphor for the Polya urn +predictive rule (5): +
        +
      • consider a restaurant with infinitely many tables
      • +
      • the first customers sits at a table with dish \(\theta_1^*\)
      • +
      • the second customer sits at the first table with probability \(\frac{1}{1+\alpha}\) or a new table with +probability \(\frac{\alpha}{1+\alpha}\)
      • +
      • repeat this process for the \(i\)th +customer: +
          +
        • sit at an occupied table with probability \(\frac{c_j}{1 - i + \alpha}\) where \(c_j\) is the number of customers already at +the table -sit at a new table with probability \(\frac{\alpha}{n-i+\alpha}\)
        • +
      • +
      • interpretation:
      • +
      • each table represents a cluster of subjects
      • +
      • the number of clusters depends on the number of subjects
      • +
      • makes sense to have the possibility of more clusters with more +subjects (instead of a fixed number of clusters as in a finite mixture +model)
      • +
    • +
    • there is a description of the sampling process for the posterior of +\(\theta_i | \theta_{-i}\)
    • +
    +

    Hyperprior distribution

    +
      +
    • \(\alpha\): the DP precision +parameter +
        +
      • plays a role in controlling the prior on the number of clusters
      • +
      • small, fixed \(\alpha\) favors +allocation to few clusters (relative to sample size) +
          +
        • if \(\alpha=1\), the prior +indicates that two randomly selected subjects have a 50/50 chance of +belonging to the same cluster
        • +
      • +
      • alternatively can set a hyperprior on \(\alpha\) and let the data inform the value +
          +
        • common to use a Gamma distribution: \(\Gamma(a_\alpha, b_\alpha)\)
        • +
        • authors indicate that setting a hyperprior on \(\alpha\) tends to work well in +practice
        • +
      • +
    • +
    • \(P_0\): base probability for the +DP +
        +
      • can think of \(P_0\) as setting the +prior for the cluster locations
      • +
      • can put hyperpriors on \(P_0\) +parameters
      • +
    • +
    +

    23.4 Beyond density estimation

    +

    Nonparametric residual +distributions

    +
      +
    • “The real attraction of Dirichlet process mixture (DPM) models is +that they can be used much more broadly for relaxing parametric +assumptions in hierarchical models.” (pg 557)
    • +
    • consider the linear regression with a nonparametric error +distribution (6) +
        +
      • \(X_i = (X_{i1}, \dots, X_{ip})\): +vector of predictors
      • +
      • \(\epsilon_i\): error term with +distribution \(f\)
      • +
    • +
    +

    \[ +y_i = X_i \beta + \epsilon_i \text{,} \quad \epsilon_i \sim f +\tag{6} +\]

    +
      +
    • can relax the assumption that \(f\) +has parametric form
    • +
    +

    \[ +\epsilon_i \sim N(0, \phi_i^{-1}) \text{,} \quad \phi \sim P \text{,} +\quad P \sim \text{DP}(\alpha P_0) +\]

    +

    Nonparametric +models for parameters that vary by group

    +
      +
    • consider hierarchical linear models with varying coefficients +
        +
      • can account for uncertainty about the distribution of coefficients +by placing a DP or DPM priors on them
      • +
      • where: +
          +
        • \(y_i = (y_{i1}, \dots, +y_{in_i})\): repeated measurements for item \(i\)
        • +
        • \(\mu_i\): subject-specific +mean
        • +
        • \(\epsilon_{ij}\): observation +specific residual
        • +
      • +
    • +
    +

    \[ +y_{ij} = \mu_i + \epsilon_{ij} \text{,} \quad \mu_i \sim f \text{,} +\quad \epsilon_{ij} \sim g +\tag{7} +\]

    +
      +
    • typically for (7): +
        +
      • \(f \equiv N(\mu, \phi^{-1})\)
      • +
      • \(g \equiv N(0, \sigma^2)\)
      • +
    • +
    • can let more flexibility in characterizing variability among +subjects: +
        +
      • \(\mu_i \sim P \text{,} \quad P \sim +\text{DP}(\alpha P_0)\)
      • +
      • the DP prior induces a latent class model +
          +
        • the subjects are grouped into an unknown number of clusters +(8)
        • +
        • \(S_i \in \{1, \dots, \infty\}\): +latent class index
        • +
        • \(\pi_h\): probability of +allocation to latent class \(h\)
        • +
      • +
    • +
    +

    \[ +\mu_i = \mu^*_{S_i} \text{,} \quad \Pr(S_i = h) = \pi_h \text{,} \quad h += 1, 2, \dots +\tag{8} +\]

    +
    +

    There is more to this chapter, but it is currently beyond my +understanding. I hope to return to this chapter later with a better +understanding of Dirichlet processes in the future.

    +
    +
    +
    +
    +
    sessionInfo()
    +
    +
    +
    #> R version 4.1.2 (2021-11-01)
    +#> Platform: x86_64-apple-darwin17.0 (64-bit)
    +#> Running under: macOS Big Sur 10.16
    +#>
    +#> Matrix products: default
    +#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
    +#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
    +#>
    +#> locale:
    +#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    +#>
    +#> attached base packages:
    +#> [1] stats     graphics  grDevices datasets  utils     methods
    +#> [7] base
    +#>
    +#> other attached packages:
    +#> [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4
    +#> [5] readr_2.0.1     tidyr_1.1.3     tibble_3.1.3    ggplot2_3.3.5
    +#> [9] tidyverse_1.3.1
    +#>
    +#> loaded via a namespace (and not attached):
    +#>  [1] Rcpp_1.0.7        lubridate_1.7.10  clisymbols_1.2.0
    +#>  [4] assertthat_0.2.1  digest_0.6.27     utf8_1.2.2
    +#>  [7] R6_2.5.0          cellranger_1.1.0  backports_1.2.1
    +#> [10] reprex_2.0.1      evaluate_0.14     highr_0.9
    +#> [13] httr_1.4.2        pillar_1.6.2      rlang_0.4.11
    +#> [16] readxl_1.3.1      rstudioapi_0.13   jquerylib_0.1.4
    +#> [19] rmarkdown_2.10    labeling_0.4.2    munsell_0.5.0
    +#> [22] broom_0.7.9       compiler_4.1.2    modelr_0.1.8
    +#> [25] xfun_0.25         pkgconfig_2.0.3   htmltools_0.5.1.1
    +#> [28] downlit_0.2.1     tidyselect_1.1.1  fansi_0.5.0
    +#> [31] crayon_1.4.1      tzdb_0.1.2        dbplyr_2.1.1
    +#> [34] withr_2.4.2       grid_4.1.2        jsonlite_1.7.2
    +#> [37] gtable_0.3.0      lifecycle_1.0.0   DBI_1.1.1
    +#> [40] magrittr_2.0.1    scales_1.1.1      cli_3.0.1
    +#> [43] stringi_1.7.3     farver_2.1.0      renv_0.14.0
    +#> [46] fs_1.5.0          xml2_1.3.2        bslib_0.2.5.1
    +#> [49] ellipsis_0.3.2    generics_0.1.0    vctrs_0.3.8
    +#> [52] distill_1.2       tools_4.1.2       glue_1.4.2
    +#> [55] hms_1.1.0         yaml_2.2.1        colorspace_2.0-2
    +#> [58] rvest_1.0.1       knitr_1.33        haven_2.4.3
    +#> [61] sass_0.4.0
    +
    +
    + + +
    + +
    +
    + + + + + + + +
    + + + + + + + diff --git a/docs/project-guidelines.html b/docs/project-guidelines.html index d41823c..816a095 100644 --- a/docs/project-guidelines.html +++ b/docs/project-guidelines.html @@ -2319,6 +2319,13 @@

    ${suggestion.title}

    Section 10. Decision analysis Section 11. Normal approximation & Frequency properties Section 12. Extended topics +13. Notes on 'Ch 14. Introduction to regression models' +14. Notes on 'Ch 15. Hierarchical linear models' +17. Notes on 'Ch 19. Parametric nonlinear models' +18. Notes on 'Ch 20. Basis function models' +19. Notes on 'Ch 21. Gaussian process models' +20. Notes on 'Ch 22. Finite mixture models' +21. Notes on 'Ch 23. Dirichlet process models'