diff --git a/_site.yml b/_site.yml
index ea51d26..b948e2c 100644
--- a/_site.yml
+++ b/_site.yml
@@ -39,6 +39,20 @@ navbar:
href: notes/11_normal-approx-freq-properties_bda3-04.html
- text: "Section 12. Extended topics"
href: notes/12_extended-topics.html
+ - text: "13. Notes on 'Ch 14. Introduction to regression models'"
+ href: notes/13_intro-to-regression-models_bda3-14.Rmd
+ - text: "14. Notes on 'Ch 15. Hierarchical linear models'"
+ href: notes/14_hierarchical-linear-models_bda3-15.html
+ - text: "17. Notes on 'Ch 19. Parametric nonlinear models'"
+ href: notes/17_parametric-nonlinear-models_bda3-19.html
+ - text: "18. Notes on 'Ch 20. Basis function models'"
+ href: notes/18_basis-function-models_bda3-20.html
+ - text: "19. Notes on 'Ch 21. Gaussian process models'"
+ href: notes/19_gaussian-processes_bda3-21.html
+ - text: "20. Notes on 'Ch 22. Finite mixture models'"
+ href: notes/20_finite-mixture-models_bda3-22.html
+ - text: "21. Notes on 'Ch 23. Dirichlet process models'"
+ href: notes/21_Dirichlet-process-models_bda3-23.html
- text: "Exercises"
menu:
- text: "Chapter 1"
@@ -47,6 +61,8 @@ navbar:
href: book-exercises/bda-exercises-02.html
- text: "Chapter 7"
href: book-exercises/bda-exercises-07.html
+ - text: "Chapter 19"
+ href: book-exercises/bda-exercises-19.html
- text: "Assignments"
menu:
- text: "Assignment 1"
diff --git a/book-exercises/bda-exercises-19.Rmd b/book-exercises/bda-exercises-19.Rmd
new file mode 100644
index 0000000..a54dfd3
--- /dev/null
+++ b/book-exercises/bda-exercises-19.Rmd
@@ -0,0 +1,329 @@
+---
+title: "Chapter 19 Exercises - Reproducing the the 'serial dilution assay'"
+date: "2021-12-10"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
+# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE
+
+rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$")
+for (rfile in rfiles) {
+ source(rfile)
+}
+
+set.seed(0)
+```
+
+> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app).
+
+In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis*[^1] by Gelman *et al.*, the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations.
+Below, I build the model with Stan and fit it using MCMC.
+Unfortunately, I was unable to find the original data in Gelman's original publication of the model[^2].
+The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data.
+
+> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html).
+
+[^1]: Gelman, Andrew, et al. Bayesian Data Analysis. CRC Press Etc., 2015.
+[^2]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/
+
+## Setup
+
+```{r setup-show, message=FALSE, warning=FALSE}
+library(rstan)
+library(tidybayes)
+library(patchwork)
+library(tidyverse)
+
+options(mc.cores = parallel::detectCores())
+rstan_options(auto_write = TRUE)
+
+theme_set(
+ theme_classic() +
+ theme(
+ panel.grid.major = element_line(),
+ strip.background = element_blank(),
+ plot.title = element_text(hjust = 0.5)
+ )
+)
+
+SNS_BLUE <- "#1F77B4"
+STAN_RED <- "#B2171D"
+```
+
+As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473.
+
+```{r}
+dilution_standards_data <- tibble::tribble(
+ ~conc, ~dilution, ~y,
+ 0.64, 1, c(101.8, 121.4),
+ 0.32, 1 / 2, c(105.2, 114.1),
+ 0.16, 1 / 4, c(92.7, 93.3),
+ 0.08, 1 / 8, c(72.4, 61.1),
+ 0.04, 1 / 16, c(57.6, 50.0),
+ 0.02, 1 / 32, c(38.5, 35.1),
+ 0.01, 1 / 64, c(26.6, 25.0),
+ 0, 0, c(14.7, 14.2),
+) %>%
+ mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+ unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+```
+
+The following plot shows the two standard dilution curves.
+They are quite similar.
+
+```{r}
+data_plot <- dilution_standards_data %>%
+ ggplot(aes(x = conc, y = y, color = rep)) +
+ geom_line(alpha = 0.5, linetype = 2) +
+ geom_point(alpha = 0.8) +
+ scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(
+ legend.position = c(0.8, 0.2),
+ legend.background = element_blank()
+ ) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Serial dilution standard curve",
+ color = "replicate"
+ )
+
+data_plot
+```
+
+## Modeling
+
+### Model specification
+
+The model uses a normal likelihood to describe the posterior distribution $p(y|x)$.
+The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field:
+
+$$
+\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\
+$$
+
+
+The model is a scaled and shifted logistic curve.
+This structure results in the following interpretations for $\beta$, all of which are restricted to positive values:
+
+1. $\beta_1$: color intensity when the concentration is 0
+2. $\beta_2$: increase to saturation
+3. $\beta_3$: the inflection point of the curve
+4. $\beta_4$: rate of saturation
+
+Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data.
+
+$$
+\beta_1 \sim N(10, 2.5) \\
+\beta_2 \sim N(100, 5) \\
+\beta_3 \sim N(0, 1) \\
+\beta_4 \sim N(0, 2.5)
+$$
+
+The measurement error of the model, representing the variance in the model's likelihood is defined as follows:
+
+$$
+\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y
+$$
+
+Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values.
+$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements.
+Below are the priors for the new variables in the model.
+
+$$
+\alpha \sim \text{Beta}(1, 1) \qquad
+\sigma \sim |N(0, 2.5)|
+$$
+
+### In Stan
+
+Below is the Stan code for the model.
+It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language.
+
+The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code.
+The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated.
+I also included the ability to provide new data to make predictions over as `xnew`.
+
+```{r}
+dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+```
+
+### Sampling
+
+As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior.
+With those helping restrict the range of their values, the model fit very well.
+
+```{r}
+xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+ N = nrow(dilution_standards_data),
+ A = 30,
+ x = dilution_standards_data$conc,
+ y = dilution_standards_data$y,
+ M = length(xnew),
+ xnew = xnew
+)
+
+dilution_model <- stan(
+ dilution_model_file,
+ model_name = "serial-dilution",
+ data = model_data,
+ refresh = 1000
+)
+```
+
+### Posterior distributions
+
+The next step is to analyze the posterior draws of the model.
+We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$.
+The trace plots are shown below followed by a table of the posteriors with the diagnostic values.
+Everything looks good suggesting MCMC was successful.
+
+```{r}
+model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ theme(legend.position = "bottom")
+```
+
+```{r}
+print(dilution_model, pars = model_pars)
+```
+
+The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$.
+
+```{r}
+rstan::stan_dens(
+ dilution_model,
+ pars = model_pars,
+ separate_chains = FALSE,
+ alpha = 0.6
+) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)))
+```
+
+### Posterior predictive check
+
+Below is a plot of the posterior predictive distributions of the model on the original data.
+1,000 individual simulations are plotted in blue and the mean in black.
+The simulated curves visually appear to correspond well with the observed data indicating the model has good fit.
+
+```{r}
+dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(seq(1, ncol(.))) %>%
+ mutate(draw = 1:n()) %>%
+ pivot_longer(-c(draw), names_to = "idx") %>%
+ left_join(
+ dilution_standards_data %>% mutate(idx = as.character(1:n())),
+ by = "idx"
+ )
+```
+
+```{r}
+plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+ group_by(conc) %>%
+ summarize(value = mean(value)) %>%
+ ungroup()
+
+dilution_post_pred %>%
+ filter(draw %in% !!plt_draws) %>%
+ mutate(grp = glue::glue("{draw}-{rep}")) %>%
+ ggplot(aes(x = conc, y = value)) +
+ geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+ geom_line(group = "a", data = ppc_mean, color = "black") +
+ geom_point(data = ppc_mean, color = "black") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = dilution_standards_data,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+```
+
+I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes.
+The mean and 89% HDI are shown in blue below along with the observed data in red.
+The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations.
+
+```{r}
+ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+ rstan::extract(dilution_model, pars = "ynew")$ynew,
+ 2,
+ bayestestR::hdi,
+ ci = 0.89
+)
+ynew_ppc <- tibble(
+ conc = xnew,
+ ynew_mean = ynew_mean,
+ ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+ ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+```
+
+```{r}
+plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+ ppc_df %>%
+ ggplot(aes(x = conc, y = ynew_mean)) +
+ geom_ribbon(
+ aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+ fill = SNS_BLUE,
+ alpha = 0.5
+ ) +
+ geom_line(group = "a") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = obs_df,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+ ynew_ppc %>% filter(conc <= sub_max),
+ dilution_standards_data %>% filter(conc <= sub_max),
+ pt_size = 0.6
+) +
+ theme(axis.title = element_blank())
+
+ppc_plot +
+ inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+```
+
+---
+
+## Session info
+
+```{r}
+sessionInfo()
+```
diff --git a/book-exercises/bda-exercises-19.html b/book-exercises/bda-exercises-19.html
new file mode 100644
index 0000000..af4c12b
--- /dev/null
+++ b/book-exercises/bda-exercises-19.html
@@ -0,0 +1,3123 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Chapter 19 Exercises - Reproducing the the 'serial dilution assay'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Chapter 19 Exercises - Reproducing the the ‘serial dilution assay’
+
+
+
+
+
+
+
+
+I turned this exercise into a blog post on my webiste.
+
+
In chapter 17 “Parametric nonlinear models” of Bayesian Data
+Analysis by Gelman et al., the
+authors present an example of fitting a curve to a serial dilution
+standard curve and using it to estimate unknown concentrations. Below, I
+build the model with Stan and fit it using MCMC. Unfortunately, I was
+unable to find the original data in Gelman’s original publication of the
+model. The best I could do was copy the
+data for the standard curve from a table in the book and build the model
+to fit that data.
+
+The source code for this post is in a repository
+of my work for Aki Vehtari’s Bayesian Data Analysis course.
+
+
Setup
+
+
As mentioned above, I couldn’t find the original data, so I copied it
+from the book’s figure 19.3 on page 473.
+
+
+
dilution_standards_data <- tibble::tribble(
+ ~conc, ~dilution, ~y,
+ 0.64, 1, c(101.8, 121.4),
+ 0.32, 1 / 2, c(105.2, 114.1),
+ 0.16, 1 / 4, c(92.7, 93.3),
+ 0.08, 1 / 8, c(72.4, 61.1),
+ 0.04, 1 / 16, c(57.6, 50.0),
+ 0.02, 1 / 32, c(38.5, 35.1),
+ 0.01, 1 / 64, c(26.6, 25.0),
+ 0, 0, c(14.7, 14.2),
+) %>%
+ mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+ unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+
+
+
+
+
+
+
+
+0.64 |
+1.000000 |
+101.8 |
+a |
+
+
+0.64 |
+1.000000 |
+121.4 |
+b |
+
+
+0.32 |
+0.500000 |
+105.2 |
+a |
+
+
+0.32 |
+0.500000 |
+114.1 |
+b |
+
+
+0.16 |
+0.250000 |
+92.7 |
+a |
+
+
+0.16 |
+0.250000 |
+93.3 |
+b |
+
+
+0.08 |
+0.125000 |
+72.4 |
+a |
+
+
+0.08 |
+0.125000 |
+61.1 |
+b |
+
+
+0.04 |
+0.062500 |
+57.6 |
+a |
+
+
+0.04 |
+0.062500 |
+50.0 |
+b |
+
+
+0.02 |
+0.031250 |
+38.5 |
+a |
+
+
+0.02 |
+0.031250 |
+35.1 |
+b |
+
+
+0.01 |
+0.015625 |
+26.6 |
+a |
+
+
+0.01 |
+0.015625 |
+25.0 |
+b |
+
+
+0.00 |
+0.000000 |
+14.7 |
+a |
+
+
+0.00 |
+0.000000 |
+14.2 |
+b |
+
+
+
+
+
The following plot shows the two standard dilution curves. They are
+quite similar.
+
+
+
data_plot <- dilution_standards_data %>%
+ ggplot(aes(x = conc, y = y, color = rep)) +
+ geom_line(alpha = 0.5, linetype = 2) +
+ geom_point(alpha = 0.8) +
+ scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(
+ legend.position = c(0.8, 0.2),
+ legend.background = element_blank()
+ ) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Serial dilution standard curve",
+ color = "replicate"
+ )
+
+data_plot
+
+
+
+
+
Modeling
+
Model specification
+
The model uses a normal likelihood to describe the posterior
+distribution \(p(y|x)\). The mean of
+the likelihood is defined for a given concentration \(x\) using the standard equation used in the
+field:
+
\[
+\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 +
+(x/\beta_3)^{-\beta_4}} \\
+\]
+
The model is a scaled and shifted logistic curve. This structure
+results in the following interpretations for \(\beta\), all of which are restricted to
+positive values:
+
+- \(\beta_1\): color intensity when
+the concentration is 0
+- \(\beta_2\): increase to
+saturation
+- \(\beta_3\): the inflection point
+of the curve
+- \(\beta_4\): rate of
+saturation
+
+
Below are the prior distributions for \(\beta\). Note that they are are drastically
+different scales - this is critical to help the model fit the data.
+
\[
+\beta_1 \sim N(10, 2.5) \\
+\beta_2 \sim N(100, 5) \\
+\beta_3 \sim N(0, 1) \\
+\beta_4 \sim N(0, 2.5)
+\]
+
The measurement error of the model, representing the variance in the
+model’s likelihood is defined as follows:
+
\[
+\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A}
+\rgroup^{2\alpha} \sigma^2_y
+\]
+
Here, \(\alpha\), restricted to lie
+between 0 and 1, allows the variance to be higher for larger measurement
+values. \(A\) is a constant (set to 30
+by the authors) that allows \(\sigma_y\) to be more easily interpreted as
+the variance from “typical” measurements. Below are the priors for the
+new variables in the model.
+
\[
+\alpha \sim \text{Beta}(1, 1) \qquad
+\sigma \sim |N(0, 2.5)|
+\]
+
In Stan
+
Below is the Stan code for the model. It looks very similar to the
+mathematical description of the model, a nice feature of the Stan
+probabilistic programming language.
+
The centrality and variance of the likelihood are calculated
+separately as g
and tau
so they can be used in
+the model
and generated quantities
block
+without duplicating the code. The log_lik
is calculated so
+that PSIS-LOO cross validation can be estimated. I also included the
+ability to provide new data to make predictions over as
+xnew
.
+
+
+
dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+
+
+
#> data {
+#> int<lower=0> N; // number of data points
+#> int<lower=0> A; // constant used in model of measurement error
+#> vector<lower=0>[N] x; // concentration values
+#> vector<lower=0>[N] y; // observed color intensity
+#> int<lower=0> M; // number of new x values
+#> vector<lower=0>[M] xnew; // new x values
+#> }
+#>
+#> parameters {
+#> vector<lower=0>[4] beta;
+#> real<lower=0,upper=1> alpha;
+#> real<lower=0> sigma;
+#> }
+#>
+#> transformed parameters {
+#> vector<lower=0>[N] g;
+#> vector<lower=0>[N] tau;
+#>
+#> for (i in 1:N) {
+#> g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4]));
+#> tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#> }
+#> }
+#>
+#> model {
+#> // Priors
+#> alpha ~ beta(1, 1);
+#> beta[1] ~ normal(10, 2.5);
+#> beta[2] ~ normal(100, 5);
+#> beta[3] ~ normal(0, 1);
+#> beta[4] ~ normal(0, 2.5);
+#> sigma ~ normal(0, 2.5);
+#>
+#> // Likelihood
+#> for (i in 1:N) {
+#> y[i] ~ normal(g[i], tau[i]);
+#> }
+#> }
+#>
+#> generated quantities {
+#> vector[N] ypred;
+#> vector[N] log_lik;
+#>
+#> vector[M] g_hat;
+#> vector[M] tau_hat;
+#> vector[M] ynew;
+#>
+#> for (i in 1:N) {
+#> ypred[i] = normal_rng(g[i], tau[i]);
+#> log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]);
+#> }
+#>
+#> for (i in 1:M) {
+#> g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4]));
+#> tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#> ynew[i] = normal_rng(g_hat[i], tau_hat[i]);
+#> }
+#> }
+
+
Sampling
+
As mentioned above, specifically defining the prior distributions for
+each \(\beta\) is necessary for MCMC to
+accurately sample from the posterior. With those helping restrict the
+range of their values, the model fit very well.
+
+
+
xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+ N = nrow(dilution_standards_data),
+ A = 30,
+ x = dilution_standards_data$conc,
+ y = dilution_standards_data$y,
+ M = length(xnew),
+ xnew = xnew
+)
+
+dilution_model <- stan(
+ dilution_model_file,
+ model_name = "serial-dilution",
+ data = model_data,
+ refresh = 1000
+)
+
+
+
+
Posterior distributions
+
The next step is to analyze the posterior draws of the model. We can
+check the success of MCMC by visualizing the traces of the chains,
+looking for good mixing (“fuzzy caterpillars”) and checking diagnostic
+values such as \(\widehat{R}\) and
+\(n_\text{eff}\). The trace plots are
+shown below followed by a table of the posteriors with the diagnostic
+values. Everything looks good suggesting MCMC was successful.
+
+
+
model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ theme(legend.position = "bottom")
+
+
+
+
+
+
+
print(dilution_model, pars = model_pars)
+
+
+
#> Inference for Stan model: serial-dilution.
+#> 4 chains, each with iter=2000; warmup=1000; thin=1;
+#> post-warmup draws per chain=1000, total post-warmup draws=4000.
+#>
+#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
+#> beta[1] 14.33 0.02 0.55 13.12 14.08 14.36 14.61 15.33 1074
+#> beta[2] 102.70 0.10 4.30 94.48 99.67 102.61 105.60 111.32 1755
+#> beta[3] 0.06 0.00 0.01 0.05 0.06 0.06 0.07 0.08 1633
+#> beta[4] 1.13 0.00 0.08 0.97 1.07 1.12 1.18 1.29 1870
+#> alpha 0.72 0.01 0.17 0.33 0.61 0.74 0.85 0.98 1074
+#> sigma 1.33 0.01 0.24 0.98 1.16 1.30 1.45 1.92 893
+#> Rhat
+#> beta[1] 1
+#> beta[2] 1
+#> beta[3] 1
+#> beta[4] 1
+#> alpha 1
+#> sigma 1
+#>
+#> Samples were drawn using NUTS(diag_e) at Sat Jan 22 11:53:33 2022.
+#> For each parameter, n_eff is a crude measure of effective sample size,
+#> and Rhat is the potential scale reduction factor on split chains (at
+#> convergence, Rhat=1).
+
+
The following density plots show the posterior distributions of the
+model parameters \(\beta\), \(\alpha\), and \(\sigma\).
+
+
+
rstan::stan_dens(
+ dilution_model,
+ pars = model_pars,
+ separate_chains = FALSE,
+ alpha = 0.6
+) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)))
+
+
+
+
+
Posterior predictive check
+
Below is a plot of the posterior predictive distributions of the
+model on the original data. 1,000 individual simulations are plotted in
+blue and the mean in black. The simulated curves visually appear to
+correspond well with the observed data indicating the model has good
+fit.
+
+
+
dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(seq(1, ncol(.))) %>%
+ mutate(draw = 1:n()) %>%
+ pivot_longer(-c(draw), names_to = "idx") %>%
+ left_join(
+ dilution_standards_data %>% mutate(idx = as.character(1:n())),
+ by = "idx"
+ )
+
+
+
+
+
+
plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+ group_by(conc) %>%
+ summarize(value = mean(value)) %>%
+ ungroup()
+
+dilution_post_pred %>%
+ filter(draw %in% !!plt_draws) %>%
+ mutate(grp = glue::glue("{draw}-{rep}")) %>%
+ ggplot(aes(x = conc, y = value)) +
+ geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+ geom_line(group = "a", data = ppc_mean, color = "black") +
+ geom_point(data = ppc_mean, color = "black") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = dilution_standards_data,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+
+
+
+
+
I also had the model make posterior predictions on concentrations
+across the observed range at smaller step-sizes. The mean and 89% HDI
+are shown in blue below along with the observed data in red. The inset
+plot is a zoomed-in view of the posterior predictive distribution at the
+lower concentrations.
+
+
+
ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+ rstan::extract(dilution_model, pars = "ynew")$ynew,
+ 2,
+ bayestestR::hdi,
+ ci = 0.89
+)
+ynew_ppc <- tibble(
+ conc = xnew,
+ ynew_mean = ynew_mean,
+ ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+ ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+
+
+
+
+
+
plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+ ppc_df %>%
+ ggplot(aes(x = conc, y = ynew_mean)) +
+ geom_ribbon(
+ aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+ fill = SNS_BLUE,
+ alpha = 0.5
+ ) +
+ geom_line(group = "a") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = obs_df,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+ ynew_ppc %>% filter(conc <= sub_max),
+ dilution_standards_data %>% filter(conc <= sub_max),
+ pt_size = 0.6
+) +
+ theme(axis.title = element_blank())
+
+ppc_plot +
+ inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+
+
+
+
+
+
Session info
+
+
+
#> R version 4.1.2 (2021-11-01)
+#> Platform: x86_64-apple-darwin17.0 (64-bit)
+#> Running under: macOS Big Sur 10.16
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
+#>
+#> locale:
+#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices datasets utils methods
+#> [7] base
+#>
+#> other attached packages:
+#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
+#> [4] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
+#> [7] tibble_3.1.3 tidyverse_1.3.1 patchwork_1.1.1
+#> [10] tidybayes_3.0.1 rstan_2.21.2 ggplot2_3.3.5
+#> [13] StanHeaders_2.21.0-7
+#>
+#> loaded via a namespace (and not attached):
+#> [1] fs_1.5.0 matrixStats_0.61.0 lubridate_1.7.10
+#> [4] insight_0.14.4 RColorBrewer_1.1-2 httr_1.4.2
+#> [7] rprojroot_2.0.2 tensorA_0.36.2 tools_4.1.2
+#> [10] backports_1.2.1 bslib_0.2.5.1 utf8_1.2.2
+#> [13] R6_2.5.0 DBI_1.1.1 colorspace_2.0-2
+#> [16] ggdist_3.0.0 withr_2.4.2 tidyselect_1.1.1
+#> [19] gridExtra_2.3 prettyunits_1.1.1 processx_3.5.2
+#> [22] downlit_0.2.1 curl_4.3.2 compiler_4.1.2
+#> [25] rvest_1.0.1 cli_3.0.1 arrayhelpers_1.1-0
+#> [28] xml2_1.3.2 bayestestR_0.11.0 labeling_0.4.2
+#> [31] posterior_1.1.0 sass_0.4.0 scales_1.1.1
+#> [34] checkmate_2.0.0 callr_3.7.0 digest_0.6.27
+#> [37] rmarkdown_2.10 pkgconfig_2.0.3 htmltools_0.5.1.1
+#> [40] highr_0.9 dbplyr_2.1.1 rlang_0.4.11
+#> [43] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
+#> [46] farver_2.1.0 generics_0.1.0 svUnit_1.0.6
+#> [49] jsonlite_1.7.2 distill_1.2 distributional_0.2.2
+#> [52] inline_0.3.19 magrittr_2.0.1 loo_2.4.1
+#> [55] Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0
+#> [58] abind_1.4-5 lifecycle_1.0.0 stringi_1.7.3
+#> [61] yaml_2.2.1 pkgbuild_1.2.0 grid_4.1.2
+#> [64] parallel_4.1.2 crayon_1.4.1 lattice_0.20-45
+#> [67] haven_2.4.3 hms_1.1.0 knitr_1.33
+#> [70] ps_1.6.0 pillar_1.6.2 codetools_0.2-18
+#> [73] clisymbols_1.2.0 stats4_4.1.2 reprex_2.0.1
+#> [76] glue_1.4.2 evaluate_0.14 V8_3.4.2
+#> [79] renv_0.14.0 RcppParallel_5.1.4 modelr_0.1.8
+#> [82] vctrs_0.3.8 tzdb_0.1.2 cellranger_1.1.0
+#> [85] gtable_0.3.0 datawizard_0.2.1 assertthat_0.2.1
+#> [88] xfun_0.25 broom_0.7.9 coda_0.19-4
+#> [91] ellipsis_0.3.2 here_1.0.1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/about.html b/docs/about.html
index f55643a..7b34ee0 100644
--- a/docs/about.html
+++ b/docs/about.html
@@ -2396,6 +2396,13 @@ ${suggestion.title}
Section 10. Decision analysis
Section 11. Normal approximation & Frequency properties
Section 12. Extended topics
+13. Notes on 'Ch 14. Introduction to regression models'
+14. Notes on 'Ch 15. Hierarchical linear models'
+17. Notes on 'Ch 19. Parametric nonlinear models'
+18. Notes on 'Ch 20. Basis function models'
+19. Notes on 'Ch 21. Gaussian process models'
+20. Notes on 'Ch 22. Finite mixture models'
+21. Notes on 'Ch 23. Dirichlet process models'
@@ -2444,16 +2452,27 @@
${suggestion.title}
About
-
Some additional details about this website and it’s author.
+
Some additional details about this website and it’s
+author.
The website
-
This site is built with ‘distill’ and deployed on GitHub pages. The main point of the website is to display my course notes in a easy to read and search format. It isn’t perfect nor optimal, but there is a good balance of functionality and simplicity.
+
This site is built with ‘distill’ and deployed on
+GitHub pages. The main point of the website is to display my course
+notes in a easy to read and search format. It isn’t perfect nor optimal,
+but there is a good balance of functionality and simplicity.
About me
-
My name is Joshua Cook and I am (at the time of taking this course and writing this About page) a graduate student at Harvard Medical School. My research is on cancer genetics and I have a specific love of Bayesian modelling. You can peruse more of my projects and other work on my GitHub profile at jhrcook or my website.
-
+
My name is Joshua Cook and I am (at the time of taking this course
+and writing this About page) a graduate student at Harvard Medical
+School. My research is on cancer genetics and I have a specific love of
+Bayesian modelling. You can peruse more of my projects and other work on
+my GitHub profile at jhrcook or
+my website.
+
diff --git a/docs/book-exercises/bda-exercises-19.Rmd b/docs/book-exercises/bda-exercises-19.Rmd
new file mode 100644
index 0000000..a54dfd3
--- /dev/null
+++ b/docs/book-exercises/bda-exercises-19.Rmd
@@ -0,0 +1,329 @@
+---
+title: "Chapter 19 Exercises - Reproducing the the 'serial dilution assay'"
+date: "2021-12-10"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
+# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE
+
+rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$")
+for (rfile in rfiles) {
+ source(rfile)
+}
+
+set.seed(0)
+```
+
+> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app).
+
+In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis*[^1] by Gelman *et al.*, the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations.
+Below, I build the model with Stan and fit it using MCMC.
+Unfortunately, I was unable to find the original data in Gelman's original publication of the model[^2].
+The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data.
+
+> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html).
+
+[^1]: Gelman, Andrew, et al. Bayesian Data Analysis. CRC Press Etc., 2015.
+[^2]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/
+
+## Setup
+
+```{r setup-show, message=FALSE, warning=FALSE}
+library(rstan)
+library(tidybayes)
+library(patchwork)
+library(tidyverse)
+
+options(mc.cores = parallel::detectCores())
+rstan_options(auto_write = TRUE)
+
+theme_set(
+ theme_classic() +
+ theme(
+ panel.grid.major = element_line(),
+ strip.background = element_blank(),
+ plot.title = element_text(hjust = 0.5)
+ )
+)
+
+SNS_BLUE <- "#1F77B4"
+STAN_RED <- "#B2171D"
+```
+
+As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473.
+
+```{r}
+dilution_standards_data <- tibble::tribble(
+ ~conc, ~dilution, ~y,
+ 0.64, 1, c(101.8, 121.4),
+ 0.32, 1 / 2, c(105.2, 114.1),
+ 0.16, 1 / 4, c(92.7, 93.3),
+ 0.08, 1 / 8, c(72.4, 61.1),
+ 0.04, 1 / 16, c(57.6, 50.0),
+ 0.02, 1 / 32, c(38.5, 35.1),
+ 0.01, 1 / 64, c(26.6, 25.0),
+ 0, 0, c(14.7, 14.2),
+) %>%
+ mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+ unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+```
+
+The following plot shows the two standard dilution curves.
+They are quite similar.
+
+```{r}
+data_plot <- dilution_standards_data %>%
+ ggplot(aes(x = conc, y = y, color = rep)) +
+ geom_line(alpha = 0.5, linetype = 2) +
+ geom_point(alpha = 0.8) +
+ scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(
+ legend.position = c(0.8, 0.2),
+ legend.background = element_blank()
+ ) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Serial dilution standard curve",
+ color = "replicate"
+ )
+
+data_plot
+```
+
+## Modeling
+
+### Model specification
+
+The model uses a normal likelihood to describe the posterior distribution $p(y|x)$.
+The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field:
+
+$$
+\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\
+$$
+
+
+The model is a scaled and shifted logistic curve.
+This structure results in the following interpretations for $\beta$, all of which are restricted to positive values:
+
+1. $\beta_1$: color intensity when the concentration is 0
+2. $\beta_2$: increase to saturation
+3. $\beta_3$: the inflection point of the curve
+4. $\beta_4$: rate of saturation
+
+Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data.
+
+$$
+\beta_1 \sim N(10, 2.5) \\
+\beta_2 \sim N(100, 5) \\
+\beta_3 \sim N(0, 1) \\
+\beta_4 \sim N(0, 2.5)
+$$
+
+The measurement error of the model, representing the variance in the model's likelihood is defined as follows:
+
+$$
+\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y
+$$
+
+Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values.
+$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements.
+Below are the priors for the new variables in the model.
+
+$$
+\alpha \sim \text{Beta}(1, 1) \qquad
+\sigma \sim |N(0, 2.5)|
+$$
+
+### In Stan
+
+Below is the Stan code for the model.
+It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language.
+
+The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code.
+The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated.
+I also included the ability to provide new data to make predictions over as `xnew`.
+
+```{r}
+dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+```
+
+### Sampling
+
+As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior.
+With those helping restrict the range of their values, the model fit very well.
+
+```{r}
+xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+ N = nrow(dilution_standards_data),
+ A = 30,
+ x = dilution_standards_data$conc,
+ y = dilution_standards_data$y,
+ M = length(xnew),
+ xnew = xnew
+)
+
+dilution_model <- stan(
+ dilution_model_file,
+ model_name = "serial-dilution",
+ data = model_data,
+ refresh = 1000
+)
+```
+
+### Posterior distributions
+
+The next step is to analyze the posterior draws of the model.
+We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$.
+The trace plots are shown below followed by a table of the posteriors with the diagnostic values.
+Everything looks good suggesting MCMC was successful.
+
+```{r}
+model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ theme(legend.position = "bottom")
+```
+
+```{r}
+print(dilution_model, pars = model_pars)
+```
+
+The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$.
+
+```{r}
+rstan::stan_dens(
+ dilution_model,
+ pars = model_pars,
+ separate_chains = FALSE,
+ alpha = 0.6
+) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)))
+```
+
+### Posterior predictive check
+
+Below is a plot of the posterior predictive distributions of the model on the original data.
+1,000 individual simulations are plotted in blue and the mean in black.
+The simulated curves visually appear to correspond well with the observed data indicating the model has good fit.
+
+```{r}
+dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(seq(1, ncol(.))) %>%
+ mutate(draw = 1:n()) %>%
+ pivot_longer(-c(draw), names_to = "idx") %>%
+ left_join(
+ dilution_standards_data %>% mutate(idx = as.character(1:n())),
+ by = "idx"
+ )
+```
+
+```{r}
+plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+ group_by(conc) %>%
+ summarize(value = mean(value)) %>%
+ ungroup()
+
+dilution_post_pred %>%
+ filter(draw %in% !!plt_draws) %>%
+ mutate(grp = glue::glue("{draw}-{rep}")) %>%
+ ggplot(aes(x = conc, y = value)) +
+ geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+ geom_line(group = "a", data = ppc_mean, color = "black") +
+ geom_point(data = ppc_mean, color = "black") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = dilution_standards_data,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+```
+
+I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes.
+The mean and 89% HDI are shown in blue below along with the observed data in red.
+The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations.
+
+```{r}
+ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+ rstan::extract(dilution_model, pars = "ynew")$ynew,
+ 2,
+ bayestestR::hdi,
+ ci = 0.89
+)
+ynew_ppc <- tibble(
+ conc = xnew,
+ ynew_mean = ynew_mean,
+ ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+ ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+```
+
+```{r}
+plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+ ppc_df %>%
+ ggplot(aes(x = conc, y = ynew_mean)) +
+ geom_ribbon(
+ aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+ fill = SNS_BLUE,
+ alpha = 0.5
+ ) +
+ geom_line(group = "a") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = obs_df,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+ ynew_ppc %>% filter(conc <= sub_max),
+ dilution_standards_data %>% filter(conc <= sub_max),
+ pt_size = 0.6
+) +
+ theme(axis.title = element_blank())
+
+ppc_plot +
+ inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+```
+
+---
+
+## Session info
+
+```{r}
+sessionInfo()
+```
diff --git a/docs/book-exercises/bda-exercises-19.html b/docs/book-exercises/bda-exercises-19.html
new file mode 100644
index 0000000..af4c12b
--- /dev/null
+++ b/docs/book-exercises/bda-exercises-19.html
@@ -0,0 +1,3123 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Chapter 19 Exercises - Reproducing the the 'serial dilution assay'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Chapter 19 Exercises - Reproducing the the ‘serial dilution assay’
+
+
+
+
+
+
+
+
+I turned this exercise into a blog post on my webiste.
+
+
In chapter 17 “Parametric nonlinear models” of Bayesian Data
+Analysis by Gelman et al., the
+authors present an example of fitting a curve to a serial dilution
+standard curve and using it to estimate unknown concentrations. Below, I
+build the model with Stan and fit it using MCMC. Unfortunately, I was
+unable to find the original data in Gelman’s original publication of the
+model. The best I could do was copy the
+data for the standard curve from a table in the book and build the model
+to fit that data.
+
+The source code for this post is in a repository
+of my work for Aki Vehtari’s Bayesian Data Analysis course.
+
+
Setup
+
+
As mentioned above, I couldn’t find the original data, so I copied it
+from the book’s figure 19.3 on page 473.
+
+
+
dilution_standards_data <- tibble::tribble(
+ ~conc, ~dilution, ~y,
+ 0.64, 1, c(101.8, 121.4),
+ 0.32, 1 / 2, c(105.2, 114.1),
+ 0.16, 1 / 4, c(92.7, 93.3),
+ 0.08, 1 / 8, c(72.4, 61.1),
+ 0.04, 1 / 16, c(57.6, 50.0),
+ 0.02, 1 / 32, c(38.5, 35.1),
+ 0.01, 1 / 64, c(26.6, 25.0),
+ 0, 0, c(14.7, 14.2),
+) %>%
+ mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
+ unnest(c(y, rep))
+
+knitr::kable(dilution_standards_data)
+
+
+
+
+
+
+
+
+0.64 |
+1.000000 |
+101.8 |
+a |
+
+
+0.64 |
+1.000000 |
+121.4 |
+b |
+
+
+0.32 |
+0.500000 |
+105.2 |
+a |
+
+
+0.32 |
+0.500000 |
+114.1 |
+b |
+
+
+0.16 |
+0.250000 |
+92.7 |
+a |
+
+
+0.16 |
+0.250000 |
+93.3 |
+b |
+
+
+0.08 |
+0.125000 |
+72.4 |
+a |
+
+
+0.08 |
+0.125000 |
+61.1 |
+b |
+
+
+0.04 |
+0.062500 |
+57.6 |
+a |
+
+
+0.04 |
+0.062500 |
+50.0 |
+b |
+
+
+0.02 |
+0.031250 |
+38.5 |
+a |
+
+
+0.02 |
+0.031250 |
+35.1 |
+b |
+
+
+0.01 |
+0.015625 |
+26.6 |
+a |
+
+
+0.01 |
+0.015625 |
+25.0 |
+b |
+
+
+0.00 |
+0.000000 |
+14.7 |
+a |
+
+
+0.00 |
+0.000000 |
+14.2 |
+b |
+
+
+
+
+
The following plot shows the two standard dilution curves. They are
+quite similar.
+
+
+
data_plot <- dilution_standards_data %>%
+ ggplot(aes(x = conc, y = y, color = rep)) +
+ geom_line(alpha = 0.5, linetype = 2) +
+ geom_point(alpha = 0.8) +
+ scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(
+ legend.position = c(0.8, 0.2),
+ legend.background = element_blank()
+ ) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Serial dilution standard curve",
+ color = "replicate"
+ )
+
+data_plot
+
+
+
+
+
Modeling
+
Model specification
+
The model uses a normal likelihood to describe the posterior
+distribution \(p(y|x)\). The mean of
+the likelihood is defined for a given concentration \(x\) using the standard equation used in the
+field:
+
\[
+\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 +
+(x/\beta_3)^{-\beta_4}} \\
+\]
+
The model is a scaled and shifted logistic curve. This structure
+results in the following interpretations for \(\beta\), all of which are restricted to
+positive values:
+
+- \(\beta_1\): color intensity when
+the concentration is 0
+- \(\beta_2\): increase to
+saturation
+- \(\beta_3\): the inflection point
+of the curve
+- \(\beta_4\): rate of
+saturation
+
+
Below are the prior distributions for \(\beta\). Note that they are are drastically
+different scales - this is critical to help the model fit the data.
+
\[
+\beta_1 \sim N(10, 2.5) \\
+\beta_2 \sim N(100, 5) \\
+\beta_3 \sim N(0, 1) \\
+\beta_4 \sim N(0, 2.5)
+\]
+
The measurement error of the model, representing the variance in the
+model’s likelihood is defined as follows:
+
\[
+\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A}
+\rgroup^{2\alpha} \sigma^2_y
+\]
+
Here, \(\alpha\), restricted to lie
+between 0 and 1, allows the variance to be higher for larger measurement
+values. \(A\) is a constant (set to 30
+by the authors) that allows \(\sigma_y\) to be more easily interpreted as
+the variance from “typical” measurements. Below are the priors for the
+new variables in the model.
+
\[
+\alpha \sim \text{Beta}(1, 1) \qquad
+\sigma \sim |N(0, 2.5)|
+\]
+
In Stan
+
Below is the Stan code for the model. It looks very similar to the
+mathematical description of the model, a nice feature of the Stan
+probabilistic programming language.
+
The centrality and variance of the likelihood are calculated
+separately as g
and tau
so they can be used in
+the model
and generated quantities
block
+without duplicating the code. The log_lik
is calculated so
+that PSIS-LOO cross validation can be estimated. I also included the
+ability to provide new data to make predictions over as
+xnew
.
+
+
+
dilution_model_file <- here::here("models", "serial-dilution.stan")
+writeLines(readLines(dilution_model_file))
+
+
+
#> data {
+#> int<lower=0> N; // number of data points
+#> int<lower=0> A; // constant used in model of measurement error
+#> vector<lower=0>[N] x; // concentration values
+#> vector<lower=0>[N] y; // observed color intensity
+#> int<lower=0> M; // number of new x values
+#> vector<lower=0>[M] xnew; // new x values
+#> }
+#>
+#> parameters {
+#> vector<lower=0>[4] beta;
+#> real<lower=0,upper=1> alpha;
+#> real<lower=0> sigma;
+#> }
+#>
+#> transformed parameters {
+#> vector<lower=0>[N] g;
+#> vector<lower=0>[N] tau;
+#>
+#> for (i in 1:N) {
+#> g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4]));
+#> tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#> }
+#> }
+#>
+#> model {
+#> // Priors
+#> alpha ~ beta(1, 1);
+#> beta[1] ~ normal(10, 2.5);
+#> beta[2] ~ normal(100, 5);
+#> beta[3] ~ normal(0, 1);
+#> beta[4] ~ normal(0, 2.5);
+#> sigma ~ normal(0, 2.5);
+#>
+#> // Likelihood
+#> for (i in 1:N) {
+#> y[i] ~ normal(g[i], tau[i]);
+#> }
+#> }
+#>
+#> generated quantities {
+#> vector[N] ypred;
+#> vector[N] log_lik;
+#>
+#> vector[M] g_hat;
+#> vector[M] tau_hat;
+#> vector[M] ynew;
+#>
+#> for (i in 1:N) {
+#> ypred[i] = normal_rng(g[i], tau[i]);
+#> log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]);
+#> }
+#>
+#> for (i in 1:M) {
+#> g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4]));
+#> tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+#> ynew[i] = normal_rng(g_hat[i], tau_hat[i]);
+#> }
+#> }
+
+
Sampling
+
As mentioned above, specifically defining the prior distributions for
+each \(\beta\) is necessary for MCMC to
+accurately sample from the posterior. With those helping restrict the
+range of their values, the model fit very well.
+
+
+
xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
+model_data <- list(
+ N = nrow(dilution_standards_data),
+ A = 30,
+ x = dilution_standards_data$conc,
+ y = dilution_standards_data$y,
+ M = length(xnew),
+ xnew = xnew
+)
+
+dilution_model <- stan(
+ dilution_model_file,
+ model_name = "serial-dilution",
+ data = model_data,
+ refresh = 1000
+)
+
+
+
+
Posterior distributions
+
The next step is to analyze the posterior draws of the model. We can
+check the success of MCMC by visualizing the traces of the chains,
+looking for good mixing (“fuzzy caterpillars”) and checking diagnostic
+values such as \(\widehat{R}\) and
+\(n_\text{eff}\). The trace plots are
+shown below followed by a table of the posteriors with the diagnostic
+values. Everything looks good suggesting MCMC was successful.
+
+
+
model_pars <- c("beta", "alpha", "sigma")
+rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ theme(legend.position = "bottom")
+
+
+
+
+
+
+
print(dilution_model, pars = model_pars)
+
+
+
#> Inference for Stan model: serial-dilution.
+#> 4 chains, each with iter=2000; warmup=1000; thin=1;
+#> post-warmup draws per chain=1000, total post-warmup draws=4000.
+#>
+#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
+#> beta[1] 14.33 0.02 0.55 13.12 14.08 14.36 14.61 15.33 1074
+#> beta[2] 102.70 0.10 4.30 94.48 99.67 102.61 105.60 111.32 1755
+#> beta[3] 0.06 0.00 0.01 0.05 0.06 0.06 0.07 0.08 1633
+#> beta[4] 1.13 0.00 0.08 0.97 1.07 1.12 1.18 1.29 1870
+#> alpha 0.72 0.01 0.17 0.33 0.61 0.74 0.85 0.98 1074
+#> sigma 1.33 0.01 0.24 0.98 1.16 1.30 1.45 1.92 893
+#> Rhat
+#> beta[1] 1
+#> beta[2] 1
+#> beta[3] 1
+#> beta[4] 1
+#> alpha 1
+#> sigma 1
+#>
+#> Samples were drawn using NUTS(diag_e) at Sat Jan 22 11:53:33 2022.
+#> For each parameter, n_eff is a crude measure of effective sample size,
+#> and Rhat is the potential scale reduction factor on split chains (at
+#> convergence, Rhat=1).
+
+
The following density plots show the posterior distributions of the
+model parameters \(\beta\), \(\alpha\), and \(\sigma\).
+
+
+
rstan::stan_dens(
+ dilution_model,
+ pars = model_pars,
+ separate_chains = FALSE,
+ alpha = 0.6
+) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02)))
+
+
+
+
+
Posterior predictive check
+
Below is a plot of the posterior predictive distributions of the
+model on the original data. 1,000 individual simulations are plotted in
+blue and the mean in black. The simulated curves visually appear to
+correspond well with the observed data indicating the model has good
+fit.
+
+
+
dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(seq(1, ncol(.))) %>%
+ mutate(draw = 1:n()) %>%
+ pivot_longer(-c(draw), names_to = "idx") %>%
+ left_join(
+ dilution_standards_data %>% mutate(idx = as.character(1:n())),
+ by = "idx"
+ )
+
+
+
+
+
+
plt_n_draws <- 1000
+plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
+
+ppc_mean <- dilution_post_pred %>%
+ group_by(conc) %>%
+ summarize(value = mean(value)) %>%
+ ungroup()
+
+dilution_post_pred %>%
+ filter(draw %in% !!plt_draws) %>%
+ mutate(grp = glue::glue("{draw}-{rep}")) %>%
+ ggplot(aes(x = conc, y = value)) +
+ geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
+ geom_line(group = "a", data = ppc_mean, color = "black") +
+ geom_point(data = ppc_mean, color = "black") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = dilution_standards_data,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+
+
+
+
+
I also had the model make posterior predictions on concentrations
+across the observed range at smaller step-sizes. The mean and 89% HDI
+are shown in blue below along with the observed data in red. The inset
+plot is a zoomed-in view of the posterior predictive distribution at the
+lower concentrations.
+
+
+
ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
+ynew_hdi <- apply(
+ rstan::extract(dilution_model, pars = "ynew")$ynew,
+ 2,
+ bayestestR::hdi,
+ ci = 0.89
+)
+ynew_ppc <- tibble(
+ conc = xnew,
+ ynew_mean = ynew_mean,
+ ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
+ ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
+)
+
+
+
+
+
+
plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
+ ppc_df %>%
+ ggplot(aes(x = conc, y = ynew_mean)) +
+ geom_ribbon(
+ aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
+ fill = SNS_BLUE,
+ alpha = 0.5
+ ) +
+ geom_line(group = "a") +
+ geom_line(
+ aes(y = y, group = rep),
+ data = obs_df,
+ color = STAN_RED
+ ) +
+ geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02)))
+}
+
+ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
+ labs(
+ x = "concentration",
+ y = "y",
+ title = "Posterior predictive distribution"
+ )
+
+sub_max <- 0.04
+sub_ppc_plot <- plot_posterior_pred(
+ ynew_ppc %>% filter(conc <= sub_max),
+ dilution_standards_data %>% filter(conc <= sub_max),
+ pt_size = 0.6
+) +
+ theme(axis.title = element_blank())
+
+ppc_plot +
+ inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
+
+
+
+
+
+
Session info
+
+
+
#> R version 4.1.2 (2021-11-01)
+#> Platform: x86_64-apple-darwin17.0 (64-bit)
+#> Running under: macOS Big Sur 10.16
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
+#>
+#> locale:
+#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices datasets utils methods
+#> [7] base
+#>
+#> other attached packages:
+#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
+#> [4] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
+#> [7] tibble_3.1.3 tidyverse_1.3.1 patchwork_1.1.1
+#> [10] tidybayes_3.0.1 rstan_2.21.2 ggplot2_3.3.5
+#> [13] StanHeaders_2.21.0-7
+#>
+#> loaded via a namespace (and not attached):
+#> [1] fs_1.5.0 matrixStats_0.61.0 lubridate_1.7.10
+#> [4] insight_0.14.4 RColorBrewer_1.1-2 httr_1.4.2
+#> [7] rprojroot_2.0.2 tensorA_0.36.2 tools_4.1.2
+#> [10] backports_1.2.1 bslib_0.2.5.1 utf8_1.2.2
+#> [13] R6_2.5.0 DBI_1.1.1 colorspace_2.0-2
+#> [16] ggdist_3.0.0 withr_2.4.2 tidyselect_1.1.1
+#> [19] gridExtra_2.3 prettyunits_1.1.1 processx_3.5.2
+#> [22] downlit_0.2.1 curl_4.3.2 compiler_4.1.2
+#> [25] rvest_1.0.1 cli_3.0.1 arrayhelpers_1.1-0
+#> [28] xml2_1.3.2 bayestestR_0.11.0 labeling_0.4.2
+#> [31] posterior_1.1.0 sass_0.4.0 scales_1.1.1
+#> [34] checkmate_2.0.0 callr_3.7.0 digest_0.6.27
+#> [37] rmarkdown_2.10 pkgconfig_2.0.3 htmltools_0.5.1.1
+#> [40] highr_0.9 dbplyr_2.1.1 rlang_0.4.11
+#> [43] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
+#> [46] farver_2.1.0 generics_0.1.0 svUnit_1.0.6
+#> [49] jsonlite_1.7.2 distill_1.2 distributional_0.2.2
+#> [52] inline_0.3.19 magrittr_2.0.1 loo_2.4.1
+#> [55] Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0
+#> [58] abind_1.4-5 lifecycle_1.0.0 stringi_1.7.3
+#> [61] yaml_2.2.1 pkgbuild_1.2.0 grid_4.1.2
+#> [64] parallel_4.1.2 crayon_1.4.1 lattice_0.20-45
+#> [67] haven_2.4.3 hms_1.1.0 knitr_1.33
+#> [70] ps_1.6.0 pillar_1.6.2 codetools_0.2-18
+#> [73] clisymbols_1.2.0 stats4_4.1.2 reprex_2.0.1
+#> [76] glue_1.4.2 evaluate_0.14 V8_3.4.2
+#> [79] renv_0.14.0 RcppParallel_5.1.4 modelr_0.1.8
+#> [82] vctrs_0.3.8 tzdb_0.1.2 cellranger_1.1.0
+#> [85] gtable_0.3.0 datawizard_0.2.1 assertthat_0.2.1
+#> [88] xfun_0.25 broom_0.7.9 coda_0.19-4
+#> [91] ellipsis_0.3.2 here_1.0.1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/index.html b/docs/index.html
index 6b08346..1e6247e 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -2397,6 +2397,13 @@ ${suggestion.title}
Section 10. Decision analysis
Section 11. Normal approximation & Frequency properties
Section 12. Extended topics
+13. Notes on 'Ch 14. Introduction to regression models'
+14. Notes on 'Ch 15. Hierarchical linear models'
+17. Notes on 'Ch 19. Parametric nonlinear models'
+18. Notes on 'Ch 20. Basis function models'
+19. Notes on 'Ch 21. Gaussian process models'
+20. Notes on 'Ch 22. Finite mixture models'
+21. Notes on 'Ch 23. Dirichlet process models'
@@ -2445,7 +2453,8 @@
${suggestion.title}
Bayesian Data Analysis course
-
My notes and work for the Bayesian Data Analysis course taught by Aki Vehtari.
+
My notes and work for the Bayesian Data Analysis course taught by
+Aki Vehtari.
@@ -2458,126 +2467,282 @@
Bayesian Data Analysis course
Resources
How to study
-The following are recommendations from the course creators on how to take the course.
+The following are recommendations from the course creators on how to
+take the course.
The recommended way to go through the material is:
-- Read the reading instructions for a chapter in the chapter notes.
-- Read the chapter in BDA3 and check that you find the terms listed in the reading instructions.
-- Watch the corresponding video lecture to get explanations for most important parts.
+- Read the reading instructions for a chapter in the chapter
+notes.
+- Read the chapter in BDA3 and check that you find the terms listed in
+the reading instructions.
+- Watch the corresponding video
+lecture to get explanations for most important parts.
- Read corresponding additional information in the chapter notes.
-- Run the corresponding demos in R demos or Python demos.
-- Read the exercise instructions and make the corresponding assignments. Demo codes in R demos and Python demos have a lot of useful examples for handling data and plotting figures. If you have problems, visit TA sessions or ask in course slack channel.
-- If you want to learn more, make also self study exercises listed below.
+- Run the corresponding demos in R demos or Python demos.
+- Read the exercise instructions and make the corresponding
+assignments. Demo codes in R demos and Python demos have a lot of useful
+examples for handling data and plotting figures. If you have problems,
+visit TA sessions or ask in course slack channel.
+- If you want to learn more, make also self study exercises listed
+below.
-
Sections
+
Course sections
+
Additional notes
+
+
+
+
+
+
+
+
+
+
+
+
+14 |
+Introduction to regression models |
+notes |
+ |
+
+
+15 |
+Hierarchical linear models |
+notes |
+ |
+
+
+16 |
+Generalized linear models |
+ |
+ |
+
+
+18 |
+Models for missing data |
+ |
+ |
+
+
+19 |
+Parametric nonlinear models |
+notes |
+serial dilution
+model |
+
+
+20 |
+Basis function models |
+notes |
+ |
+
+
+21 |
+Gaussian process models |
+notes |
+ |
+
+
+22 |
+Finite mixture models |
+notes |
+ |
+
+
+23 |
+Dirichlet process models |
+notes |
+ |
Stan models
+
serial dilution assay for
+chapter 19 exercises
+
diff --git a/docs/models/serial-dilution.rds b/docs/models/serial-dilution.rds
new file mode 100644
index 0000000..43fa9ea
Binary files /dev/null and b/docs/models/serial-dilution.rds differ
diff --git a/docs/models/serial-dilution.stan b/docs/models/serial-dilution.stan
new file mode 100644
index 0000000..1dab7e7
--- /dev/null
+++ b/docs/models/serial-dilution.stan
@@ -0,0 +1,59 @@
+data {
+ int
N; // number of data points
+ int A; // constant used in model of measurement error
+ vector[N] x; // concentration values
+ vector[N] y; // observed color intensity
+ int M; // number of new x values
+ vector[M] xnew; // new x values
+}
+
+parameters {
+ vector[4] beta;
+ real alpha;
+ real sigma;
+}
+
+transformed parameters {
+ vector[N] g;
+ vector[N] tau;
+
+ for (i in 1:N) {
+ g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4]));
+ tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+ }
+}
+
+model {
+ // Priors
+ alpha ~ beta(1, 1);
+ beta[1] ~ normal(10, 2.5);
+ beta[2] ~ normal(100, 5);
+ beta[3] ~ normal(0, 1);
+ beta[4] ~ normal(0, 2.5);
+ sigma ~ normal(0, 2.5);
+
+ // Likelihood
+ for (i in 1:N) {
+ y[i] ~ normal(g[i], tau[i]);
+ }
+}
+
+generated quantities {
+ vector[N] ypred;
+ vector[N] log_lik;
+
+ vector[M] g_hat;
+ vector[M] tau_hat;
+ vector[M] ynew;
+
+ for (i in 1:N) {
+ ypred[i] = normal_rng(g[i], tau[i]);
+ log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]);
+ }
+
+ for (i in 1:M) {
+ g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4]));
+ tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+ ynew[i] = normal_rng(g_hat[i], tau_hat[i]);
+ }
+}
diff --git a/docs/notes/13_intro-to-regression-models_bda3-14.Rmd b/docs/notes/13_intro-to-regression-models_bda3-14.Rmd
new file mode 100644
index 0000000..48c861f
--- /dev/null
+++ b/docs/notes/13_intro-to-regression-models_bda3-14.Rmd
@@ -0,0 +1,78 @@
+---
+title: "13. Notes on 'Ch 14. Introduction to regression models'"
+date: "2021-12-06"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part ofthe course.
+
+## Chapter 14. Introduction to regression models
+
+### 14.1 Conditional modeling
+
+- question: how does one quantity $y$ vary as a function of another quantity or vector of quantities $x$?
+ - conditional distribution of $y$ given $x$ parameterized as $p(y|\theta,x)$
+- key statistical modeling issues:
+ 1. defining $y$ and $x$ so that $y$ is reasonably linear as a function of the columns of $X$
+ - may need to transform $x$
+ 2. set priors on the model parameters
+
+### 14.2 Bayesian analysis of classical regression
+
+- simplest case: *ordinary linear regression*
+ - observation errors are independent and have equal variance
+
+$$
+y | \beta, \sigma, X \sim \text{N}(X \beta, \sigma^2 I)
+$$
+
+#### Posterior predictive distribution for new data
+
+- posterior predictive distribution has two sources of uncertainty:
+ 1. the inherent variability in the model represented by $\sigma$ in $y$
+ 2. posterior uncertainty in $\beta$ and $\sigma$
+- draw a random sample $\tilde{y}$ from the posterior predictive distribution:
+ - draw $(\beta, \sigma)$ from their posteriors
+ - draw $\tilde{y} \sim \text{N}(\tilde{X} \beta, \sigma^2 I)$
+
+## 14.4 Goals of regression analysis
+
+- at least three goals:
+ 1. understand the behavior of $y$ given $x$
+ 2. predict $y$ given $x$
+ 3. causal inference; predict how $y$ would change if $x$ were changed
+
+## 14.5 Assembling the matrix of explanitory variables
+
+### Identifiability and collinearity
+
+- "the parameters in a classical regression cannot be uniquely estimated if there are more parameters than data points or, more generally, if the columns of the matrix $X$ of explanatory variables are not linearly independent" (pg 365)
+
+### Nonlinear relations
+
+- may need to transform variables
+- can include more than one transformation in the model as separate covariates
+- GLMs and non-linear models are discussed in later chapters
+
+### Indicator variables
+
+- include a categorical variable in a regression using a indicator variable
+ - separate effect for each category
+ - or model as related with a hierarchical model
+
+### Interactions
+
+- "If the response to a unit change in $x_i$ depends in what value another predictor $x_j$ has been fixed at, then it is necessary to include *interaction* terms in the model" (pg 367)
+ - $(x_i - \bar{x_i})(x_j - \bar{x_j})$
+
+## 14.6 Regularization and dimension reduction
+
+- see lecture notes on regularization for more updated recommendations
+- "Bayesian regularization":
+ - location and scale of the prior
+ - analytic form of the prior (e.g. normal vs. Laplacian vs. Cauchy)
+ - how the posterior inference is summarized
diff --git a/docs/notes/13_intro-to-regression-models_bda3-14.html b/docs/notes/13_intro-to-regression-models_bda3-14.html
new file mode 100644
index 0000000..a7c86c7
--- /dev/null
+++ b/docs/notes/13_intro-to-regression-models_bda3-14.html
@@ -0,0 +1,2607 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 13. Notes on 'Ch 14. Introduction to regression models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
13. Notes on ‘Ch 14. Introduction to regression models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part ofthe course.
+
+
Chapter 14. Introduction to regression models
+
14.1 Conditional modeling
+
+- question: how does one quantity \(y\) vary as a function of another quantity or vector of quantities \(x\)?
+
+- conditional distribution of \(y\) given \(x\) parameterized as \(p(y|\theta,x)\)
+
+- key statistical modeling issues:
+
+- defining \(y\) and \(x\) so that \(y\) is reasonably linear as a function of the columns of \(X\)
+
+
+- may need to transform \(x\)
+
+
+- set priors on the model parameters
+
+
+
14.2 Bayesian analysis of classical regression
+
+- simplest case: ordinary linear regression
+
+- observation errors are independent and have equal variance
+
+
+
\[
+y | \beta, \sigma, X \sim \text{N}(X \beta, \sigma^2 I)
+\]
+
Posterior predictive distribution for new data
+
+- posterior predictive distribution has two sources of uncertainty:
+
+- the inherent variability in the model represented by \(\sigma\) in \(y\)
+- posterior uncertainty in \(\beta\) and \(\sigma\)
+
+- draw a random sample \(\tilde{y}\) from the posterior predictive distribution:
+
+- draw \((\beta, \sigma)\) from their posteriors
+- draw \(\tilde{y} \sim \text{N}(\tilde{X} \beta, \sigma^2 I)\)
+
+
+
14.4 Goals of regression analysis
+
+- at least three goals:
+
+- understand the behavior of \(y\) given \(x\)
+- predict \(y\) given \(x\)
+- causal inference; predict how \(y\) would change if \(x\) were changed
+
+
+
14.5 Assembling the matrix of explanitory variables
+
Identifiability and collinearity
+
+- “the parameters in a classical regression cannot be uniquely estimated if there are more parameters than data points or, more generally, if the columns of the matrix \(X\) of explanatory variables are not linearly independent” (pg 365)
+
+
Nonlinear relations
+
+- may need to transform variables
+- can include more than one transformation in the model as separate covariates
+- GLMs and non-linear models are discussed in later chapters
+
+
Indicator variables
+
+- include a categorical variable in a regression using a indicator variable
+
+- separate effect for each category
+- or model as related with a hierarchical model
+
+
+
Interactions
+
+- “If the response to a unit change in \(x_i\) depends in what value another predictor \(x_j\) has been fixed at, then it is necessary to include interaction terms in the model” (pg 367)
+
+- \((x_i - \bar{x_i})(x_j - \bar{x_j})\)
+
+
+
14.6 Regularization and dimension reduction
+
+- see lecture notes on regularization for more updated recommendations
+- “Bayesian regularization”:
+
+- location and scale of the prior
+- analytic form of the prior (e.g. normal vs. Laplacian vs. Cauchy)
+- how the posterior inference is summarized
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/notes/14_hierarchical-linear-models_bda3-15.Rmd b/docs/notes/14_hierarchical-linear-models_bda3-15.Rmd
new file mode 100644
index 0000000..8f15df1
--- /dev/null
+++ b/docs/notes/14_hierarchical-linear-models_bda3-15.Rmd
@@ -0,0 +1,18 @@
+---
+title: "14. Notes on 'Ch 15. Hierarchical linear models'"
+date: "2021-12-09"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part ofthe course.
+> Also, these notes are rather sparse because I have a bit of experience with these models already.
+
+## Chapter 15. Hierarchical linear models
+
+- "Hierarchical regression models are useful as soon as there are predictors at different levels of variation" (pg 381)
+ - or for data obtained by stratified/clustered sampling
+- general recommendation to start simple and build up the model's complexity
diff --git a/docs/notes/14_hierarchical-linear-models_bda3-15.html b/docs/notes/14_hierarchical-linear-models_bda3-15.html
new file mode 100644
index 0000000..81ef91e
--- /dev/null
+++ b/docs/notes/14_hierarchical-linear-models_bda3-15.html
@@ -0,0 +1,2520 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 14. Notes on 'Ch 15. Hierarchical linear models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
14. Notes on ‘Ch 15. Hierarchical linear models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part ofthe course. Also, these notes are rather sparse because I have a bit of experience with these models already.
+
+
Chapter 15. Hierarchical linear models
+
+- “Hierarchical regression models are useful as soon as there are predictors at different levels of variation” (pg 381)
+
+- or for data obtained by stratified/clustered sampling
+
+- general recommendation to start simple and build up the model’s complexity
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/notes/17_parametric-nonlinear-models_bda3-19.Rmd b/docs/notes/17_parametric-nonlinear-models_bda3-19.Rmd
new file mode 100644
index 0000000..a286d06
--- /dev/null
+++ b/docs/notes/17_parametric-nonlinear-models_bda3-19.Rmd
@@ -0,0 +1,94 @@
+---
+title: "17. Notes on 'Ch 19. Parametric nonlinear models'"
+date: "2021-12-09"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 19. Parametric nonlinear models
+
+- examples:
+ - a ratio; $\text{E}(y = \frac{a_1 + b_1 x_1}{a_2 + b_2 x_2}$
+ - sum of nonlinear functions: $\text{E}(y) = A_1 e^{-\alpha_1x} + A_2 e^{-\alpha_2x}$
+- parameters of a nonlinear model are often harder to interpret
+ - often requires custom visualization techniques
+- **"Generally each new modeling problem must be tackled afresh."** (pg 471)
+ - these models are less systematic than linear modeling
+
+### 19.1 Example: serial dilution assay
+
+- estimate 10 unknown concentrations of an allergen based off of serial dilutions of a known standard
+
+#### The model
+
+- **Notation**:
+ - parameters of interest: concentrations of unknown samples $\theta_1, \dots, \theta_10$
+ - known concentration of the standard $\theta_0$
+ - dilution of measure $i$ as $x_i$ and color intensity (measurement) as $y_i$
+- **Curve of expected measurements given the concentration**
+ - use the following equation that is standard in the field
+ - parameters:
+ - $\beta_1$: color intensity at the limit of 0 concentration
+ - $\beta_2$: the increase to saturation
+ - $\beta_3$: concentration at which the gradient of the curve turns
+ - $\beta_4$: rate at which saturation occurs
+
+$$
+\text{E}(y | x, \beta) =
+ g(x, \beta) =
+ \beta_1 + \frac{\beta_2}{1 + (x / \beta_3)^{-\beta_4}}
+$$
+
+- **Measurement error**
+ - modeled as normally distributed with unequal variances
+ - parameters:
+ - $\alpha$: models the pattern that variances are higher for larger measurements
+ - restricted $[0, 1]$
+ - $A$ is a arbitrary constant to scale the data so $\sigma$ can be interpreted as the deviation from "typical" values
+ - $\sigma$: deviation of a measure from the "typical"
+
+$$
+y_i \sim \text{N}(g(x_i, \beta), (\frac{g(x_i, \beta)}{A})^{2\alpha} \sigma_y^2)
+$$
+
+- **Dilution errors**
+ - two possible sources:
+ 1. *initial dilution*: the accuracy of the creation of the initial standard concentration
+ 2. *serial dilutions*: error in creation of the subsequent dilutions (low enough to ignore for this analysis)
+ - use a normal model on the log scale of the initial dilution error
+ - parameters:
+ - $\theta_0$: known concentration of the standard solution
+ - $d_0^\text{init}$: known initial dilution of the standard that is called for
+ - without error, the concentration of the initial solution would be $d_0^\text{init} \theta_0$
+ - $x_0^\text{init}$: the *actual* (unknown) concentration of the initial dilution
+
+$$
+\log(x_0^\text{init}) \sim \text{N}(\log(d_0^\text{init} \cdot \theta_0), (\sigma^\text{init})^2)
+$$
+- **Dilution errors** (cont)
+ - there is no initial dilution for the unknown samples being tested
+ - therefore, the unknown initial concentration for sample $j$ is $x^\text{init} = \theta_j$ for $j = 1, \dots, 10$
+ - for the dilutions of the unknown samples, set $x_i = d_i \cdot x_{j(i)}^\text{init}$
+ - $j(i)$ is the sample $j$ corresponding to measurement $i$
+ - $d_i$ is the dilution of measurement $i$ relative to the initial concentration
+
+#### Prior distributions
+
+- priors used as described by book (are likely different than what would be recommended now):
+ - $\log(\beta) \sim U(-\infty, \infty)$
+ - $\sigma_y \sim U(0, \infty)$
+ - $\alpha \sim U(0,1)$
+ - $p(\log \theta_j) \propto 1$ for each unknown $j = 1, \dots, 10$
+- cannot estimate $\sigma^\text{init}$ because we only have a single standard
+ - use a fixed value of 0.02 based on a previous analysis of different plates
+
+### 19.2 Example: population toxicokinetics
+
+- this is a more complex model
+- uses a physiological model with parameters that cannot be solely determined using the data
+ - requires informative priors based on previous studies
diff --git a/docs/notes/17_parametric-nonlinear-models_bda3-19.html b/docs/notes/17_parametric-nonlinear-models_bda3-19.html
new file mode 100644
index 0000000..625c447
--- /dev/null
+++ b/docs/notes/17_parametric-nonlinear-models_bda3-19.html
@@ -0,0 +1,2629 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 17. Notes on 'Ch 19. Parametric nonlinear models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
17. Notes on ‘Ch 19. Parametric nonlinear models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part of the course.
+
+
Chapter 19. Parametric nonlinear models
+
+- examples:
+
+- a ratio; \(\text{E}(y = \frac{a_1 + b_1 x_1}{a_2 + b_2 x_2}\)
+- sum of nonlinear functions: \(\text{E}(y) = A_1 e^{-\alpha_1x} + A_2 e^{-\alpha_2x}\)
+
+- parameters of a nonlinear model are often harder to interpret
+
+- often requires custom visualization techniques
+
+- “Generally each new modeling problem must be tackled afresh.” (pg 471)
+
+- these models are less systematic than linear modeling
+
+
+
19.1 Example: serial dilution assay
+
+- estimate 10 unknown concentrations of an allergen based off of serial dilutions of a known standard
+
+
The model
+
+- Notation:
+
+- parameters of interest: concentrations of unknown samples \(\theta_1, \dots, \theta_10\)
+- known concentration of the standard \(\theta_0\)
+- dilution of measure \(i\) as \(x_i\) and color intensity (measurement) as \(y_i\)
+
+- Curve of expected measurements given the concentration
+
+- use the following equation that is standard in the field
+- parameters:
+
+- \(\beta_1\): color intensity at the limit of 0 concentration
+- \(\beta_2\): the increase to saturation
+- \(\beta_3\): concentration at which the gradient of the curve turns
+- \(\beta_4\): rate at which saturation occurs
+
+
+
+
\[
+\text{E}(y | x, \beta) =
+ g(x, \beta) =
+ \beta_1 + \frac{\beta_2}{1 + (x / \beta_3)^{-\beta_4}}
+\]
+
+- Measurement error
+
+- modeled as normally distributed with unequal variances
+- parameters:
+
+- \(\alpha\): models the pattern that variances are higher for larger measurements
+
+- \(A\) is a arbitrary constant to scale the data so \(\sigma\) can be interpreted as the deviation from “typical” values
+- \(\sigma\): deviation of a measure from the “typical”
+
+
+
+
\[
+y_i \sim \text{N}(g(x_i, \beta), (\frac{g(x_i, \beta)}{A})^{2\alpha} \sigma_y^2)
+\]
+
+- Dilution errors
+
+- two possible sources:
+
+- initial dilution: the accuracy of the creation of the initial standard concentration
+- serial dilutions: error in creation of the subsequent dilutions (low enough to ignore for this analysis)
+
+- use a normal model on the log scale of the initial dilution error
+- parameters:
+
+- \(\theta_0\): known concentration of the standard solution
+- \(d_0^\text{init}\): known initial dilution of the standard that is called for
+
+- without error, the concentration of the initial solution would be \(d_0^\text{init} \theta_0\)
+
+- \(x_0^\text{init}\): the actual (unknown) concentration of the initial dilution
+
+
+
+
\[
+\log(x_0^\text{init}) \sim \text{N}(\log(d_0^\text{init} \cdot \theta_0), (\sigma^\text{init})^2)
+\] - Dilution errors (cont) - there is no initial dilution for the unknown samples being tested - therefore, the unknown initial concentration for sample \(j\) is \(x^\text{init} = \theta_j\) for \(j = 1, \dots, 10\) - for the dilutions of the unknown samples, set \(x_i = d_i \cdot x_{j(i)}^\text{init}\) - \(j(i)\) is the sample \(j\) corresponding to measurement \(i\) - \(d_i\) is the dilution of measurement \(i\) relative to the initial concentration
+
Prior distributions
+
+- priors used as described by book (are likely different than what would be recommended now):
+
+- \(\log(\beta) \sim U(-\infty, \infty)\)
+- \(\sigma_y \sim U(0, \infty)\)
+- \(\alpha \sim U(0,1)\)
+- \(p(\log \theta_j) \propto 1\) for each unknown \(j = 1, \dots, 10\)
+
+- cannot estimate \(\sigma^\text{init}\) because we only have a single standard
+
+- use a fixed value of 0.02 based on a previous analysis of different plates
+
+
+
19.2 Example: population toxicokinetics
+
+- this is a more complex model
+- uses a physiological model with parameters that cannot be solely determined using the data
+
+- requires informative priors based on previous studies
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/notes/18_basis-function-models_bda3-20.Rmd b/docs/notes/18_basis-function-models_bda3-20.Rmd
new file mode 100644
index 0000000..181d612
--- /dev/null
+++ b/docs/notes/18_basis-function-models_bda3-20.Rmd
@@ -0,0 +1,64 @@
+---
+title: "18. Notes on 'Ch 20. Basis function models'"
+date: "2022-01-12"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 20. Basis function models
+
+- chapter 19 focused on nonlinear models with $\text{E}(y|X,\beta) = \mu(X_i,\phi)$ where $\mu$ is a parametric nonlinear function of unknowns $\phi$
+- in this and following chapters, consider models where $\mu$ is also unknown
+
+### 20.1 Splines and weighted sums of basis functions
+
+- replace $X_i \beta$ with $\mu(X_i)$ where $\mu(\cdot)$ is some class of nonlinear functions
+ - different options for modeling $\mu$ including with basis function expansions or Gaussian processes (next chapter)
+- basis function approach: $\mu(x) = \sum_{h=1}^{H} \beta_h b_h(x)$
+ - $b_h$: set of basis functions
+ - $\beta_h$: vector of basis coefficients
+- common choices for basis functions are:
+ - **Gaussian radial basis functions**: multiple centers of the basis functions with a width parameter controlling a set of Gaussian functions
+ - **B-spline**: a piecewise continuous function based on a set of knots
+ - knots locations control the flexibility of the basis
+ - knots cn be placed uniformly or non-uniformly (e.g. based on the density of the data)
+ - can use a "free knot approach" with a prior on the number and location of knots, but is computationally demanding
+ - instead can use priors on the coefficients $\beta$ to shrink values to near 0
+
+## 20.2 Basis selection and shrinkage coefficients
+
+- common to not know which basis functions are really needed
+- can use a variable selection approach to allow the model to estimate the "importance" of each basis function
+ - can then either select the best model from the posterior or average over all possible models by weighting each basis by its importance
+- possible for some bias based on initial choice of the basis functions
+ - implied prior information on the smoothness and shape of the model
+ - can include multiple types of basis functions in the initial collection
+
+### Shrinkage priors
+
+- allowing basis function coefficients to be zero with positive probability represents a challenge for sampling from the posterior
+ - with many basis functions, it is computationally infeasible to visit all possible states
+- may be better to use shrinkage priors instead
+ - there are various options discussed in the book and there are likely others recommended now
+
+## 20.3 Non-normal models and regression surfaces
+
+### Other error distributions
+
+- may want to model data that is not a continuous output variable $y$ or does not have Gaussian residuals
+ - can modify the residual densities with different prior distributions to accommodate outliers
+ - can use the basis function and its coefficients $\eta_i = w_i \beta$ as the linear component in a GLM
+
+### Multivariate regression surfaces
+
+- careful with *curse of dimensionality*
+- one option is to assume additive of the covariates so can model as the sum of univariate regression functions
+ - this does not always make sense and a different approach using a tensor product is described at the end of the section
+- can use informative priors to help restrict the search space
+ - is a form of including prior information, so still proper Bayesian results and measures of uncertainty
+ - e.g. if we know *a priori* that the mean response variable is non-decreasing, restrict the coefficients $\beta$ to be non-negative
diff --git a/docs/notes/18_basis-function-models_bda3-20.html b/docs/notes/18_basis-function-models_bda3-20.html
new file mode 100644
index 0000000..fc1f146
--- /dev/null
+++ b/docs/notes/18_basis-function-models_bda3-20.html
@@ -0,0 +1,2595 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 18. Notes on 'Ch 20. Basis function models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
18. Notes on ‘Ch 20. Basis function models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part of the course.
+
+
Chapter 20. Basis function models
+
+- chapter 19 focused on nonlinear models with \(\text{E}(y|X,\beta) = \mu(X_i,\phi)\) where \(\mu\) is a parametric nonlinear function of unknowns \(\phi\)
+- in this and following chapters, consider models where \(\mu\) is also unknown
+
+
20.1 Splines and weighted sums of basis functions
+
+- replace \(X_i \beta\) with \(\mu(X_i)\) where \(\mu(\cdot)\) is some class of nonlinear functions
+
+- different options for modeling \(\mu\) including with basis function expansions or Gaussian processes (next chapter)
+
+- basis function approach: \(\mu(x) = \sum_{h=1}^{H} \beta_h b_h(x)\)
+
+- \(b_h\): set of basis functions
+- \(\beta_h\): vector of basis coefficients
+
+- common choices for basis functions are:
+
+- Gaussian radial basis functions: multiple centers of the basis functions with a width parameter controlling a set of Gaussian functions
+- B-spline: a piecewise continuous function based on a set of knots
+
+- knots locations control the flexibility of the basis
+- knots cn be placed uniformly or non-uniformly (e.g. based on the density of the data)
+- can use a “free knot approach” with a prior on the number and location of knots, but is computationally demanding
+- instead can use priors on the coefficients \(\beta\) to shrink values to near 0
+
+
+
+
20.2 Basis selection and shrinkage coefficients
+
+- common to not know which basis functions are really needed
+- can use a variable selection approach to allow the model to estimate the “importance” of each basis function
+
+- can then either select the best model from the posterior or average over all possible models by weighting each basis by its importance
+
+- possible for some bias based on initial choice of the basis functions
+
+- implied prior information on the smoothness and shape of the model
+- can include multiple types of basis functions in the initial collection
+
+
+
Shrinkage priors
+
+- allowing basis function coefficients to be zero with positive probability represents a challenge for sampling from the posterior
+
+- with many basis functions, it is computationally infeasible to visit all possible states
+
+- may be better to use shrinkage priors instead
+
+- there are various options discussed in the book and there are likely others recommended now
+
+
+
20.3 Non-normal models and regression surfaces
+
Other error distributions
+
+- may want to model data that is not a continuous output variable \(y\) or does not have Gaussian residuals
+
+- can modify the residual densities with different prior distributions to accommodate outliers
+- can use the basis function and its coefficients \(\eta_i = w_i \beta\) as the linear component in a GLM
+
+
+
Multivariate regression surfaces
+
+- careful with curse of dimensionality
+- one option is to assume additive of the covariates so can model as the sum of univariate regression functions
+
+- this does not always make sense and a different approach using a tensor product is described at the end of the section
+
+- can use informative priors to help restrict the search space
+
+- is a form of including prior information, so still proper Bayesian results and measures of uncertainty
+- e.g. if we know a priori that the mean response variable is non-decreasing, restrict the coefficients \(\beta\) to be non-negative
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/notes/19_gaussian-processes_bda3-21.Rmd b/docs/notes/19_gaussian-processes_bda3-21.Rmd
new file mode 100644
index 0000000..6dac644
--- /dev/null
+++ b/docs/notes/19_gaussian-processes_bda3-21.Rmd
@@ -0,0 +1,147 @@
+---
+title: "19. Notes on 'Ch 21. Gaussian process models'"
+date: "2022-01-13"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+
+library(glue)
+library(ggtext)
+library(tidyverse)
+
+theme_set(
+ theme_bw() +
+ theme(axis.ticks = element_blank(), strip.background = element_blank())
+)
+```
+
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 21. Gaussian process models
+
+- *Gaussian process* (GP): "flexible class of models for which any finite-dimensional marginal distribution is Gaussian" (pg. 501)
+ - "can be viewed as a potentially infinite-dimensional generalization of Gaussian distribution" (pg. 501)
+
+### 21.1 Gaussian process regression
+
+- realizations from a GP correspond to random functions
+ - good prior for an unknown regression function $\mu(x)$
+- $\mu \sim \text{GP}(m,k)$
+ - $m$: mean function
+ - $k$: covariance function
+- $\mu$ is a random function ("stochastic process") where the values at any $n$ pooints $x_1, \dots, x_n$ are drawn from the $n-dimensional$ normal distribution
+ - with mean $m$ and covariance $K$:
+
+$$
+\mu(x_1), \dots, \mu(x_n) \sim \text{N}((m(x_1), \dots, m(x_n)), K(x_1, \dots, x_n))
+$$
+
+- the GP $\mu \sim \text{GP}(m,k)$ is nonparametric with infinitely many parameters
+ - the mean function $m$ represents an inital guess at the regression function
+ - the covariance function $k$ represents the covariance between the process at any two points
+ - controls the smoothness of realizations from the GP and degree of shrinkage towards the mean
+- below is an example of realizations from a GP with mean function 0 and the *squared exponential* (a.k.a. exponentiated quadratic, Gaussian) covariance function with different parameters
+
+$$
+k(x, x^\prime) = \tau^2 \exp(-\frac{|x-x^\prime|^2}{2l^2})
+$$
+
+```{r}
+squared_exponential_cov <- function(x, tau, l) {
+ n <- length(x)
+ k <- matrix(0, nrow = n, ncol = n)
+ denom <- 2 * (l^2)
+ for (i in 1:n) {
+ for (j in 1:n) {
+ a <- x[i]
+ b <- x[j]
+ k[i, j] <- tau^2 * exp(-(abs(a - b)^2) / (denom))
+ }
+ }
+ return(k)
+}
+
+my_gaussian_process <- function(x, tau, l, n = 3) {
+ m <- rep(0, length(x))
+ k <- squared_exponential_cov(x = x, tau = tau, l = l)
+ gp_samples <- mvtnorm::rmvnorm(n = n, mean = m, sigma = k)
+ return(gp_samples)
+}
+
+tidy_gp <- function(x, tau, l, n = 3) {
+ my_gaussian_process(x = x, tau = tau, l = l, n = n) %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(x) %>%
+ mutate(sample_idx = as.character(1:n())) %>%
+ pivot_longer(-sample_idx, names_to = "x", values_to = "y") %>%
+ mutate(x = as.numeric(x))
+}
+
+set.seed(0)
+x <- seq(0, 10, by = 0.1)
+gp_samples <- tibble(tau = c(0.25, 0.5, 0.25, 0.5), l = c(0.5, 0.5, 2, 2)) %>%
+ mutate(samples = purrr::map2(tau, l, ~ tidy_gp(x = x, tau = .x, l = .y, n = 3))) %>%
+ unnest(samples)
+
+gp_samples %>%
+ mutate(grp = glue("\u03C4 = {tau}, \u2113 = {l}")) %>%
+ ggplot(aes(x = x, y = y)) +
+ facet_wrap(vars(grp), nrow = 2) +
+ geom_line(aes(color = sample_idx)) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(legend.position = "none", axis.text.y = element_markdown()) +
+ labs(x = "x", y = "\u03BC(x)")
+```
+
+#### Covariance functions
+
+- "Different covariance functions can be used to add structural prior assumptions like smoothness, nonstationarity, periodicity, and multi-scale or hierarchical structures." (pg. 502)
+ - sums and products of GPs are also GPs so can combine them in the same model
+- can also use "anisotropic" GPs covariance functions for multiple predictors
+
+#### Inference
+
+- computing the mean and covariance in the $n$-variate normal conditional posterior for $\tilde{\mu}$ involves a matrix inversion that requires $O(n^3)$ computation
+ - this needs to be repeated for each MCMC step
+ - limits the size of the data set and number of covariates in a model
+
+#### Covariance function approximations
+
+- there are approximations to the GP that can speed up computation
+ - generally work by reducing the matrix inversion burden
+
+### 21.3 Latent Gaussian process models
+
+- with non-Gaussian likelihoods, the GP prior becomes a latent function $f$ which determines the likelihood $p(y|f,\phi)$ through a link function
+
+### 21.4 Functional data analysis
+
+- *functional data analysis*: considers responses and predictors not a scalar/vector-valued random variables but as random functions with infinitely-many points
+ - GPs fit this need well with little modification
+
+### 21.5 Density estimation and regression
+
+- can get more flexibility by modeling the conditional observation model as a nonparametric GP
+ - so far have used a GP as a prior for a function controlling location or shape of a parametric observation model
+ - one solution is the *logistic Gaussian process* (LGP) or a Dirichlet process (covered in a later chapter)
+
+#### Density estimation
+
+- LGP generates a random surface from a GP and then transforms the surface to the space of probability densities
+ - with 1D, the surface is just a curve
+ - use the continuous logistic transformation to constrain to non-negative and integrate to 1
+- there is illustrative example in the book on page 513
+
+#### Density regression
+
+- generalize the LPG to density regression by putting a prior on the collection of conditional densities
+
+#### Latent-variable regression
+
+- an alternative to LPG
diff --git a/docs/notes/19_gaussian-processes_bda3-21.html b/docs/notes/19_gaussian-processes_bda3-21.html
new file mode 100644
index 0000000..599ebf8
--- /dev/null
+++ b/docs/notes/19_gaussian-processes_bda3-21.html
@@ -0,0 +1,2709 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 19. Notes on 'Ch 21. Gaussian process models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
19. Notes on ‘Ch 21. Gaussian process models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were
+not part of the course.
+
+
Chapter 21. Gaussian
+process models
+
+- Gaussian process (GP): “flexible class of models for which
+any finite-dimensional marginal distribution is Gaussian” (pg. 501)
+
+- “can be viewed as a potentially infinite-dimensional generalization
+of Gaussian distribution” (pg. 501)
+
+
+
21.1 Gaussian process
+regression
+
+- realizations from a GP correspond to random functions
+
+- good prior for an unknown regression function \(\mu(x)\)
+
+- \(\mu \sim \text{GP}(m,k)\)
+
+- \(m\): mean function
+- \(k\): covariance function
+
+- \(\mu\) is a random function
+(“stochastic process”) where the values at any \(n\) pooints \(x_1, \dots, x_n\) are drawn from the \(n-dimensional\) normal distribution
+
+- with mean \(m\) and covariance
+\(K\):
+
+
+
\[
+\mu(x_1), \dots, \mu(x_n) \sim \text{N}((m(x_1), \dots, m(x_n)), K(x_1,
+\dots, x_n))
+\]
+
+- the GP \(\mu \sim \text{GP}(m,k)\)
+is nonparametric with infinitely many parameters
+
+- the mean function \(m\) represents
+an inital guess at the regression function
+- the covariance function \(k\)
+represents the covariance between the process at any two points
+
+- controls the smoothness of realizations from the GP and degree of
+shrinkage towards the mean
+
+
+- below is an example of realizations from a GP with mean function 0
+and the squared exponential (a.k.a. exponentiated quadratic,
+Gaussian) covariance function with different parameters
+
+
\[
+k(x, x^\prime) = \tau^2 \exp(-\frac{|x-x^\prime|^2}{2l^2})
+\]
+
+
+
squared_exponential_cov <- function(x, tau, l) {
+ n <- length(x)
+ k <- matrix(0, nrow = n, ncol = n)
+ denom <- 2 * (l^2)
+ for (i in 1:n) {
+ for (j in 1:n) {
+ a <- x[i]
+ b <- x[j]
+ k[i, j] <- tau^2 * exp(-(abs(a - b)^2) / (denom))
+ }
+ }
+ return(k)
+}
+
+my_gaussian_process <- function(x, tau, l, n = 3) {
+ m <- rep(0, length(x))
+ k <- squared_exponential_cov(x = x, tau = tau, l = l)
+ gp_samples <- mvtnorm::rmvnorm(n = n, mean = m, sigma = k)
+ return(gp_samples)
+}
+
+tidy_gp <- function(x, tau, l, n = 3) {
+ my_gaussian_process(x = x, tau = tau, l = l, n = n) %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(x) %>%
+ mutate(sample_idx = as.character(1:n())) %>%
+ pivot_longer(-sample_idx, names_to = "x", values_to = "y") %>%
+ mutate(x = as.numeric(x))
+}
+
+set.seed(0)
+x <- seq(0, 10, by = 0.1)
+gp_samples <- tibble(tau = c(0.25, 0.5, 0.25, 0.5), l = c(0.5, 0.5, 2, 2)) %>%
+ mutate(samples = purrr::map2(tau, l, ~ tidy_gp(x = x, tau = .x, l = .y, n = 3))) %>%
+ unnest(samples)
+
+gp_samples %>%
+ mutate(grp = glue("\u03C4 = {tau}, \u2113 = {l}")) %>%
+ ggplot(aes(x = x, y = y)) +
+ facet_wrap(vars(grp), nrow = 2) +
+ geom_line(aes(color = sample_idx)) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(legend.position = "none", axis.text.y = element_markdown()) +
+ labs(x = "x", y = "\u03BC(x)")
+
+
+
+
+
Covariance functions
+
+- “Different covariance functions can be used to add structural prior
+assumptions like smoothness, nonstationarity, periodicity, and
+multi-scale or hierarchical structures.” (pg. 502)
+
+- sums and products of GPs are also GPs so can combine them in the
+same model
+
+- can also use “anisotropic” GPs covariance functions for multiple
+predictors
+
+
Inference
+
+- computing the mean and covariance in the \(n\)-variate normal conditional posterior
+for \(\tilde{\mu}\) involves a matrix
+inversion that requires \(O(n^3)\)
+computation
+
+- this needs to be repeated for each MCMC step
+- limits the size of the data set and number of covariates in a
+model
+
+
+
Covariance function
+approximations
+
+- there are approximations to the GP that can speed up computation
+
+- generally work by reducing the matrix inversion burden
+
+
+
21.3 Latent Gaussian process
+models
+
+- with non-Gaussian likelihoods, the GP prior becomes a latent
+function \(f\) which determines the
+likelihood \(p(y|f,\phi)\) through a
+link function
+
+
21.4 Functional data analysis
+
+- functional data analysis: considers responses and
+predictors not a scalar/vector-valued random variables but as random
+functions with infinitely-many points
+
+- GPs fit this need well with little modification
+
+
+
21.5 Density estimation and
+regression
+
+- can get more flexibility by modeling the conditional observation
+model as a nonparametric GP
+
+- so far have used a GP as a prior for a function controlling location
+or shape of a parametric observation model
+- one solution is the logistic Gaussian process (LGP) or a
+Dirichlet process (covered in a later chapter)
+
+
+
Density estimation
+
+- LGP generates a random surface from a GP and then transforms the
+surface to the space of probability densities
+
+- with 1D, the surface is just a curve
+- use the continuous logistic transformation to constrain to
+non-negative and integrate to 1
+
+- there is illustrative example in the book on page 513
+
+
Density regression
+
+- generalize the LPG to density regression by putting a prior on the
+collection of conditional densities
+
+
Latent-variable regression
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/notes/20_finite-mixture-models_bda3-22.Rmd b/docs/notes/20_finite-mixture-models_bda3-22.Rmd
new file mode 100644
index 0000000..abfae8b
--- /dev/null
+++ b/docs/notes/20_finite-mixture-models_bda3-22.Rmd
@@ -0,0 +1,91 @@
+---
+title: "20. Notes on 'Ch 22. Finite mixture models'"
+date: "2022-01-18"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 22. Finite mixture models
+
+- "when measurements of a random variable are taken under two different conditions" (pg. 519)
+- where the data contains multiple subpopulations where each has a different, relatively simple model
+- basic mixture modeling principle is to introduce *unobserved indicators* $z$ to specify the mixture component for an observation
+ - can think of a mixture indicator as missing data
+
+### 22.1 Setting up and interpreting mixture models
+
+#### Finite mixtures
+
+- want to model the distribution of $y = (y_1, \dots, y_n)$ or $y|x$ as a mixture of $H$ components
+ - for each component $h \in (1, \dots, H)$, the distribution $f_h(y_i | \theta_h)$ depends on a parameter vector $\theta_h$
+ - $\lambda_h$ denotes the proportion of the population in component $h$
+ - $\sum_{h=1}^{H} \lambda_h = 1$
+ - common to assume all mixture components have the same parametric form
+ - thus, the sampling distribution of $y$ is:
+
+$$
+p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + \lambda_H f(y_i | \theta_H)
+$$
+
+- can think of the mixture distribution probabilities $\lambda$ as priors over the parameters $\theta_h$
+ - or as a description of the variation in $\theta$ in a population
+ - akin to a hierarchical model
+- introduce the indicator variables $z_{ih}$ where $z_{ih} = 1$ if the $i$th data point is drawn from component $h$ and 0 otherwise
+ - the $lambda$ values are used to determine $z$
+ - can think of $lambda$ as a hyperprior over $z$
+ - joint distribution of the observed data $y$ and the unobserved indicators $z$ conditions on the model parameters:
+ - only one $z_{ih}$ can be 1 for each $i$
+
+$$
+\begin{aligned}
+p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\
+ &= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | \theta_h))^{z_{i,h}}
+\end{aligned}
+$$
+
+#### Continuous mixtures
+
+- generalize the finite mixture to allow probability of an observation belongs to a class
+- hierarchical models are a form a continuous mixture model
+ - each observed value $y_i$ is modeled as coming from a mixture of models defined by the probability of values for $\theta$
+- in the book, the focus is on finite mixtures and "minor modifications" are generally required to form a continuous distribution
+
+#### Identifiabilitiy of the mxixture likelihood
+
+- **all finite mixture models are nonidentifiable: the distribution is unchanged if the group labels are permuted**
+- in many cases, purposeful, informative priors can solve the issue
+
+#### Priors distributions
+
+- the priors for a finite mixture model's parameters $\theta$ and $\lambda$ are usually the product of the two independent priors on each variable
+ - because the vector of mixture indicators $z_i = (z_{i,1}, \dots, z_{i,H})$ is multinomial with parameter $\lambda$, a common prior for $\lambda$ is the Dirichlet
+ - $\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H)$
+ - $\theta = (\theta_1, \dots, \theta_H)$ is the vector of parameters for each component's sub-model
+ - some can be shared across components (i.e. equal variance for a group of normal distributions)
+
+#### Number of mixture components
+
+- can model $H$ as unknown but is computationally expensive
+- usually can just build models with different $H$ and compare their goodness of fit
+ - compare the posterior predictive distributions with a "suitably chosen" test quantity
+
+#### Mixtures as true models or approximating distributions
+
+- two classes of thought[^1]:
+ 1. *theoretical:* a mixture model is "a realistic characterization of the true data-generating mechanism" (pg. 522)
+ 2. *pragmatic:* "trying to infer latent subpopulations is an intrinsically ill-defined statistical problem, but finite mixture models are nonetheless useful" (pg. 523)
+
+[^1]: I came up with the names of the two schools of thought as descriptive titles; they do not appear in the book.
+
+### 22.4 Unspecifed number of mixture components
+
+- can assign a Poisson distribution as a on $H$ (the number of groups/components in the mixture model)
+ - computationally intensive
+ - more common to just fit the model with different $H$ and compare with some statistic and a penalty for model complexity
+ - WAIC is theoretically justified, but ignores the uncertainty over $H$
+ - LOO-CV may be even better
diff --git a/docs/notes/20_finite-mixture-models_bda3-22.html b/docs/notes/20_finite-mixture-models_bda3-22.html
new file mode 100644
index 0000000..454345e
--- /dev/null
+++ b/docs/notes/20_finite-mixture-models_bda3-22.html
@@ -0,0 +1,2682 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20. Notes on 'Ch 22. Finite mixture models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
20. Notes on ‘Ch 22. Finite mixture models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were
+not part of the course.
+
+
Chapter 22. Finite mixture
+models
+
+- “when measurements of a random variable are taken under two
+different conditions” (pg. 519)
+- where the data contains multiple subpopulations where each has a
+different, relatively simple model
+- basic mixture modeling principle is to introduce unobserved
+indicators \(z\) to specify the
+mixture component for an observation
+
+- can think of a mixture indicator as missing data
+
+
+
22.1 Setting up and
+interpreting mixture models
+
Finite mixtures
+
+- want to model the distribution of \(y =
+(y_1, \dots, y_n)\) or \(y|x\)
+as a mixture of \(H\) components
+
+- for each component \(h \in (1, \dots,
+H)\), the distribution \(f_h(y_i |
+\theta_h)\) depends on a parameter vector \(\theta_h\)
+- \(\lambda_h\) denotes the
+proportion of the population in component \(h\)
+
+- \(\sum_{h=1}^{H} \lambda_h =
+1\)
+
+- common to assume all mixture components have the same parametric
+form
+- thus, the sampling distribution of \(y\) is:
+
+
+
\[
+p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots +
+\lambda_H f(y_i | \theta_H)
+\]
+
+- can think of the mixture distribution probabilities \(\lambda\) as priors over the parameters
+\(\theta_h\)
+
+- or as a description of the variation in \(\theta\) in a population
+- akin to a hierarchical model
+
+- introduce the indicator variables \(z_{ih}\) where \(z_{ih} = 1\) if the \(i\)th data point is drawn from component
+\(h\) and 0 otherwise
+
+- the \(lambda\) values are used to
+determine \(z\)
+
+- can think of \(lambda\) as a
+hyperprior over \(z\)
+
+- joint distribution of the observed data \(y\) and the unobserved indicators \(z\) conditions on the model parameters:
+
+- only one \(z_{ih}\) can be 1 for
+each \(i\)
+
+
+
+
\[
+\begin{aligned}
+p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\
+&= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i |
+\theta_h))^{z_{i,h}}
+\end{aligned}
+\]
+
Continuous mixtures
+
+- generalize the finite mixture to allow probability of an observation
+belongs to a class
+- hierarchical models are a form a continuous mixture model
+
+- each observed value \(y_i\) is
+modeled as coming from a mixture of models defined by the probability of
+values for \(\theta\)
+
+- in the book, the focus is on finite mixtures and “minor
+modifications” are generally required to form a continuous
+distribution
+
+
Identifiabilitiy of
+the mxixture likelihood
+
+- all finite mixture models are nonidentifiable: the
+distribution is unchanged if the group labels are permuted
+- in many cases, purposeful, informative priors can solve the
+issue
+
+
Priors distributions
+
+- the priors for a finite mixture model’s parameters \(\theta\) and \(\lambda\) are usually the product of the
+two independent priors on each variable
+
+- because the vector of mixture indicators \(z_i = (z_{i,1}, \dots, z_{i,H})\) is
+multinomial with parameter \(\lambda\),
+a common prior for \(\lambda\) is the
+Dirichlet
+
+- \(\lambda \sim \text{Dirichlet}(\alpha_1,
+\dots, \alpha_H)\)
+
+- \(\theta = (\theta_1, \dots,
+\theta_H)\) is the vector of parameters for each component’s
+sub-model
+
+- some can be shared across components (i.e. equal variance for a
+group of normal distributions)
+
+
+
+
Number of mixture components
+
+- can model \(H\) as unknown but is
+computationally expensive
+- usually can just build models with different \(H\) and compare their goodness of fit
+
+- compare the posterior predictive distributions with a “suitably
+chosen” test quantity
+
+
+
Mixtures
+as true models or approximating distributions
+
+- two classes of thought:
+
+- theoretical: a mixture model is “a realistic
+characterization of the true data-generating mechanism” (pg. 522)
+- pragmatic: “trying to infer latent subpopulations is an
+intrinsically ill-defined statistical problem, but finite mixture models
+are nonetheless useful” (pg. 523)
+
+
+
22.4 Unspecifed number
+of mixture components
+
+- can assign a Poisson distribution as a on \(H\) (the number of groups/components in the
+mixture model)
+
+- computationally intensive
+- more common to just fit the model with different \(H\) and compare with some statistic and a
+penalty for model complexity
+
+- WAIC is theoretically justified, but ignores the uncertainty over
+\(H\)
+- LOO-CV may be even better
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/notes/21_Dirichlet-process-models_bda3-23.Rmd b/docs/notes/21_Dirichlet-process-models_bda3-23.Rmd
new file mode 100644
index 0000000..95aa2ef
--- /dev/null
+++ b/docs/notes/21_Dirichlet-process-models_bda3-23.Rmd
@@ -0,0 +1,350 @@
+---
+title: "21. Notes on 'Ch 23. Dirichlet process models'"
+date: "2022-01-20"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+
+library(tidyverse)
+
+theme_set(
+ theme_bw() +
+ theme(
+ strip.background = element_blank(),
+ axis.ticks = element_blank()
+ )
+)
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 23. Dirichlet process models
+
+- **Dirichlet process**: an infinite-dimensional generalization of the Dirichlet distribution
+ - used as a prior on unknown distributions
+ - can extend finite component mixture models to infinite mixture models
+
+### 23.1 Bayesian histograms
+
+- the histogram as a simple form of density estimation
+ - demonstrate a flexible parametric version that motivates the non-parametric in the following section
+- prespecified knots: $\xi = (\xi_0, \dots, \xi_k)$ with $\xi_{n-1} < \xi_n$
+- probability model for the density (a histogram):
+ - where $\pi = (\pi_1, \dots, \pi_k)$ is an unknown probability vector
+
+$$
+f(y) = \sum_{h=1}^k 1_{\xi_{h-1} < y \le \xi_h} \frac{\pi_h}{(\xi_h - \xi_{h-1})}
+$$
+
+- prior for the probabilities $\pi$ as a Dirichlet distribution:
+
+$$
+p(\pi|a) = \frac{\Gamma(\sum_{h=1}^k a_h)}{\prod_{h=1}^k \Gamma(a_h)} \prod_{h=1}^k \pi _h^{a_h - 1}
+$$
+
+- replace the hyperparameter vector: $a = \alpha \pi_0$ where $\pi_0$ is:
+
+$$
+\text{E}(\pi|a) = \pi_0 = \left( \frac{a_1}{\sum_h a_h}, \dots, \frac{a_k}{\sum_h a_h} \right)
+$$
+
+- the posterior for $\pi$ becomes:
+ - where $n_i$ is the number of observations $y$ in the $i$th bin
+
+$$
+p(\pi | y) \propto \prod_{h=1}^k \pi_h^{a_h + n_h - 1} = \text{Dirichlet}(a_1 + n_1, \dots, a_k + n_k)
+$$
+
+- this histogram estimator does well but is sensitive to the specification of the knots
+
+### 23.2 Dirichlet process prior distributions
+
+#### Definition and basic properties
+
+- goal is to not need to prespecify the bins of the histogram
+- let:
+ - $\Omega$: sample space
+ - $B_1, \dots, B_k$: measure subsets of $\Omega$
+- if $\Omega = \Re$, then $B_1, \dots, B_k$ are non-overlapping intervals that partition the real line into a finite number of bins
+- $P$: unknown probability measure of $(\Omega, \mathcal{B})$
+ - $\mathcal{B}$: "collection of all possible subsets of the sample space $\Omega$"
+ - $P$ assigns probabilities to the subsets $\mathcal{B}$
+ - probability for a set of bins $B_1, \dots, B_k$ partitioning $\Omega$:
+
+$$
+P(B_1), \dots, P(B_k) = \left( \int_{B_1} f(y) dy, \dots, \int_{B_k} f(y) dy \right)
+$$
+
+- $P$ is a *random probability measure* (RPM), so the bin probabilities are random variables
+- a good prior for the bin probabilities is the Dirichlet distribution \@ref(eq:dirichlet-prior
+ - where $P_0$ is a base probability measure providing the initial guess at $P$
+ - where $\alpha$ is a prior concentration parameter
+ - controls shrinkage of $P$ towards $P_0$
+
+$$
+P(B_1), \dots, P(B_k) \sim \text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha P_0(B_k))
+(\#eq:dirichlet-prior)
+$$
+
+- difference with previous Bayesian histogram: only specifies that bin $B_k$ is assigned probability $P(B_k)$ and not how probability mass is distributed across the bin $B_k$
+ - thus, for a fixed set s of bins, this equation does not full specify the prior for $P$
+ - need to eliminate the sensitivity to the choice of bins by assuming the prior holds for all possible partitions $B_1, \dots, B_k$ for all $k$
+ - then it is a fully specified prior for $P$
+- "must exist a random probability measure $P$ such that the probabilities assigned to any measurable partition $B_1, \dots, B_k$ by $P$ is $\text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha P_0(B_k))$"
+ - the resulting $P$ is a Dirichlet process:
+ - $P \sim \text{DP}(\alpha P_0)$
+ - $\alpha > 0$: a scalar precision parameter
+ - $P_0$: baseline probability measure also on $(\Omega, \mathcal{B})$
+- implications of DP:
+ - the marginal random probability assigned to any subset $B$ is a beta distribution
+ - $P(B) \sim \text{Beta}(\alpha PP_0(B), \alpha (1-P_0(B)))$ for all $B \in \mathcal{B}$
+ - the prior for $P$ is centered on $P_0$: $E(P(B)) = P_0(B)$
+ - $\alpha$ controls variance
+ - $\text{var}(P(B)) = \frac{P_0(B)(1 - P_0(B))}{1 + \alpha}$
+
+- get posterior for $P$:
+ - let $y_i \stackrel{iid}{\sim} P$ fir $i = 1, \dots, n$
+ - $P \sim \text{DP}(\alpha P_0)$
+ - $P$ denotes the probability measure *and* its corresponding distribution
+ - from \@ref(eq:dirichlet-prior), for any partition $B_1, \dots, B_k$:
+
+$$
+p(B_1), \dots, P(B_k) | y_1, \dots, y_k \sim
+ \text{Dirichlet} \left(
+ \alpha P_0(B_1) + \sum_{i=1}^n 1_{y_i \in B_1}, \dots, \alpha P_0(B_k) + \sum_{i=1}^n 1_{y_i \in B_k}
+ \right)
+$$
+
+- this can be converted to the following:
+
+$$
+P | y_1, \dots, y_n \sim \text{DP} \left( \alpha P_0 \sum_i \delta_{y+i} \right)
+$$
+
+- finally, the posterior expectation of $P$:
+
+$$
+\text{E}(P(B) | y^n) =
+ \left(\frac{\alpha}{\alpha + n} \right) P_0(B) +
+ \left(\frac{n}{\alpha + n} \right) \sum_{i=1}^n \frac{1}{n} \delta_{y_i}
+(\#eq:postp)
+$$
+
+- DP is a model similar to a random histogram but without dependence on the bins
+- cons of a DP prior:
+ - lack of smoothness
+ - induces negative correlation between $P(B_1)$ and $P(B_2)$ for any two disjoint bins with no account for the distance between them
+ - realizations from the DP are discrete distributions
+ - with $P \sim \text{DP}(\alpha P_0)$, $P$ is atomic and have nonzero weights only on a set of atoms, not a continuous density on the real line
+
+#### Stick-breaking construction
+
+- more intuitive understanding of DP
+- induce $P \sim \text{DP}(\alpha P_0)$ by letting:
+
+$$
+\begin{aligned}
+P(\cdot) &= \sum_{h=1}^{\infty} \pi_h \delta_{\theta_h}(\cdot) \\
+\pi_h &= V_h \prod_{l%
+ mutate(
+ dp = purrr::map(alpha, stick_breaking_process, n = 1000),
+ theta = purrr::map(dp, ~ .x$theta),
+ pi = purrr::map(dp, ~ .x$pi)
+ ) %>%
+ select(-dp) %>%
+ unnest(c(theta, pi)) %>%
+ ggplot(aes(x = theta, y = pi)) +
+ facet_wrap(vars(alpha), nrow = 2, scales = "fixed") +
+ geom_col(width = 0.05) +
+ scale_x_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02))) +
+ labs(x = "\u03B8", y = "\u03C0")
+```
+
+### 23.3 Dirichlet process mixtures
+
+#### Specification and Polya urns
+
+- **"the DP is more appropriately used as a prior for an unknown mixture of distributions"** (pg 549)
+- in the case of density estimation, a general kernel mixture model can be specified as \@ref(eq:kernel-density)
+ - $\mathcal{K}(\cdot | \theta)$: a kernel
+ - $\theta$: location and possibly scale parameters
+ - $P$: "mixing measure"
+
+$$
+f(y|P) = \int \mathcal{K}(y|\theta) d P(\theta)
+(\#eq:kernel-density)
+$$
+
+
+- treating $P$ as discrete results in a finite mixture model
+- setting a prior on $P$ creates an infinite mixture model
+ - prior: $P \sim \pi_\mathcal{P}$
+ - $\mathcal{P}$: space of all probability measures on $(\Omega, \mathcal{B})$
+ - $\pi_\mathcal{P}$: the prior over the space defined by $\mathcal{P}$
+- if set $\pi_\mathcal{P}$ as a DP prior, results in a DP mixture model
+ - from \@ref(eq:postp) and \@ref(eq:kernel-density), a DP prior on $P$ results in \@ref(eq:dp-kernel)
+ - where:
+ - $\pi = \sim \text{stick}(\alpha)$ denotes that the probability weights are from the stick-breaking process with parameter $\alpha$
+ - $\theta_h \sim P_0$ independently for each $h=1, \dots, \infty$
+
+$$
+f(y) = \sum_{h=1}^\infty \pi_h \mathcal{K}(y | \theta_h^*)
+(\#eq:dp-kernel)
+$$
+
+- equation \@ref(eq:dp-kernel) resembles a finite mixture model except the number of mixture components in set to infinity
+ - does not mean there will be infinite number of components
+ - instead the model is just flexible to add more mixture components if necessary
+- consider the following specification
+ - issue of how to conduct posterior computation with a DP mixture (DPM) because $P$ has infinitely many parameters
+
+$$
+y_i \sim \mathcal{K}(\theta_i) \text{,} \quad
+\theta_i \sim P \text{,} \quad
+P \sim \text{DP}(\alpha P_0)
+$$
+
+- can marginalize out $P$ to get an induced prior on the subject-specific parameters $\theta^n = (\theta_1, \dots, \theta_n)$
+- results in the *Polya urn* predictive rule:
+
+$$
+p(\theta_i | \theta_1, \dots, \theta_{i-1}) \sim \left( \frac{\alpha}{\alpha + i -1} \right) P_0(\theta_i) +
+\sum_{j=1}^{i-1} \left( \frac{1}{\alpha + i - 1} \right) \delta_{\theta_j}
+(\#eq:polya-urn)
+$$
+
+- *Chinese restaurant process* as a metaphor for the Polya urn predictive rule \@ref(eq:polya-urn):
+ - consider a restaurant with infinitely many tables
+ - the first customers sits at a table with dish $\theta_1^*$
+ - the second customer sits at the first table with probability $\frac{1}{1+\alpha}$ or a new table with probability $\frac{\alpha}{1+\alpha}$
+ - repeat this process for the $i$th customer:
+ - sit at an occupied table with probability $\frac{c_j}{1 - i + \alpha}$ where $c_j$ is the number of customers already at the table
+ -sit at a new table with probability $\frac{\alpha}{n-i+\alpha}$
+ - interpretation:
+ - each table represents a cluster of subjects
+ - the number of clusters depends on the number of subjects
+ - makes sense to have the possibility of more clusters with more subjects (instead of a fixed number of clusters as in a finite mixture model)
+- there is a description of the sampling process for the posterior of $\theta_i | \theta_{-i}$
+
+#### Hyperprior distribution
+
+- $\alpha$: the DP precision parameter
+ - plays a role in controlling the prior on the number of clusters
+ - small, fixed $\alpha$ favors allocation to few clusters (relative to sample size)
+ - if $\alpha=1$, the prior indicates that two randomly selected subjects have a 50/50 chance of belonging to the same cluster
+ - alternatively can set a hyperprior on $\alpha$ and let the data inform the value
+ - common to use a Gamma distribution: $\Gamma(a_\alpha, b_\alpha)$
+ - authors indicate that setting a hyperprior on $\alpha$ tends to work well in practice
+- $P_0$: base probability for the DP
+ - can think of $P_0$ as setting the prior for the cluster locations
+ - can put hyperpriors on $P_0$ parameters
+
+### 23.4 Beyond density estimation
+
+#### Nonparametric residual distributions
+
+- "The real attraction of Dirichlet process mixture (DPM) models is that they can be used much more broadly for relaxing parametric assumptions in hierarchical models." (pg 557)
+- consider the linear regression with a nonparametric error distribution \@ref(eq:linreg-nonparam-error)
+ - $X_i = (X_{i1}, \dots, X_{ip})$: vector of predictors
+ - $\epsilon_i$: error term with distribution $f$
+
+$$
+y_i = X_i \beta + \epsilon_i \text{,} \quad \epsilon_i \sim f
+(\#eq:linreg-nonparam-error)
+$$
+
+- can relax the assumption that $f$ has parametric form
+
+$$
+\epsilon_i \sim N(0, \phi_i^{-1}) \text{,} \quad \phi \sim P \text{,} \quad P \sim \text{DP}(\alpha P_0)
+$$
+
+#### Nonparametric models for parameters that vary by group
+
+- consider hierarchical linear models with varying coefficients
+ - can account for uncertainty about the distribution of coefficients by placing a DP or DPM priors on them
+ - where:
+ - $y_i = (y_{i1}, \dots, y_{in_i})$: repeated measurements for item $i$
+ - $\mu_i$: subject-specific mean
+ - $\epsilon_{ij}$: observation specific residual
+
+$$
+y_{ij} = \mu_i + \epsilon_{ij} \text{,} \quad \mu_i \sim f \text{,} \quad \epsilon_{ij} \sim g
+(\#eq:hier-linreg-dp)
+$$
+
+- typically for \@ref(eq:hier-linreg-dp):
+ - $f \equiv N(\mu, \phi^{-1})$
+ - $g \equiv N(0, \sigma^2)$
+- can let more flexibility in characterizing variability among subjects:
+ - $\mu_i \sim P \text{,} \quad P \sim \text{DP}(\alpha P_0)$
+ - the DP prior induces a *latent class* model
+ - the subjects are grouped into an unknown number of clusters \@ref(eq:latent-class-model)
+ - $S_i \in \{1, \dots, \infty\}$: latent class index
+ - $\pi_h$: probability of allocation to latent class $h$
+
+$$
+\mu_i = \mu^*_{S_i} \text{,} \quad \Pr(S_i = h) = \pi_h \text{,} \quad h = 1, 2, \dots
+(\#eq:latent-class-model)
+$$
+
+> There is more to this chapter, but it is currently beyond my understanding.
+> I hope to return to this chapter later with a better understanding of Dirichlet processes in the future.
+
+---
+
+```{r}
+sessionInfo()
+```
diff --git a/docs/notes/21_Dirichlet-process-models_bda3-23.html b/docs/notes/21_Dirichlet-process-models_bda3-23.html
new file mode 100644
index 0000000..eef4c24
--- /dev/null
+++ b/docs/notes/21_Dirichlet-process-models_bda3-23.html
@@ -0,0 +1,3108 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 21. Notes on 'Ch 23. Dirichlet process models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
21. Notes on ‘Ch 23. Dirichlet process models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were
+not part of the course.
+
+
Chapter 23. Dirichlet
+process models
+
+- Dirichlet process: an infinite-dimensional
+generalization of the Dirichlet distribution
+
+- used as a prior on unknown distributions
+- can extend finite component mixture models to infinite mixture
+models
+
+
+
23.1 Bayesian histograms
+
+- the histogram as a simple form of density estimation
+
+- demonstrate a flexible parametric version that motivates the
+non-parametric in the following section
+
+- prespecified knots: \(\xi = (\xi_0, \dots,
+\xi_k)\) with \(\xi_{n-1} <
+\xi_n\)
+- probability model for the density (a histogram):
+
+- where \(\pi = (\pi_1, \dots,
+\pi_k)\) is an unknown probability vector
+
+
+
\[
+f(y) = \sum_{h=1}^k 1_{\xi_{h-1} < y \le \xi_h} \frac{\pi_h}{(\xi_h -
+\xi_{h-1})}
+\]
+
+- prior for the probabilities \(\pi\)
+as a Dirichlet distribution:
+
+
\[
+p(\pi|a) = \frac{\Gamma(\sum_{h=1}^k a_h)}{\prod_{h=1}^k \Gamma(a_h)}
+\prod_{h=1}^k \pi _h^{a_h - 1}
+\]
+
+- replace the hyperparameter vector: \(a =
+\alpha \pi_0\) where \(\pi_0\)
+is:
+
+
\[
+\text{E}(\pi|a) = \pi_0 = \left( \frac{a_1}{\sum_h a_h}, \dots,
+\frac{a_k}{\sum_h a_h} \right)
+\]
+
+- the posterior for \(\pi\) becomes:
+
+- where \(n_i\) is the number of
+observations \(y\) in the \(i\)th bin
+
+
+
\[
+p(\pi | y) \propto \prod_{h=1}^k \pi_h^{a_h + n_h - 1} =
+\text{Dirichlet}(a_1 + n_1, \dots, a_k + n_k)
+\]
+
+- this histogram estimator does well but is sensitive to the
+specification of the knots
+
+
23.2 Dirichlet process
+prior distributions
+
Definition and basic
+properties
+
+- goal is to not need to prespecify the bins of the histogram
+- let:
+
+- \(\Omega\): sample space
+- \(B_1, \dots, B_k\): measure
+subsets of \(\Omega\)
+
+- if \(\Omega = \Re\), then \(B_1, \dots, B_k\) are non-overlapping
+intervals that partition the real line into a finite number of bins
+- \(P\): unknown probability measure
+of \((\Omega, \mathcal{B})\)
+
+- \(\mathcal{B}\): “collection of all
+possible subsets of the sample space \(\Omega\)”
+- \(P\) assigns probabilities to the
+subsets \(\mathcal{B}\)
+- probability for a set of bins \(B_1,
+\dots, B_k\) partitioning \(\Omega\):
+
+
+
\[
+P(B_1), \dots, P(B_k) = \left( \int_{B_1} f(y) dy, \dots, \int_{B_k}
+f(y) dy \right)
+\]
+
+- \(P\) is a random probability
+measure (RPM), so the bin probabilities are random variables
+- a good prior for the bin probabilities is the Dirichlet distribution
+@ref(eq:dirichlet-prior
+
+- where \(P_0\) is a base probability
+measure providing the initial guess at \(P\)
+- where \(\alpha\) is a prior
+concentration parameter
+
+- controls shrinkage of \(P\) towards
+\(P_0\)
+
+
+
+
\[
+P(B_1), \dots, P(B_k) \sim \text{Dirichlet}(\alpha P_0(B_1), \dots,
+\alpha P_0(B_k))
+\tag{1}
+\]
+
+- difference with previous Bayesian histogram: only specifies that bin
+\(B_k\) is assigned probability \(P(B_k)\) and not how probability mass is
+distributed across the bin \(B_k\)
+
+- thus, for a fixed set s of bins, this equation does not full specify
+the prior for \(P\)
+- need to eliminate the sensitivity to the choice of bins by assuming
+the prior holds for all possible partitions \(B_1, \dots, B_k\) for all \(k\)
+
+- then it is a fully specified prior for \(P\)
+
+
+- “must exist a random probability measure \(P\) such that the probabilities assigned to
+any measurable partition \(B_1, \dots,
+B_k\) by \(P\) is \(\text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha
+P_0(B_k))\)”
+
+- the resulting \(P\) is a Dirichlet
+process:
+
+- \(P \sim \text{DP}(\alpha
+P_0)\)
+- \(\alpha > 0\): a scalar
+precision parameter
+- \(P_0\): baseline probability
+measure also on \((\Omega,
+\mathcal{B})\)
+
+
+- implications of DP:
+
+- the marginal random probability assigned to any subset \(B\) is a beta distribution
+
+- \(P(B) \sim \text{Beta}(\alpha PP_0(B),
+\alpha (1-P_0(B)))\) for all \(B \in
+\mathcal{B}\)
+
+- the prior for \(P\) is centered on
+\(P_0\): \(E(P(B)) = P_0(B)\)
+- \(\alpha\) controls variance
+
+- \(\text{var}(P(B)) = \frac{P_0(B)(1 -
+P_0(B))}{1 + \alpha}\)
+
+
+- get posterior for \(P\):
+
+- let \(y_i \stackrel{iid}{\sim} P\)
+fir \(i = 1, \dots, n\)
+- \(P \sim \text{DP}(\alpha P_0)\)
+
+- \(P\) denotes the probability
+measure and its corresponding distribution
+
+- from (1), for any partition \(B_1, \dots, B_k\):
+
+
+
\[
+p(B_1), \dots, P(B_k) | y_1, \dots, y_k \sim
+ \text{Dirichlet} \left(
+ \alpha P_0(B_1) + \sum_{i=1}^n 1_{y_i \in B_1}, \dots, \alpha
+P_0(B_k) + \sum_{i=1}^n 1_{y_i \in B_k}
+ \right)
+\]
+
+- this can be converted to the following:
+
+
\[
+P | y_1, \dots, y_n \sim \text{DP} \left( \alpha P_0 \sum_i \delta_{y+i}
+\right)
+\]
+
+- finally, the posterior expectation of \(P\):
+
+
\[
+\text{E}(P(B) | y^n) =
+ \left(\frac{\alpha}{\alpha + n} \right) P_0(B) +
+ \left(\frac{n}{\alpha + n} \right) \sum_{i=1}^n \frac{1}{n}
+\delta_{y_i}
+\tag{2}
+\]
+
+- DP is a model similar to a random histogram but without dependence
+on the bins
+- cons of a DP prior:
+
+- lack of smoothness
+- induces negative correlation between \(P(B_1)\) and \(P(B_2)\) for any two disjoint bins with no
+account for the distance between them
+- realizations from the DP are discrete distributions
+
+- with \(P \sim \text{DP}(\alpha
+P_0)\), \(P\) is atomic and have
+nonzero weights only on a set of atoms, not a continuous density on the
+real line
+
+
+
+
Stick-breaking construction
+
+- more intuitive understanding of DP
+- induce \(P \sim \text{DP}(\alpha
+P_0)\) by letting:
+
+
\[
+\begin{aligned}
+P(\cdot) &= \sum_{h=1}^{\infty} \pi_h \delta_{\theta_h}(\cdot) \\
+\pi_h &= V_h \prod_{l<h} (1 - V_i) \\
+V_h &\sim \text{Beta}(1, \alpha) \\
+\theta_h &\sim P_0
+\end{aligned}
+\]
+
+- where:
+
+- \(P_0\): base distribution
+- \(\delta_\theta\): degenerate
+distribution with all mass at \(\theta\)
+- \((\theta_h)_{h=1}^{\infty}\): the
+atoms generated independently by from \(P_0\)
+
+- the atoms are generated by the stick-breaking process
+- this ensures the weights sum to 1
+
+- \(\pi_h\): probability mass at atom
+\(\theta_h\)
+
+- the stick-breaking process:
+
+- start with a stick of length 1
+
+- represents the total probability allocated to all the atoms
+
+- break off a random piece of length \(V_1\) determined by a draw from \(\text{Beta}(1, \alpha)\)
+- set \(\pi_1 = V_1\) as the
+probability weight to the randomly generated first atom \(\theta_1 \sim P_0\)
+- break off another piece of the remaining stick (now length \(1-V_1\)): \(V_2
+\sim \text{Beta}(1, \alpha)\)
+- set \(\pi_2 = V_2(1-V_1)\) as the
+probability weight to the next atom \(\theta_2
+\sim P_0\)
+- repeat until the stick is fully used
+
+- implications:
+
+- during the process, the stick get shorter, so lengths allocated to
+later indexed atoms decrease stochastically
+- rate of decrease of stick length depends on \(\alpha\)
+
+- \(\alpha\) near 0 lead to high
+weights early on
+
+
+- below are realizations of the stick breaking process
+
+- set \(P_0\) as a standard normal
+distribution and vary \(\alpha\)
+
+
+
+
+
stick_breaking_process <- function(alpha, n = 1000) {
+ theta <- rnorm(n, 0, 1) # P0 = standard normal
+ vs <- rbeta(n, 1, alpha)
+ pi <- rep(0, n)
+ pi[1] <- vs[1]
+ stick <- 1.0 - vs[1]
+ for (h in 2:n) {
+ pi[h] <- vs[h] * stick
+ stick <- stick - pi[h]
+ }
+ return(list(theta = theta, pi = pi))
+}
+
+dp_realization <- stick_breaking_process(10)
+sum(dp_realization$pi)
+
+
+
#> [1] 1
+
+
+
+
set.seed(549)
+
+tibble(alpha = c(0.5, 1, 5, 10)) %>%
+ mutate(
+ dp = purrr::map(alpha, stick_breaking_process, n = 1000),
+ theta = purrr::map(dp, ~ .x$theta),
+ pi = purrr::map(dp, ~ .x$pi)
+ ) %>%
+ select(-dp) %>%
+ unnest(c(theta, pi)) %>%
+ ggplot(aes(x = theta, y = pi)) +
+ facet_wrap(vars(alpha), nrow = 2, scales = "fixed") +
+ geom_col(width = 0.05) +
+ scale_x_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02))) +
+ labs(x = "\u03B8", y = "\u03C0")
+
+
+
+
+
23.3 Dirichlet process mixtures
+
Specification and Polya urns
+
+- “the DP is more appropriately used as a prior for an unknown
+mixture of distributions” (pg 549)
+- in the case of density estimation, a general kernel mixture model
+can be specified as (3)
+
+- \(\mathcal{K}(\cdot | \theta)\): a
+kernel
+- \(\theta\): location and possibly
+scale parameters
+- \(P\): “mixing measure”
+
+
+
\[
+f(y|P) = \int \mathcal{K}(y|\theta) d P(\theta)
+\tag{3}
+\]
+
+- treating \(P\) as discrete results
+in a finite mixture model
+- setting a prior on \(P\) creates an
+infinite mixture model
+
+- prior: \(P \sim \pi_\mathcal{P}\)
+
+- \(\mathcal{P}\): space of all
+probability measures on \((\Omega,
+\mathcal{B})\)
+- \(\pi_\mathcal{P}\): the prior over
+the space defined by \(\mathcal{P}\)
+
+
+- if set \(\pi_\mathcal{P}\) as a DP
+prior, results in a DP mixture model
+
+- from (2) and (3), a DP prior on \(P\) results in (4)
+- where:
+
+- \(\pi = \sim \text{stick}(\alpha)\)
+denotes that the probability weights are from the stick-breaking process
+with parameter \(\alpha\)
+- \(\theta_h \sim P_0\) independently
+for each \(h=1, \dots, \infty\)
+
+
+
+
\[
+f(y) = \sum_{h=1}^\infty \pi_h \mathcal{K}(y | \theta_h^*)
+\tag{4}
+\]
+
+- equation (4) resembles a finite mixture model except
+the number of mixture components in set to infinity
+
+- does not mean there will be infinite number of components
+- instead the model is just flexible to add more mixture components if
+necessary
+
+- consider the following specification
+
+- issue of how to conduct posterior computation with a DP mixture
+(DPM) because \(P\) has infinitely many
+parameters
+
+
+
\[
+y_i \sim \mathcal{K}(\theta_i) \text{,} \quad
+\theta_i \sim P \text{,} \quad
+P \sim \text{DP}(\alpha P_0)
+\]
+
+- can marginalize out \(P\) to get an
+induced prior on the subject-specific parameters \(\theta^n = (\theta_1, \dots,
+\theta_n)\)
+- results in the Polya urn predictive rule:
+
+
\[
+p(\theta_i | \theta_1, \dots, \theta_{i-1}) \sim \left(
+\frac{\alpha}{\alpha + i -1} \right) P_0(\theta_i) +
+\sum_{j=1}^{i-1} \left( \frac{1}{\alpha + i - 1} \right)
+\delta_{\theta_j}
+\tag{5}
+\]
+
+- Chinese restaurant process as a metaphor for the Polya urn
+predictive rule (5):
+
+- consider a restaurant with infinitely many tables
+- the first customers sits at a table with dish \(\theta_1^*\)
+- the second customer sits at the first table with probability \(\frac{1}{1+\alpha}\) or a new table with
+probability \(\frac{\alpha}{1+\alpha}\)
+- repeat this process for the \(i\)th
+customer:
+
+- sit at an occupied table with probability \(\frac{c_j}{1 - i + \alpha}\) where \(c_j\) is the number of customers already at
+the table -sit at a new table with probability \(\frac{\alpha}{n-i+\alpha}\)
+
+- interpretation:
+- each table represents a cluster of subjects
+- the number of clusters depends on the number of subjects
+- makes sense to have the possibility of more clusters with more
+subjects (instead of a fixed number of clusters as in a finite mixture
+model)
+
+- there is a description of the sampling process for the posterior of
+\(\theta_i | \theta_{-i}\)
+
+
Hyperprior distribution
+
+- \(\alpha\): the DP precision
+parameter
+
+- plays a role in controlling the prior on the number of clusters
+- small, fixed \(\alpha\) favors
+allocation to few clusters (relative to sample size)
+
+- if \(\alpha=1\), the prior
+indicates that two randomly selected subjects have a 50/50 chance of
+belonging to the same cluster
+
+- alternatively can set a hyperprior on \(\alpha\) and let the data inform the value
+
+- common to use a Gamma distribution: \(\Gamma(a_\alpha, b_\alpha)\)
+- authors indicate that setting a hyperprior on \(\alpha\) tends to work well in
+practice
+
+
+- \(P_0\): base probability for the
+DP
+
+- can think of \(P_0\) as setting the
+prior for the cluster locations
+- can put hyperpriors on \(P_0\)
+parameters
+
+
+
23.4 Beyond density estimation
+
Nonparametric residual
+distributions
+
+- “The real attraction of Dirichlet process mixture (DPM) models is
+that they can be used much more broadly for relaxing parametric
+assumptions in hierarchical models.” (pg 557)
+- consider the linear regression with a nonparametric error
+distribution (6)
+
+- \(X_i = (X_{i1}, \dots, X_{ip})\):
+vector of predictors
+- \(\epsilon_i\): error term with
+distribution \(f\)
+
+
+
\[
+y_i = X_i \beta + \epsilon_i \text{,} \quad \epsilon_i \sim f
+\tag{6}
+\]
+
+- can relax the assumption that \(f\)
+has parametric form
+
+
\[
+\epsilon_i \sim N(0, \phi_i^{-1}) \text{,} \quad \phi \sim P \text{,}
+\quad P \sim \text{DP}(\alpha P_0)
+\]
+
Nonparametric
+models for parameters that vary by group
+
+- consider hierarchical linear models with varying coefficients
+
+- can account for uncertainty about the distribution of coefficients
+by placing a DP or DPM priors on them
+- where:
+
+- \(y_i = (y_{i1}, \dots,
+y_{in_i})\): repeated measurements for item \(i\)
+- \(\mu_i\): subject-specific
+mean
+- \(\epsilon_{ij}\): observation
+specific residual
+
+
+
+
\[
+y_{ij} = \mu_i + \epsilon_{ij} \text{,} \quad \mu_i \sim f \text{,}
+\quad \epsilon_{ij} \sim g
+\tag{7}
+\]
+
+- typically for (7):
+
+- \(f \equiv N(\mu, \phi^{-1})\)
+- \(g \equiv N(0, \sigma^2)\)
+
+- can let more flexibility in characterizing variability among
+subjects:
+
+- \(\mu_i \sim P \text{,} \quad P \sim
+\text{DP}(\alpha P_0)\)
+- the DP prior induces a latent class model
+
+- the subjects are grouped into an unknown number of clusters
+(8)
+- \(S_i \in \{1, \dots, \infty\}\):
+latent class index
+- \(\pi_h\): probability of
+allocation to latent class \(h\)
+
+
+
+
\[
+\mu_i = \mu^*_{S_i} \text{,} \quad \Pr(S_i = h) = \pi_h \text{,} \quad h
+= 1, 2, \dots
+\tag{8}
+\]
+
+There is more to this chapter, but it is currently beyond my
+understanding. I hope to return to this chapter later with a better
+understanding of Dirichlet processes in the future.
+
+
+
+
+
#> R version 4.1.2 (2021-11-01)
+#> Platform: x86_64-apple-darwin17.0 (64-bit)
+#> Running under: macOS Big Sur 10.16
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
+#>
+#> locale:
+#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices datasets utils methods
+#> [7] base
+#>
+#> other attached packages:
+#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
+#> [5] readr_2.0.1 tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5
+#> [9] tidyverse_1.3.1
+#>
+#> loaded via a namespace (and not attached):
+#> [1] Rcpp_1.0.7 lubridate_1.7.10 clisymbols_1.2.0
+#> [4] assertthat_0.2.1 digest_0.6.27 utf8_1.2.2
+#> [7] R6_2.5.0 cellranger_1.1.0 backports_1.2.1
+#> [10] reprex_2.0.1 evaluate_0.14 highr_0.9
+#> [13] httr_1.4.2 pillar_1.6.2 rlang_0.4.11
+#> [16] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
+#> [19] rmarkdown_2.10 labeling_0.4.2 munsell_0.5.0
+#> [22] broom_0.7.9 compiler_4.1.2 modelr_0.1.8
+#> [25] xfun_0.25 pkgconfig_2.0.3 htmltools_0.5.1.1
+#> [28] downlit_0.2.1 tidyselect_1.1.1 fansi_0.5.0
+#> [31] crayon_1.4.1 tzdb_0.1.2 dbplyr_2.1.1
+#> [34] withr_2.4.2 grid_4.1.2 jsonlite_1.7.2
+#> [37] gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1
+#> [40] magrittr_2.0.1 scales_1.1.1 cli_3.0.1
+#> [43] stringi_1.7.3 farver_2.1.0 renv_0.14.0
+#> [46] fs_1.5.0 xml2_1.3.2 bslib_0.2.5.1
+#> [49] ellipsis_0.3.2 generics_0.1.0 vctrs_0.3.8
+#> [52] distill_1.2 tools_4.1.2 glue_1.4.2
+#> [55] hms_1.1.0 yaml_2.2.1 colorspace_2.0-2
+#> [58] rvest_1.0.1 knitr_1.33 haven_2.4.3
+#> [61] sass_0.4.0
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/project-guidelines.html b/docs/project-guidelines.html
index d41823c..816a095 100644
--- a/docs/project-guidelines.html
+++ b/docs/project-guidelines.html
@@ -2319,6 +2319,13 @@ ${suggestion.title}
Section 10. Decision analysis
Section 11. Normal approximation & Frequency properties
Section 12. Extended topics
+13. Notes on 'Ch 14. Introduction to regression models'
+14. Notes on 'Ch 15. Hierarchical linear models'
+17. Notes on 'Ch 19. Parametric nonlinear models'
+18. Notes on 'Ch 20. Basis function models'
+19. Notes on 'Ch 21. Gaussian process models'
+20. Notes on 'Ch 22. Finite mixture models'
+21. Notes on 'Ch 23. Dirichlet process models'
@@ -2373,55 +2381,94 @@
project-guidelines
Project work details
-
Project work involves choosing a data set and performing a whole analysis according to all the parts of Bayesian workflow studied along the course.
+
Project work involves choosing a data set and performing a whole
+analysis according to all the parts of Bayesian workflow studied along
+the course.
- The project work is meant to be done in period II.
- In the beginning of the period II
-- Form a group. We prefer groups of 3, but the project can be done in groups of 1-2.
-- Select a topic. You may ask in the course chat channel
#project
for opinion whether it’s a good topic and a good dataset. You can change the topic later.
+ - Form a group. We prefer groups of 3, but the project can be done in
+groups of 1-2.
+- Select a topic. You may ask in the course chat channel
+
#project
for opinion whether it’s a good topic and a good
+dataset. You can change the topic later.
-- The main work for the project and the presentation will be done in the second half of the period II after all the workflow parts have been discussed in the course.
-- The online presentations will be made on the evaluation week after period II.
-- All suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.
+- The main work for the project and the presentation will be done in
+the second half of the period II after all the workflow parts have been
+discussed in the course.
+- The online presentations will be made on the evaluation week after
+period II.
+- All suspected plagiarism will be reported and investigated. See more
+about the Aalto
+University Code of Academic Integrity and Handling Violations
+Thereof.
Project schedule
-- Form a group and pick a topic. Register the group before end of 8th Nov, 2021.
+- Form a group and pick a topic. Register the group before end of 8th
+Nov, 2021.
- Groups of 3 can reserve a presentation slot starting TBA.
- Groups of 2 can reserve a presentation slot starting TBA.
- Groups of 1 can reserve a presentation slot starting TBA.
-- Groups that register late can reserve a presentation slot starting TBA.
-- Work on the project. TA session queue is also for project questions.
-- Project report deadline December 6, 2021. Submit in peergrade (separate “class”, the link will be added to MyCourses).
-- Project report peer grading December 7-9, 2021 (so that you’ll get feedback for the report before the presentations).
+- Groups that register late can reserve a presentation slot starting
+TBA.
+- Work on the project. TA session queue is also for project
+questions.
+- Project report deadline December 6, 2021. Submit in peergrade
+(separate “class”, the link will be added to MyCourses).
+- Project report peer grading December 7-9, 2021 (so that you’ll get
+feedback for the report before the presentations).
- Project presentations December 13-17, 2021.
Groups
-
Project work is done in groups of 1-3 persons. Preferred group size is 3, because you learn more when you talk about the project with someone else.
-
If you don’t have a group, you can ask other students in the group chat channel #project
. Tell what kind of data you are interested in (e.g. medicine, health, biological, engineering, political, business), whether you prefer R or Python, and whether you have already more concrete idea for the topic.
-
Groups of 3 students can choose their presentation time slot before 1-2 student groups. 3 person group is expected to do a bit more work than 1-2 person groups.
-
You can do the project alone, but the amount of work is expected to the same for 2 person groups.
+
Project work is done in groups of 1-3 persons. Preferred group size
+is 3, because you learn more when you talk about the project with
+someone else.
+
If you don’t have a group, you can ask other students in the group
+chat channel #project
. Tell what kind of data you are
+interested in (e.g. medicine, health, biological, engineering,
+political, business), whether you prefer R or Python, and whether you
+have already more concrete idea for the topic.
+
Groups of 3 students can choose their presentation time slot before
+1-2 student groups. 3 person group is expected to do a bit more work
+than 1-2 person groups.
+
You can do the project alone, but the amount of work is expected to
+the same for 2 person groups.
TA sessions
-
The groups will get help for the project work in TA sessions. When there are no weekly assignments, the TA sessions are still organized for helping in the project work.
+
The groups will get help for the project work in TA sessions. When there are no
+weekly assignments, the TA sessions are still organized for helping in
+the project work.
Evaluation
The project work’s evaluation consists of:
-- peer-graded project report (40%) (within peergrade submission 80% and feedack 20%)
+- peer-graded project report (40%) (within peergrade submission 80%
+and feedack 20%)
- presentation and oral exam graded by the course staff (60%)
- clarity of slides + use of figures
- clarity of oral presentation + flow of the presentation
-- all required parts included (not necessarily all in main slides, but it needs to be clear that all required steps were performed)
+- all required parts included (not necessarily all in main slides, but
+it needs to be clear that all required steps were performed)
- accuracy of use of terms (oral exam)
- responses to questions (oral exam)
Project report
-
In the project report you practice presenting the problem and data analysis results, which means that minimal listing of code and figures is not a good report. There are different levels for how data analysis project could be reported. This report should be more than a summary of results without workflow steps. While describing the steps and decisions made during the workflow, to keep the report readable some of the diagnostic outputs and code can be put in the appendix. If you are uncertain you can ask TAs in TA sessions whether you are on a good level of amount of details.
+
In the project report you practice presenting the problem and data
+analysis results, which means that minimal listing of code and figures
+is not a good report. There are different levels for how data analysis
+project could be reported. This report should be more than a summary of
+results without workflow steps. While describing the steps and decisions
+made during the workflow, to keep the report readable some of the
+diagnostic outputs and code can be put in the appendix. If you are
+uncertain you can ask TAs in TA sessions whether you are on a good level
+of amount of details.
The report should include
- Introduction describing
@@ -2433,7 +2480,10 @@ Project report
- Showing some illustrative figure is recommended.
-- Description of the data and the analysis problem. Provide information where the data was obtained, and if it has been previously used in some online case study and how your analysis differs from the existing analyses.
+- Description of the data and the analysis problem. Provide
+information where the data was obtained, and if it has been previously
+used in some online case study and how your analysis differs from the
+existing analyses.
- Description of at least two models, for example:
@@ -2442,86 +2492,182 @@ Project report
- variable selection with many models.
-- Informative or weakly informative priors, and justification of their choices.
+- Informative or weakly informative priors, and justification of their
+choices.
- Stan, rstanarm or brms code.
-- How to the Stan model was run, that is, what options were used. This is also more clear as combination of textual explanation and the actual code line.
-- Convergence diagnostics (\(\widehat{R}\), ESS, divergences) and what was done if the convergence was not good with the first try.
-- Posterior predictive checks and what was done to improve the model.
+- How to the Stan model was run, that is, what options were used. This
+is also more clear as combination of textual explanation and the actual
+code line.
+- Convergence diagnostics (\(\widehat{R}\), ESS, divergences) and what
+was done if the convergence was not good with the first try.
+- Posterior predictive checks and what was done to improve the
+model.
- Model comparison (e.g. with LOO-CV).
-- Predictive performance assessment if applicable (e.g. classification accuracy) and evaluation of practical usefulness of the accuracy.
-- Sensitivity analysis with respect to prior choices (i.e. checking whether the result changes a lot if prior is changed)
+- Predictive performance assessment if applicable (e.g. classification
+accuracy) and evaluation of practical usefulness of the accuracy.
+- Sensitivity analysis with respect to prior choices (i.e. checking
+whether the result changes a lot if prior is changed)
- Discussion of issues and potential improvements.
- Conclusion what was learned from the data analysis.
-- Self-reflection of what the group learned while making the project.
+- Self-reflection of what the group learned while making the
+project.
Project presentation
-In addition to the submitted report, each project must be presented by the authoring group, according to the following guidelines:
+In addition to the submitted report, each project must be presented
+by the authoring group, according to the following guidelines:
-- The presentation should be high level but sufficiently detailed information should be readily available to help answering questions from the audience.
-- The duration of the presentation should be 10 minutes (groups of 1-2 students) or 15 minutes (groups of 3 students).
-- At the end of the presentation there will be an extra 5-10 minutes of questions by anyone in the audience or two members of the course staff who are present. The questions from lecturer/TAs can be considered as an oral exam questions, and if answers to these questions reveal weak knowledge of the methods and workflow steps which should be part of the project, that can reduce the grade.
-- Grading will be done by the two members of the course staff using standardized grading instructions.
+- The presentation should be high level but sufficiently detailed
+information should be readily available to help answering questions from
+the audience.
+- The duration of the presentation should be 10 minutes (groups of 1-2
+students) or 15 minutes (groups of 3 students).
+- At the end of the presentation there will be an extra 5-10 minutes
+of questions by anyone in the audience or two members of the course
+staff who are present. The questions from lecturer/TAs can be considered
+as an oral exam questions, and if answers to these questions reveal weak
+knowledge of the methods and workflow steps which should be part of the
+project, that can reduce the grade.
+- Grading will be done by the two members of the course staff using
+standardized grading instructions.
Specific recommendations for the presentations include:
-- The first slide should include project’s title and group members’ names.
-- The chosen statistical model(s), including observation model and priors, must be explained and justified,
-- Make sure the font is big enough to be easily readable by the audience. This includes figure captions, legends and axis information,
-- The last slide should be a summary or take-home-messages and include contact information or link to a further information. (The grade will be reduced by one if the last slide has only something like “Thank you” or “Questions?”),
-- In general, the best presentations are often given by teams that have frequently attended TA sessions and gotten feedback, so we strongly recommend attending these sessions.
+- The first slide should include project’s title and group members’
+names.
+- The chosen statistical model(s), including observation model and
+priors, must be explained and justified,
+- Make sure the font is big enough to be easily readable by the
+audience. This includes figure captions, legends and axis
+information,
+- The last slide should be a summary or take-home-messages and include
+contact information or link to a further information. (The grade will be
+reduced by one if the last slide has only something like “Thank you” or
+“Questions?”),
+- In general, the best presentations are often given by teams that
+have frequently attended TA sessions and gotten feedback, so we strongly
+recommend attending these sessions.
More details on the presentation sessions
-- If you don’t have microphone or video camera (e.g. in your laptop or mobile phone) then we’ll arrange your presentation on campus in period III.
-- If you reserved a presentation slot but need to cancel, do it asap.
-- Zoom meeting link for all time slots available in the course chat.
-- As we have many presentation in each slot join the meeting in time. Late arrivals will lower the grade. Very late arrivals will fail the presentation and can present in period III.
+- If you don’t have microphone or video camera (e.g. in your laptop or
+mobile phone) then we’ll arrange your presentation on campus in period
+III.
+- If you reserved a presentation slot but need to cancel, do it
+asap.
+- Zoom meeting link for all time slots available in the course
+chat.
+- As we have many presentation in each slot join the meeting in time.
+Late arrivals will lower the grade. Very late arrivals will fail the
+presentation and can present in period III.
- Presenting group needs to have video and audio on.
-- It is easiest if just one from the group shares the slides, but it is expected that all group members present some part of the presentation orally.
-- Presentation time is 10 min for 1-2 person groups and 15min for 3 person groups
-- Time limit is strict. It’s good idea to practice the talk so that you get the timing right. Staff will announce 2min and 1min left and time ended. Going overtime reduces the grade.
-- After the presentation there will be 5min for questions, answers, and feedback.
-- Each student has to come up with at least one question during the session. Students can ask more questions. Questions by students are posted in chat, and they can be posted already during the presentation.
+- It is easiest if just one from the group shares the slides, but it
+is expected that all group members present some part of the presentation
+orally.
+- Presentation time is 10 min for 1-2 person groups and 15min for 3
+person groups
+- Time limit is strict. It’s good idea to practice the talk so that
+you get the timing right. Staff will announce 2min and 1min left and
+time ended. Going overtime reduces the grade.
+- After the presentation there will be 5min for questions, answers,
+and feedback.
+- Each student has to come up with at least one question during the
+session. Students can ask more questions. Questions by students are
+posted in chat, and they can be posted already during the
+presentation.
- Staff Will ask further questions (kind of oral exam)
- Grading of the project presentation takes int account
- clarity of slides + use of figures
- clarity of oral presentation + flow of the presentation
-- all required parts included (not necessarily all in main slides, but it needs to be clear that all required steps were performed)
+- all required parts included (not necessarily all in main slides, but
+it needs to be clear that all required steps were performed)
- accuracy of use of terms (oral exam)
- responses to questions (oral exam)
-- Students will also self-evaluate their project. After the presentation each student who just presented sends a private message to one of the staff members with a self evaluation grade from themselves and for each group member (if applicable).
+- Students will also self-evaluate their project. After the
+presentation each student who just presented sends a private message to
+one of the staff members with a self evaluation grade from themselves
+and for each group member (if applicable).
Data sets
-As some data sets have been overused for these particular goals, note that the following ones are forbidden in this work (more can be added to this list so make sure to check it regularly):
+As some data sets have been overused for these particular goals, note
+that the following ones are forbidden in this work (more can be added to
+this list so make sure to check it regularly):
- extremly common data sets like titanic, mtcars, iris
- Baseball batting (used by Bob Carpenter’s StanCon case study).
- Data sets used in the course demos
-It’s best to use a dataset for which there is no ready made analysis in internet, but if you choose a dataset used already in some online case study, provide the link to previous studies and report how your analysis differs from those (for example if someone has made non-Bayesian analysis and you do the full Bayesian analysis).
-Depending on the model and the structure of the data, a good data set would have more than 100 observations but less than 1 million. If you know an interesting big data set, you can use a smaller subset of the data to keep the computation times feasible. It would be good that the data has some structure, so that it is sensible to use multilevel/hierarchical models.
+It’s best to use a dataset for which there is no ready made analysis
+in internet, but if you choose a dataset used already in some online
+case study, provide the link to previous studies and report how your
+analysis differs from those (for example if someone has made
+non-Bayesian analysis and you do the full Bayesian analysis).
+Depending on the model and the structure of the data, a good data set
+would have more than 100 observations but less than 1 million. If you
+know an interesting big data set, you can use a smaller subset of the
+data to keep the computation times feasible. It would be good that the
+data has some structure, so that it is sensible to use
+multilevel/hierarchical models.
Model requirements
-- Every parameter needs to have an explicit proper prior. Improper flat priors are not allowed.
-- A hierarchical model is a model where the prior of certain parameter contain other parameters that are also estimated in the model. For instance,
b ~ normal(mu, sigma)
, mu ~ normal(0, 1)
, sigma ~ exponential(1)
.
-- Do not impose hard constrains on a parameter unless they are natural to them.
uniform(a, b)
should not be used unless the boundaries are really logical boundaries and values beyond the boundaries are completely impossible.
-- At least some models should include covariates. Modelling the outcome without predictors is likely too simple for the project.
-brms
can be used, but the Stan code must be included, briefly commented, and all priors need to be checked from the Stan code and adjusted to be weakly informative based on some justified explanation.
+- Every parameter needs to have an explicit proper prior. Improper
+flat priors are not allowed.
+- A hierarchical model is a model where the prior of certain parameter
+contain other parameters that are also estimated in the model. For
+instance,
b ~ normal(mu, sigma)
, mu ~ normal(0,
+1)
, sigma ~ exponential(1)
.
+- Do not impose hard constrains on a parameter unless they are natural
+to them.
uniform(a, b)
should not be used unless the
+boundaries are really logical boundaries and values beyond the
+boundaries are completely impossible.
+- At least some models should include covariates. Modelling the
+outcome without predictors is likely too simple for the project.
+brms
can be used, but the Stan code must be included,
+briefly commented, and all priors need to be checked from the Stan code
+and adjusted to be weakly informative based on some justified
+explanation.
Some examples
-The following case study examples demonstrate how text, equations, figures, and code, and inference results can be included in one report. These examples don’t necessarily have all the workflow steps required in your report, but different steps are illustrated in different case studies and you can get good ideas for your report just by browsing through them.
+The following case study examples demonstrate how text, equations,
+figures, and code, and inference results can be included in one report.
+These examples don’t necessarily have all the workflow steps required in
+your report, but different steps are illustrated in different case
+studies and you can get good ideas for your report just by browsing
+through them.
diff --git a/docs/search.json b/docs/search.json
index 5db4b12..62e929a 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -5,8 +5,8 @@
"title": "About",
"description": "Some additional details about this website and it's author.",
"author": [],
- "contents": "\nThe website\nThis site is built with ‘distill’ and deployed on GitHub pages. The main point of the website is to display my course notes in a easy to read and search format. It isn’t perfect nor optimal, but there is a good balance of functionality and simplicity.\nAbout me\nMy name is Joshua Cook and I am (at the time of taking this course and writing this About page) a graduate student at Harvard Medical School. My research is on cancer genetics and I have a specific love of Bayesian modelling. You can peruse more of my projects and other work on my GitHub profile at jhrcook or my website.\n\n\n\n",
- "last_modified": "2021-12-03T07:43:26-05:00"
+ "contents": "\nThe website\nThis site is built with ‘distill’ and deployed on\nGitHub pages. The main point of the website is to display my course\nnotes in a easy to read and search format. It isn’t perfect nor optimal,\nbut there is a good balance of functionality and simplicity.\nAbout me\nMy name is Joshua Cook and I am (at the time of taking this course\nand writing this About page) a graduate student at Harvard Medical\nSchool. My research is on cancer genetics and I have a specific love of\nBayesian modelling. You can peruse more of my projects and other work on\nmy GitHub profile at jhrcook or\nmy website.\n\n\n\n",
+ "last_modified": "2022-01-31T07:00:22-05:00"
},
{
"path": "index.html",
@@ -18,14 +18,14 @@
"url": "https://joshuacook.netlify.app"
}
],
- "contents": "\nResources\nCourse website\n2021 Schedule\nGitHub repo (my fork)\nBayesian Data Analysis (3e) (BDA3) (exercise solutions)\nChapter Notes\nVideo lectures or individually lists here\nLecture slides\nHow to study\n\nThe following are recommendations from the course creators on how to take the course.\n\nThe recommended way to go through the material is:\nRead the reading instructions for a chapter in the chapter notes.\nRead the chapter in BDA3 and check that you find the terms listed in the reading instructions.\nWatch the corresponding video lecture to get explanations for most important parts.\nRead corresponding additional information in the chapter notes.\nRun the corresponding demos in R demos or Python demos.\nRead the exercise instructions and make the corresponding assignments. Demo codes in R demos and Python demos have a lot of useful examples for handling data and plotting figures. If you have problems, visit TA sessions or ask in course slack channel.\nIf you want to learn more, make also self study exercises listed below.\nSections\nSection\nNotes\nBook exercises\nAssignments\n1. Course Introduction\nnotes\nexercises\nassignment 1\n2. Basics of Bayesian Inference\nnotes\nexercises\nassignment 2\n3. Multidimensional Posterior\nnotes\nexercises\nassignment 3\n4. Monte Carlo\nnotes\nexercises\nassignment 4\n5. Markov chain Monte Carlo\nnotes\nexercises\nassignment 5\n6. HMC, NUTS, and Stan\nnotes\nexercises\nassignment 6\n7. Hierarchical models and exchangeability\nnotes\nexercises\nassignment 7\n8. Model checking & Cross-validation\nnotes\n(none)\n(none)\n9. Model comparison and selection\nnotes\n(none)\nassignment 8\n10. Decision analysis\nnotes\n(none)\nassignment 9\n11. Normal approximation & Frequency properties\nnotes\n(none)\n(none)\n12. Extended topics\nnotes\n(none)\n(none)\nStan models\nDrug bioassay model for Assignment 6\n8 school SAT model\nDrownings for Assignment 7\nFactory machine measurements for Assignments 7 & 8:\npooled\nseparate\nhierarchical (also used in Assignment 9)\n\n",
- "last_modified": "2021-12-03T07:43:26-05:00"
+ "contents": "\nResources\nCourse\nwebsite\n2021\nSchedule\nGitHub\nrepo (my\nfork)\nBayesian Data\nAnalysis (3e) (BDA3) (exercise solutions)\nChapter\nNotes\nVideo\nlectures or individually lists here\nLecture\nslides\nHow to study\n\nThe following are recommendations from the course creators on how to\ntake the course.\n\nThe recommended way to go through the material is:\nRead the reading instructions for a chapter in the chapter\nnotes.\nRead the chapter in BDA3 and check that you find the terms listed in\nthe reading instructions.\nWatch the corresponding video\nlecture to get explanations for most important parts.\nRead corresponding additional information in the chapter notes.\nRun the corresponding demos in R demos or Python demos.\nRead the exercise instructions and make the corresponding\nassignments. Demo codes in R demos and Python demos have a lot of useful\nexamples for handling data and plotting figures. If you have problems,\nvisit TA sessions or ask in course slack channel.\nIf you want to learn more, make also self study exercises listed\nbelow.\nCourse sections\nSection\nNotes\nBook exercises\nAssignments\n1. Course\nIntroduction\nnotes\nexercises\nassignment 1\n2. Basics of Bayesian\nInference\nnotes\nexercises\nassignment 2\n3. Multidimensional\nPosterior\nnotes\n(none)\nassignment 3\n4. Monte Carlo\nnotes\n(none)\nassignment 4\n5. Markov chain Monte\nCarlo\nnotes\n(none)\nassignment 5\n6. HMC, NUTS, and\nStan\nnotes\n(none)\nassignment 6\n7. Hierarchical models and\nexchangeability\nnotes\nexercises\nassignment 7\n8. Model checking &\nCross-validation\nnotes\n(none)\n(none)\n9. Model comparison and\nselection\nnotes\n(none)\nassignment 8\n10. Decision\nanalysis\nnotes\n(none)\nassignment 9\n11. Normal approximation &\nFrequency properties\nnotes\n(none)\n(none)\n12. Extended topics\nnotes\n(none)\n(none)\nAdditional notes\nChapter\nTitle\nNotes\nAdditional work\n14\nIntroduction to regression models\nnotes\n\n15\nHierarchical linear models\nnotes\n\n16\nGeneralized linear models\n\n\n18\nModels for missing data\n\n\n19\nParametric nonlinear models\nnotes\nserial dilution\nmodel\n20\nBasis function models\nnotes\n\n21\nGaussian process models\nnotes\n\n22\nFinite mixture models\nnotes\n\n23\nDirichlet process models\nnotes\n\nStan models\nDrug bioassay model\nfor Assignment 6\n8 school SAT model\nDrownings for\nAssignment 7\nFactory machine measurements for Assignments 7 & 8:\npooled\nseparate\nhierarchical\n(also used in Assignment 9)\n\nserial dilution assay for\nchapter 19 exercises\n\n\n\n",
+ "last_modified": "2022-01-31T07:00:23-05:00"
},
{
"path": "project-guidelines.html",
"author": [],
- "contents": "\nProject work details\nProject work involves choosing a data set and performing a whole analysis according to all the parts of Bayesian workflow studied along the course.\nThe project work is meant to be done in period II.\nIn the beginning of the period II\nForm a group. We prefer groups of 3, but the project can be done in groups of 1-2.\nSelect a topic. You may ask in the course chat channel #project for opinion whether it’s a good topic and a good dataset. You can change the topic later.\nStart planning.\n\n\nThe main work for the project and the presentation will be done in the second half of the period II after all the workflow parts have been discussed in the course.\nThe online presentations will be made on the evaluation week after period II.\nAll suspected plagiarism will be reported and investigated. See more about the Aalto University Code of Academic Integrity and Handling Violations Thereof.\nProject schedule\nForm a group and pick a topic. Register the group before end of 8th Nov, 2021.\nGroups of 3 can reserve a presentation slot starting TBA.\nGroups of 2 can reserve a presentation slot starting TBA.\nGroups of 1 can reserve a presentation slot starting TBA.\nGroups that register late can reserve a presentation slot starting TBA.\nWork on the project. TA session queue is also for project questions.\nProject report deadline December 6, 2021. Submit in peergrade (separate “class”, the link will be added to MyCourses).\nProject report peer grading December 7-9, 2021 (so that you’ll get feedback for the report before the presentations).\nProject presentations December 13-17, 2021.\nGroups\nProject work is done in groups of 1-3 persons. Preferred group size is 3, because you learn more when you talk about the project with someone else.\nIf you don’t have a group, you can ask other students in the group chat channel #project. Tell what kind of data you are interested in (e.g. medicine, health, biological, engineering, political, business), whether you prefer R or Python, and whether you have already more concrete idea for the topic.\nGroups of 3 students can choose their presentation time slot before 1-2 student groups. 3 person group is expected to do a bit more work than 1-2 person groups.\nYou can do the project alone, but the amount of work is expected to the same for 2 person groups.\nTA sessions\nThe groups will get help for the project work in TA sessions. When there are no weekly assignments, the TA sessions are still organized for helping in the project work.\nEvaluation\nThe project work’s evaluation consists of:\npeer-graded project report (40%) (within peergrade submission 80% and feedack 20%)\npresentation and oral exam graded by the course staff (60%)\nclarity of slides + use of figures\nclarity of oral presentation + flow of the presentation\nall required parts included (not necessarily all in main slides, but it needs to be clear that all required steps were performed)\naccuracy of use of terms (oral exam)\nresponses to questions (oral exam)\n\nProject report\nIn the project report you practice presenting the problem and data analysis results, which means that minimal listing of code and figures is not a good report. There are different levels for how data analysis project could be reported. This report should be more than a summary of results without workflow steps. While describing the steps and decisions made during the workflow, to keep the report readable some of the diagnostic outputs and code can be put in the appendix. If you are uncertain you can ask TAs in TA sessions whether you are on a good level of amount of details.\nThe report should include\nIntroduction describing\nthe motivation\nthe problem\nand the main modeling idea.\nShowing some illustrative figure is recommended.\nDescription of the data and the analysis problem. Provide information where the data was obtained, and if it has been previously used in some online case study and how your analysis differs from the existing analyses.\nDescription of at least two models, for example:\nnon hierarchical and hierarchical,\nlinear and non linear,\nvariable selection with many models.\nInformative or weakly informative priors, and justification of their choices.\nStan, rstanarm or brms code.\nHow to the Stan model was run, that is, what options were used. This is also more clear as combination of textual explanation and the actual code line.\nConvergence diagnostics (\\(\\widehat{R}\\), ESS, divergences) and what was done if the convergence was not good with the first try.\nPosterior predictive checks and what was done to improve the model.\nModel comparison (e.g. with LOO-CV).\nPredictive performance assessment if applicable (e.g. classification accuracy) and evaluation of practical usefulness of the accuracy.\nSensitivity analysis with respect to prior choices (i.e. checking whether the result changes a lot if prior is changed)\nDiscussion of issues and potential improvements.\nConclusion what was learned from the data analysis.\nSelf-reflection of what the group learned while making the project.\nProject presentation\nIn addition to the submitted report, each project must be presented by the authoring group, according to the following guidelines:\nThe presentation should be high level but sufficiently detailed information should be readily available to help answering questions from the audience.\nThe duration of the presentation should be 10 minutes (groups of 1-2 students) or 15 minutes (groups of 3 students).\nAt the end of the presentation there will be an extra 5-10 minutes of questions by anyone in the audience or two members of the course staff who are present. The questions from lecturer/TAs can be considered as an oral exam questions, and if answers to these questions reveal weak knowledge of the methods and workflow steps which should be part of the project, that can reduce the grade.\nGrading will be done by the two members of the course staff using standardized grading instructions.\nSpecific recommendations for the presentations include:\nThe first slide should include project’s title and group members’ names.\nThe chosen statistical model(s), including observation model and priors, must be explained and justified,\nMake sure the font is big enough to be easily readable by the audience. This includes figure captions, legends and axis information,\nThe last slide should be a summary or take-home-messages and include contact information or link to a further information. (The grade will be reduced by one if the last slide has only something like “Thank you” or “Questions?”),\nIn general, the best presentations are often given by teams that have frequently attended TA sessions and gotten feedback, so we strongly recommend attending these sessions.\nMore details on the presentation sessions\nIf you don’t have microphone or video camera (e.g. in your laptop or mobile phone) then we’ll arrange your presentation on campus in period III.\nIf you reserved a presentation slot but need to cancel, do it asap.\nZoom meeting link for all time slots available in the course chat.\nAs we have many presentation in each slot join the meeting in time. Late arrivals will lower the grade. Very late arrivals will fail the presentation and can present in period III.\nPresenting group needs to have video and audio on.\nIt is easiest if just one from the group shares the slides, but it is expected that all group members present some part of the presentation orally.\nPresentation time is 10 min for 1-2 person groups and 15min for 3 person groups\nTime limit is strict. It’s good idea to practice the talk so that you get the timing right. Staff will announce 2min and 1min left and time ended. Going overtime reduces the grade.\nAfter the presentation there will be 5min for questions, answers, and feedback.\nEach student has to come up with at least one question during the session. Students can ask more questions. Questions by students are posted in chat, and they can be posted already during the presentation.\nStaff Will ask further questions (kind of oral exam)\nGrading of the project presentation takes int account\nclarity of slides + use of figures\nclarity of oral presentation + flow of the presentation\nall required parts included (not necessarily all in main slides, but it needs to be clear that all required steps were performed)\naccuracy of use of terms (oral exam)\nresponses to questions (oral exam)\n\nStudents will also self-evaluate their project. After the presentation each student who just presented sends a private message to one of the staff members with a self evaluation grade from themselves and for each group member (if applicable).\nData sets\nAs some data sets have been overused for these particular goals, note that the following ones are forbidden in this work (more can be added to this list so make sure to check it regularly):\nextremly common data sets like titanic, mtcars, iris\nBaseball batting (used by Bob Carpenter’s StanCon case study).\nData sets used in the course demos\nIt’s best to use a dataset for which there is no ready made analysis in internet, but if you choose a dataset used already in some online case study, provide the link to previous studies and report how your analysis differs from those (for example if someone has made non-Bayesian analysis and you do the full Bayesian analysis).\nDepending on the model and the structure of the data, a good data set would have more than 100 observations but less than 1 million. If you know an interesting big data set, you can use a smaller subset of the data to keep the computation times feasible. It would be good that the data has some structure, so that it is sensible to use multilevel/hierarchical models.\nModel requirements\nEvery parameter needs to have an explicit proper prior. Improper flat priors are not allowed.\nA hierarchical model is a model where the prior of certain parameter contain other parameters that are also estimated in the model. For instance, b ~ normal(mu, sigma), mu ~ normal(0, 1), sigma ~ exponential(1).\nDo not impose hard constrains on a parameter unless they are natural to them. uniform(a, b) should not be used unless the boundaries are really logical boundaries and values beyond the boundaries are completely impossible.\nAt least some models should include covariates. Modelling the outcome without predictors is likely too simple for the project.\nbrms can be used, but the Stan code must be included, briefly commented, and all priors need to be checked from the Stan code and adjusted to be weakly informative based on some justified explanation.\nSome examples\nThe following case study examples demonstrate how text, equations, figures, and code, and inference results can be included in one report. These examples don’t necessarily have all the workflow steps required in your report, but different steps are illustrated in different case studies and you can get good ideas for your report just by browsing through them.\nBDA R and Python demos are quite minimal in description of the data and discussion of the results, but show many diagnostics and basic plots.\nSome Stan case studies focus on some specific methods, but there are many case studies that are excellent examples for this course. They don’t include all the steps required in this course, but are good examples of writing. Some of them are longer or use more advanced models than required in this course.\nBayesian workflow for disease transmission modeling in Stan\nModel-based Inference for Causal Effects in Completely Randomized Experments\nTagging Basketball Events with HMM in Stan\nModel building and expansion for golf putting\nA Dyadic Item Response Theory Model\nPredator-Prey Population Dynamics: the Lotka-Volterra model in Stan\nSome StanCon case studies (scroll down) can also provide good project ideas.\n",
- "last_modified": "2021-12-03T07:43:27-05:00"
+ "contents": "\nProject work details\nProject work involves choosing a data set and performing a whole\nanalysis according to all the parts of Bayesian workflow studied along\nthe course.\nThe project work is meant to be done in period II.\nIn the beginning of the period II\nForm a group. We prefer groups of 3, but the project can be done in\ngroups of 1-2.\nSelect a topic. You may ask in the course chat channel\n#project for opinion whether it’s a good topic and a good\ndataset. You can change the topic later.\nStart planning.\n\n\nThe main work for the project and the presentation will be done in\nthe second half of the period II after all the workflow parts have been\ndiscussed in the course.\nThe online presentations will be made on the evaluation week after\nperiod II.\nAll suspected plagiarism will be reported and investigated. See more\nabout the Aalto\nUniversity Code of Academic Integrity and Handling Violations\nThereof.\nProject schedule\nForm a group and pick a topic. Register the group before end of 8th\nNov, 2021.\nGroups of 3 can reserve a presentation slot starting TBA.\nGroups of 2 can reserve a presentation slot starting TBA.\nGroups of 1 can reserve a presentation slot starting TBA.\nGroups that register late can reserve a presentation slot starting\nTBA.\nWork on the project. TA session queue is also for project\nquestions.\nProject report deadline December 6, 2021. Submit in peergrade\n(separate “class”, the link will be added to MyCourses).\nProject report peer grading December 7-9, 2021 (so that you’ll get\nfeedback for the report before the presentations).\nProject presentations December 13-17, 2021.\nGroups\nProject work is done in groups of 1-3 persons. Preferred group size\nis 3, because you learn more when you talk about the project with\nsomeone else.\nIf you don’t have a group, you can ask other students in the group\nchat channel #project. Tell what kind of data you are\ninterested in (e.g. medicine, health, biological, engineering,\npolitical, business), whether you prefer R or Python, and whether you\nhave already more concrete idea for the topic.\nGroups of 3 students can choose their presentation time slot before\n1-2 student groups. 3 person group is expected to do a bit more work\nthan 1-2 person groups.\nYou can do the project alone, but the amount of work is expected to\nthe same for 2 person groups.\nTA sessions\nThe groups will get help for the project work in TA sessions. When there are no\nweekly assignments, the TA sessions are still organized for helping in\nthe project work.\nEvaluation\nThe project work’s evaluation consists of:\npeer-graded project report (40%) (within peergrade submission 80%\nand feedack 20%)\npresentation and oral exam graded by the course staff (60%)\nclarity of slides + use of figures\nclarity of oral presentation + flow of the presentation\nall required parts included (not necessarily all in main slides, but\nit needs to be clear that all required steps were performed)\naccuracy of use of terms (oral exam)\nresponses to questions (oral exam)\n\nProject report\nIn the project report you practice presenting the problem and data\nanalysis results, which means that minimal listing of code and figures\nis not a good report. There are different levels for how data analysis\nproject could be reported. This report should be more than a summary of\nresults without workflow steps. While describing the steps and decisions\nmade during the workflow, to keep the report readable some of the\ndiagnostic outputs and code can be put in the appendix. If you are\nuncertain you can ask TAs in TA sessions whether you are on a good level\nof amount of details.\nThe report should include\nIntroduction describing\nthe motivation\nthe problem\nand the main modeling idea.\nShowing some illustrative figure is recommended.\nDescription of the data and the analysis problem. Provide\ninformation where the data was obtained, and if it has been previously\nused in some online case study and how your analysis differs from the\nexisting analyses.\nDescription of at least two models, for example:\nnon hierarchical and hierarchical,\nlinear and non linear,\nvariable selection with many models.\nInformative or weakly informative priors, and justification of their\nchoices.\nStan, rstanarm or brms code.\nHow to the Stan model was run, that is, what options were used. This\nis also more clear as combination of textual explanation and the actual\ncode line.\nConvergence diagnostics (\\(\\widehat{R}\\), ESS, divergences) and what\nwas done if the convergence was not good with the first try.\nPosterior predictive checks and what was done to improve the\nmodel.\nModel comparison (e.g. with LOO-CV).\nPredictive performance assessment if applicable (e.g. classification\naccuracy) and evaluation of practical usefulness of the accuracy.\nSensitivity analysis with respect to prior choices (i.e. checking\nwhether the result changes a lot if prior is changed)\nDiscussion of issues and potential improvements.\nConclusion what was learned from the data analysis.\nSelf-reflection of what the group learned while making the\nproject.\nProject presentation\nIn addition to the submitted report, each project must be presented\nby the authoring group, according to the following guidelines:\nThe presentation should be high level but sufficiently detailed\ninformation should be readily available to help answering questions from\nthe audience.\nThe duration of the presentation should be 10 minutes (groups of 1-2\nstudents) or 15 minutes (groups of 3 students).\nAt the end of the presentation there will be an extra 5-10 minutes\nof questions by anyone in the audience or two members of the course\nstaff who are present. The questions from lecturer/TAs can be considered\nas an oral exam questions, and if answers to these questions reveal weak\nknowledge of the methods and workflow steps which should be part of the\nproject, that can reduce the grade.\nGrading will be done by the two members of the course staff using\nstandardized grading instructions.\nSpecific recommendations for the presentations include:\nThe first slide should include project’s title and group members’\nnames.\nThe chosen statistical model(s), including observation model and\npriors, must be explained and justified,\nMake sure the font is big enough to be easily readable by the\naudience. This includes figure captions, legends and axis\ninformation,\nThe last slide should be a summary or take-home-messages and include\ncontact information or link to a further information. (The grade will be\nreduced by one if the last slide has only something like “Thank you” or\n“Questions?”),\nIn general, the best presentations are often given by teams that\nhave frequently attended TA sessions and gotten feedback, so we strongly\nrecommend attending these sessions.\nMore details on the presentation sessions\nIf you don’t have microphone or video camera (e.g. in your laptop or\nmobile phone) then we’ll arrange your presentation on campus in period\nIII.\nIf you reserved a presentation slot but need to cancel, do it\nasap.\nZoom meeting link for all time slots available in the course\nchat.\nAs we have many presentation in each slot join the meeting in time.\nLate arrivals will lower the grade. Very late arrivals will fail the\npresentation and can present in period III.\nPresenting group needs to have video and audio on.\nIt is easiest if just one from the group shares the slides, but it\nis expected that all group members present some part of the presentation\norally.\nPresentation time is 10 min for 1-2 person groups and 15min for 3\nperson groups\nTime limit is strict. It’s good idea to practice the talk so that\nyou get the timing right. Staff will announce 2min and 1min left and\ntime ended. Going overtime reduces the grade.\nAfter the presentation there will be 5min for questions, answers,\nand feedback.\nEach student has to come up with at least one question during the\nsession. Students can ask more questions. Questions by students are\nposted in chat, and they can be posted already during the\npresentation.\nStaff Will ask further questions (kind of oral exam)\nGrading of the project presentation takes int account\nclarity of slides + use of figures\nclarity of oral presentation + flow of the presentation\nall required parts included (not necessarily all in main slides, but\nit needs to be clear that all required steps were performed)\naccuracy of use of terms (oral exam)\nresponses to questions (oral exam)\n\nStudents will also self-evaluate their project. After the\npresentation each student who just presented sends a private message to\none of the staff members with a self evaluation grade from themselves\nand for each group member (if applicable).\nData sets\nAs some data sets have been overused for these particular goals, note\nthat the following ones are forbidden in this work (more can be added to\nthis list so make sure to check it regularly):\nextremly common data sets like titanic, mtcars, iris\nBaseball batting (used by Bob Carpenter’s StanCon case study).\nData sets used in the course demos\nIt’s best to use a dataset for which there is no ready made analysis\nin internet, but if you choose a dataset used already in some online\ncase study, provide the link to previous studies and report how your\nanalysis differs from those (for example if someone has made\nnon-Bayesian analysis and you do the full Bayesian analysis).\nDepending on the model and the structure of the data, a good data set\nwould have more than 100 observations but less than 1 million. If you\nknow an interesting big data set, you can use a smaller subset of the\ndata to keep the computation times feasible. It would be good that the\ndata has some structure, so that it is sensible to use\nmultilevel/hierarchical models.\nModel requirements\nEvery parameter needs to have an explicit proper prior. Improper\nflat priors are not allowed.\nA hierarchical model is a model where the prior of certain parameter\ncontain other parameters that are also estimated in the model. For\ninstance, b ~ normal(mu, sigma), mu ~ normal(0,\n1), sigma ~ exponential(1).\nDo not impose hard constrains on a parameter unless they are natural\nto them. uniform(a, b) should not be used unless the\nboundaries are really logical boundaries and values beyond the\nboundaries are completely impossible.\nAt least some models should include covariates. Modelling the\noutcome without predictors is likely too simple for the project.\nbrms can be used, but the Stan code must be included,\nbriefly commented, and all priors need to be checked from the Stan code\nand adjusted to be weakly informative based on some justified\nexplanation.\nSome examples\nThe following case study examples demonstrate how text, equations,\nfigures, and code, and inference results can be included in one report.\nThese examples don’t necessarily have all the workflow steps required in\nyour report, but different steps are illustrated in different case\nstudies and you can get good ideas for your report just by browsing\nthrough them.\nBDA R and Python demos are quite minimal in\ndescription of the data and discussion of the results, but show many\ndiagnostics and basic plots.\nSome Stan case\nstudies focus on some specific methods, but there are many case\nstudies that are excellent examples for this course. They don’t include\nall the steps required in this course, but are good examples of writing.\nSome of them are longer or use more advanced models than required in\nthis course.\nBayesian\nworkflow for disease transmission modeling in Stan\nModel-based\nInference for Causal Effects in Completely Randomized\nExperments\nTagging\nBasketball Events with HMM in Stan\nModel\nbuilding and expansion for golf putting\nA\nDyadic Item Response Theory Model\nPredator-Prey\nPopulation Dynamics: the Lotka-Volterra model in Stan\nSome StanCon\ncase studies (scroll down) can also provide good project ideas.\n",
+ "last_modified": "2022-01-31T07:00:24-05:00"
}
],
"collections": []
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 9b2cc8d..8192459 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -2,14 +2,14 @@
https://jhrcook.github.io/bayesian-data-analysis-course/about.html
- 2021-12-03T07:42:16-05:00
+ 2021-12-03T07:45:23-05:00
https://jhrcook.github.io/bayesian-data-analysis-course/
- 2021-12-03T07:07:04-05:00
+ 2022-01-22T11:57:32-05:00
https://jhrcook.github.io/bayesian-data-analysis-course/project-guidelines.html
- 2021-12-03T07:43:27-05:00
+ 2022-01-31T07:00:23-05:00
diff --git a/index.Rmd b/index.Rmd
index cc006c3..1faebb3 100644
--- a/index.Rmd
+++ b/index.Rmd
@@ -39,22 +39,36 @@ The recommended way to go through the material is:
6. Read the exercise instructions and make the corresponding assignments. Demo codes in R demos and Python demos have a lot of useful examples for handling data and plotting figures. If you have problems, visit TA sessions or ask in course slack channel.
7. If you want to learn more, make also self study exercises listed below.
-## Sections
+## Course sections
-| Section | Notes | Book exercises | Assignments |
-|---------|-------|----------------|-------------|
-| **1. Course Introduction** | [notes](notes/01_introduction_bda3-01.html) | [exercises](book-exercises/bda-exercises-01.html) | [assignment 1](assignments/jhcook-assignment-01.html) |
-| **2. Basics of Bayesian Inference** | [notes](notes/02_basics-of-Bayesian-inference_bda3-02.html) | [exercises](book-exercises/bda-exercises-02.html) | [assignment 2](assignments/jhcook-assignment-02.html) |
-| **3. Multidimensional Posterior** | [notes](notes/03_multidimensional-posterior_bda3-03.html) | exercises | [assignment 3](assignments/jhcook-assignment-03.html) |
-| **4. Monte Carlo** | [notes](notes/04_monte-carlo_bda3-10.html) | exercises | [assignment 4](assignments/jhcook-assignment-04.html) |
-| **5. Markov chain Monte Carlo** | [notes](notes/05_mcmc_bda3-11.html) | exercises | [assignment 5](assignments/jhcook-assignment-05.html) |
-| **6. HMC, NUTS, and Stan** | [notes](notes/06_hmc-nuts-stan_bda3-12.html) | exercises | [assignment 6](assignments/jhcook-assignment-06.html) |
-| **7. Hierarchical models and exchangeability** | [notes](notes/07_hierarchical-models_bda3-5.html) | [exercises](book-exercises/bda-exercises-07.html) | [assignment 7](assignments/jhcook-assignment-07.html) |
-| **8. Model checking & Cross-validation** | [notes](notes/08_model-checking-and-cv_bda3-6-7.html) | (none) | (none) |
-| **9. Model comparison and selection** | [notes](notes/09_model-selection_bda3-7.html) | (none) | [assignment 8](assignments/jhcook-assignment-08.html) |
-| **10. Decision analysis** | [notes](notes/10_decision-analysis_bda3-9.html) | (none) | [assignment 9](assignments/jhcook-assignment-09.html) |
-| **11. Normal approximation & Frequency properties** | [notes](notes/11_normal-approx-freq-properties_bda3-04.html) | (none) | (none) |
-| **12. Extended topics** | [notes](notes/12_extended-topics.html) | (none) | (none) |
+| Section | Notes | Book exercises | Assignments |
+|----------------------------------------------------:|:------------------------------------------------------------:|:-------------------------------------------------:|:-----------------------------------------------------:|
+| **1. Course Introduction** | [notes](notes/01_introduction_bda3-01.html) | [exercises](book-exercises/bda-exercises-01.html) | [assignment 1](assignments/jhcook-assignment-01.html) |
+| **2. Basics of Bayesian Inference** | [notes](notes/02_basics-of-Bayesian-inference_bda3-02.html) | [exercises](book-exercises/bda-exercises-02.html) | [assignment 2](assignments/jhcook-assignment-02.html) |
+| **3. Multidimensional Posterior** | [notes](notes/03_multidimensional-posterior_bda3-03.html) | (none) | [assignment 3](assignments/jhcook-assignment-03.html) |
+| **4. Monte Carlo** | [notes](notes/04_monte-carlo_bda3-10.html) | (none) | [assignment 4](assignments/jhcook-assignment-04.html) |
+| **5. Markov chain Monte Carlo** | [notes](notes/05_mcmc_bda3-11.html) | (none) | [assignment 5](assignments/jhcook-assignment-05.html) |
+| **6. HMC, NUTS, and Stan** | [notes](notes/06_hmc-nuts-stan_bda3-12.html) | (none) | [assignment 6](assignments/jhcook-assignment-06.html) |
+| **7. Hierarchical models and exchangeability** | [notes](notes/07_hierarchical-models_bda3-5.html) | [exercises](book-exercises/bda-exercises-07.html) | [assignment 7](assignments/jhcook-assignment-07.html) |
+| **8. Model checking & Cross-validation** | [notes](notes/08_model-checking-and-cv_bda3-6-7.html) | (none) | (none) |
+| **9. Model comparison and selection** | [notes](notes/09_model-selection_bda3-7.html) | (none) | [assignment 8](assignments/jhcook-assignment-08.html) |
+| **10. Decision analysis** | [notes](notes/10_decision-analysis_bda3-9.html) | (none) | [assignment 9](assignments/jhcook-assignment-09.html) |
+| **11. Normal approximation & Frequency properties** | [notes](notes/11_normal-approx-freq-properties_bda3-04.html) | (none) | (none) |
+| **12. Extended topics** | [notes](notes/12_extended-topics.html) | (none) | (none) |
+
+## Additional notes
+
+| Chapter | Title | Notes | Additional work |
+|--------:|---------------------------------------|:----------------------------------------------------------:|:-------------------------------------------------------------:|
+| 14 | **Introduction to regression models** | [notes](notes/13_intro-to-regression-models_bda3-14.html) | |
+| 15 | **Hierarchical linear models** | [notes](notes/14_hierarchical-linear-models_bda3-15.html) | |
+| 16 | **Generalized linear models** | | |
+| 18 | **Models for missing data** | | |
+| 19 | **Parametric nonlinear models** | [notes](notes/17_parametric-nonlinear-models_bda3-19.html) | [serial dilution model](book-exercises/bda-exercises-19.html) |
+| 20 | **Basis function models** | [notes](notes/18_basis-function-models_bda3-20.html) | |
+| 21 | **Gaussian process models** | [notes](notes/19_gaussian-processes_bda3-21.html) | |
+| 22 | **Finite mixture models** | [notes](notes/20_finite-mixture-models_bda3-22.html) | |
+| 23 | **Dirichlet process models** | [notes](notes/21_Dirichlet-process-models_bda3-23.html) | |
## Stan models
@@ -65,3 +79,4 @@ The recommended way to go through the material is:
- [pooled](models/assignment07_factories_pooled.stan)
- [separate](models/assignment07_factories_separate.stan)
- [hierarchical](models/assignment07_factories_hierarchical.stan) (also used in Assignment 9)
+- [serial dilution assay](models/serial-dilution.stan) for chapter 19 exercises
diff --git a/models/serial-dilution.rds b/models/serial-dilution.rds
new file mode 100644
index 0000000..43fa9ea
Binary files /dev/null and b/models/serial-dilution.rds differ
diff --git a/models/serial-dilution.stan b/models/serial-dilution.stan
new file mode 100644
index 0000000..1dab7e7
--- /dev/null
+++ b/models/serial-dilution.stan
@@ -0,0 +1,59 @@
+data {
+ int N; // number of data points
+ int A; // constant used in model of measurement error
+ vector[N] x; // concentration values
+ vector[N] y; // observed color intensity
+ int M; // number of new x values
+ vector[M] xnew; // new x values
+}
+
+parameters {
+ vector[4] beta;
+ real alpha;
+ real sigma;
+}
+
+transformed parameters {
+ vector[N] g;
+ vector[N] tau;
+
+ for (i in 1:N) {
+ g[i] = beta[1] + beta[2] / (1 + (x[i] / beta[3]) ^ (-beta[4]));
+ tau[i] = ((g[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+ }
+}
+
+model {
+ // Priors
+ alpha ~ beta(1, 1);
+ beta[1] ~ normal(10, 2.5);
+ beta[2] ~ normal(100, 5);
+ beta[3] ~ normal(0, 1);
+ beta[4] ~ normal(0, 2.5);
+ sigma ~ normal(0, 2.5);
+
+ // Likelihood
+ for (i in 1:N) {
+ y[i] ~ normal(g[i], tau[i]);
+ }
+}
+
+generated quantities {
+ vector[N] ypred;
+ vector[N] log_lik;
+
+ vector[M] g_hat;
+ vector[M] tau_hat;
+ vector[M] ynew;
+
+ for (i in 1:N) {
+ ypred[i] = normal_rng(g[i], tau[i]);
+ log_lik[i] = normal_lpdf(y[i] | g[i], tau[i]);
+ }
+
+ for (i in 1:M) {
+ g_hat[i] = beta[1] + beta[2] / (1 + (xnew[i] / beta[3]) ^ (-beta[4]));
+ tau_hat[i] = ((g_hat[i] / A) ^ (2.0 * alpha)) * (sigma ^ 2.0);
+ ynew[i] = normal_rng(g_hat[i], tau_hat[i]);
+ }
+}
diff --git a/notes/13_intro-to-regression-models_bda3-14.Rmd b/notes/13_intro-to-regression-models_bda3-14.Rmd
new file mode 100644
index 0000000..48c861f
--- /dev/null
+++ b/notes/13_intro-to-regression-models_bda3-14.Rmd
@@ -0,0 +1,78 @@
+---
+title: "13. Notes on 'Ch 14. Introduction to regression models'"
+date: "2021-12-06"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part ofthe course.
+
+## Chapter 14. Introduction to regression models
+
+### 14.1 Conditional modeling
+
+- question: how does one quantity $y$ vary as a function of another quantity or vector of quantities $x$?
+ - conditional distribution of $y$ given $x$ parameterized as $p(y|\theta,x)$
+- key statistical modeling issues:
+ 1. defining $y$ and $x$ so that $y$ is reasonably linear as a function of the columns of $X$
+ - may need to transform $x$
+ 2. set priors on the model parameters
+
+### 14.2 Bayesian analysis of classical regression
+
+- simplest case: *ordinary linear regression*
+ - observation errors are independent and have equal variance
+
+$$
+y | \beta, \sigma, X \sim \text{N}(X \beta, \sigma^2 I)
+$$
+
+#### Posterior predictive distribution for new data
+
+- posterior predictive distribution has two sources of uncertainty:
+ 1. the inherent variability in the model represented by $\sigma$ in $y$
+ 2. posterior uncertainty in $\beta$ and $\sigma$
+- draw a random sample $\tilde{y}$ from the posterior predictive distribution:
+ - draw $(\beta, \sigma)$ from their posteriors
+ - draw $\tilde{y} \sim \text{N}(\tilde{X} \beta, \sigma^2 I)$
+
+## 14.4 Goals of regression analysis
+
+- at least three goals:
+ 1. understand the behavior of $y$ given $x$
+ 2. predict $y$ given $x$
+ 3. causal inference; predict how $y$ would change if $x$ were changed
+
+## 14.5 Assembling the matrix of explanitory variables
+
+### Identifiability and collinearity
+
+- "the parameters in a classical regression cannot be uniquely estimated if there are more parameters than data points or, more generally, if the columns of the matrix $X$ of explanatory variables are not linearly independent" (pg 365)
+
+### Nonlinear relations
+
+- may need to transform variables
+- can include more than one transformation in the model as separate covariates
+- GLMs and non-linear models are discussed in later chapters
+
+### Indicator variables
+
+- include a categorical variable in a regression using a indicator variable
+ - separate effect for each category
+ - or model as related with a hierarchical model
+
+### Interactions
+
+- "If the response to a unit change in $x_i$ depends in what value another predictor $x_j$ has been fixed at, then it is necessary to include *interaction* terms in the model" (pg 367)
+ - $(x_i - \bar{x_i})(x_j - \bar{x_j})$
+
+## 14.6 Regularization and dimension reduction
+
+- see lecture notes on regularization for more updated recommendations
+- "Bayesian regularization":
+ - location and scale of the prior
+ - analytic form of the prior (e.g. normal vs. Laplacian vs. Cauchy)
+ - how the posterior inference is summarized
diff --git a/notes/13_intro-to-regression-models_bda3-14.html b/notes/13_intro-to-regression-models_bda3-14.html
new file mode 100644
index 0000000..a7c86c7
--- /dev/null
+++ b/notes/13_intro-to-regression-models_bda3-14.html
@@ -0,0 +1,2607 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 13. Notes on 'Ch 14. Introduction to regression models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
13. Notes on ‘Ch 14. Introduction to regression models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part ofthe course.
+
+
Chapter 14. Introduction to regression models
+
14.1 Conditional modeling
+
+- question: how does one quantity \(y\) vary as a function of another quantity or vector of quantities \(x\)?
+
+- conditional distribution of \(y\) given \(x\) parameterized as \(p(y|\theta,x)\)
+
+- key statistical modeling issues:
+
+- defining \(y\) and \(x\) so that \(y\) is reasonably linear as a function of the columns of \(X\)
+
+
+- may need to transform \(x\)
+
+
+- set priors on the model parameters
+
+
+
14.2 Bayesian analysis of classical regression
+
+- simplest case: ordinary linear regression
+
+- observation errors are independent and have equal variance
+
+
+
\[
+y | \beta, \sigma, X \sim \text{N}(X \beta, \sigma^2 I)
+\]
+
Posterior predictive distribution for new data
+
+- posterior predictive distribution has two sources of uncertainty:
+
+- the inherent variability in the model represented by \(\sigma\) in \(y\)
+- posterior uncertainty in \(\beta\) and \(\sigma\)
+
+- draw a random sample \(\tilde{y}\) from the posterior predictive distribution:
+
+- draw \((\beta, \sigma)\) from their posteriors
+- draw \(\tilde{y} \sim \text{N}(\tilde{X} \beta, \sigma^2 I)\)
+
+
+
14.4 Goals of regression analysis
+
+- at least three goals:
+
+- understand the behavior of \(y\) given \(x\)
+- predict \(y\) given \(x\)
+- causal inference; predict how \(y\) would change if \(x\) were changed
+
+
+
14.5 Assembling the matrix of explanitory variables
+
Identifiability and collinearity
+
+- “the parameters in a classical regression cannot be uniquely estimated if there are more parameters than data points or, more generally, if the columns of the matrix \(X\) of explanatory variables are not linearly independent” (pg 365)
+
+
Nonlinear relations
+
+- may need to transform variables
+- can include more than one transformation in the model as separate covariates
+- GLMs and non-linear models are discussed in later chapters
+
+
Indicator variables
+
+- include a categorical variable in a regression using a indicator variable
+
+- separate effect for each category
+- or model as related with a hierarchical model
+
+
+
Interactions
+
+- “If the response to a unit change in \(x_i\) depends in what value another predictor \(x_j\) has been fixed at, then it is necessary to include interaction terms in the model” (pg 367)
+
+- \((x_i - \bar{x_i})(x_j - \bar{x_j})\)
+
+
+
14.6 Regularization and dimension reduction
+
+- see lecture notes on regularization for more updated recommendations
+- “Bayesian regularization”:
+
+- location and scale of the prior
+- analytic form of the prior (e.g. normal vs. Laplacian vs. Cauchy)
+- how the posterior inference is summarized
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notes/14_hierarchical-linear-models_bda3-15.Rmd b/notes/14_hierarchical-linear-models_bda3-15.Rmd
new file mode 100644
index 0000000..8f15df1
--- /dev/null
+++ b/notes/14_hierarchical-linear-models_bda3-15.Rmd
@@ -0,0 +1,18 @@
+---
+title: "14. Notes on 'Ch 15. Hierarchical linear models'"
+date: "2021-12-09"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part ofthe course.
+> Also, these notes are rather sparse because I have a bit of experience with these models already.
+
+## Chapter 15. Hierarchical linear models
+
+- "Hierarchical regression models are useful as soon as there are predictors at different levels of variation" (pg 381)
+ - or for data obtained by stratified/clustered sampling
+- general recommendation to start simple and build up the model's complexity
diff --git a/notes/14_hierarchical-linear-models_bda3-15.html b/notes/14_hierarchical-linear-models_bda3-15.html
new file mode 100644
index 0000000..81ef91e
--- /dev/null
+++ b/notes/14_hierarchical-linear-models_bda3-15.html
@@ -0,0 +1,2520 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 14. Notes on 'Ch 15. Hierarchical linear models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
14. Notes on ‘Ch 15. Hierarchical linear models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part ofthe course. Also, these notes are rather sparse because I have a bit of experience with these models already.
+
+
Chapter 15. Hierarchical linear models
+
+- “Hierarchical regression models are useful as soon as there are predictors at different levels of variation” (pg 381)
+
+- or for data obtained by stratified/clustered sampling
+
+- general recommendation to start simple and build up the model’s complexity
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notes/17_parametric-nonlinear-models_bda3-19.Rmd b/notes/17_parametric-nonlinear-models_bda3-19.Rmd
new file mode 100644
index 0000000..a286d06
--- /dev/null
+++ b/notes/17_parametric-nonlinear-models_bda3-19.Rmd
@@ -0,0 +1,94 @@
+---
+title: "17. Notes on 'Ch 19. Parametric nonlinear models'"
+date: "2021-12-09"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 19. Parametric nonlinear models
+
+- examples:
+ - a ratio; $\text{E}(y = \frac{a_1 + b_1 x_1}{a_2 + b_2 x_2}$
+ - sum of nonlinear functions: $\text{E}(y) = A_1 e^{-\alpha_1x} + A_2 e^{-\alpha_2x}$
+- parameters of a nonlinear model are often harder to interpret
+ - often requires custom visualization techniques
+- **"Generally each new modeling problem must be tackled afresh."** (pg 471)
+ - these models are less systematic than linear modeling
+
+### 19.1 Example: serial dilution assay
+
+- estimate 10 unknown concentrations of an allergen based off of serial dilutions of a known standard
+
+#### The model
+
+- **Notation**:
+ - parameters of interest: concentrations of unknown samples $\theta_1, \dots, \theta_10$
+ - known concentration of the standard $\theta_0$
+ - dilution of measure $i$ as $x_i$ and color intensity (measurement) as $y_i$
+- **Curve of expected measurements given the concentration**
+ - use the following equation that is standard in the field
+ - parameters:
+ - $\beta_1$: color intensity at the limit of 0 concentration
+ - $\beta_2$: the increase to saturation
+ - $\beta_3$: concentration at which the gradient of the curve turns
+ - $\beta_4$: rate at which saturation occurs
+
+$$
+\text{E}(y | x, \beta) =
+ g(x, \beta) =
+ \beta_1 + \frac{\beta_2}{1 + (x / \beta_3)^{-\beta_4}}
+$$
+
+- **Measurement error**
+ - modeled as normally distributed with unequal variances
+ - parameters:
+ - $\alpha$: models the pattern that variances are higher for larger measurements
+ - restricted $[0, 1]$
+ - $A$ is a arbitrary constant to scale the data so $\sigma$ can be interpreted as the deviation from "typical" values
+ - $\sigma$: deviation of a measure from the "typical"
+
+$$
+y_i \sim \text{N}(g(x_i, \beta), (\frac{g(x_i, \beta)}{A})^{2\alpha} \sigma_y^2)
+$$
+
+- **Dilution errors**
+ - two possible sources:
+ 1. *initial dilution*: the accuracy of the creation of the initial standard concentration
+ 2. *serial dilutions*: error in creation of the subsequent dilutions (low enough to ignore for this analysis)
+ - use a normal model on the log scale of the initial dilution error
+ - parameters:
+ - $\theta_0$: known concentration of the standard solution
+ - $d_0^\text{init}$: known initial dilution of the standard that is called for
+ - without error, the concentration of the initial solution would be $d_0^\text{init} \theta_0$
+ - $x_0^\text{init}$: the *actual* (unknown) concentration of the initial dilution
+
+$$
+\log(x_0^\text{init}) \sim \text{N}(\log(d_0^\text{init} \cdot \theta_0), (\sigma^\text{init})^2)
+$$
+- **Dilution errors** (cont)
+ - there is no initial dilution for the unknown samples being tested
+ - therefore, the unknown initial concentration for sample $j$ is $x^\text{init} = \theta_j$ for $j = 1, \dots, 10$
+ - for the dilutions of the unknown samples, set $x_i = d_i \cdot x_{j(i)}^\text{init}$
+ - $j(i)$ is the sample $j$ corresponding to measurement $i$
+ - $d_i$ is the dilution of measurement $i$ relative to the initial concentration
+
+#### Prior distributions
+
+- priors used as described by book (are likely different than what would be recommended now):
+ - $\log(\beta) \sim U(-\infty, \infty)$
+ - $\sigma_y \sim U(0, \infty)$
+ - $\alpha \sim U(0,1)$
+ - $p(\log \theta_j) \propto 1$ for each unknown $j = 1, \dots, 10$
+- cannot estimate $\sigma^\text{init}$ because we only have a single standard
+ - use a fixed value of 0.02 based on a previous analysis of different plates
+
+### 19.2 Example: population toxicokinetics
+
+- this is a more complex model
+- uses a physiological model with parameters that cannot be solely determined using the data
+ - requires informative priors based on previous studies
diff --git a/notes/17_parametric-nonlinear-models_bda3-19.html b/notes/17_parametric-nonlinear-models_bda3-19.html
new file mode 100644
index 0000000..625c447
--- /dev/null
+++ b/notes/17_parametric-nonlinear-models_bda3-19.html
@@ -0,0 +1,2629 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 17. Notes on 'Ch 19. Parametric nonlinear models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
17. Notes on ‘Ch 19. Parametric nonlinear models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part of the course.
+
+
Chapter 19. Parametric nonlinear models
+
+- examples:
+
+- a ratio; \(\text{E}(y = \frac{a_1 + b_1 x_1}{a_2 + b_2 x_2}\)
+- sum of nonlinear functions: \(\text{E}(y) = A_1 e^{-\alpha_1x} + A_2 e^{-\alpha_2x}\)
+
+- parameters of a nonlinear model are often harder to interpret
+
+- often requires custom visualization techniques
+
+- “Generally each new modeling problem must be tackled afresh.” (pg 471)
+
+- these models are less systematic than linear modeling
+
+
+
19.1 Example: serial dilution assay
+
+- estimate 10 unknown concentrations of an allergen based off of serial dilutions of a known standard
+
+
The model
+
+- Notation:
+
+- parameters of interest: concentrations of unknown samples \(\theta_1, \dots, \theta_10\)
+- known concentration of the standard \(\theta_0\)
+- dilution of measure \(i\) as \(x_i\) and color intensity (measurement) as \(y_i\)
+
+- Curve of expected measurements given the concentration
+
+- use the following equation that is standard in the field
+- parameters:
+
+- \(\beta_1\): color intensity at the limit of 0 concentration
+- \(\beta_2\): the increase to saturation
+- \(\beta_3\): concentration at which the gradient of the curve turns
+- \(\beta_4\): rate at which saturation occurs
+
+
+
+
\[
+\text{E}(y | x, \beta) =
+ g(x, \beta) =
+ \beta_1 + \frac{\beta_2}{1 + (x / \beta_3)^{-\beta_4}}
+\]
+
+- Measurement error
+
+- modeled as normally distributed with unequal variances
+- parameters:
+
+- \(\alpha\): models the pattern that variances are higher for larger measurements
+
+- \(A\) is a arbitrary constant to scale the data so \(\sigma\) can be interpreted as the deviation from “typical” values
+- \(\sigma\): deviation of a measure from the “typical”
+
+
+
+
\[
+y_i \sim \text{N}(g(x_i, \beta), (\frac{g(x_i, \beta)}{A})^{2\alpha} \sigma_y^2)
+\]
+
+- Dilution errors
+
+- two possible sources:
+
+- initial dilution: the accuracy of the creation of the initial standard concentration
+- serial dilutions: error in creation of the subsequent dilutions (low enough to ignore for this analysis)
+
+- use a normal model on the log scale of the initial dilution error
+- parameters:
+
+- \(\theta_0\): known concentration of the standard solution
+- \(d_0^\text{init}\): known initial dilution of the standard that is called for
+
+- without error, the concentration of the initial solution would be \(d_0^\text{init} \theta_0\)
+
+- \(x_0^\text{init}\): the actual (unknown) concentration of the initial dilution
+
+
+
+
\[
+\log(x_0^\text{init}) \sim \text{N}(\log(d_0^\text{init} \cdot \theta_0), (\sigma^\text{init})^2)
+\] - Dilution errors (cont) - there is no initial dilution for the unknown samples being tested - therefore, the unknown initial concentration for sample \(j\) is \(x^\text{init} = \theta_j\) for \(j = 1, \dots, 10\) - for the dilutions of the unknown samples, set \(x_i = d_i \cdot x_{j(i)}^\text{init}\) - \(j(i)\) is the sample \(j\) corresponding to measurement \(i\) - \(d_i\) is the dilution of measurement \(i\) relative to the initial concentration
+
Prior distributions
+
+- priors used as described by book (are likely different than what would be recommended now):
+
+- \(\log(\beta) \sim U(-\infty, \infty)\)
+- \(\sigma_y \sim U(0, \infty)\)
+- \(\alpha \sim U(0,1)\)
+- \(p(\log \theta_j) \propto 1\) for each unknown \(j = 1, \dots, 10\)
+
+- cannot estimate \(\sigma^\text{init}\) because we only have a single standard
+
+- use a fixed value of 0.02 based on a previous analysis of different plates
+
+
+
19.2 Example: population toxicokinetics
+
+- this is a more complex model
+- uses a physiological model with parameters that cannot be solely determined using the data
+
+- requires informative priors based on previous studies
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notes/18_basis-function-models_bda3-20.Rmd b/notes/18_basis-function-models_bda3-20.Rmd
new file mode 100644
index 0000000..181d612
--- /dev/null
+++ b/notes/18_basis-function-models_bda3-20.Rmd
@@ -0,0 +1,64 @@
+---
+title: "18. Notes on 'Ch 20. Basis function models'"
+date: "2022-01-12"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 20. Basis function models
+
+- chapter 19 focused on nonlinear models with $\text{E}(y|X,\beta) = \mu(X_i,\phi)$ where $\mu$ is a parametric nonlinear function of unknowns $\phi$
+- in this and following chapters, consider models where $\mu$ is also unknown
+
+### 20.1 Splines and weighted sums of basis functions
+
+- replace $X_i \beta$ with $\mu(X_i)$ where $\mu(\cdot)$ is some class of nonlinear functions
+ - different options for modeling $\mu$ including with basis function expansions or Gaussian processes (next chapter)
+- basis function approach: $\mu(x) = \sum_{h=1}^{H} \beta_h b_h(x)$
+ - $b_h$: set of basis functions
+ - $\beta_h$: vector of basis coefficients
+- common choices for basis functions are:
+ - **Gaussian radial basis functions**: multiple centers of the basis functions with a width parameter controlling a set of Gaussian functions
+ - **B-spline**: a piecewise continuous function based on a set of knots
+ - knots locations control the flexibility of the basis
+ - knots cn be placed uniformly or non-uniformly (e.g. based on the density of the data)
+ - can use a "free knot approach" with a prior on the number and location of knots, but is computationally demanding
+ - instead can use priors on the coefficients $\beta$ to shrink values to near 0
+
+## 20.2 Basis selection and shrinkage coefficients
+
+- common to not know which basis functions are really needed
+- can use a variable selection approach to allow the model to estimate the "importance" of each basis function
+ - can then either select the best model from the posterior or average over all possible models by weighting each basis by its importance
+- possible for some bias based on initial choice of the basis functions
+ - implied prior information on the smoothness and shape of the model
+ - can include multiple types of basis functions in the initial collection
+
+### Shrinkage priors
+
+- allowing basis function coefficients to be zero with positive probability represents a challenge for sampling from the posterior
+ - with many basis functions, it is computationally infeasible to visit all possible states
+- may be better to use shrinkage priors instead
+ - there are various options discussed in the book and there are likely others recommended now
+
+## 20.3 Non-normal models and regression surfaces
+
+### Other error distributions
+
+- may want to model data that is not a continuous output variable $y$ or does not have Gaussian residuals
+ - can modify the residual densities with different prior distributions to accommodate outliers
+ - can use the basis function and its coefficients $\eta_i = w_i \beta$ as the linear component in a GLM
+
+### Multivariate regression surfaces
+
+- careful with *curse of dimensionality*
+- one option is to assume additive of the covariates so can model as the sum of univariate regression functions
+ - this does not always make sense and a different approach using a tensor product is described at the end of the section
+- can use informative priors to help restrict the search space
+ - is a form of including prior information, so still proper Bayesian results and measures of uncertainty
+ - e.g. if we know *a priori* that the mean response variable is non-decreasing, restrict the coefficients $\beta$ to be non-negative
diff --git a/notes/18_basis-function-models_bda3-20.html b/notes/18_basis-function-models_bda3-20.html
new file mode 100644
index 0000000..fc1f146
--- /dev/null
+++ b/notes/18_basis-function-models_bda3-20.html
@@ -0,0 +1,2595 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 18. Notes on 'Ch 20. Basis function models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
18. Notes on ‘Ch 20. Basis function models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were not part of the course.
+
+
Chapter 20. Basis function models
+
+- chapter 19 focused on nonlinear models with \(\text{E}(y|X,\beta) = \mu(X_i,\phi)\) where \(\mu\) is a parametric nonlinear function of unknowns \(\phi\)
+- in this and following chapters, consider models where \(\mu\) is also unknown
+
+
20.1 Splines and weighted sums of basis functions
+
+- replace \(X_i \beta\) with \(\mu(X_i)\) where \(\mu(\cdot)\) is some class of nonlinear functions
+
+- different options for modeling \(\mu\) including with basis function expansions or Gaussian processes (next chapter)
+
+- basis function approach: \(\mu(x) = \sum_{h=1}^{H} \beta_h b_h(x)\)
+
+- \(b_h\): set of basis functions
+- \(\beta_h\): vector of basis coefficients
+
+- common choices for basis functions are:
+
+- Gaussian radial basis functions: multiple centers of the basis functions with a width parameter controlling a set of Gaussian functions
+- B-spline: a piecewise continuous function based on a set of knots
+
+- knots locations control the flexibility of the basis
+- knots cn be placed uniformly or non-uniformly (e.g. based on the density of the data)
+- can use a “free knot approach” with a prior on the number and location of knots, but is computationally demanding
+- instead can use priors on the coefficients \(\beta\) to shrink values to near 0
+
+
+
+
20.2 Basis selection and shrinkage coefficients
+
+- common to not know which basis functions are really needed
+- can use a variable selection approach to allow the model to estimate the “importance” of each basis function
+
+- can then either select the best model from the posterior or average over all possible models by weighting each basis by its importance
+
+- possible for some bias based on initial choice of the basis functions
+
+- implied prior information on the smoothness and shape of the model
+- can include multiple types of basis functions in the initial collection
+
+
+
Shrinkage priors
+
+- allowing basis function coefficients to be zero with positive probability represents a challenge for sampling from the posterior
+
+- with many basis functions, it is computationally infeasible to visit all possible states
+
+- may be better to use shrinkage priors instead
+
+- there are various options discussed in the book and there are likely others recommended now
+
+
+
20.3 Non-normal models and regression surfaces
+
Other error distributions
+
+- may want to model data that is not a continuous output variable \(y\) or does not have Gaussian residuals
+
+- can modify the residual densities with different prior distributions to accommodate outliers
+- can use the basis function and its coefficients \(\eta_i = w_i \beta\) as the linear component in a GLM
+
+
+
Multivariate regression surfaces
+
+- careful with curse of dimensionality
+- one option is to assume additive of the covariates so can model as the sum of univariate regression functions
+
+- this does not always make sense and a different approach using a tensor product is described at the end of the section
+
+- can use informative priors to help restrict the search space
+
+- is a form of including prior information, so still proper Bayesian results and measures of uncertainty
+- e.g. if we know a priori that the mean response variable is non-decreasing, restrict the coefficients \(\beta\) to be non-negative
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notes/19_gaussian-processes_bda3-21.Rmd b/notes/19_gaussian-processes_bda3-21.Rmd
new file mode 100644
index 0000000..6dac644
--- /dev/null
+++ b/notes/19_gaussian-processes_bda3-21.Rmd
@@ -0,0 +1,147 @@
+---
+title: "19. Notes on 'Ch 21. Gaussian process models'"
+date: "2022-01-13"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+
+library(glue)
+library(ggtext)
+library(tidyverse)
+
+theme_set(
+ theme_bw() +
+ theme(axis.ticks = element_blank(), strip.background = element_blank())
+)
+```
+
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 21. Gaussian process models
+
+- *Gaussian process* (GP): "flexible class of models for which any finite-dimensional marginal distribution is Gaussian" (pg. 501)
+ - "can be viewed as a potentially infinite-dimensional generalization of Gaussian distribution" (pg. 501)
+
+### 21.1 Gaussian process regression
+
+- realizations from a GP correspond to random functions
+ - good prior for an unknown regression function $\mu(x)$
+- $\mu \sim \text{GP}(m,k)$
+ - $m$: mean function
+ - $k$: covariance function
+- $\mu$ is a random function ("stochastic process") where the values at any $n$ pooints $x_1, \dots, x_n$ are drawn from the $n-dimensional$ normal distribution
+ - with mean $m$ and covariance $K$:
+
+$$
+\mu(x_1), \dots, \mu(x_n) \sim \text{N}((m(x_1), \dots, m(x_n)), K(x_1, \dots, x_n))
+$$
+
+- the GP $\mu \sim \text{GP}(m,k)$ is nonparametric with infinitely many parameters
+ - the mean function $m$ represents an inital guess at the regression function
+ - the covariance function $k$ represents the covariance between the process at any two points
+ - controls the smoothness of realizations from the GP and degree of shrinkage towards the mean
+- below is an example of realizations from a GP with mean function 0 and the *squared exponential* (a.k.a. exponentiated quadratic, Gaussian) covariance function with different parameters
+
+$$
+k(x, x^\prime) = \tau^2 \exp(-\frac{|x-x^\prime|^2}{2l^2})
+$$
+
+```{r}
+squared_exponential_cov <- function(x, tau, l) {
+ n <- length(x)
+ k <- matrix(0, nrow = n, ncol = n)
+ denom <- 2 * (l^2)
+ for (i in 1:n) {
+ for (j in 1:n) {
+ a <- x[i]
+ b <- x[j]
+ k[i, j] <- tau^2 * exp(-(abs(a - b)^2) / (denom))
+ }
+ }
+ return(k)
+}
+
+my_gaussian_process <- function(x, tau, l, n = 3) {
+ m <- rep(0, length(x))
+ k <- squared_exponential_cov(x = x, tau = tau, l = l)
+ gp_samples <- mvtnorm::rmvnorm(n = n, mean = m, sigma = k)
+ return(gp_samples)
+}
+
+tidy_gp <- function(x, tau, l, n = 3) {
+ my_gaussian_process(x = x, tau = tau, l = l, n = n) %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(x) %>%
+ mutate(sample_idx = as.character(1:n())) %>%
+ pivot_longer(-sample_idx, names_to = "x", values_to = "y") %>%
+ mutate(x = as.numeric(x))
+}
+
+set.seed(0)
+x <- seq(0, 10, by = 0.1)
+gp_samples <- tibble(tau = c(0.25, 0.5, 0.25, 0.5), l = c(0.5, 0.5, 2, 2)) %>%
+ mutate(samples = purrr::map2(tau, l, ~ tidy_gp(x = x, tau = .x, l = .y, n = 3))) %>%
+ unnest(samples)
+
+gp_samples %>%
+ mutate(grp = glue("\u03C4 = {tau}, \u2113 = {l}")) %>%
+ ggplot(aes(x = x, y = y)) +
+ facet_wrap(vars(grp), nrow = 2) +
+ geom_line(aes(color = sample_idx)) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(legend.position = "none", axis.text.y = element_markdown()) +
+ labs(x = "x", y = "\u03BC(x)")
+```
+
+#### Covariance functions
+
+- "Different covariance functions can be used to add structural prior assumptions like smoothness, nonstationarity, periodicity, and multi-scale or hierarchical structures." (pg. 502)
+ - sums and products of GPs are also GPs so can combine them in the same model
+- can also use "anisotropic" GPs covariance functions for multiple predictors
+
+#### Inference
+
+- computing the mean and covariance in the $n$-variate normal conditional posterior for $\tilde{\mu}$ involves a matrix inversion that requires $O(n^3)$ computation
+ - this needs to be repeated for each MCMC step
+ - limits the size of the data set and number of covariates in a model
+
+#### Covariance function approximations
+
+- there are approximations to the GP that can speed up computation
+ - generally work by reducing the matrix inversion burden
+
+### 21.3 Latent Gaussian process models
+
+- with non-Gaussian likelihoods, the GP prior becomes a latent function $f$ which determines the likelihood $p(y|f,\phi)$ through a link function
+
+### 21.4 Functional data analysis
+
+- *functional data analysis*: considers responses and predictors not a scalar/vector-valued random variables but as random functions with infinitely-many points
+ - GPs fit this need well with little modification
+
+### 21.5 Density estimation and regression
+
+- can get more flexibility by modeling the conditional observation model as a nonparametric GP
+ - so far have used a GP as a prior for a function controlling location or shape of a parametric observation model
+ - one solution is the *logistic Gaussian process* (LGP) or a Dirichlet process (covered in a later chapter)
+
+#### Density estimation
+
+- LGP generates a random surface from a GP and then transforms the surface to the space of probability densities
+ - with 1D, the surface is just a curve
+ - use the continuous logistic transformation to constrain to non-negative and integrate to 1
+- there is illustrative example in the book on page 513
+
+#### Density regression
+
+- generalize the LPG to density regression by putting a prior on the collection of conditional densities
+
+#### Latent-variable regression
+
+- an alternative to LPG
diff --git a/notes/19_gaussian-processes_bda3-21.html b/notes/19_gaussian-processes_bda3-21.html
new file mode 100644
index 0000000..599ebf8
--- /dev/null
+++ b/notes/19_gaussian-processes_bda3-21.html
@@ -0,0 +1,2709 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 19. Notes on 'Ch 21. Gaussian process models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
19. Notes on ‘Ch 21. Gaussian process models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were
+not part of the course.
+
+
Chapter 21. Gaussian
+process models
+
+- Gaussian process (GP): “flexible class of models for which
+any finite-dimensional marginal distribution is Gaussian” (pg. 501)
+
+- “can be viewed as a potentially infinite-dimensional generalization
+of Gaussian distribution” (pg. 501)
+
+
+
21.1 Gaussian process
+regression
+
+- realizations from a GP correspond to random functions
+
+- good prior for an unknown regression function \(\mu(x)\)
+
+- \(\mu \sim \text{GP}(m,k)\)
+
+- \(m\): mean function
+- \(k\): covariance function
+
+- \(\mu\) is a random function
+(“stochastic process”) where the values at any \(n\) pooints \(x_1, \dots, x_n\) are drawn from the \(n-dimensional\) normal distribution
+
+- with mean \(m\) and covariance
+\(K\):
+
+
+
\[
+\mu(x_1), \dots, \mu(x_n) \sim \text{N}((m(x_1), \dots, m(x_n)), K(x_1,
+\dots, x_n))
+\]
+
+- the GP \(\mu \sim \text{GP}(m,k)\)
+is nonparametric with infinitely many parameters
+
+- the mean function \(m\) represents
+an inital guess at the regression function
+- the covariance function \(k\)
+represents the covariance between the process at any two points
+
+- controls the smoothness of realizations from the GP and degree of
+shrinkage towards the mean
+
+
+- below is an example of realizations from a GP with mean function 0
+and the squared exponential (a.k.a. exponentiated quadratic,
+Gaussian) covariance function with different parameters
+
+
\[
+k(x, x^\prime) = \tau^2 \exp(-\frac{|x-x^\prime|^2}{2l^2})
+\]
+
+
+
squared_exponential_cov <- function(x, tau, l) {
+ n <- length(x)
+ k <- matrix(0, nrow = n, ncol = n)
+ denom <- 2 * (l^2)
+ for (i in 1:n) {
+ for (j in 1:n) {
+ a <- x[i]
+ b <- x[j]
+ k[i, j] <- tau^2 * exp(-(abs(a - b)^2) / (denom))
+ }
+ }
+ return(k)
+}
+
+my_gaussian_process <- function(x, tau, l, n = 3) {
+ m <- rep(0, length(x))
+ k <- squared_exponential_cov(x = x, tau = tau, l = l)
+ gp_samples <- mvtnorm::rmvnorm(n = n, mean = m, sigma = k)
+ return(gp_samples)
+}
+
+tidy_gp <- function(x, tau, l, n = 3) {
+ my_gaussian_process(x = x, tau = tau, l = l, n = n) %>%
+ as.data.frame() %>%
+ as_tibble() %>%
+ set_names(x) %>%
+ mutate(sample_idx = as.character(1:n())) %>%
+ pivot_longer(-sample_idx, names_to = "x", values_to = "y") %>%
+ mutate(x = as.numeric(x))
+}
+
+set.seed(0)
+x <- seq(0, 10, by = 0.1)
+gp_samples <- tibble(tau = c(0.25, 0.5, 0.25, 0.5), l = c(0.5, 0.5, 2, 2)) %>%
+ mutate(samples = purrr::map2(tau, l, ~ tidy_gp(x = x, tau = .x, l = .y, n = 3))) %>%
+ unnest(samples)
+
+gp_samples %>%
+ mutate(grp = glue("\u03C4 = {tau}, \u2113 = {l}")) %>%
+ ggplot(aes(x = x, y = y)) +
+ facet_wrap(vars(grp), nrow = 2) +
+ geom_line(aes(color = sample_idx)) +
+ scale_x_continuous(expand = expansion(c(0, 0))) +
+ scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_color_brewer(type = "qual", palette = "Set1") +
+ theme(legend.position = "none", axis.text.y = element_markdown()) +
+ labs(x = "x", y = "\u03BC(x)")
+
+
+
+
+
Covariance functions
+
+- “Different covariance functions can be used to add structural prior
+assumptions like smoothness, nonstationarity, periodicity, and
+multi-scale or hierarchical structures.” (pg. 502)
+
+- sums and products of GPs are also GPs so can combine them in the
+same model
+
+- can also use “anisotropic” GPs covariance functions for multiple
+predictors
+
+
Inference
+
+- computing the mean and covariance in the \(n\)-variate normal conditional posterior
+for \(\tilde{\mu}\) involves a matrix
+inversion that requires \(O(n^3)\)
+computation
+
+- this needs to be repeated for each MCMC step
+- limits the size of the data set and number of covariates in a
+model
+
+
+
Covariance function
+approximations
+
+- there are approximations to the GP that can speed up computation
+
+- generally work by reducing the matrix inversion burden
+
+
+
21.3 Latent Gaussian process
+models
+
+- with non-Gaussian likelihoods, the GP prior becomes a latent
+function \(f\) which determines the
+likelihood \(p(y|f,\phi)\) through a
+link function
+
+
21.4 Functional data analysis
+
+- functional data analysis: considers responses and
+predictors not a scalar/vector-valued random variables but as random
+functions with infinitely-many points
+
+- GPs fit this need well with little modification
+
+
+
21.5 Density estimation and
+regression
+
+- can get more flexibility by modeling the conditional observation
+model as a nonparametric GP
+
+- so far have used a GP as a prior for a function controlling location
+or shape of a parametric observation model
+- one solution is the logistic Gaussian process (LGP) or a
+Dirichlet process (covered in a later chapter)
+
+
+
Density estimation
+
+- LGP generates a random surface from a GP and then transforms the
+surface to the space of probability densities
+
+- with 1D, the surface is just a curve
+- use the continuous logistic transformation to constrain to
+non-negative and integrate to 1
+
+- there is illustrative example in the book on page 513
+
+
Density regression
+
+- generalize the LPG to density regression by putting a prior on the
+collection of conditional densities
+
+
Latent-variable regression
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notes/20_finite-mixture-models_bda3-22.Rmd b/notes/20_finite-mixture-models_bda3-22.Rmd
new file mode 100644
index 0000000..abfae8b
--- /dev/null
+++ b/notes/20_finite-mixture-models_bda3-22.Rmd
@@ -0,0 +1,91 @@
+---
+title: "20. Notes on 'Ch 22. Finite mixture models'"
+date: "2022-01-18"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 22. Finite mixture models
+
+- "when measurements of a random variable are taken under two different conditions" (pg. 519)
+- where the data contains multiple subpopulations where each has a different, relatively simple model
+- basic mixture modeling principle is to introduce *unobserved indicators* $z$ to specify the mixture component for an observation
+ - can think of a mixture indicator as missing data
+
+### 22.1 Setting up and interpreting mixture models
+
+#### Finite mixtures
+
+- want to model the distribution of $y = (y_1, \dots, y_n)$ or $y|x$ as a mixture of $H$ components
+ - for each component $h \in (1, \dots, H)$, the distribution $f_h(y_i | \theta_h)$ depends on a parameter vector $\theta_h$
+ - $\lambda_h$ denotes the proportion of the population in component $h$
+ - $\sum_{h=1}^{H} \lambda_h = 1$
+ - common to assume all mixture components have the same parametric form
+ - thus, the sampling distribution of $y$ is:
+
+$$
+p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + \lambda_H f(y_i | \theta_H)
+$$
+
+- can think of the mixture distribution probabilities $\lambda$ as priors over the parameters $\theta_h$
+ - or as a description of the variation in $\theta$ in a population
+ - akin to a hierarchical model
+- introduce the indicator variables $z_{ih}$ where $z_{ih} = 1$ if the $i$th data point is drawn from component $h$ and 0 otherwise
+ - the $lambda$ values are used to determine $z$
+ - can think of $lambda$ as a hyperprior over $z$
+ - joint distribution of the observed data $y$ and the unobserved indicators $z$ conditions on the model parameters:
+ - only one $z_{ih}$ can be 1 for each $i$
+
+$$
+\begin{aligned}
+p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\
+ &= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | \theta_h))^{z_{i,h}}
+\end{aligned}
+$$
+
+#### Continuous mixtures
+
+- generalize the finite mixture to allow probability of an observation belongs to a class
+- hierarchical models are a form a continuous mixture model
+ - each observed value $y_i$ is modeled as coming from a mixture of models defined by the probability of values for $\theta$
+- in the book, the focus is on finite mixtures and "minor modifications" are generally required to form a continuous distribution
+
+#### Identifiabilitiy of the mxixture likelihood
+
+- **all finite mixture models are nonidentifiable: the distribution is unchanged if the group labels are permuted**
+- in many cases, purposeful, informative priors can solve the issue
+
+#### Priors distributions
+
+- the priors for a finite mixture model's parameters $\theta$ and $\lambda$ are usually the product of the two independent priors on each variable
+ - because the vector of mixture indicators $z_i = (z_{i,1}, \dots, z_{i,H})$ is multinomial with parameter $\lambda$, a common prior for $\lambda$ is the Dirichlet
+ - $\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H)$
+ - $\theta = (\theta_1, \dots, \theta_H)$ is the vector of parameters for each component's sub-model
+ - some can be shared across components (i.e. equal variance for a group of normal distributions)
+
+#### Number of mixture components
+
+- can model $H$ as unknown but is computationally expensive
+- usually can just build models with different $H$ and compare their goodness of fit
+ - compare the posterior predictive distributions with a "suitably chosen" test quantity
+
+#### Mixtures as true models or approximating distributions
+
+- two classes of thought[^1]:
+ 1. *theoretical:* a mixture model is "a realistic characterization of the true data-generating mechanism" (pg. 522)
+ 2. *pragmatic:* "trying to infer latent subpopulations is an intrinsically ill-defined statistical problem, but finite mixture models are nonetheless useful" (pg. 523)
+
+[^1]: I came up with the names of the two schools of thought as descriptive titles; they do not appear in the book.
+
+### 22.4 Unspecifed number of mixture components
+
+- can assign a Poisson distribution as a on $H$ (the number of groups/components in the mixture model)
+ - computationally intensive
+ - more common to just fit the model with different $H$ and compare with some statistic and a penalty for model complexity
+ - WAIC is theoretically justified, but ignores the uncertainty over $H$
+ - LOO-CV may be even better
diff --git a/notes/20_finite-mixture-models_bda3-22.html b/notes/20_finite-mixture-models_bda3-22.html
new file mode 100644
index 0000000..454345e
--- /dev/null
+++ b/notes/20_finite-mixture-models_bda3-22.html
@@ -0,0 +1,2682 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20. Notes on 'Ch 22. Finite mixture models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
20. Notes on ‘Ch 22. Finite mixture models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were
+not part of the course.
+
+
Chapter 22. Finite mixture
+models
+
+- “when measurements of a random variable are taken under two
+different conditions” (pg. 519)
+- where the data contains multiple subpopulations where each has a
+different, relatively simple model
+- basic mixture modeling principle is to introduce unobserved
+indicators \(z\) to specify the
+mixture component for an observation
+
+- can think of a mixture indicator as missing data
+
+
+
22.1 Setting up and
+interpreting mixture models
+
Finite mixtures
+
+- want to model the distribution of \(y =
+(y_1, \dots, y_n)\) or \(y|x\)
+as a mixture of \(H\) components
+
+- for each component \(h \in (1, \dots,
+H)\), the distribution \(f_h(y_i |
+\theta_h)\) depends on a parameter vector \(\theta_h\)
+- \(\lambda_h\) denotes the
+proportion of the population in component \(h\)
+
+- \(\sum_{h=1}^{H} \lambda_h =
+1\)
+
+- common to assume all mixture components have the same parametric
+form
+- thus, the sampling distribution of \(y\) is:
+
+
+
\[
+p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots +
+\lambda_H f(y_i | \theta_H)
+\]
+
+- can think of the mixture distribution probabilities \(\lambda\) as priors over the parameters
+\(\theta_h\)
+
+- or as a description of the variation in \(\theta\) in a population
+- akin to a hierarchical model
+
+- introduce the indicator variables \(z_{ih}\) where \(z_{ih} = 1\) if the \(i\)th data point is drawn from component
+\(h\) and 0 otherwise
+
+- the \(lambda\) values are used to
+determine \(z\)
+
+- can think of \(lambda\) as a
+hyperprior over \(z\)
+
+- joint distribution of the observed data \(y\) and the unobserved indicators \(z\) conditions on the model parameters:
+
+- only one \(z_{ih}\) can be 1 for
+each \(i\)
+
+
+
+
\[
+\begin{aligned}
+p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\
+&= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i |
+\theta_h))^{z_{i,h}}
+\end{aligned}
+\]
+
Continuous mixtures
+
+- generalize the finite mixture to allow probability of an observation
+belongs to a class
+- hierarchical models are a form a continuous mixture model
+
+- each observed value \(y_i\) is
+modeled as coming from a mixture of models defined by the probability of
+values for \(\theta\)
+
+- in the book, the focus is on finite mixtures and “minor
+modifications” are generally required to form a continuous
+distribution
+
+
Identifiabilitiy of
+the mxixture likelihood
+
+- all finite mixture models are nonidentifiable: the
+distribution is unchanged if the group labels are permuted
+- in many cases, purposeful, informative priors can solve the
+issue
+
+
Priors distributions
+
+- the priors for a finite mixture model’s parameters \(\theta\) and \(\lambda\) are usually the product of the
+two independent priors on each variable
+
+- because the vector of mixture indicators \(z_i = (z_{i,1}, \dots, z_{i,H})\) is
+multinomial with parameter \(\lambda\),
+a common prior for \(\lambda\) is the
+Dirichlet
+
+- \(\lambda \sim \text{Dirichlet}(\alpha_1,
+\dots, \alpha_H)\)
+
+- \(\theta = (\theta_1, \dots,
+\theta_H)\) is the vector of parameters for each component’s
+sub-model
+
+- some can be shared across components (i.e. equal variance for a
+group of normal distributions)
+
+
+
+
Number of mixture components
+
+- can model \(H\) as unknown but is
+computationally expensive
+- usually can just build models with different \(H\) and compare their goodness of fit
+
+- compare the posterior predictive distributions with a “suitably
+chosen” test quantity
+
+
+
Mixtures
+as true models or approximating distributions
+
+- two classes of thought:
+
+- theoretical: a mixture model is “a realistic
+characterization of the true data-generating mechanism” (pg. 522)
+- pragmatic: “trying to infer latent subpopulations is an
+intrinsically ill-defined statistical problem, but finite mixture models
+are nonetheless useful” (pg. 523)
+
+
+
22.4 Unspecifed number
+of mixture components
+
+- can assign a Poisson distribution as a on \(H\) (the number of groups/components in the
+mixture model)
+
+- computationally intensive
+- more common to just fit the model with different \(H\) and compare with some statistic and a
+penalty for model complexity
+
+- WAIC is theoretically justified, but ignores the uncertainty over
+\(H\)
+- LOO-CV may be even better
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notes/21_Dirichlet-process-models_bda3-23.Rmd b/notes/21_Dirichlet-process-models_bda3-23.Rmd
new file mode 100644
index 0000000..95aa2ef
--- /dev/null
+++ b/notes/21_Dirichlet-process-models_bda3-23.Rmd
@@ -0,0 +1,350 @@
+---
+title: "21. Notes on 'Ch 23. Dirichlet process models'"
+date: "2022-01-20"
+output: distill::distill_article
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
+
+library(tidyverse)
+
+theme_set(
+ theme_bw() +
+ theme(
+ strip.background = element_blank(),
+ axis.ticks = element_blank()
+ )
+)
+```
+
+> These are just notes on a single chapter of *BDA3* that were not part of the course.
+
+## Chapter 23. Dirichlet process models
+
+- **Dirichlet process**: an infinite-dimensional generalization of the Dirichlet distribution
+ - used as a prior on unknown distributions
+ - can extend finite component mixture models to infinite mixture models
+
+### 23.1 Bayesian histograms
+
+- the histogram as a simple form of density estimation
+ - demonstrate a flexible parametric version that motivates the non-parametric in the following section
+- prespecified knots: $\xi = (\xi_0, \dots, \xi_k)$ with $\xi_{n-1} < \xi_n$
+- probability model for the density (a histogram):
+ - where $\pi = (\pi_1, \dots, \pi_k)$ is an unknown probability vector
+
+$$
+f(y) = \sum_{h=1}^k 1_{\xi_{h-1} < y \le \xi_h} \frac{\pi_h}{(\xi_h - \xi_{h-1})}
+$$
+
+- prior for the probabilities $\pi$ as a Dirichlet distribution:
+
+$$
+p(\pi|a) = \frac{\Gamma(\sum_{h=1}^k a_h)}{\prod_{h=1}^k \Gamma(a_h)} \prod_{h=1}^k \pi _h^{a_h - 1}
+$$
+
+- replace the hyperparameter vector: $a = \alpha \pi_0$ where $\pi_0$ is:
+
+$$
+\text{E}(\pi|a) = \pi_0 = \left( \frac{a_1}{\sum_h a_h}, \dots, \frac{a_k}{\sum_h a_h} \right)
+$$
+
+- the posterior for $\pi$ becomes:
+ - where $n_i$ is the number of observations $y$ in the $i$th bin
+
+$$
+p(\pi | y) \propto \prod_{h=1}^k \pi_h^{a_h + n_h - 1} = \text{Dirichlet}(a_1 + n_1, \dots, a_k + n_k)
+$$
+
+- this histogram estimator does well but is sensitive to the specification of the knots
+
+### 23.2 Dirichlet process prior distributions
+
+#### Definition and basic properties
+
+- goal is to not need to prespecify the bins of the histogram
+- let:
+ - $\Omega$: sample space
+ - $B_1, \dots, B_k$: measure subsets of $\Omega$
+- if $\Omega = \Re$, then $B_1, \dots, B_k$ are non-overlapping intervals that partition the real line into a finite number of bins
+- $P$: unknown probability measure of $(\Omega, \mathcal{B})$
+ - $\mathcal{B}$: "collection of all possible subsets of the sample space $\Omega$"
+ - $P$ assigns probabilities to the subsets $\mathcal{B}$
+ - probability for a set of bins $B_1, \dots, B_k$ partitioning $\Omega$:
+
+$$
+P(B_1), \dots, P(B_k) = \left( \int_{B_1} f(y) dy, \dots, \int_{B_k} f(y) dy \right)
+$$
+
+- $P$ is a *random probability measure* (RPM), so the bin probabilities are random variables
+- a good prior for the bin probabilities is the Dirichlet distribution \@ref(eq:dirichlet-prior
+ - where $P_0$ is a base probability measure providing the initial guess at $P$
+ - where $\alpha$ is a prior concentration parameter
+ - controls shrinkage of $P$ towards $P_0$
+
+$$
+P(B_1), \dots, P(B_k) \sim \text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha P_0(B_k))
+(\#eq:dirichlet-prior)
+$$
+
+- difference with previous Bayesian histogram: only specifies that bin $B_k$ is assigned probability $P(B_k)$ and not how probability mass is distributed across the bin $B_k$
+ - thus, for a fixed set s of bins, this equation does not full specify the prior for $P$
+ - need to eliminate the sensitivity to the choice of bins by assuming the prior holds for all possible partitions $B_1, \dots, B_k$ for all $k$
+ - then it is a fully specified prior for $P$
+- "must exist a random probability measure $P$ such that the probabilities assigned to any measurable partition $B_1, \dots, B_k$ by $P$ is $\text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha P_0(B_k))$"
+ - the resulting $P$ is a Dirichlet process:
+ - $P \sim \text{DP}(\alpha P_0)$
+ - $\alpha > 0$: a scalar precision parameter
+ - $P_0$: baseline probability measure also on $(\Omega, \mathcal{B})$
+- implications of DP:
+ - the marginal random probability assigned to any subset $B$ is a beta distribution
+ - $P(B) \sim \text{Beta}(\alpha PP_0(B), \alpha (1-P_0(B)))$ for all $B \in \mathcal{B}$
+ - the prior for $P$ is centered on $P_0$: $E(P(B)) = P_0(B)$
+ - $\alpha$ controls variance
+ - $\text{var}(P(B)) = \frac{P_0(B)(1 - P_0(B))}{1 + \alpha}$
+
+- get posterior for $P$:
+ - let $y_i \stackrel{iid}{\sim} P$ fir $i = 1, \dots, n$
+ - $P \sim \text{DP}(\alpha P_0)$
+ - $P$ denotes the probability measure *and* its corresponding distribution
+ - from \@ref(eq:dirichlet-prior), for any partition $B_1, \dots, B_k$:
+
+$$
+p(B_1), \dots, P(B_k) | y_1, \dots, y_k \sim
+ \text{Dirichlet} \left(
+ \alpha P_0(B_1) + \sum_{i=1}^n 1_{y_i \in B_1}, \dots, \alpha P_0(B_k) + \sum_{i=1}^n 1_{y_i \in B_k}
+ \right)
+$$
+
+- this can be converted to the following:
+
+$$
+P | y_1, \dots, y_n \sim \text{DP} \left( \alpha P_0 \sum_i \delta_{y+i} \right)
+$$
+
+- finally, the posterior expectation of $P$:
+
+$$
+\text{E}(P(B) | y^n) =
+ \left(\frac{\alpha}{\alpha + n} \right) P_0(B) +
+ \left(\frac{n}{\alpha + n} \right) \sum_{i=1}^n \frac{1}{n} \delta_{y_i}
+(\#eq:postp)
+$$
+
+- DP is a model similar to a random histogram but without dependence on the bins
+- cons of a DP prior:
+ - lack of smoothness
+ - induces negative correlation between $P(B_1)$ and $P(B_2)$ for any two disjoint bins with no account for the distance between them
+ - realizations from the DP are discrete distributions
+ - with $P \sim \text{DP}(\alpha P_0)$, $P$ is atomic and have nonzero weights only on a set of atoms, not a continuous density on the real line
+
+#### Stick-breaking construction
+
+- more intuitive understanding of DP
+- induce $P \sim \text{DP}(\alpha P_0)$ by letting:
+
+$$
+\begin{aligned}
+P(\cdot) &= \sum_{h=1}^{\infty} \pi_h \delta_{\theta_h}(\cdot) \\
+\pi_h &= V_h \prod_{l%
+ mutate(
+ dp = purrr::map(alpha, stick_breaking_process, n = 1000),
+ theta = purrr::map(dp, ~ .x$theta),
+ pi = purrr::map(dp, ~ .x$pi)
+ ) %>%
+ select(-dp) %>%
+ unnest(c(theta, pi)) %>%
+ ggplot(aes(x = theta, y = pi)) +
+ facet_wrap(vars(alpha), nrow = 2, scales = "fixed") +
+ geom_col(width = 0.05) +
+ scale_x_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02))) +
+ labs(x = "\u03B8", y = "\u03C0")
+```
+
+### 23.3 Dirichlet process mixtures
+
+#### Specification and Polya urns
+
+- **"the DP is more appropriately used as a prior for an unknown mixture of distributions"** (pg 549)
+- in the case of density estimation, a general kernel mixture model can be specified as \@ref(eq:kernel-density)
+ - $\mathcal{K}(\cdot | \theta)$: a kernel
+ - $\theta$: location and possibly scale parameters
+ - $P$: "mixing measure"
+
+$$
+f(y|P) = \int \mathcal{K}(y|\theta) d P(\theta)
+(\#eq:kernel-density)
+$$
+
+
+- treating $P$ as discrete results in a finite mixture model
+- setting a prior on $P$ creates an infinite mixture model
+ - prior: $P \sim \pi_\mathcal{P}$
+ - $\mathcal{P}$: space of all probability measures on $(\Omega, \mathcal{B})$
+ - $\pi_\mathcal{P}$: the prior over the space defined by $\mathcal{P}$
+- if set $\pi_\mathcal{P}$ as a DP prior, results in a DP mixture model
+ - from \@ref(eq:postp) and \@ref(eq:kernel-density), a DP prior on $P$ results in \@ref(eq:dp-kernel)
+ - where:
+ - $\pi = \sim \text{stick}(\alpha)$ denotes that the probability weights are from the stick-breaking process with parameter $\alpha$
+ - $\theta_h \sim P_0$ independently for each $h=1, \dots, \infty$
+
+$$
+f(y) = \sum_{h=1}^\infty \pi_h \mathcal{K}(y | \theta_h^*)
+(\#eq:dp-kernel)
+$$
+
+- equation \@ref(eq:dp-kernel) resembles a finite mixture model except the number of mixture components in set to infinity
+ - does not mean there will be infinite number of components
+ - instead the model is just flexible to add more mixture components if necessary
+- consider the following specification
+ - issue of how to conduct posterior computation with a DP mixture (DPM) because $P$ has infinitely many parameters
+
+$$
+y_i \sim \mathcal{K}(\theta_i) \text{,} \quad
+\theta_i \sim P \text{,} \quad
+P \sim \text{DP}(\alpha P_0)
+$$
+
+- can marginalize out $P$ to get an induced prior on the subject-specific parameters $\theta^n = (\theta_1, \dots, \theta_n)$
+- results in the *Polya urn* predictive rule:
+
+$$
+p(\theta_i | \theta_1, \dots, \theta_{i-1}) \sim \left( \frac{\alpha}{\alpha + i -1} \right) P_0(\theta_i) +
+\sum_{j=1}^{i-1} \left( \frac{1}{\alpha + i - 1} \right) \delta_{\theta_j}
+(\#eq:polya-urn)
+$$
+
+- *Chinese restaurant process* as a metaphor for the Polya urn predictive rule \@ref(eq:polya-urn):
+ - consider a restaurant with infinitely many tables
+ - the first customers sits at a table with dish $\theta_1^*$
+ - the second customer sits at the first table with probability $\frac{1}{1+\alpha}$ or a new table with probability $\frac{\alpha}{1+\alpha}$
+ - repeat this process for the $i$th customer:
+ - sit at an occupied table with probability $\frac{c_j}{1 - i + \alpha}$ where $c_j$ is the number of customers already at the table
+ -sit at a new table with probability $\frac{\alpha}{n-i+\alpha}$
+ - interpretation:
+ - each table represents a cluster of subjects
+ - the number of clusters depends on the number of subjects
+ - makes sense to have the possibility of more clusters with more subjects (instead of a fixed number of clusters as in a finite mixture model)
+- there is a description of the sampling process for the posterior of $\theta_i | \theta_{-i}$
+
+#### Hyperprior distribution
+
+- $\alpha$: the DP precision parameter
+ - plays a role in controlling the prior on the number of clusters
+ - small, fixed $\alpha$ favors allocation to few clusters (relative to sample size)
+ - if $\alpha=1$, the prior indicates that two randomly selected subjects have a 50/50 chance of belonging to the same cluster
+ - alternatively can set a hyperprior on $\alpha$ and let the data inform the value
+ - common to use a Gamma distribution: $\Gamma(a_\alpha, b_\alpha)$
+ - authors indicate that setting a hyperprior on $\alpha$ tends to work well in practice
+- $P_0$: base probability for the DP
+ - can think of $P_0$ as setting the prior for the cluster locations
+ - can put hyperpriors on $P_0$ parameters
+
+### 23.4 Beyond density estimation
+
+#### Nonparametric residual distributions
+
+- "The real attraction of Dirichlet process mixture (DPM) models is that they can be used much more broadly for relaxing parametric assumptions in hierarchical models." (pg 557)
+- consider the linear regression with a nonparametric error distribution \@ref(eq:linreg-nonparam-error)
+ - $X_i = (X_{i1}, \dots, X_{ip})$: vector of predictors
+ - $\epsilon_i$: error term with distribution $f$
+
+$$
+y_i = X_i \beta + \epsilon_i \text{,} \quad \epsilon_i \sim f
+(\#eq:linreg-nonparam-error)
+$$
+
+- can relax the assumption that $f$ has parametric form
+
+$$
+\epsilon_i \sim N(0, \phi_i^{-1}) \text{,} \quad \phi \sim P \text{,} \quad P \sim \text{DP}(\alpha P_0)
+$$
+
+#### Nonparametric models for parameters that vary by group
+
+- consider hierarchical linear models with varying coefficients
+ - can account for uncertainty about the distribution of coefficients by placing a DP or DPM priors on them
+ - where:
+ - $y_i = (y_{i1}, \dots, y_{in_i})$: repeated measurements for item $i$
+ - $\mu_i$: subject-specific mean
+ - $\epsilon_{ij}$: observation specific residual
+
+$$
+y_{ij} = \mu_i + \epsilon_{ij} \text{,} \quad \mu_i \sim f \text{,} \quad \epsilon_{ij} \sim g
+(\#eq:hier-linreg-dp)
+$$
+
+- typically for \@ref(eq:hier-linreg-dp):
+ - $f \equiv N(\mu, \phi^{-1})$
+ - $g \equiv N(0, \sigma^2)$
+- can let more flexibility in characterizing variability among subjects:
+ - $\mu_i \sim P \text{,} \quad P \sim \text{DP}(\alpha P_0)$
+ - the DP prior induces a *latent class* model
+ - the subjects are grouped into an unknown number of clusters \@ref(eq:latent-class-model)
+ - $S_i \in \{1, \dots, \infty\}$: latent class index
+ - $\pi_h$: probability of allocation to latent class $h$
+
+$$
+\mu_i = \mu^*_{S_i} \text{,} \quad \Pr(S_i = h) = \pi_h \text{,} \quad h = 1, 2, \dots
+(\#eq:latent-class-model)
+$$
+
+> There is more to this chapter, but it is currently beyond my understanding.
+> I hope to return to this chapter later with a better understanding of Dirichlet processes in the future.
+
+---
+
+```{r}
+sessionInfo()
+```
diff --git a/notes/21_Dirichlet-process-models_bda3-23.html b/notes/21_Dirichlet-process-models_bda3-23.html
new file mode 100644
index 0000000..eef4c24
--- /dev/null
+++ b/notes/21_Dirichlet-process-models_bda3-23.html
@@ -0,0 +1,3108 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 21. Notes on 'Ch 23. Dirichlet process models'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
21. Notes on ‘Ch 23. Dirichlet process models’
+
+
+
+
+
+
+
+
+These are just notes on a single chapter of BDA3 that were
+not part of the course.
+
+
Chapter 23. Dirichlet
+process models
+
+- Dirichlet process: an infinite-dimensional
+generalization of the Dirichlet distribution
+
+- used as a prior on unknown distributions
+- can extend finite component mixture models to infinite mixture
+models
+
+
+
23.1 Bayesian histograms
+
+- the histogram as a simple form of density estimation
+
+- demonstrate a flexible parametric version that motivates the
+non-parametric in the following section
+
+- prespecified knots: \(\xi = (\xi_0, \dots,
+\xi_k)\) with \(\xi_{n-1} <
+\xi_n\)
+- probability model for the density (a histogram):
+
+- where \(\pi = (\pi_1, \dots,
+\pi_k)\) is an unknown probability vector
+
+
+
\[
+f(y) = \sum_{h=1}^k 1_{\xi_{h-1} < y \le \xi_h} \frac{\pi_h}{(\xi_h -
+\xi_{h-1})}
+\]
+
+- prior for the probabilities \(\pi\)
+as a Dirichlet distribution:
+
+
\[
+p(\pi|a) = \frac{\Gamma(\sum_{h=1}^k a_h)}{\prod_{h=1}^k \Gamma(a_h)}
+\prod_{h=1}^k \pi _h^{a_h - 1}
+\]
+
+- replace the hyperparameter vector: \(a =
+\alpha \pi_0\) where \(\pi_0\)
+is:
+
+
\[
+\text{E}(\pi|a) = \pi_0 = \left( \frac{a_1}{\sum_h a_h}, \dots,
+\frac{a_k}{\sum_h a_h} \right)
+\]
+
+- the posterior for \(\pi\) becomes:
+
+- where \(n_i\) is the number of
+observations \(y\) in the \(i\)th bin
+
+
+
\[
+p(\pi | y) \propto \prod_{h=1}^k \pi_h^{a_h + n_h - 1} =
+\text{Dirichlet}(a_1 + n_1, \dots, a_k + n_k)
+\]
+
+- this histogram estimator does well but is sensitive to the
+specification of the knots
+
+
23.2 Dirichlet process
+prior distributions
+
Definition and basic
+properties
+
+- goal is to not need to prespecify the bins of the histogram
+- let:
+
+- \(\Omega\): sample space
+- \(B_1, \dots, B_k\): measure
+subsets of \(\Omega\)
+
+- if \(\Omega = \Re\), then \(B_1, \dots, B_k\) are non-overlapping
+intervals that partition the real line into a finite number of bins
+- \(P\): unknown probability measure
+of \((\Omega, \mathcal{B})\)
+
+- \(\mathcal{B}\): “collection of all
+possible subsets of the sample space \(\Omega\)”
+- \(P\) assigns probabilities to the
+subsets \(\mathcal{B}\)
+- probability for a set of bins \(B_1,
+\dots, B_k\) partitioning \(\Omega\):
+
+
+
\[
+P(B_1), \dots, P(B_k) = \left( \int_{B_1} f(y) dy, \dots, \int_{B_k}
+f(y) dy \right)
+\]
+
+- \(P\) is a random probability
+measure (RPM), so the bin probabilities are random variables
+- a good prior for the bin probabilities is the Dirichlet distribution
+@ref(eq:dirichlet-prior
+
+- where \(P_0\) is a base probability
+measure providing the initial guess at \(P\)
+- where \(\alpha\) is a prior
+concentration parameter
+
+- controls shrinkage of \(P\) towards
+\(P_0\)
+
+
+
+
\[
+P(B_1), \dots, P(B_k) \sim \text{Dirichlet}(\alpha P_0(B_1), \dots,
+\alpha P_0(B_k))
+\tag{1}
+\]
+
+- difference with previous Bayesian histogram: only specifies that bin
+\(B_k\) is assigned probability \(P(B_k)\) and not how probability mass is
+distributed across the bin \(B_k\)
+
+- thus, for a fixed set s of bins, this equation does not full specify
+the prior for \(P\)
+- need to eliminate the sensitivity to the choice of bins by assuming
+the prior holds for all possible partitions \(B_1, \dots, B_k\) for all \(k\)
+
+- then it is a fully specified prior for \(P\)
+
+
+- “must exist a random probability measure \(P\) such that the probabilities assigned to
+any measurable partition \(B_1, \dots,
+B_k\) by \(P\) is \(\text{Dirichlet}(\alpha P_0(B_1), \dots, \alpha
+P_0(B_k))\)”
+
+- the resulting \(P\) is a Dirichlet
+process:
+
+- \(P \sim \text{DP}(\alpha
+P_0)\)
+- \(\alpha > 0\): a scalar
+precision parameter
+- \(P_0\): baseline probability
+measure also on \((\Omega,
+\mathcal{B})\)
+
+
+- implications of DP:
+
+- the marginal random probability assigned to any subset \(B\) is a beta distribution
+
+- \(P(B) \sim \text{Beta}(\alpha PP_0(B),
+\alpha (1-P_0(B)))\) for all \(B \in
+\mathcal{B}\)
+
+- the prior for \(P\) is centered on
+\(P_0\): \(E(P(B)) = P_0(B)\)
+- \(\alpha\) controls variance
+
+- \(\text{var}(P(B)) = \frac{P_0(B)(1 -
+P_0(B))}{1 + \alpha}\)
+
+
+- get posterior for \(P\):
+
+- let \(y_i \stackrel{iid}{\sim} P\)
+fir \(i = 1, \dots, n\)
+- \(P \sim \text{DP}(\alpha P_0)\)
+
+- \(P\) denotes the probability
+measure and its corresponding distribution
+
+- from (1), for any partition \(B_1, \dots, B_k\):
+
+
+
\[
+p(B_1), \dots, P(B_k) | y_1, \dots, y_k \sim
+ \text{Dirichlet} \left(
+ \alpha P_0(B_1) + \sum_{i=1}^n 1_{y_i \in B_1}, \dots, \alpha
+P_0(B_k) + \sum_{i=1}^n 1_{y_i \in B_k}
+ \right)
+\]
+
+- this can be converted to the following:
+
+
\[
+P | y_1, \dots, y_n \sim \text{DP} \left( \alpha P_0 \sum_i \delta_{y+i}
+\right)
+\]
+
+- finally, the posterior expectation of \(P\):
+
+
\[
+\text{E}(P(B) | y^n) =
+ \left(\frac{\alpha}{\alpha + n} \right) P_0(B) +
+ \left(\frac{n}{\alpha + n} \right) \sum_{i=1}^n \frac{1}{n}
+\delta_{y_i}
+\tag{2}
+\]
+
+- DP is a model similar to a random histogram but without dependence
+on the bins
+- cons of a DP prior:
+
+- lack of smoothness
+- induces negative correlation between \(P(B_1)\) and \(P(B_2)\) for any two disjoint bins with no
+account for the distance between them
+- realizations from the DP are discrete distributions
+
+- with \(P \sim \text{DP}(\alpha
+P_0)\), \(P\) is atomic and have
+nonzero weights only on a set of atoms, not a continuous density on the
+real line
+
+
+
+
Stick-breaking construction
+
+- more intuitive understanding of DP
+- induce \(P \sim \text{DP}(\alpha
+P_0)\) by letting:
+
+
\[
+\begin{aligned}
+P(\cdot) &= \sum_{h=1}^{\infty} \pi_h \delta_{\theta_h}(\cdot) \\
+\pi_h &= V_h \prod_{l<h} (1 - V_i) \\
+V_h &\sim \text{Beta}(1, \alpha) \\
+\theta_h &\sim P_0
+\end{aligned}
+\]
+
+- where:
+
+- \(P_0\): base distribution
+- \(\delta_\theta\): degenerate
+distribution with all mass at \(\theta\)
+- \((\theta_h)_{h=1}^{\infty}\): the
+atoms generated independently by from \(P_0\)
+
+- the atoms are generated by the stick-breaking process
+- this ensures the weights sum to 1
+
+- \(\pi_h\): probability mass at atom
+\(\theta_h\)
+
+- the stick-breaking process:
+
+- start with a stick of length 1
+
+- represents the total probability allocated to all the atoms
+
+- break off a random piece of length \(V_1\) determined by a draw from \(\text{Beta}(1, \alpha)\)
+- set \(\pi_1 = V_1\) as the
+probability weight to the randomly generated first atom \(\theta_1 \sim P_0\)
+- break off another piece of the remaining stick (now length \(1-V_1\)): \(V_2
+\sim \text{Beta}(1, \alpha)\)
+- set \(\pi_2 = V_2(1-V_1)\) as the
+probability weight to the next atom \(\theta_2
+\sim P_0\)
+- repeat until the stick is fully used
+
+- implications:
+
+- during the process, the stick get shorter, so lengths allocated to
+later indexed atoms decrease stochastically
+- rate of decrease of stick length depends on \(\alpha\)
+
+- \(\alpha\) near 0 lead to high
+weights early on
+
+
+- below are realizations of the stick breaking process
+
+- set \(P_0\) as a standard normal
+distribution and vary \(\alpha\)
+
+
+
+
+
stick_breaking_process <- function(alpha, n = 1000) {
+ theta <- rnorm(n, 0, 1) # P0 = standard normal
+ vs <- rbeta(n, 1, alpha)
+ pi <- rep(0, n)
+ pi[1] <- vs[1]
+ stick <- 1.0 - vs[1]
+ for (h in 2:n) {
+ pi[h] <- vs[h] * stick
+ stick <- stick - pi[h]
+ }
+ return(list(theta = theta, pi = pi))
+}
+
+dp_realization <- stick_breaking_process(10)
+sum(dp_realization$pi)
+
+
+
#> [1] 1
+
+
+
+
set.seed(549)
+
+tibble(alpha = c(0.5, 1, 5, 10)) %>%
+ mutate(
+ dp = purrr::map(alpha, stick_breaking_process, n = 1000),
+ theta = purrr::map(dp, ~ .x$theta),
+ pi = purrr::map(dp, ~ .x$pi)
+ ) %>%
+ select(-dp) %>%
+ unnest(c(theta, pi)) %>%
+ ggplot(aes(x = theta, y = pi)) +
+ facet_wrap(vars(alpha), nrow = 2, scales = "fixed") +
+ geom_col(width = 0.05) +
+ scale_x_continuous(expand = expansion(c(0.02, 0.02))) +
+ scale_y_continuous(expand = expansion(c(0, 0.02))) +
+ labs(x = "\u03B8", y = "\u03C0")
+
+
+
+
+
23.3 Dirichlet process mixtures
+
Specification and Polya urns
+
+- “the DP is more appropriately used as a prior for an unknown
+mixture of distributions” (pg 549)
+- in the case of density estimation, a general kernel mixture model
+can be specified as (3)
+
+- \(\mathcal{K}(\cdot | \theta)\): a
+kernel
+- \(\theta\): location and possibly
+scale parameters
+- \(P\): “mixing measure”
+
+
+
\[
+f(y|P) = \int \mathcal{K}(y|\theta) d P(\theta)
+\tag{3}
+\]
+
+- treating \(P\) as discrete results
+in a finite mixture model
+- setting a prior on \(P\) creates an
+infinite mixture model
+
+- prior: \(P \sim \pi_\mathcal{P}\)
+
+- \(\mathcal{P}\): space of all
+probability measures on \((\Omega,
+\mathcal{B})\)
+- \(\pi_\mathcal{P}\): the prior over
+the space defined by \(\mathcal{P}\)
+
+
+- if set \(\pi_\mathcal{P}\) as a DP
+prior, results in a DP mixture model
+
+- from (2) and (3), a DP prior on \(P\) results in (4)
+- where:
+
+- \(\pi = \sim \text{stick}(\alpha)\)
+denotes that the probability weights are from the stick-breaking process
+with parameter \(\alpha\)
+- \(\theta_h \sim P_0\) independently
+for each \(h=1, \dots, \infty\)
+
+
+
+
\[
+f(y) = \sum_{h=1}^\infty \pi_h \mathcal{K}(y | \theta_h^*)
+\tag{4}
+\]
+
+- equation (4) resembles a finite mixture model except
+the number of mixture components in set to infinity
+
+- does not mean there will be infinite number of components
+- instead the model is just flexible to add more mixture components if
+necessary
+
+- consider the following specification
+
+- issue of how to conduct posterior computation with a DP mixture
+(DPM) because \(P\) has infinitely many
+parameters
+
+
+
\[
+y_i \sim \mathcal{K}(\theta_i) \text{,} \quad
+\theta_i \sim P \text{,} \quad
+P \sim \text{DP}(\alpha P_0)
+\]
+
+- can marginalize out \(P\) to get an
+induced prior on the subject-specific parameters \(\theta^n = (\theta_1, \dots,
+\theta_n)\)
+- results in the Polya urn predictive rule:
+
+
\[
+p(\theta_i | \theta_1, \dots, \theta_{i-1}) \sim \left(
+\frac{\alpha}{\alpha + i -1} \right) P_0(\theta_i) +
+\sum_{j=1}^{i-1} \left( \frac{1}{\alpha + i - 1} \right)
+\delta_{\theta_j}
+\tag{5}
+\]
+
+- Chinese restaurant process as a metaphor for the Polya urn
+predictive rule (5):
+
+- consider a restaurant with infinitely many tables
+- the first customers sits at a table with dish \(\theta_1^*\)
+- the second customer sits at the first table with probability \(\frac{1}{1+\alpha}\) or a new table with
+probability \(\frac{\alpha}{1+\alpha}\)
+- repeat this process for the \(i\)th
+customer:
+
+- sit at an occupied table with probability \(\frac{c_j}{1 - i + \alpha}\) where \(c_j\) is the number of customers already at
+the table -sit at a new table with probability \(\frac{\alpha}{n-i+\alpha}\)
+
+- interpretation:
+- each table represents a cluster of subjects
+- the number of clusters depends on the number of subjects
+- makes sense to have the possibility of more clusters with more
+subjects (instead of a fixed number of clusters as in a finite mixture
+model)
+
+- there is a description of the sampling process for the posterior of
+\(\theta_i | \theta_{-i}\)
+
+
Hyperprior distribution
+
+- \(\alpha\): the DP precision
+parameter
+
+- plays a role in controlling the prior on the number of clusters
+- small, fixed \(\alpha\) favors
+allocation to few clusters (relative to sample size)
+
+- if \(\alpha=1\), the prior
+indicates that two randomly selected subjects have a 50/50 chance of
+belonging to the same cluster
+
+- alternatively can set a hyperprior on \(\alpha\) and let the data inform the value
+
+- common to use a Gamma distribution: \(\Gamma(a_\alpha, b_\alpha)\)
+- authors indicate that setting a hyperprior on \(\alpha\) tends to work well in
+practice
+
+
+- \(P_0\): base probability for the
+DP
+
+- can think of \(P_0\) as setting the
+prior for the cluster locations
+- can put hyperpriors on \(P_0\)
+parameters
+
+
+
23.4 Beyond density estimation
+
Nonparametric residual
+distributions
+
+- “The real attraction of Dirichlet process mixture (DPM) models is
+that they can be used much more broadly for relaxing parametric
+assumptions in hierarchical models.” (pg 557)
+- consider the linear regression with a nonparametric error
+distribution (6)
+
+- \(X_i = (X_{i1}, \dots, X_{ip})\):
+vector of predictors
+- \(\epsilon_i\): error term with
+distribution \(f\)
+
+
+
\[
+y_i = X_i \beta + \epsilon_i \text{,} \quad \epsilon_i \sim f
+\tag{6}
+\]
+
+- can relax the assumption that \(f\)
+has parametric form
+
+
\[
+\epsilon_i \sim N(0, \phi_i^{-1}) \text{,} \quad \phi \sim P \text{,}
+\quad P \sim \text{DP}(\alpha P_0)
+\]
+
Nonparametric
+models for parameters that vary by group
+
+- consider hierarchical linear models with varying coefficients
+
+- can account for uncertainty about the distribution of coefficients
+by placing a DP or DPM priors on them
+- where:
+
+- \(y_i = (y_{i1}, \dots,
+y_{in_i})\): repeated measurements for item \(i\)
+- \(\mu_i\): subject-specific
+mean
+- \(\epsilon_{ij}\): observation
+specific residual
+
+
+
+
\[
+y_{ij} = \mu_i + \epsilon_{ij} \text{,} \quad \mu_i \sim f \text{,}
+\quad \epsilon_{ij} \sim g
+\tag{7}
+\]
+
+- typically for (7):
+
+- \(f \equiv N(\mu, \phi^{-1})\)
+- \(g \equiv N(0, \sigma^2)\)
+
+- can let more flexibility in characterizing variability among
+subjects:
+
+- \(\mu_i \sim P \text{,} \quad P \sim
+\text{DP}(\alpha P_0)\)
+- the DP prior induces a latent class model
+
+- the subjects are grouped into an unknown number of clusters
+(8)
+- \(S_i \in \{1, \dots, \infty\}\):
+latent class index
+- \(\pi_h\): probability of
+allocation to latent class \(h\)
+
+
+
+
\[
+\mu_i = \mu^*_{S_i} \text{,} \quad \Pr(S_i = h) = \pi_h \text{,} \quad h
+= 1, 2, \dots
+\tag{8}
+\]
+
+There is more to this chapter, but it is currently beyond my
+understanding. I hope to return to this chapter later with a better
+understanding of Dirichlet processes in the future.
+
+
+
+
+
#> R version 4.1.2 (2021-11-01)
+#> Platform: x86_64-apple-darwin17.0 (64-bit)
+#> Running under: macOS Big Sur 10.16
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
+#>
+#> locale:
+#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices datasets utils methods
+#> [7] base
+#>
+#> other attached packages:
+#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
+#> [5] readr_2.0.1 tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5
+#> [9] tidyverse_1.3.1
+#>
+#> loaded via a namespace (and not attached):
+#> [1] Rcpp_1.0.7 lubridate_1.7.10 clisymbols_1.2.0
+#> [4] assertthat_0.2.1 digest_0.6.27 utf8_1.2.2
+#> [7] R6_2.5.0 cellranger_1.1.0 backports_1.2.1
+#> [10] reprex_2.0.1 evaluate_0.14 highr_0.9
+#> [13] httr_1.4.2 pillar_1.6.2 rlang_0.4.11
+#> [16] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
+#> [19] rmarkdown_2.10 labeling_0.4.2 munsell_0.5.0
+#> [22] broom_0.7.9 compiler_4.1.2 modelr_0.1.8
+#> [25] xfun_0.25 pkgconfig_2.0.3 htmltools_0.5.1.1
+#> [28] downlit_0.2.1 tidyselect_1.1.1 fansi_0.5.0
+#> [31] crayon_1.4.1 tzdb_0.1.2 dbplyr_2.1.1
+#> [34] withr_2.4.2 grid_4.1.2 jsonlite_1.7.2
+#> [37] gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1
+#> [40] magrittr_2.0.1 scales_1.1.1 cli_3.0.1
+#> [43] stringi_1.7.3 farver_2.1.0 renv_0.14.0
+#> [46] fs_1.5.0 xml2_1.3.2 bslib_0.2.5.1
+#> [49] ellipsis_0.3.2 generics_0.1.0 vctrs_0.3.8
+#> [52] distill_1.2 tools_4.1.2 glue_1.4.2
+#> [55] hms_1.1.0 yaml_2.2.1 colorspace_2.0-2
+#> [58] rvest_1.0.1 knitr_1.33 haven_2.4.3
+#> [61] sass_0.4.0
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+