sex_age_bmi_associated_proteins.Rmd

---
title: "CHRIS MS-based plasma proteome: general overview and sex, age, BMI and medication associated proteins"
author: "Nikola Dordevic, Clemens Dierks, Essi Hantikainen, Johannes Rainer"
graphics: yes
output:
  BiocStyle::html_document:
    toc_float: true
    code_folding: hide
bibliography: references.bib
---

**Modified**: `r file.info("sex_age_bmi_associated_proteins.Rmd")$mtime`<br />
**Compiled**: `r date()`

```{r biocstyle, echo = FALSE, results = "hide", message = FALSE}
#' rmarkdown format settings
library(BiocStyle)
library(knitr)
BiocStyle::markdown()
```

```{r settings, echo = FALSE}
#' General settings
filename <- "sex_age_bmi_associated_proteins"
#' Path to save images to
IMAGE_PATH <- paste0("images/", filename, "/")
if (file.exists(IMAGE_PATH))
    unlink(IMAGE_PATH, recursive = TRUE)
dir.create(IMAGE_PATH, recursive = TRUE)
#' Path to store RData files
RDATA_PATH <- paste0("data/RData/", filename, "/")
dir.create(RDATA_PATH, recursive = TRUE, showWarnings = FALSE)
#' wheter interactive plots should be used. Should be set to FALSE if report
#' is to be shared with external researchers.
PLOTLY <- FALSE
ggplotly_or_not <- function(x, interactive = TRUE) {
    if (interactive) ggplotly(x)
    else print(x)
}
opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 8,
               fig.height = 7, fig.path = IMAGE_PATH)
chris_path <- "."
```

# Introduction

This document provides a general description of the ScanningSWATH mass
spectrometry (MS)-based plasma proteome data set of the CHRIS population study
[@Pattaro:2015fu] and describes the analysis to identify plasma proteins that
are associated to age, sex and body mass index (BMI). The analysis is performed
similarly to the analysis for the identification of age, sex and BMI-associated
metabolites [@verri_hernandes_age_2022] but in addition, influence of (frequent)
medication on the quantified plasma proteome is evaluated.

The document and repository does not provide any data. Access to any individual
level data, including the mass spectrometry data, needs to be requested through
the CHRIS Access Committee.

# Data import

Below we load the data and all required libraries. The data is loaded from the
CHRIS *TDFF* modules (i.e. the data file format in which CHRIS study data is
internally stored). Most used packages are available either on
[CRAN](https://cran.r-project.org/) or [Bioconductor](https://bioconductor.org)
and can be installed using the `BiocManager::install` function. Exceptions are
the following packages that can be installed from github:

- [*tidyfr*](https://github.com/EuracBiomedicalResearch/tidyfr): can be
  installed with `BiocManager::install("EuracBiomedicalResearch/tidyfr")`.
- [*atc*](https://github.com/jorainer/atc): can be installed with
  `BiocManager::install("jorainer/atc")`.
- [*CompMetaboTools*](https://github.com/EuracBiomedicalResearch/CompMetaboTools):
  can be installed with
  `BiocManager::install("EuracBiomedicalResearch/CompMetaboTools")`.

Data packages (*chrisData* and *chrisUtils*) and the ScanningSWATH TDFF data
module *chris_scanningswath_proteins* are not publicly available. Any such data
access needs to got through a *CHRIS Access Committee* request.

```{r data-libraries}
library(chrisUtils)
library(chrisData)
library(RColorBrewer)
library(pander)
library(pheatmap)
library(DT)
library(ggfortify)
library(ggplot2)
library(ggpubr)
library(reshape2)
library(plotly)
options(rgl.useNULL = TRUE)
library(rgl)
library(readxl)
library(writexl)
library(tidyfr)
library(MatchIt)
library(pROC)

#' Load the CHRIS plasma proteomics data module
mdl <- data_module("chris_scanningswath_proteins", "1.0.0.1", chris_path)
prot_data <- data(mdl)
prot_ann <- labels(mdl)
stopifnot(all(colnames(prot_data) == rownames(prot_ann)))
colnames(prot_data) <- prot_ann$description
rownames(prot_ann) <- prot_ann$description
#' define the columns with protein concentrations
proteins <- rownames(prot_ann)[prot_ann$Genes != ""]

```

Before proceeding with the data analysis we remove the POOL samples. In addition
we restrict to samples from the CHRIS study (i.e., remove the samples from the
NAFLD sub-study).

```{r remove-POOLs-NAFLDs, message = FALSE, warning = FALSE}
prot_data <- prot_data[grep("^001", rownames(prot_data)), ]
```

Next we define the various variables which will be included in the analysis. The
related trait information is extracted from the `chrisData` R package.
First we define and add the BMI data, both as a numeric value, but also using
the BMI categories. For the BMI categories we are using *2* (18.5 - <25) as
baseline since that is supposed to be the *normal*.

```{r define-variables-bmi, message = FALSE, warning = FALSE}
clin <- chrisData("clinical")
rownames(clin) <- clin$AID
#' Adding this information to the colData
prot_data$BMI <- clin[rownames(prot_data), "x0an03q"]
prot_data$BMIcat <- factor(as.integer(clin[rownames(prot_data), "x0an03b"]),
                           levels = c("2", "1", "3", "4"))
```

We also define the season in which samples were collected from the study
participants and add the age and sex information for each study participant.

```{r define-variables-general-info, message = FALSE}
#' add here phenotype data from chrisData and recalculate season from exam day
gen_info <- chrisData("general_information", release = "baseline")
rownames(gen_info) <- gen_info$AID
prot_data$Sex <- gen_info[rownames(prot_data), "x0_sex"]
prot_data$Age <- gen_info[rownames(prot_data), "x0_age"]
```

Import information for pregnancy:

```{r define-variables-pregnant, message = FALSE}
interview_info <- chrisData("interview", release = "baseline")
rownames(interview_info) <- interview_info$AID
#' Add pregnancy information
prot_data$Pregnant <- interview_info[rownames(prot_data), "x0wo05"]
```

A categorical variable for fasting status is defined based self-reported data
from the participants' interview.

```{r define-variables-fasting, message = FALSE}
#' add number of people reporting fasting status (`x0bc12` from `chrisData`
#' `"labdata"`)
labdata_info <- chrisData("labdata", release = "baseline")
rownames(labdata_info) <- labdata_info$AID
prot_data$Fasting <- labdata_info[rownames(prot_data), "x0bc12"]
```

At last we load also the medication data which might be used later to test for
potential confounding.

```{r}
#' Load medication data
medi <- chrisData("drugs_1")
library(atc)
```

Create additional variable for age! Coefficients and effect sizes correspond to
one year difference. Dividing age with 10, difference in coefficents and effect
sizes would correspond to 10 years difference and it would be easier to define
meaningful cut-off!

```{r define-age10-variable, message = FALSE}
prot_data$Age_10 <- prot_data$Age/10
```


# General data overview

In resulting dataset, there are `r nrow(prot_data)` samples in total and
general overview of the trait is summarized below:

```{r, message=FALSE}
Overview <- summary(
    prot_data[, c("Sex", "Fasting", "Age", "Age_10", "BMIcat",
                  "Pregnant", "Fasting")])
print(Overview)
```

We further remove all individuals with missing values for any of the required
variables (age, sex, fasting and BMI) as well as pregnant women.

```{r}
nas <- is.na(prot_data[, c("Sex", "Fasting", "Age", "BMIcat")])
prot_data <- prot_data[rowSums(nas) == 0, ]
## remove pregnant women
rem <- c(which(prot_data$Pregnant == "Yes"),
         which(prot_data$Pregnant == "Don't know, possible"))
prot_data <- prot_data[-rem, ]
```

Summary for the final data set.

```{r}
prot_data[, c("Sex", "Fasting", "Age", "Age_10", "BMIcat",
              "Pregnant", "Fasting")] %>% summary()
```

This reduces the data set to `r nrow(prot_data)` individuals.

We next evaluate the variance of protein abundances observed in the present data
set (study samples).

```{r}
#' Calculate CV for proteins across study samples
cv_study <- vapply(prot_data[proteins], function(z) {
    z <- 2^z
    (sd(z, na.rm = TRUE) / mean(z, na.rm = TRUE)) * 100
}, numeric(1))
cv_rel <- cv_study / (prot_ann[proteins, "cv_qc_chris"] * 100)
```

Top 30 proteins with highest absolute CV in study samples are listed in the
table below.

```{r, results = "asis"}
tab <- data.frame(
    Genes = prot_ann[proteins, "Genes"],
    cv_qc_chris = prot_ann[proteins, "cv_qc_chris"] * 100,
    cv_study = cv_study,
    cv_rel = cv_rel)

tab_sub <- tab[order(tab$cv_study, decreasing = TRUE), ][1:30, ]
pandoc.table(tab_sub, caption = "Top 30 proteins with largest CV.",
             style = "rmarkdown", split.table = Inf)
```

In contrast, the proteins with the lowest absolute CV:

```{r, results = "asis"}
tab_sub <- tab[order(tab$cv_study), ][1:30, ]
pandoc.table(tab_sub, caption = "Top 30 proteins with smallest CV.",
             style = "rmarkdown", split.table = Inf)
```

Proteins with highest relative CV.

```{r, results = "asis"}
tab_sub <- tab[order(tab$cv_rel, decreasing = TRUE), ][1:30, ]
pandoc.table(tab_sub, caption = "Top 30 proteins with largest relative CV.",
             style = "rmarkdown", split.table = Inf)

```

Proteins with the lowest relative CV.

```{r, results = "asis"}
tab_sub <- tab[order(tab$cv_rel, decreasing = FALSE), ][1:30, ]
pandoc.table(tab_sub, caption = "Top 30 proteins with smallest relative CV.",
             style = "rmarkdown", split.table = Inf)

```

We next perform a principal component analysis to group participants based on
their plasma proteome. Samples are colored according to their sex. Prior the
PCA, data is normalized to 0 mean and standard deviation of 1 (autoscaling). The
plot below shows the scores and loadings of all subjects in the space of the
first two PCs, with 15% of total variances explained.

```{r PCA_1_figure, fig.path = IMAGE_PATH, fig.width=12, fig.height=7}
## include age in PCA
scaledData <- scale(prot_data[, proteins], scale = TRUE)
scPCA <- svd(scaledData)
scPCA.scores <- scPCA$u %*% diag(scPCA$d)    ## scores
scPCA.loadings <- scPCA$v                    ## loadings
rownames(scPCA.loadings) <- proteins
scPCA.variances <- round(100*((scPCA$d^2) / sum(scPCA$d^2)), 1)[1:10]
Sex <- prot_data$Sex
gg_1 <- autoplot(prcomp(scaledData), data = prot_data,
                 loadings = TRUE, loadings.colour = 'gray',
                 loadings.label = TRUE, loadings.label.size = 3,
                 loadings.label.colour ="darkblue", colour = NA) +
    geom_point(aes(colour = Sex)) +
    theme_bw()
ggplotly_or_not(gg_1, PLOTLY)
```

```{r FIGURE_1_MANUSCRIPT}
#' FIGURE 1
#' Transparency for protein names depending on arrow length?
png("images/manuscript/Figure_1.png", width = 15, height = 5, units = "cm",
    res = 600, pointsize = 4, type = "cairo-png")
pc_all <- prcomp(scaledData)
library(CompMetaboTools)
cols <- rep("#E41A1C", nrow(prot_data))
cols[prot_data$Sex == "Male"] <- "#377EB8"
par(mfrow = c(1, 3), mar = c(4.5, 4.5, 0.5, 0.7), cex.lab = 2, bty = "n",
    cex.axis = 1.5,  las = 1)
plot_pca(pc_all, pch = NA, xlim = c(-16, 9))
mtext(side = 3, outer = FALSE, text = "a", cex = 3, at = -18, line = -4.0)
points(pc_all$x[, 1], pc_all$x[, 2], pch = 21, col = paste0(cols, 80),
       bg = paste0(cols, 60), cex = 1.5)
plot_pca(pc_all, pch = NA, col = NA, xlim = c(-16, 9))
mtext(side = 3, outer = FALSE, text = "b", cex = 3, at = -18, line = -4.0)
unsigned.range <- function(x)
        c(-abs(min(x, na.rm = TRUE)), abs(max(x, na.rm = TRUE)))
y <- pc_all$rotation[, 1:2]
scl <- max(
    unsigned.range(y[, 1]) / unsigned.range(pc_all$x[, 1]),
    unsigned.range(y[, 2]) / unsigned.range(pc_all$x[, 2]))
a_lengths <- sqrt(y[, 1]^2 + y[, 2]^2)
y <- y / scl
arrows(x0 = 0, x1 = y[, 1],
       y0 = 0, y1 = y[, 2],
       lwd = 0.75, angle = 30, length = 0.05,
       col = "#00000050")
text(y[, 1] * 1.1, y[, 2] * 1.1, labels = prot_ann[rownames(y), "Genes"],
     col = "#00000080", cex = 1.4)
## Correlation PC1 and age.
pc_s <- summary(pc_all)
plot(pc_all$x[, 1], prot_data$Age, pch = 21, col = paste0(cols, 80),
     bg = paste0(cols, 60), cex = 1.5, xlim = c(-16, 9),
     xlab = paste0("PC1: ", format(pc_s$importance[2, 1] * 100, digits = 3),
                   " % variance"), ylab = "Age")
mtext(side = 3, outer = FALSE, text = "c", cex = 3, at = -18, line = -4.0)
grid()
dev.off()

## Males only
png("images/manuscript/supplement/Figure_1_male.png", width = 15, height = 5,
    units = "cm", res = 600, pointsize = 4, type = "cairo-png")
idx <- which(prot_data$Sex == "Male")
pc <- prcomp(scale(prot_data[idx, proteins], scale = TRUE))
par(mfrow = c(1, 3), mar = c(4.5, 4.5, 0.5, 0.5), cex.lab = 2, bty = "n",
    cex.axis = 1.5, las = 1)
plot_pca(pc, pch = NA, xlim = c(-16, 12), ylim = c(-11, 12))
mtext(side = 3, outer = FALSE, text = "A", cex = 3, at = -18, line = -4.0)
points(pc$x[, 1], pc$x[, 2], pch = 21, col = paste0(cols[idx], 80),
       bg = paste0(cols[idx], 60), cex = 1.5)
plot_pca(pc, pch = NA, col = NA, xlim = c(-16, 12), ylim = c(-11, 12))
mtext(side = 3, outer = FALSE, text = "B", cex = 3, at = -18, line = -4.0)
unsigned.range <- function(x)
        c(-abs(min(x, na.rm = TRUE)), abs(max(x, na.rm = TRUE)))
y <- pc$rotation[, 1:2]
scl <- max(
    unsigned.range(y[, 1]) / unsigned.range(pc$x[, 1]),
    unsigned.range(y[, 2]) / unsigned.range(pc$x[, 2]))
a_lengths <- sqrt(y[, 1]^2 + y[, 2]^2)
y <- y / scl
arrows(x0 = 0, x1 = y[, 1],
       y0 = 0, y1 = y[, 2],
       lwd = 0.75, angle = 30, length = 0.05,
       col = "#00000050")
text(y[, 1] * 1.1, y[, 2] * 1.1, labels = prot_ann[rownames(y), "Genes"],
     col = "#00000080", cex = 1.4)
## Correlation PC1 and age.
pc_s <- summary(pc)
plot(pc$x[, 1], prot_data$Age[idx], pch = 21, col = paste0(cols[idx], 80),
     bg = paste0(cols[idx], 60), cex = 1.5,
     xlab = paste0("PC1: ", format(pc_s$importance[2, 1] * 100, digits = 3),
                   " % variance"), ylab = "Age", xlim = c(-16, 12))
mtext(side = 3, outer = FALSE, text = "C", cex = 3, at = -18, line = -4.0)
grid()
dev.off()

## Females only
png("images/manuscript/supplement/Figure_1_female.png", width = 15, height = 5,
    units = "cm", res = 600, pointsize = 4, type = "cairo-png")
idx <- which(prot_data$Sex == "Female")
pc <- prcomp(scale(prot_data[idx, proteins], scale = TRUE))
par(mfrow = c(1, 3), mar = c(4.5, 4.5, 0.5, 0.5), cex.lab = 2, bty = "n",
    cex.axis = 1.5, las = 1)
plot_pca(pc, pch = NA, xlim = c(-16, 9))
mtext(side = 3, outer = FALSE, text = "A", cex = 3, at = -18, line = -4.0)
points(pc$x[, 1], pc$x[, 2], pch = 21, col = paste0(cols[idx], 80),
       bg = paste0(cols[idx], 60), cex = 1.5)
plot_pca(pc, pch = NA, col = NA, xlim = c(-16, 9), ylim = c(-11, 12))
mtext(side = 3, outer = FALSE, text = "B", cex = 3, at = -18, line = -4.0)
unsigned.range <- function(x)
        c(-abs(min(x, na.rm = TRUE)), abs(max(x, na.rm = TRUE)))
y <- pc$rotation[, 1:2]
scl <- max(
    unsigned.range(y[, 1]) / unsigned.range(pc$x[, 1]),
    unsigned.range(y[, 2]) / unsigned.range(pc$x[, 2]))
a_lengths <- sqrt(y[, 1]^2 + y[, 2]^2)
y <- y / scl
arrows(x0 = 0, x1 = y[, 1],
       y0 = 0, y1 = y[, 2],
       lwd = 0.75, angle = 30, length = 0.05,
       col = "#00000050")
text(y[, 1] * 1.1, y[, 2] * 1.1, labels = prot_ann[rownames(y), "Genes"],
     col = "#00000080", cex = 1.4)
## Correlation PC1 and age.
pc_s <- summary(pc)
plot(pc$x[, 1], prot_data$Age[idx], pch = 21, col = paste0(cols[idx], 80),
     bg = paste0(cols[idx], 60), cex = 1.5,
     xlab = paste0("PC1: ", format(pc_s$importance[2, 1] * 100, digits = 3),
                   " % variance"), ylab = "Age", xlim = c(-16, 9))
mtext(side = 3, outer = FALSE, text = "C", cex = 3, at = -18, line = -4.0)
grid()
dev.off()

#' And again for the graphical abstract
## png("images/manuscript/Figure_2_GA.png", width = 5, height = 5, units = "cm",
##     res = 600, pointsize = 4)
## par(mfrow = c(1, 1), mar = c(4.5, 4.3, 0.5, 0.5), cex.lab = 1.5, bty = "n")
## plot_pca(pc, pch = NA, col = NA, bg = paste0(cols, 60), cex = 1.5,
##          xlim = c(-16, 9))
## unsigned.range <- function(x)
##         c(-abs(min(x, na.rm = TRUE)), abs(max(x, na.rm = TRUE)))
## y <- pc$rotation[, 1:2]
## scl <- max(
##     unsigned.range(y[, 1]) / unsigned.range(pc$x[, 1]),
##     unsigned.range(y[, 2]) / unsigned.range(pc$x[, 2]))
## y <- y / scl
## points(pc$x[, 1], pc$x[, 2], pch = 21, col = NA, bg = paste0(cols, 60),
##        cex = 1)
## grid()
## dev.off()
```
From the scores, there is a clear effect of sex in direction of PC1. From the
loadings we can observe relationships between proteins. Every arrow corresponds
to the one parameter (protein) in data set. Longest arrows in the direction of
particular Principal Component have the largest impact and importance for that
PC.  The same direction of groups of arrows and their lengths indicates
correlation of corresponding parameters in the data set and opposite: opposite
directed arrows indicate negative correlation between corresponding
parameters. For the cases when angles between arrows are around 90°, indicates
that there is no correlation between corresponding parameters.  Note that with
the large numbers of parameters in PCA, some loadings can be placed close to
each other even if there is no strong correlation. Thus, attention is necessary.

There is an interesting separation on PC1 with a subgroup of mostly female
participants which seems also to be related to Age (not shown). We next apply a
multivariate regression model to the data to identify which of the available
variables are strongest associated with the PC1. We first prepare the medication
data and define a (categorical) variable for each ATC level 3 medication taken
on a regular basis by at least 14 CHRIS participants.

```{r, warning = FALSE}
#' Prepare medication data
med <- chrisData("drugs_1")
med <- med[med$AID %in% rownames(prot_data), ]
med <- med[!is.na(med$x0dd03), ]
med$atc_2 <- toAtcLevel(med$x0dd03, level = 2)
med$atc_3 <- toAtcLevel(med$x0dd03, level = 3)
med$atc_4 <- toAtcLevel(med$x0dd03, level = 4)
med$atc_5 <- toAtcLevel(med$x0dd03, level = 5)

atc_df <- as.data.frame(atc)
med$atc_2_name <- atc_df[match(med$atc_2, atc_df$key), "name"]
med$atc_3_name <- atc_df[match(med$atc_3, atc_df$key), "name"]
med$atc_4_name <- atc_df[match(med$atc_4, atc_df$key), "name"]
med$atc_5_name <- atc_df[match(med$atc_5, atc_df$key), "name"]

#' Restrict to medications taken by at least 14 individuals on a regular
#' basis
tmp <- med[med$x0dd13 %in% c("daily", "every 2. day", "every 3. day",
                             "every 4. day = 2x per week"), ]

level3 <- split(tmp$AID, tmp$atc_3)
level3 <- lapply(level3, unique)
level3 <- level3[lengths(level3) > 14]

#' Reformat into categorical variables
general_data <- prot_data[, c("BMIcat", "Sex", "Age_10", "Fasting")]
tmp <- lapply(names(level3), function(z) {
    cl <- rep("No", nrow(general_data))
    cl[rownames(general_data) %in% level3[[z]]] <- "Yes"
    factor(cl)
})
names(tmp) <- names(level3)
general_data <- cbind(general_data, do.call(cbind.data.frame, tmp))

```

Next we fit a multivariate regression model explaining PC1 by all available
variables, which includes age, sex, (categorical) BMI, (binary) self-reported
fasting status and the above medication information.

```{r}
#' association analysis: age, sex, BMI, medication on PC1.
general_data$pc1 <- pc_all$x[, 1L]
lmod <- lm(pc1 ~ ., data = general_data)
```

The significantly associated variables (at an alpha = 0.01) are listed below.

```{r table-association-pc1, results = "asis"}
lsum <- summary(lmod)
ltab <- lsum$coefficients[lsum$coefficients[, 4L] < 0.01, c(1L, 4L)]
ltab <- data.frame(variable = sub("Yes", "", rownames(ltab)), ltab,
                   check.names = FALSE)
ltab$name <- atc_df[match(ltab$variable, atc_df$key), c("name")]
ltab <- ltab[rownames(ltab) != "(Intercept)", ]
ltab <- ltab[order(ltab[, 3]), ]
rownames(ltab) <- NULL
pandoc.table(
    ltab, style = "rmarkdown", split.table = "Inf",
    caption = "Variables significantly associated with PC1.")
```

The sample sizes for the individual significant variables are:

```{r}
summary(general_data[, c("Sex", "BMIcat", "G03A", "G03H", "G02B", "G03F")])
```

Hormonal contraceptives seem to be the strongest drivers for the observed
separation on PC1, which is also related to age, sex and obesity.

To exclude potential other technical influence we in addition evaluate whether
there is a relationship between PC1 and the *age* of the samples, which we
estimate from the study participants' participation date.

```{r, fig.width = 5, fig.height = 5, fig.cap = "PC1 against the sample age (in days)."}
#' Calculate the difference (in days) between the partipation date and an
#' arbitrary day, in our case 2022-02-14.
part_date <- gen_info[rownames(prot_data), "x0_examd"]
sample_age <- as.integer(difftime("2022-02-14", part_date, units = "days"))
plot(pc_all$x[, 1L], sample_age, xlab = "PC1", ylab = "Sample age [days]")
grid()
cor(pc_all$x[, 1L], sample_age, method = "spearman")
```

There is thus no relationship between the samples' age and PC1.

We thus next define the categorical variable for *hormonal contraceptives* using
the ATC level 3 medication information from CHRIS. In particular, we define the
trait using the ATC3 categories *HORMONAL CONTRACEPTIVES FOR SYSTEMIC USE* and
*ANTIANDROGENS*. This definition is similar to the one in Ramsey et al
[@ramsey_variation_2016] that considered women taking oral
contraceptives. Contraceptives for topical use are not included in this
definition, because they don't have the same route of administration
(i.e. orally and metabolized in/through the liver).

```{r, warning = FALSE}
#' Subset to selected sex hormone treatment (WHO definition)
med_sex_horm <- med[med$atc_3_name %in%
                    c("HORMONAL CONTRACEPTIVES FOR SYSTEMIC USE",
                      "ANTIANDROGENS"), ]
prot_data$HCU <- "No"
prot_data$HCU[match(med_sex_horm$AID, rownames(prot_data))] <- "Yes"
prot_data$HCU <- factor(prot_data$HCU,
                        levels = c("No", "Yes"))
summary(prot_data$HCU)
```

```{r PCA_HCU, warning = FALSE, fig.width = 12, fig.height = 7}
gg_2 <- autoplot(prcomp(scaledData), x = 1, y = 2,
                 data = prot_data, loadings = TRUE, loadings.colour = 'gray',
                 loadings.label = TRUE, loadings.label.size = 3,
                 loadings.label.colour = "darkblue", colour = NA) +
    geom_point(aes(colour = HCU), size = 1) +
    theme_bw ()
ggplotly_or_not(gg_2, PLOTLY)
```

```{r, echo = FALSE}
#' Supplementary Figure PCA
#' Same plot but with coloring representing HCU.
png("images/manuscript/supplement/Figure_S149.png", width = 8.2, height = 6.4,
    units = "cm", res = 600, pointsize = 4, type = "cairo-png")
par(mar = c(4.5, 4.3, 0.5, 0.5), cex.lab = 1.5, bty = "n", las = 1)
plot_pca(pc_all, pch = NA, col = NA, bg = paste0(cols, 60), cex = 1.5,
         xlim = c(-16, 9))
unsigned.range <- function(x)
        c(-abs(min(x, na.rm = TRUE)), abs(max(x, na.rm = TRUE)))
y <- pc_all$rotation[, 1:2]
scl <- max(
    unsigned.range(y[, 1]) / unsigned.range(pc_all$x[, 1]),
    unsigned.range(y[, 2]) / unsigned.range(pc_all$x[, 2]))
y <- y / scl
arrows(x0 = 0, x1 = y[, 1],
       y0 = 0, y1 = y[, 2],
       lwd = 0.75, angle = 30, length = 0.05,
       col = "#00000050")
text(y[, 1] * 1.1, y[, 2] * 1.1, labels = prot_ann[rownames(y), "Genes"],
     col = "#00000080", cex = 1.4)

cols <- rep("#E41A1C", nrow(prot_data))
cols[prot_data$Sex == "Male"] <- "#377EB8"
cols[prot_data$HCU == "Yes"] <- "#000000"
points(pc_all$x[, 1], pc_all$x[, 2], pch = 21, col = NA, bg = paste0(cols, 60),
       cex = 1.5)
grid()
dev.off()

```

Most of the participants separating from the main group take indeed systemic
contraceptives. Other (female) participants in that group either take other
types of hormonal contraceptives, or did not bring the medication with them to
the study center and thus the medication was not recorded for them (false
negatives).

Below we evaluate the age distribution of participants by sex. Individuals are
colored according to the *HCU* variable.

```{r, BP_age_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6, warning = FALSE}
prot_plot <- function(data, aes, protein) {
    ggplot(data, aes) +
        geom_jitter(position = position_jitter(0.24),
                    aes(colour = HCU, alpha = 0.5)) +
        geom_violin(trim = FALSE, aes(alpha = 0.5)) +
        stat_summary(fun = median, geom = "point", shape = 23, size = 2) +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8),
              axis.title.y = element_blank()) + labs(title = protein) +
        theme_bw ()
}
age_1 <- prot_plot(prot_data, aes(x = Sex, y = Age),
                   protein = "Sex Vs. Age")
ggplotly_or_not(age_1, PLOTLY)
```

Also, age distribution per HCU is visualized:

```{r, BP_age_2_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6, warning = FALSE}
age_2 <- prot_plot(prot_data, aes(x = HCU, y = Age),
                   protein = "HCU Vs. Age")
ggplotly_or_not(age_2, PLOTLY)
```

Clearly there is an enrichment of young women among participants using hormonal
contraceptives. We next extract the information on the individual hormonal
contraceptive preparations to then (later) classify and categorize them based on
the type of hormones they contain.

```{r, results = "asis"}
contr <- med[med$atc_3_name %in% c("HORMONAL CONTRACEPTIVES FOR SYSTEMIC USE",
                                   "CONTRACEPTIVES FOR TOPICAL USE",
                                   "ANTIANDROGENS"), ]
tmp <- split(contr, f = contr$atc_5)
contr_atc5 <- lapply(tmp, function(z) {
    a <- unique(z[, c("atc_3", "atc_4", "atc_5", "atc_4_name", "atc_5_name")])
    a$count <- nrow(z)
    a$preparation <- paste0(unique(z$x0dd07), collapse = ";")
    a
})
contr_atc5 <- do.call(rbind, contr_atc5)

#' to manually go through preparations to classify.
write_xlsx(contr_atc5, path = "contraceptives.xlsx")

pandoc.table(
    contr_atc5[, c("atc_4_name", "atc_5_name", "count")], style = "rmarkdown",
    caption = "Types of hormonal contraceptives taken by participants.")
```

The table above was exported and manually evaluated/classified.

Additionally, we evaluate the predictive ability of all plasma proteins for
hormonal contraceptive use. To ensure comparability, we match the users of
hormonal contraceptives to a control group of identical size (n = 275).  Only
non-pregnant women below the age of 40 not taking hormonal contraceptives were
considered for the control group.

```{r}
#' Prepare Data for matching
#' Filter for  HCU users and controls
filter_hcu <- prot_data$Age < 40 & prot_data$HCU == "Yes" &
    prot_data$Pregnant == "No"
filter_control <- prot_data$Age < 40 & prot_data$HCU == "No" &
    prot_data$Pregnant == "No"

women_hcu <- na.omit(prot_data[filter_hcu, ])
women_control <- na.omit(prot_data[filter_control, ]) # n = 454 possible control candidates

#' combine HC users and control group within dataframe
women_hcu_control <- rbind(women_hcu, women_control)

#' drop unnecessary columns
drop_col <- c("File.Name", "Date", "Inst", "Inj", "Plate", "Sample.Name",
              "Sex", "container", "MS_Batch", "SP_Batch", "BMIcat",
              "Pregnant", "Age_10")
women_hcu_control <- women_hcu_control[, !(colnames(women_hcu_control) %in%
                                           drop_col)]

#' Adjust covariates for matching
women_hcu_control$Fasting <- as.factor(women_hcu_control$Fasting)
women_hcu_control$HCU <- as.factor(women_hcu_control$HCU)
```

*Optimal* matching via a generalized linear model (GLM) was used for propensity
score calculation (including age, BMI and fasting status as covariates)
as implemented in the R package (MatchIt).

```{r}
#' Matching
match.out <- matchit(
    HCU ~ Age + Fasting + BMI,
    data = women_hcu_control[,c("HCU", "Fasting", "BMI", "Age")],
    method = "optimal", distance = "glm")

#' Select matched participants
hcu_control_matched <- rbind(women_control[match.out$match.matrix, ],
                             women_hcu)

summary(match.out)
```

We further visualize propensity scores and covariate distributions to assess
matching quality. Controls (HCU = No) are shown in gray - Treatment group (HCU =
Yes) in black. Propensity score evaluation and inspection of covariate
distribution, show sufficient matching for the subsequent analysis.

```{r}
plot(match.out, type = "jitter", interactive = FALSE)

plot(match.out, type = "density", interactive = FALSE,
     which.xs = ~Age + Fasting + BMI)

plot(summary(match.out))
```

Predictive performance of proteins for HCU is evaluated by calculating Area
Under the Receiver Operating Characteristic curve (AUROC).

```{r}
#' Select only plasma protein and HCU status
hcu_control_matched <- hcu_control_matched[, c(proteins, "HCU")]
labels <- hcu_control_matched$HCU

#' calculate ROC
calculate_roc <- function(col_name) {
  if (!(col_name == "HCU")) {
    protein_values <- hcu_control_matched[[col_name]]
    roc_data <- roc(labels, protein_values)$auc
    return(roc_data)
  }
}

roc_list <- lapply(colnames(hcu_control_matched), calculate_roc)

#' Convert List to dataframe
roc_df <- do.call(rbind, lapply(roc_list, as.data.frame))

#' add Genes and UniProtIDs to dataframe
roc_df$Genes <- prot_ann[proteins, "Genes"]
roc_df$UniProtId <- prot_ann[proteins, "description"]

#' Save the ROC data to a CSV file
#write.csv(roc_df, "rocValsHCU.csv", row.names = FALSE)
```

Angiotensinogen (AGT) shows high predictive power for hormonal contraceptive
usage (AUROC = 0.89)

```{r,}
#' Calculate ROC seperately for AGT
roc_data_agt <- roc(hcu_control_matched$HCU, hcu_control_matched$P01019)

roc_df <- data.frame(
  OneMinusSpecificity = 1 - roc_data_agt$specificities,
  Sensitivity = roc_data_agt$sensitivities
)

```

```{r}
par(mfrow = c(1, 2), mar = c(4, 4.5, 5, 1), cex.axis = 1.5, cex.lab = 1.5,
    bty = "n", cex.main = 2.5)
#' Supplementary Figure P01019 AGT concentration in different groups
f <- as.character(prot_data$Sex)
f[prot_data$HCU == "Yes"] <- "HCU"
f <- factor(f, levels = c("HCU", "Female", "Male"))
library(vioplot)
vioplot(split(prot_data$P01019, f), ylab = expression(log[2]~abundance),
        main = "")
grid(nx = NA, ny = NULL)
mtext(side = 3, outer = FALSE, text = "P01019 (AGT)", line = 0,
      cex = par("cex.main"))
mtext(side = 3, outer = FALSE, text = "A", cex = 5, at = 0.1, line = 0.5)
#' ROC
plot(roc_df$OneMinusSpecificity, roc_df$Sensitivity, xlab = "1 - Specificity",
     ylab = "Sensitivity", type = "l", lwd = 2)
abline(0, 1, lty = 3, col = "grey")
grid()
legend("bottomright", legend = "AUC AGT = 0.89", cex = 2, bty = "n")
mtext(side = 3, outer = FALSE, text = "B", cex = 5, at = -0.14, line = 0.5)

```

```{r}
#' PAPER, Figure 5B
#' AGT abundance in HCU, women and men.
png("images/manuscript/_Figure_4B.png", width = 5,
    height = 6, units = "cm", res = 600, pointsize = 4, type = "cairo-png")
par(mfrow = c(1, 1), mar = c(5, 6, 5, 1) + 0.1, cex.axis = 2, cex.lab = 2,
    bty = "n", cex.main = 2.5, las = 1)
#' Supplementary Figure P01019 AGT concentration in different groups
f <- as.character(prot_data$Sex)
f[f == "Female"] <- "Women"
f[f == "Male"] <- "Men"
f[which(prot_data$HCU == "Yes")] <- "HCU"
f <- factor(f, levels = c("HCU", "Women", "Men"))
library(vioplot)
vioplot(split(prot_data$P01019, f), ylab = "",
        main = "", yaxt = "n", xaxt = "n")
mtext(text = expression(log[2]~abundance), side = 2, line = 3, cex = 2, las = 0)
grid(nx = NA, ny = NULL)
xat <- c(1,2,3)
labs <- c("HCU", "Women", "Men")
axis(1, at = xat, labels = labs, padj = 0.4)
yat <- axTicks(2, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(yat))
axis(2, at = yat, labels = labs)
mtext(side = 3, outer = FALSE, text = "P01019 (AGT)", line = 0,
      cex = par("cex.main"))
## mtext(side = 3, outer = FALSE, text = "A", cex = 5, at = 0.1, line = 0.5)
dev.off()

#' PAPER, Figure 5C
#' ROC for AGT
png("images/manuscript/_Figure_4C.png", width = 5,
    height = 6, units = "cm", res = 600, pointsize = 4, type = "cairo-png")
par(mfrow = c(1, 1), mar = c(5, 6, 5, 1) + 0.1, cex.axis = 2, cex.lab = 2,
    bty = "n", cex.main = 2.5, las = 1)
#' ROC
plot(roc_df$OneMinusSpecificity, roc_df$Sensitivity, xlab = "1 - Specificity",
     ylab = "", type = "l", lwd = 2)
mtext(text = "Sensitivity", side = 2, line = 4, cex = 2, las = 0)
abline(0, 1, lty = 3, col = "grey")
grid()
legend("bottomright", legend = "AUC AGT = 0.89", cex = 2, bty = "n")
## mtext(side = 3, outer = FALSE, text = "B", cex = 5, at = -0.14, line = 0.5)
dev.off()

```


# Sex, Age, BMI and oral hormonal contraceptive-associated proteins

To identify proteins related with sex, age, fasting status and BMI, linear
regression models are fitted for all proteins separately (as a response) and
sex, age, fasting status and (categorical) BMI as a covariates (same as in
[@verri_hernandes_age_2022]). In addition, and based on the observations from
the previous section, a categorical variable representing the use of hormonal
contraceptives is included to adjust for the influence of this medication. In
this model, no interactions between variables are considered, although a
relationship between BMI and sex seems to be present in the data. A sensitivity
analysis evaluating the impact of the additional interaction between BMI and sex
was performed but did not show large differences (data not shown).

Raw p-values for each coefficient are adjusted for multiple hypothesis testing
with the method from Bonferroni.

```{r, message=FALSE}
#' Linear model LM1
FunForLinReg <- function(AnalyteName, Data) {
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ Sex + Age_10 + Fasting + BMIcat + HCU,
                data = Data)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    return(p.Values)
}
LMRegtests <- do.call(
    cbind, lapply(proteins, function(x) FunForLinReg(x, prot_data)))
Test_output <- t(LMRegtests)
#' Adjusting for multiple hypothesis testing
adjp <- apply(Test_output[, grep("p-value", colnames(Test_output))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_output <- cbind(Test_output, adjp)
```

To obtain comparable coefficients, prior to linear modelling all the variables
are standardized to zero mean and unit variance (note that this does not
influence the linear models since they are affine equivariant).
In autoscaled data, difference of 1 corresponds to standard deviation of 1.
This parameter will be called *effect size* and will be use to evaluate group
differences.

```{r, message = FALSE}
#' perform data normalization (autoscaling)
prot_data_AS <- prot_data
prot_data_AS[, proteins] <- scale(prot_data_AS[, proteins], scale = TRUE)
LMRegtests_AS <- do.call(cbind.data.frame,
                         lapply(proteins,
                                function(x) FunForLinReg(x, prot_data_AS)))
Test_output_AS <- t(LMRegtests_AS)
EffectSize <- Test_output_AS[, grep("coef", colnames(Test_output_AS))]
colnames(EffectSize) <- sub("coef", "effect_size", colnames(EffectSize))
```

We next define the criteria to call proteins *significant*. Due the size of the
data set (and hence the statistical power) proteins would be called significant,
even if their difference in concentrations is only very small. Hence we evaluate
whether we could combine the criteria for *statistical significance* (the
p-value) with another criteria that would allow us to define the proteins that
are the *strongest* associated with the contrast of interest.

```{r significant-effect-size}
comps <- colnames(Test_output)[grep("coef", colnames(Test_output))]
sign_p <- Test_output[, grep("p-adj", colnames(Test_output))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
#' Define significant analytes based on effect size.
cut_effect_size <- 0.5
sign_es <- sign_p &
    abs(EffectSize) > cut_effect_size
sign_es[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]
```

We require in addition to the *statistical significance* that the difference in
concentrations for a protein is larger than its *technical variance*. Technical
variance of an protein is estimated by the coefficient of variation of the
protein in QC CHRIS Pool samples.

Below we add thus evaluate for each protein and comparison whether it fulfills
that criteria.

```{r define-significant}
#' Calculate the (absolute) difference in abundance expressed as a percentage
#' based on a (log2) coefficient.
diff_percentage <- function(x) {
    (2^abs(x) - 1) * 100
}
comps <- colnames(Test_output)[grep("coef", colnames(Test_output))]
diff_perc <- apply(Test_output[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
#' define which metabolites are considered significant: require p-value < 0.05
#' and difference in concentration > CV_QC (expressed in %)
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]
```

The number of proteins called significant or with an adjusted p-value smaller
than 0.05 for each of the extracted coefficients are shown below.

```{r table-sig-mets, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown",
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

```{r export-results, warning = FALSE}
#' Loading also the information whether an analyte was already identified
#' before to be significant.
results <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_output,
    EffectSize,
    diff_perc,
    sign_cv)
results$cv_qc_chris <- 100 * results$cv_qc_chris
results$cv_study <- cv_study
results$relative_cv <- cv_rel
#' Add average concentrations for most important groups.
results$Avg_Male <- colMeans(
    prot_data[which(prot_data$Sex == "Male"), proteins])
results$Avg_Female <- colMeans(
    prot_data[which(prot_data$Sex == "Female"), proteins])
results$Avg_BMI1 <- colMeans(
    prot_data[which(prot_data$BMIcat == "1"), proteins])
results$Avg_BMI2 <- colMeans(
    prot_data[which(prot_data$BMIcat == "2"), proteins])
results$Avg_BMI3 <- colMeans(
    prot_data[which(prot_data$BMIcat == "3"), proteins])
results$Avg_BMI4 <- colMeans(
    prot_data[which(prot_data$BMIcat == "4"), proteins])
dr <- paste0("data/xlsx/", filename)
dir.create(dr, showWarnings = FALSE, recursive = TRUE)

colnames(results) <- sub("^description$", "Uniprot", colnames(results))
colnames(results) <- sub("^long_description$", "Description", colnames(results))
write_xlsx(
    results, path = paste0(dr, "/results_sex_age_bmi_hcu_adjusted.xlsx"))
```


## Sex-associated plasma proteins

Proteins significantly associated with sex are extracted from the linear
modeling results.

```{r FIG-volcano-sex, fig.path = IMAGE_PATH, fig.width = 8, fig.height = 8, fig.cap = "Volcano plot for the sex-comparison. Red colored points are those called *significant* by considering the adjusted p-value and CV.", results = "hide", echo = FALSE}
plot_volcano <- function(x, comparison = character(),
                         xlab = "coefficient",
                         ylab = expression(-log[10](p[adj])), ...) {
    X <- x[, paste0("coef_", comparison)]
    Y <- x[, paste0("p.adj_", comparison)]
    Y[Y == 0] <- min(Y > 0) / 10
    plot(X, -log10(Y), pch = 21,
         col = "#000000ce", bg = "#00000060",
         xlab = "", ylab = "", xaxt = "n",
         ...)
    xat <- axTicks(1, usr = par("usr")[1:2])
    labs <- gsub("-", "\U2212", print.default(xat))
    axis(1, at = xat, labels = labs)
    mtext(side = 1, line = 2.2, text = xlab, cex = par("cex.axis"), las = 0)
    mtext(side = 2, line = 3.5, text = expression(-log[10](p[adj])),
          cex = par("cex.axis"), las = 0)
}
significant_points <- function(x, comparison = character(),
                               diff_perc = 2, col = "#E41A1CCE",
                               bg = "#E41A1C40", ...) {
    sig <- x[, paste0("significant_", comparison)]
    X <- x[, paste0("coef_", comparison)]
    Y <- x[, paste0("p.adj_", comparison)]
    Y[Y == 0] <- min(Y > 0) / 10
    if (any(sig))
        points(X[sig], -log10(Y[sig]), col = col, bg = bg, pch = 21)
}
plot_volcano(results, "SexFemale", main = "Female vs Male")
grid()
abline(h = -log10(0.05), lty = 2, col = "black")
significant_points(results, comparison = "SexFemale")
```

```{r, echo = FALSE}
#' PAPER Figure 2.
#' Volcano plots for sex and age associations.
dr <- "images/manuscript/"
dir.create(dr, showWarnings = FALSE, recursive = TRUE)
png(paste0(dr, "_Figure_2AB.png"), width = 12, height = 4.1, units = "cm",
    res = 600, pointsize = 3.5)
par(mar = c(4, 7.4, 1, 1.5), mfrow = c(1, 2), cex.axis = 2,
    cex.lab = 2, cex.main = 2.5, las = 1)
plot_volcano(results, "SexFemale", main = "", bty = "n",
             xlab = expression(coef[sex]))
grid()
## abline(h = -log10(0.05), lty = 2)
significant_points(results, sign_es, comparison = "SexFemale", cex = 1.1)
plot_volcano(results, "Age_10", main = "", bty = "n",
             xlab = expression(coef[age]))
grid()
## abline(h = -log10(0.05), lty = 2)
significant_points(results, sign_es, comparison = "Age_10", cex = 1.1)
dev.off()
any_sig <- results$significant_BMIcat1 | results$significant_BMIcat3 |
    results$significant_BMIcat4
coefs <- results[any_sig, c("coef_BMIcat1", "coef_BMIcat3", "coef_BMIcat4")]
colnames(coefs) <- c("1-2", "3-2", "4-2")
rownames(coefs) <- results[rownames(coefs), "Genes"]
## significance cells.
cell_labels <- results[any_sig, c("significant_BMIcat1", "significant_BMIcat3",
                                  "significant_BMIcat4")]
cell_labels <- do.call(
    cbind, lapply(cell_labels, function(z) ifelse(z, "\u2731", "")))
pheatmap(coefs, filename = paste0(dr, "_Figure_2C.png"),
         width = 2.8, height = 3.8,
         breaks = seq(-0.5, 0.5, length.out = 101),
         display_numbers = cell_labels, las = 1)

png(paste0(dr, "_Figure_2D.png"), width = 7, height = 5, units = "cm",
    res = 600, pointsize = 3.5)
## par(mar = c(4, 4.7, 1, 1.5), mfrow = c(1, 2), cex.axis = 2,
##     cex.lab = 2, cex.main = 2.5, las = 1)
par(cex.axis = 1, cex.lab = 1, cex.main = 1.5, las = 1)
udata <- results[, c("significant_SexFemale", "significant_Age_10",
                     "significant_BMIcat4", "significant_HCUYes")]
colnames(udata) <- c("Sex", "Age", "BMI4", "HCU")
library(UpSetR)
udata <- as.data.frame(lapply(udata, as.integer))
upset(udata, order.by = "freq", point.size = 1.2, line.size = 0.5,
      text.scale = c(0.8, 0.8, 0.65, 0.8, 0.75, 0.8))
dev.off()
```

The (static) table with the significant proteins is shown below:

```{r table-sig-sex, results = "asis", echo = FALSE}
significant_table <- function(x, comparison = character(),
                              columns = c("Genes", "Description",
                                          "cv_qc_chris",
                                          paste0(c("coef_", "p.adj_",
                                                   "effect_size_"),
                                                 comparison))) {
    sigs <- x[, paste0("significant_", comparison)]
    tmp <- x[sigs, columns]
    tmp <- tmp[order(tmp[, paste0("p.adj_", comparison)]), ]
    colnames(tmp) <- sub(paste0("_", comparison), "", colnames(tmp))
    tmp
}
sig_tab <- significant_table(results, "SexFemale")
#' Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_SexFemale.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for Sex (Female vs Male)."))
```

Top 3 proteins according to tests p-values are visualized in boxplots in next
figures. To evaluate influence of hormonal contraceptive use data points are
colored accordingly.

```{r, BP_Sex_1_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6, warning = FALSE}
s_1 <- prot_plot(prot_data, aes(x = Sex, y = P02753),
                   protein = "P02753")
ggplotly_or_not(s_1, PLOTLY)
```

Male study participants have an about 25% higher concentrations of RBP4 in
serum. Retinol-binding protein 4 belongs to the lipocaling family and is the
specific carrier for retinol (vitamin A alcohol) in the blood. This protein has
a significant role in the visual cycle as it is the primary transporter of
retinol to the retinal pigment epithelium. Serum retinol concentrations as well
as RBP4 were reported to be lower in females compared to male
[PMID:35108705](https://pubmed.ncbi.nlm.nih.gov/35108705/) (small, potentially
biased, cohort). This is thus not really a novel finding.

```{r, BP_Sex_2_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
s_2 <- prot_plot(prot_data, aes(x = Sex, y = P00450),
                 protein = "P00450")
ggplotly_or_not(s_2, PLOTLY)
```

```{r, BP_Sex_3_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
s_3 <- prot_plot(prot_data, aes(x = Sex, y = P80108),
                 protein = "P80108")
ggplotly_or_not(s_3, PLOTLY)
```


## Age-associated plasma proteins


To identify age-related proteins, results from linear models fitted previously
are used.

```{r FIG-volcano-age, fig.path = IMAGE_PATH, fig.width = 8, fig.height = 8, fig.cap = "Volcano plot showing age-related associations. Red colored points are those with an adjusted p-value smaller than 0.05.", echo = FALSE, results = "hide"}
plot_volcano(results, "Age_10", main = "Age")
abline(h = -log10(0.05), lty = 2)
grid()
significant_points(cbind(results, sign_cv), comparison = "Age_10")
```

```{r table-sig-age, echo = FALSE, results = "asis"}
sig_tab <- significant_table(results, "Age_10")
#' Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_Age.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for Age. "))
```

Top 3 proteins according to tests p-values are visualized in scatter plots with
age and colored according sex. Instead of the actual concentrations, the
residuals from a linear model fitted to the data including all covariates except
age are used.

```{r, SP_Age_1_figure, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 5}
scatter_plot <- function(data, aes, protein) {
    ggplot(data, aes) +
    geom_jitter(aes(colour = Sex), position=position_jitter(0.05)) +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
    panel.background = element_blank()) +
    ggpubr::stat_cor(method = "pearson", label.x.npc = "middle",
    label.y.npc = "top", col="black") +
    labs(title = protein) +  ylab("Adjusted conc.") +
    scale_color_manual(values=c("#0000ff80", "#ff000080")) +
    geom_smooth(method = "lm", se = FALSE, color="black") +
    theme_bw()
    }

LMReg_1 <- lm(`P35858;P35858-2` ~ Sex + Age + Fasting + BMIcat + HCU,
              data = prot_data)
LMReg_2 <- lm(`P35858;P35858-2` ~ Sex + Fasting + BMIcat + HCU,
              data = prot_data)
residualS <- data.frame(LMReg_2$residuals)
ResData <- cbind.data.frame(model.frame(LMReg_1),residualS)

a_1 <- scatter_plot(ResData, aes(x = Age, y = `LMReg_2.residuals`),
                    protein  = "P35858;P35858-2")
ggplotly_or_not(a_1, PLOTLY)

ggsave("images/manuscript/supplement/Figure_age_1.png", plot = a_1,
       dpi = 600, width = 10, height = 5, units = "cm", scale = 2.5)

```

```{r, SP_Age_2_figure, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 5}
LMReg_1 <- lm(`P04004` ~ Sex + Age + Fasting + BMIcat + HCU,
              data = prot_data)
LMReg_2 <- lm(`P04004` ~ Sex + Fasting + BMIcat + HCU,
              data = prot_data)
residualS <- data.frame(LMReg_2$residuals)
ResData <- cbind.data.frame(model.frame(LMReg_1),residualS)
##
a_2 <- scatter_plot(ResData, aes(x = Age, y = LMReg_2.residuals),
                    protein  = "P04004")
ggplotly_or_not(a_2, PLOTLY)

ggsave("images/manuscript/supplement/Figure_age_2.png", plot = a_2,
       dpi = 600, width = 10, height = 5, units = "cm", scale = 2.5)

```

```{r, SP_Age_3_figure, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 5}
LMReg_1 <- lm(`P02787` ~ Sex + Age + Fasting + BMIcat + HCU,
              data = prot_data)
LMReg_2 <- lm(`P02787` ~ Sex + Fasting + BMIcat + HCU,
              data = prot_data)
residualS <- data.frame(LMReg_2$residuals)
ResData <- cbind.data.frame(model.frame(LMReg_1),residualS)
##
a_3 <- scatter_plot(ResData, aes(x = Age, y = LMReg_2.residuals),
                    protein  = "P02787")

ggplotly_or_not(a_3, PLOTLY)

ggsave("images/manuscript/supplement/Figure_age_3.png", plot = a_3,
       dpi = 600, width = 10, height = 5, units = "cm", scale = 2.5)
```

**IGFALS** (insulin like growth factor binding protein acid labile subunit) is a
serum protein that binds insulin-like growth factors, increasing their half-life
and their vascular localization. Downregulation of IGFALS mRNA with age was
described on a two-group liver transcriptome analysis
[PMID:34959291](https://pubmed.ncbi.nlm.nih.gov/34959291/). Other reports
mentioning an age-dependency (or dependency on age-related traits) exist: [PMID:27329260](https://pubmed.ncbi.nlm.nih.gov/27329260/).

**VTN** (vitronectin) acts in part as an adhesive glycoprotein. This protein
also inhibits the membrane-damaging effect of the terminal cytolytic complement
pathway and binds to several serpin serine protease inhibitor. It is involved in
a variety of other biological processes such as the regulation of the
coagulation pathway, wound healing, and tissue remodeling. It is also a lipid
binding protein that forms a principal component of the high density
lipoprotein.

**APOB** (apolipoprotein B) is the main apolipoprotein of chylomicrons and low
density lipoprotein (LDL).

**APOL1** (apolipoprotein L1) is the major apoprotein of HDL.

Decreasing VTN and APOL1 with age might indicate decreasing concentration of HDL
with age, while LDL increases.


## Identification of BMI related proteins

To identify BMI related proteins, results from linear models fitted previously
are used. The volcano plot below shows the results for the 3 comparisons (i.e.
BMI category 1, 3 or 4 against the baseline BMI category 2).

```{r FIG-volcano-bmi, fig.path = IMAGE_PATH, fig.width = 15, fig.height = 5, fig.cap = "Volcano plot for BMI-related proteins. Red colored points are those with an adjusted p-value smaller than 0.05 and a difference in concentrations which is larger than 2 times the coefficient of variation for that analyte calculated on QC samples.", results = "hide", echo = FALSE}
par(mfrow = c(1, 3), mar = c(4, 4, 1, 0.5))
xlim <- range(results[, c("coef_BMIcat1", "coef_BMIcat3", "coef_BMIcat4")])
ylim <- range(
    -log10(results[, c("p.adj_BMIcat1", "p.adj_BMIcat3", "p.adj_BMIcat4")]))
plot_volcano(results, "BMIcat1", main = "BMI 1 against 2",
             xlim = xlim, ylim = ylim)
abline(h = -log10(0.05), lty = 2)
grid()
significant_points(results, sign_es, comparison = "BMIcat1")
plot_volcano(results, "BMIcat3", main = "BMI 3 against 2",
             xlim = xlim, ylim = ylim)
abline(h = -log10(0.05), lty = 2)
grid()
significant_points(results, sign_es, comparison = "BMIcat3")
plot_volcano(results, "BMIcat4", main = "BMI 4 against 2",
             xlim = xlim, ylim = ylim)
abline(h = -log10(0.05), lty = 2)
grid()
significant_points(results, sign_es, comparison = "BMIcat4")
```

As expected, the highest number of significant proteins was detected for the
comparison of BMI category 4 against baseline 2. Also, the extent and
significance of proteins increases with BMI. The tables below show the
significant proteins for each comparison.

```{r table-sig-BMI1, results = "asis", echo = FALSE}
sig_tab <- significant_table(results, "BMIcat1")
## Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_BMIcat1vs2.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for BMI cat 1 vs 2."))
```

```{r table-sig-BMI3, results = "asis", echo = FALSE}
sig_tab <- significant_table(results, "BMIcat3")
## Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_BMIcat3vs2.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for BMI cat 3 vs 2."))
```

```{r table-sig-BMI4, results = "asis", echo = FALSE}
sig_tab <- significant_table(results, "BMIcat4")
## Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_BMIcat4vs2.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for BMI cat 4 vs 2."))
```

An overlap of significant proteins can be seen for the comparisons all levels
(P01023 or Alpha-2-macroglobulin). Not unexpectedly, also the coefficient for
these proteins are larger in the comparison for BMI 4 than for BMI 3.

**A2M** (alpha-2-macroglobulin) is a protease inhibitor and cytokine
transporter. It can  inhibit inflammatory cytokines, and it thus disrupts
inflammatory cascades. A negative correlation between A2M mRNA expression and
BMI was described previously
[PMID:32927694](https://pubmed.ncbi.nlm.nih.gov/32927694/).

**C3** (complement C3) plays a central role in the activation of the complement
system. Its activation is required for both classical and alternative complement
activation pathways.

**CFH** (complement factor H) has an essential role in the regulation of
complement activation, restricting this innate defense mechanism to microbial
infections.

**APOA4** (apolipoprotein A4) encodes apolipoprotein which precise function is
not known. APOA4 is a potent activator of lecithin-cholesterol acetyltransferase
in vitro.

All 5 proteins according to tests p-values are visualized in boxplots in next
figures.
Points are shadowed according to sample age.

```{r, BP_BMI_1_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
bmi_1 <- prot_plot(prot_data, aes(x = BMIcat, y = P01024),
                   protein = "P01024")
ggplotly_or_not(bmi_1, PLOTLY)
```

```{r, BP_BMI_2_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
bmi_2 <- prot_plot(prot_data, aes(x = BMIcat, y = P05090),
                   protein = "P05090")
ggplotly_or_not(bmi_2, PLOTLY)
```

```{r, BP_BMI_3_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
bmi_3 <- prot_plot(prot_data, aes(x = BMIcat, y = P08603),
                   protein = "P08603")
ggplotly_or_not(bmi_3, PLOTLY)
```

```{r, BP_BMI_4_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
bmi_4 <- prot_plot(prot_data, aes(x = BMIcat, y = P01023),
                   protein = "P01023")
ggplotly_or_not(bmi_4, PLOTLY)
```

```{r, BP_BMI_5_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
bmi_4 <- prot_plot(prot_data, aes(x = BMIcat, y = P06727),
                   protein = "P06727")
ggplotly_or_not(bmi_4, PLOTLY)
```


## Identification of hormonal contraceptive use related proteins

To identify proteins associated with hormonal contraceptive use, results from
linear models fitted previously are used.

```{r FIG-volcano-sexHorm, fig.path = IMAGE_PATH, fig.width = 8, fig.height = 8, fig.cap = "Volcano plot showing hormonal contraceptive use-related associations. Red colored points are those with an adjusted p-value smaller than 0.05.", echo = FALSE, results = "hide"}
plot_volcano(results, "HCUYes", main = "HCU")
abline(h = -log10(0.05), lty = 2)
significant_points(cbind(results, sign_cv), comparison = "HCUYes")
```

```{r table-sig-sex-horm, echo = FALSE, results = "asis"}
sig_tab <- significant_table(results, "HCUYes")
## Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_HCU.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for hormonal contraceptive use. "))
```

**AGT** (angiotensinogen) is expressed in the liver and is cleaved by the enzyme
renin in response to lowered blood pressure. The protein is involved in
maintaining blood pressure, body fluid and electrolyte homeostasis.

**CP** (ceruloplasmin) is a metalloprotein that binds most of the copper in
plasma and is involved in the peroxidation of Fe(II)transferrin to
FE(III)transferrin.

**SHBG** (sex hormone binding globulin) encodes a protein which transports
androgens and estrogens in the blood.

**FETUB** (fetuin B) is part of a protein family which has been implicated in
osteogenesis, bone resorption, regulation of the insulin and hepatocyte growth
factor receptors, and response to systemic inflammation.

**SERPINA6** (serpin family A member 6) encodes an alpha-globulin protein with
corticosteroid-binding properties. This is the major transport protein for
glucocorticoids and progestins in the blood of most vertebrates.

**SERPINA7** (serpin family A member 7) encodes the major thyroid hormone
transport protein.

**SERPING1** (serpin family G member 1) is a highly glycosylated plasma protein
involved in the regulation of the complement cascade.

**PGLYRP2** (peptidoglycan recognition protein 2).

Top 5 proteins according to tests p-values are visualized in boxplots in next
figures.


```{r, BP_HCU_1_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
sh_1 <- prot_plot(prot_data, aes(x = HCU, y = P01019),
                  protein = "P01019")
ggplotly_or_not(sh_1, PLOTLY)
```

```{r, BP_HCU_2_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
sh_2 <- prot_plot(prot_data, aes(x = HCU, y = P00450),
                  protein = "P00450")
ggplotly_or_not(sh_2, PLOTLY)
```

```{r, BP_HCU_3_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
sh_3 <- prot_plot(prot_data, aes(x = HCU, y = Q96PD5),
                   protein = "Q96PD5")
ggplotly_or_not(sh_3, PLOTLY)
```

```{r, BP_HCU_4_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
sh_4 <- prot_plot(prot_data, aes(x = HCU, y = P04278),
                   protein = "P04278")
ggplotly_or_not(sh_4, PLOTLY)
```

```{r, BP_HCU_5_figure, fig.path = IMAGE_PATH, fig.width=10, fig.height=6}
sh_5 <- prot_plot(prot_data, aes(x = HCU, y = P08185),
                   protein = "P08185")
ggplotly_or_not(sh_5, PLOTLY)
```


# Sensitivity analysis: impact of hormonal contraceptives on age, sex, BMI results

We next conduct a sensitivity analysis to evaluate the influence of hormonal
contraceptives on the results of potentially related traits such as sex or
age. To this end we fit linear regression models to the data that do not include
a variable for hormonal contraceptive use.

```{r}
FunForLinReg2 <- function(AnalyteName, Data) {
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ Sex + Age_10 + Fasting + BMIcat,
                data = Data)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    return(p.Values)
}
LMRegtests <- do.call(
    cbind, lapply(proteins, function(x) FunForLinReg2(x, prot_data)))
Test_output_sens <- t(LMRegtests)
adjp <- apply(Test_output_sens[, grep("p-value", colnames(Test_output_sens))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_output_sens  <- cbind(Test_output_sens, adjp)
#' Create result data.frame
comps <- colnames(Test_output_sens)[grep("coef", colnames(Test_output_sens))]
sign_p <- Test_output_sens[, grep("p-adj", colnames(Test_output_sens))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
diff_perc <- apply(Test_output_sens[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]
results_sens <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_output_sens,
    diff_perc,
    sign_cv)
```

The table below lists the significant proteins for the various variables.

```{r, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown",
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

The distribution of (absolute) differences in coefficients between the two
models:

```{r}
quantile(abs(Test_output[, "coef_SexFemale"] -
             Test_output_sens[, "coef_SexFemale"]))
quantile(abs(Test_output[, "coef_Age_10"] -
             Test_output_sens[, "coef_Age_10"]))

```


## Influence on sex-associated proteins

We compare the (raw) p-values and coefficients for association with sex from the
models with and without adjustment for use of hormonal contraceptives.

```{r sensitivity-sex, fig.path = IMAGE_PATH, fig.width = 12, fig.height = 6, fig.cap = "Comparison of p-values (left) and coefficients (right) for association with sex from the models with (x-axis) and without adjustment for hormonal contraceptive use."}
par(mfrow = c(1, 2))
x <- Test_output[, "p-value_SexFemale"]
y <- Test_output_sens[, "p-value_SexFemale"]
x[x == 0] <- min(x[x > 0]) / 10
y[y == 0] <- min(y[y > 0]) / 10
x <- -log10(x)
y <- -log10(y)
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(log[10]~p-value[adj]),
     ylab = expression(log[10]~p-value),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Sex association")
grid()
abline(0, 1)
x <- Test_output[, "coef_SexFemale"]
y <- Test_output_sens[, "coef_SexFemale"]
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef[adj]),
     ylab = expression(coef),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Sex association")
grid()
abline(0, 1)
```

Coefficients are larger and p-values smaller for proteins without adjustment for
hormonal contraceptive usage.


## Influence on age-associated proteins

We compare the (raw) p-values and coefficients for association with sex from the
models with and without adjustment for use of hormonal contraceptives.

```{r sensitivity-age, fig.path = IMAGE_PATH, fig.width = 12, fig.height = 6, fig.cap = "Comparison of p-values (left) and coefficients (right) for association with age from the models with (x-axis) and without adjustment for hormonal contraceptive use."}
par(mfrow = c(1, 2))
x <- Test_output[, "p-value_Age_10"]
y <- Test_output_sens[, "p-value_Age_10"]
x[x == 0] <- min(x[x > 0]) / 10
y[y == 0] <- min(y[y > 0]) / 10
x <- -log10(x)
y <- -log10(y)
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(log[10]~p-value[adj]),
     ylab = expression(log[10]~p-value),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Age association")
grid()
abline(0, 1)
x <- Test_output[, "coef_Age_10"]
y <- Test_output_sens[, "coef_Age_10"]
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef[adj]),
     ylab = expression(coef),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Age association")
grid()
abline(0, 1)
```

The effect of the adjustment is even bigger for age associated proteins.


## Influence on BMI-associated proteins

We compare the (raw) p-values and coefficients for association with BMI
(category 4) from the models with and without adjustment for use of hormonal
contraceptives.

```{r sensitivity-bmi4, fig.path = IMAGE_PATH, fig.width = 12, fig.height = 6, fig.cap = "Comparison of p-values (left) and coefficients (right) for association with BMI from the models with (x-axis) and without adjustment for hormonal contraceptive use."}
par(mfrow = c(1, 2))
x <- Test_output[, "p-value_BMIcat4"]
y <- Test_output_sens[, "p-value_BMIcat4"]
x[x == 0] <- min(x[x > 0]) / 10
y[y == 0] <- min(y[y > 0]) / 10
x <- -log10(x)
y <- -log10(y)
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(log[10]~p-value[adj]),
     ylab = expression(log[10]~p-value),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "BMI 4 vs 2")
grid()
abline(0, 1)
x <- Test_output[, "coef_BMIcat4"]
y <- Test_output_sens[, "coef_BMIcat4"]
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef[adj]),
     ylab = expression(coef),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "BMI 4 vs 2")
grid()
abline(0, 1)
```

Adjustment for hormonal contraceptive use has only a minimal influence on the
results for BMI.

```{r}
#' PAPER Figure 3
#' results from sensitivity analysis
png("images/manuscript/Figure_3.png", width = 12, height = 4, units = "cm",
    res = 600, pointsize = 4)
par(mfrow = c(1, 3), mar = c(4.5, 5.5, 3.5, 0.5), cex.lab = 2, bty = "n",
    cex.main = 2.5, cex.axis = 2, las = 1)
## Sex
x <- Test_output[, "coef_SexFemale"]
y <- Test_output_sens[, "coef_SexFemale"]
bg <- rep("#00000040", length(x))
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef[adj]),
     ylab = "",
     pch = 21, col = "#00000080", bg = bg,
     main = "", cex = 1.5)
mtext(text = expression(coef), side = 2, line = 4, las = 0, cex = 1.3)
grid()
abline(0, 1)
mtext(side = 3, outer = FALSE, text = "a", cex = 3, at = -0.6, line = -0.5)
#' Age
x <- Test_output[, "coef_Age_10"]
y <- Test_output_sens[, "coef_Age_10"]
bg <- rep("#00000040", length(x))
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef[adj]),
     ylab = "",
     pch = 21, col = "#00000080", bg = bg,
     main = "", cex = 1.5)
mtext(text = expression(coef), side = 2, line = 4, las = 0, cex = 1.3)
grid()
abline(0, 1)
mtext(side = 3, outer = FALSE, text = "b", cex = 3, at = -0.16, line = -0.5)
#' BMI 4 vs 2
x <- Test_output[, "coef_BMIcat4"]
y <- Test_output_sens[, "coef_BMIcat4"]
bg <- rep("#00000040", length(x))
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef[adj]),
     ylab = "",
     pch = 21, col = "#00000080", bg = bg,
     main = "", cex = 1.5)
mtext(text = expression(coef), side = 2, line = 4, las = 0, cex = 1.3)
grid()
abline(0, 1)
mtext(side = 3, outer = FALSE, text = "c", cex = 3, at = -0.7, line = -0.5)
dev.off()

```

We next identify proteins that are significantly sex-associated in the analysis
without adjustment for hormonal contraceptive use (HCU) that were however
no longer significantly associated with sex in the analysis with HCU adjustment.

```{r, results = "asis"}
sel <- results_sens$significant_SexFemale & !results$significant_SexFemale
tab <- cbind(results_sens[sel, c("Genes", "coef_SexFemale", "p.adj_SexFemale")],
             results[sel, c("coef_SexFemale", "p.adj_SexFemale",
                            "coef_HCUYes", "p.adj_HCUYes")])
tab <- tab[order(tab[, 3]), ]
pandoc.table(
    tab, style = "rmarkdown", split.table = Inf,
    caption = paste0("Proteins significantly associated with sex in the ",
                     "analysis without HCU adjustment that are no longer ",
                     "significant if HCU is accounted for. Last 4 columns ",
                     "contain the results from the analysis with HCU ",
                     "adjustment"))
dr <- paste0("data/xlsx/", filename)
write_xlsx(tab, path = paste0(dr, "/sensitivity_sex.xlsx"))
```

We next identify the proteins that were significant for age in the analysis
without HCU adjustment but no longer age associated if HCU is accounted for.

```{r, results = "asis"}
sel <- results_sens$significant_Age & !results$significant_Age
tab <- cbind(results_sens[sel, c("Genes", "coef_Age_10", "p.adj_Age_10")],
             results[sel, c("coef_Age_10", "p.adj_Age_10",
                            "coef_HCUYes", "p.adj_HCUYes")])
tab <- tab[order(tab[, 3]), ]
pandoc.table(
    tab, style = "rmarkdown", split.table = Inf,
    caption = paste0("Proteins significantly associated with age in the ",
                     "analysis without HCU adjustment that are no longer ",
                     "significant if HCU is accounted for. Last 4 columns ",
                     "contain the results from the analysis with HCU ",
                     "adjustment"))
dr <- paste0("data/xlsx/", filename)
write_xlsx(tab, path = paste0(dr, "/sensitivity_age.xlsx"))
```

Also, from the `r sum(results_sens$significant_SexFemale)` proteins
significantly associated with sex,
`r sum(results_sens$significant_SexFemale & results$significant_HCUYes)`
are in fact significantly associated with hormonal contraceptive use. Similarly,
also from the `r sum(results_sens$significant_Age)` proteins
`r sum(results_sens$significant_Age & results$significant_HCUYes)` were
significantly associated with hormonal contraceptive use.


# Sensitivity analysis: impact of sample's age on the results

We next conduct a sensitivity analysis to evaluate the influence of the sample's
age on the results. To this end we fit linear regression models to the data that
include in addition a (numeric) variable for the sample's age.

```{r}
FunForLinRegSampleAge <- function(AnalyteName, Data) {
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ Sex + Age_10 + Fasting + BMIcat + HCU + sample_age,
                data = Data)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    p.Values
}

tmp <- prot_data
tmp$sample_age <- sample_age

res_sample_age <- do.call(
    cbind, lapply(proteins, function(x) FunForLinRegSampleAge(x, tmp)))
res_sample_age <- t(res_sample_age)
adjp <- apply(res_sample_age[, grep("p-value", colnames(res_sample_age))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
res_sample_age  <- cbind(res_sample_age, adjp)
#' Create result data.frame
comps <- colnames(res_sample_age)[grep("coef", colnames(res_sample_age))]
sign_p <- res_sample_age[, grep("p-adj", colnames(res_sample_age))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
diff_perc <- apply(res_sample_age[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]
sign_cv[, "significant_sample_age"] <- sign_p[, "significant_sample_age"]
results_sample_age <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    res_sample_age,
    diff_perc,
    sign_cv)
```

The table below lists the significant proteins for the various variables.

```{r, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown",
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

The distribution of (absolute) differences in coefficients between the two
models:

```{r}
quantile(abs(Test_output[, "coef_SexFemale"] -
             results_sample_age[, "coef_SexFemale"]))
quantile(abs(Test_output[, "coef_Age_10"] -
             results_sample_age[, "coef_Age_10"]))

```

```{r}

png("images/manuscript/supplement/Figure_sensitivity_sample_age.png",
    width = 12, height = 12, units = "cm", res = 600, pointsize = 4)
par(mfrow = c(2, 2), las = 2, cex.lab = 1.5, bty = "n", cex.main = 2.5,
    cex.axis = 1.5, mar = c(5, 5, 2.5, 0.5), las = 1)
plot(results$coef_Age_10, results_sample_age$coef_Age_10, pch = 21,
     col = "#00000080", bg = "#00000040", xlab = expression(coef[main]),
     ylab = expression(coef[sample~age]), main = "Age association")
grid()
abline(0, 1, lty = 2, col = "#00000060")
plot(results$coef_SexFemale, results_sample_age$coef_SexFemale, pch = 21,
     col = "#00000080", bg = "#00000040", xlab = expression(coef[main]),
     ylab = expression(coef[sample~age]), main = "Sex association")
grid()
abline(0, 1, lty = 2, col = "#00000060")
plot(results$coef_BMIcat4, results_sample_age$coef_BMIcat4, pch = 21,
     col = "#00000080", bg = "#00000040", xlab = expression(coef[main]),
     ylab = expression(coef[sample~age]), main = "BMI (obesity)")
grid()
abline(0, 1, lty = 2, col = "#00000060")
plot(results$coef_HCUYes, results_sample_age$coef_HCUYes, pch = 21,
     col = "#00000080", bg = "#00000040", xlab = expression(coef[main]),
     ylab = expression(coef[sample~age]), main = "HCU")
grid()
abline(0, 1, lty = 2, col = "#00000060")
dev.off()

```


# Hormonal contraceptive use-associated proteins

While proteins significantly associated with hormonal contraceptive use were
already defined in the previous section on the *full* linear model, we perform
again the analysis restricting to female participants below the age of 40 to
reduce any potential influence of sex and age on the identified proteins (since
hormonal contraceptive use is clearly related with sex and age). In this subset
we fit linear models to the data explaining protein abundances by the
participant's age, BMI, fasting status and hormonal contraceptive
use. Coefficients and p-values for the HCU categorical variable are used to
identify proteins significantly associated with hormonal contraceptive use.

```{r}
#' Subset to female participants below 40 years of age
prot_data_sub <- prot_data[
    which(prot_data$Age < 40 & prot_data$Sex == "Female"), ]

```

A summary of the data subset is given below.

```{r}
summary(prot_data_sub[, c("Age", "Sex", "BMIcat", "Fasting", "HCU")])
```

```{r}
FunForLinRegHCU <- function(AnalyteName, Data) {
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ Age_10 + Fasting + BMIcat + HCU,
                data = Data)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    p.Values
}

#' Fit model
LMRegtests <- do.call(
    cbind, lapply(proteins, FunForLinRegHCU, Data = prot_data_sub))
Test_output_hcu <- t(LMRegtests)
adjp <- apply(Test_output_hcu[, grep("p-value", colnames(Test_output_hcu))],
              MARGIN = 2, p.adjust, method = "bonferroni")
#' Adjust p-values
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_output_hcu  <- cbind(Test_output_hcu, adjp)

#' Calculate effect sizes
tmp <- prot_data_sub
tmp[, proteins] <- scale(tmp[, proteins], scale = TRUE)
LMRegtests <- do.call(
    cbind, lapply(proteins, FunForLinRegHCU, Data = tmp))
Test_output <- t(LMRegtests)
EffectSize <- Test_output[, grep("coef", colnames(Test_output))]
colnames(EffectSize) <- sub("coef", "effect_size", colnames(EffectSize))

#' Define sighificant proteins
comps <- colnames(Test_output_hcu)[grep("coef", colnames(Test_output_hcu))]
sign_p <- Test_output_hcu[, grep("p-adj", colnames(Test_output_hcu))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
diff_perc <- apply(Test_output_hcu[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]

#' Create result data.frame
results_hcu <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_output_hcu,
    EffectSize,
    diff_perc,
    sign_cv)

colnames(results_hcu) <- sub("^description$", "Uniprot", colnames(results_hcu))
colnames(results_hcu) <- sub("^long_description$", "Description",
                             colnames(results_hcu))
write_xlsx(
    results_hcu, path = paste0(dr, "/results_age_bmi_hcu_adjusted.xlsx"))

```

The table below lists the significant proteins for the various variables.

```{r, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown",
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

```{r}
#' PAPER Figure 5A
#' Volcano plot for hormonal contraceptive use.
png(paste0("images/manuscript/_Figure_4A.png"), width = 5, height = 6,
    units = "cm", res = 600, pointsize = 4, type = "cairo-png")
par(mar=c(5,6,5,1)+.1, cex.lab = 2, cex.axis = 2, cex.main = 2.5,
    bty = "n", las = 1)
cols_vol = rep("#00000080", nrow(results_hcu))
cols_vol[results_hcu$significant_HCUYes] <- "#c31c1d"
plot(
    x = results_hcu$coef_HCUYes,  y= -log10(results_hcu$p.adj_HCUYes),
    pch = NA, col = "#00000080", bg = "#00000040",
    xlab = expression(coef[HCU]),
    ylab = expression(-log[10](p[adj])),
    xaxt = "n"
)
xat <- axTicks(1, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(xat))
axis(1, at = xat, labels = labs)
grid()
points(results_hcu$coef_HCUYes, -log10(results_hcu$p.adj_HCUYes),
       pch = 21, col = cols_vol, bg = "#00000040",
       cex = 1.1)
dev.off()
```

```{r comparison-hcu-results, fig.width = 15, fig.height = 5, fig.cap = "Comparison of coefficients (left) and p-values for association with hormonal contraceptive from the analysis on the full data and female participant subset."}
par(mfrow = c(1, 3))
x <- results$p.value_HCUYes
y <- results_hcu$p.value_HCUYes
x[x == 0] <- min(x[x > 0]) / 10
y[y == 0] <- min(y[y > 0]) / 10
x <- -log10(x)
y <- -log10(y)
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(log[10]~p-value),
     ylab = expression(log[10]~p-value[subset]),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Hormonal contraceptive use")
grid()
abline(0, 1)
## Order?
xl <- c(0, length(x))
plot(x = rank(x), y = rank(y), xlim = xl, ylim = xl,
     xlab = expression(rank(p-value)),
     ylab = expression(rank(p-value[subset])),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Hormonal contraceptive use")
grid()
abline(0, 1)
x <- results$coef_HCUYes
y <- results_hcu$coef_HCUYes
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef),
     ylab = expression(coef[subset]),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "Hormonal contraceptive use")
grid()
abline(0, 1)

## Export as figure for paper.
png("images/manuscript/supplement/Figure_S151.png",
    width = 12, height = 4, units = "cm", res = 600, pointsize = 4)
par(mfrow = c(1, 3), mar = c(4.5, 4.3, 3.5, 0.5), cex.lab = 1.5, bty = "n",
    cex.main = 2.5, las = 1)
x <- results$p.value_HCUYes
y <- results_hcu$p.value_HCUYes
x[x == 0] <- min(x[x > 0]) / 10
y[y == 0] <- min(y[y > 0]) / 10
x <- -log10(x)
y <- -log10(y)
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(-log[10]~p-value),
     ylab = expression(-log[10]~p-value[subset]),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "")
grid()
abline(0, 1)
mtext(side = 3, outer = FALSE, text = "A", cex = 3, at = -10.5, line = -0)

## Order?
xl <- c(0, length(x))
plot(x = rank(x), y = rank(y), xlim = xl, ylim = xl,
     xlab = expression(rank(p-value)),
     ylab = expression(rank(p-value[subset])),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "")
grid()
abline(0, 1)
mtext(side = 3, outer = FALSE, text = "B", cex = 3, at = -10.5, line = -0)
x <- results$coef_HCUYes
y <- results_hcu$coef_HCUYes
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = expression(coef),
     ylab = expression(coef[subset]),
     pch = 21, col = "#00000080", bg = "#00000040",
     main = "")
grid()
abline(0, 1)
mtext(side = 3, outer = FALSE, text = "C", cex = 3, at = -1.1, line = -0)
dev.off()

```

The table below lists the proteins found significantly associated with hormonal
contraceptive use in the analysis on the subset of the data.

```{r table-sig-sex-horm-sub, echo = FALSE, results = "asis"}
sig_tab <- significant_table(results_hcu, "HCUYes")
#' Exporting results
write_xlsx(sig_tab, path = paste0(dr, "/significant_HCU_subset.xlsx"))
pandoc.table(
    sig_tab,
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Significant proteins for hormonal contraceptive use ",
                     "from the analysis on the subset of the data."))
```

From the `r sum(results_hcu$significant_HCUYes)` proteins found
significantly associated with HCU
`r sum(results_hcu$significant_HCUYes & results$significant_HCUYes)`
were also significant in the analysis of the full data set.

Below we compare the absolute effect sizes for age, sex, BMI and HCU.

```{r, results = "asis"}
tab <- rbind(
    Sex = quantile(abs(results$effect_size_SexFemale[results$significant_SexFemale])),
    Age = quantile(abs(results$effect_size_Age[results$significant_Age])),
    BMI = quantile(abs(results$effect_size_BMIcat4[results$significant_BMIcat4])),
    HCU = quantile(abs(results$effect_size_HCU[results$significant_HCUYes])))
pandoc.table(
    tab, style = "rmarkdown",
    caption = "Distribution of absolute effect sizes for significant proteins"
)
```

## Comparing HCU effect with data from the BASE-II study

The results for effect of hormonal contraceptive use were replicated in the
serum proteome data of young women of the BASE-II study. Below we load these
results (the data analysis leading to these results are described in a separate
document).

```{r}
#' Load HCU results from BASE-II
results_hcu_baseii <- read_xlsx("data/xlsx/Supplementary_Table_S46.xlsx") |>
    as.data.frame()
#' Subset and match CHRIS HCU results to those of BASE-II
results_hcu_mtch <- results_hcu[match(results_hcu_baseii$gene_symbols,
                                      results_hcu$Genes), ]
```

The two aligned data sets provide thus abundances for
`r nrow(results_hcu_baseii)` proteins. Below we compare the effect sizes for HCU
from the two studies.

```{r}
png(paste0("images/manuscript/_Figure_4F.png"), width = 5, height = 6,
    units = "cm", res = 600, pointsize = 4, type = "cairo-png")

par(mar=c(5,6,5,1)+.1, cex.lab = 2, cex.axis = 2, cex.main = 2.5,
    bty = "n", las = 1)
x <- results_hcu_mtch[, "effect_size_HCUYes"]
y <- results_hcu_baseii[, "effect_size_HCUYes"]
corr_es <- cor.test(x, y, method = "spearman")
xl <- range(c(x, y))
plot(x = x, y = y, xlim = xl, ylim = xl,
     xlab = "",
     ylab = "",
     pch = 21, col = "#00000080", bg = "#00000040",
     xaxt = "n", yaxt = "n", cex=1.5)
mtext(side = 1, text = expression(ES[CHRIS]), cex = 2, line = 3.7)
mtext(side = 2, text = expression(ES["BASE-II"]), cex = 2, line = 3.7, las = 0)
xat <- axTicks(1, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(xat))
axis(1, at = xat, labels = labs, padj = 0.4)
yat <- axTicks(2, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(yat))
axis(2, at = yat, labels = labs)
grid()
abline(0, 1)
legend("topleft", legend = paste0("r = ", round(corr_es$estimate, 2)),
       cex = 1.5)
dev.off()

```


```{r, echo = FALSE}
#' Volcano plot for BASE-II
png(paste0("images/manuscript/_Figure_4D.png"), width = 5, height = 6,
    units = "cm", res = 600, pointsize = 4, type = "cairo-png")
par(mar=c(5,6,5,1)+.1, cex.lab = 2, cex.axis = 2, cex.main = 2.5,
    bty = "n", las = 1)
cols_vol = rep("#00000080", nrow(results_hcu_baseii))
cols_vol[as.logical(results_hcu_baseii$significant_HCUYes)] <- "#c31c1d"
plot(
    x = results_hcu_baseii$coef_HCUYes,
    y= -log10(results_hcu_baseii$p.adj_HCUYes),
    pch = NA, col = "#00000080", bg = "#00000040",
    xlab = expression(coef[HCU]),
    ylab = expression(-log[10](p[adj])),
    xaxt = "n"
)
xat <- axTicks(1, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(xat))
axis(1, at = xat, labels = labs)
grid()
points(results_hcu_baseii$coef_HCUYes,
       -log10(results_hcu_baseii$p.adj_HCUYes),
       pch = 21, col = cols_vol, bg = "#00000040",
       cex = 1.1)
dev.off()
```


## Evaluating effect of different hormone preparations

Ethinylestradiol (EE), used in most horomonal contraceptive preparations have
been shown to have a broader effect than the bioidentical horomones or
progesteron only preparations [@kangasniemi_ethinylestradiol_2023]. We thus here
evaluate whether we see differences in the abundances of participants by
hormonal preparations.

For the analysis, we first restrict to oral contraceptives, excluding
intravaginal and intrauterine contraceptives as well as patches, and categorize
them into preparations that contain ethinylestradiol (EE), preparations with
bioidentical estrogens (BE) and progestogen preparations (P4). In the present
data set, two preparations contain bioidentical hormones: ZOELY (estradiol (E2))
and KLAIRA (estradiol valerate (EV)).

```{r, results = "asis"}
#' Define the table with contraceptives; this bases on the manual evaluation
#' of all preparations exported to an xlsx sheet above.
oral_hcu <- data.frame(
    ATC5 = c("G03AA07",
             "G03AA09",
             "G03AA10",
             "G03AA12",
             "G03AA15",
             "G03AA16",
             "G03AB05",
             "G03AB06",
             "G03HB01",
             "G03AC09",
             "G03AA14",
             "G03AB08"),
    ATC5_name = c("levonorgestrel and ethinylestradiol",
                  "desogestrel and ethinylestradiol",
                  "gestodene and ethinylestradiol",
                  "drospirenone and ethinylestradiol",
                  "chlormadinone and ethinylestradiol",
                  "dienogest and ethinylestradiol",
                  "desogestrel and ethinylestradiol",
                  "gestodene and ethinylestradiol",
                  "cyproterone and estrogen",
                  "desogestrel",
                  "nomegestrol and estradiol",
                  "dienogest and estradiol"),
    hcu_class = c("EE",
                  "EE",
                  "EE",
                  "EE",
                  "EE",
                  "EE",
                  "EE",
                  "EE",
                  "EE",
                  "P4",
                  "BE",
                  "BE"),
    pre_name = c("LOETTE, MICROGYNON, LESTRONETTE, EGOGYN, MIRANOVA, NAOMI",
                 "MERCILON, PRACTIL, PLANUM, DESOREEN",
                 "MINULET, GINODEN, ARIANNA, ESTINETTE, FEDRA, HARMONET, KIPLING, MINESSE, GESTODIOL, GESTODELLE, MELIANE",
                 "YASMINELLE, LUCINELLE, YASMIN, LUTIZ, LUSINE, YAZ, JASMINELLE, RUBIRA, LUSINELLE",
                 "BELARA",
                 "VALETTE, SIBILLA",
                 "LUCILLE, GRACIAL",
                 "MILVANE, TRIMINULET",
                 "DIANE, VISOFID",
                 "NACREZ, CERAZETTE",
                 "ZOELY",
                 "KLAIRA, QLAIRA")
)

#' restrict contraceptives to the ones in the subset
contr_sub <- contr[contr$AID %in% rownames(prot_data_sub), ]

#' Add ATC5 medication info to the data.
prot_data_sub$atc_5 <- NA_character_
prot_data_sub$atc_5[match(contr_sub$AID, rownames(prot_data_sub))] <-
    contr_sub$atc_5

#' add numbers
cnts <- table(prot_data_sub$atc_5)
oral_hcu$count <- as.integer(cnts[oral_hcu$ATC5])

pandoc.table(
    oral_hcu[, c("ATC5", "ATC5_name", "hcu_class", "count", "pre_name")],
    style = "rmarkdown",
    caption = paste0("Oral contraceptive preparations with ethinylestradiol ",
                     "(EE), bioidentical estrogen (BE) or progestogen (P4) ",
                     "in the present data set.")
)
```

To define the data set for the analysis, we restrict the data for female
participants below the age of 40 to those that either take one of the oral
contraceptives listed above, or that don't take any hormonal contraceptives
(i.e. excluding also all participants all with ATC3 codes G03H (antiandrogens),
G02B (contraceptives for topical use) or G03A (contraceptives for systemic use).

```{r}
#' Define participants that take any type of contraceptive
aid_any_contraceptive <- med[med$atc_3 %in% c("G03H", "G02B", "G03A"), "AID"] |>
    unique()
#' Define participants that take any of the selected contraceptives
aid_ohc <- med[med$atc_5 %in% oral_hcu$ATC5, "AID"] |> unique()
#' subset: no contraceptives or selected contraceptives
prot_data_ohc <- prot_data_sub[
    (prot_data_sub$HCU == "No" &
     !(rownames(prot_data_sub) %in% aid_any_contraceptive)) |
    rownames(prot_data_sub) %in% aid_ohc, ]
```

Next we assign the class labels to the participants.

```{r}
prot_data_ohc$hcu_class <- "none"
prot_data_ohc$hcu_class[prot_data_ohc$atc_5 %in%
                        oral_hcu$ATC5[oral_hcu$hcu_class == "EE"]] <- "EE"
prot_data_ohc$hcu_class[prot_data_ohc$atc_5 %in%
                        oral_hcu$ATC5[oral_hcu$hcu_class == "P4"]] <- "P4"
prot_data_ohc$hcu_class[prot_data_ohc$atc_5 %in%
                        oral_hcu$ATC5[oral_hcu$hcu_class == "BE"]] <- "BE"
prot_data_ohc$hcu_class <- factor(prot_data_ohc$hcu_class,
                                  levels = c("none", "BE", "EE", "P4"))
table(prot_data_ohc$hcu_class)
```

We next evaluate the abundances for HCU-associated proteins for these 3
different preparation classes (in comparison to participants without use of
hormonal contraceptives.

```{r}
#' Select the significant proteins.
prot_hcu <- rownames(results_hcu)[results_hcu$significant_HCUYes]

#' Plot per protein.
library(beeswarm)
library(vioplot)

for (prot in prot_hcu) {
    ann <- prot_ann[prot, c("Genes", "description")]
    fl <- paste0(IMAGE_PATH, "HCU_med_beeswarm_",
                 paste0(ann, collapse = "_"), ".png")
    png(fl, width = 8, height = 8, units = "cm", res = 600, pointsize = 6)
    par(mar = c(4, 5, 1.5, 0.5), cex.axis = 1.5, cex.lab = 1.5, bty = "n")
    beeswarm(split(prot_data_ohc[, prot], prot_data_ohc$hcu_class),
             ylab = expression(log[2]~abundance),
             las = 2, pch = 16, cex = 0.9, col = "darkgrey",
             method = "compactswarm", spacing = 0.4,
             main = paste0(ann, collapse = "; "))
    bxplot(split(prot_data_ohc[, prot], prot_data_ohc$hcu_class), add = TRUE,
           probs = 0.5, lwd = 1)
    grid()
    dev.off()
}

#' Adjusting the protein concentrations for age, BMT and fasting.
tmp <- lapply(proteins, function(z) {
    a <- prot_data_ohc
    a$y <- a[, z]
    l <- lm(y ~ Age_10 + Fasting + BMIcat, data = a)
    mean(a$y) + l$residuals
})
names(tmp) <- proteins
prot_data_ohc_adj <- cbind(
    prot_data_ohc[, c("BMIcat", "Sex", "Age", "Fasting", "HCU", "hcu_class")],
    do.call(cbind, tmp))

for (prot in prot_hcu) {
    ann <- prot_ann[prot, c("Genes", "description")]
    fl <- paste0(IMAGE_PATH, "HCU_med_beeswarm_",
                 paste0(ann, collapse = "_"), "_adj.png")
    png(fl, width = 8, height = 8, units = "cm", res = 600, pointsize = 6)
    par(mar = c(4, 5, 1.5, 0.5), cex.axis = 1.5, cex.lab = 1.5, bty = "n")
    beeswarm(split(prot_data_ohc_adj[, prot], prot_data_ohc_adj$hcu_class),
             ylab = expression(log[2]~abundance[adj]),
             las = 2, pch = 16, cex = 0.9, col = "darkgrey",
             method = "compactswarm", spacing = 0.4,
             main = paste0(ann, collapse = "; "))
    bxplot(split(prot_data_ohc_adj[, prot], prot_data_ohc$hcu_class),
           add = TRUE, probs = 0.5, lwd = 1)
    grid()
    dev.off()
}

#' Beeswarm plot of top 4 significant proteins for the Supplement
#'
png("images/manuscript/supplement/Figure_COCs_beeswarm.png", width = 14,
    height = 12, units = "cm", res = 600, pointsize = 5)

par(mfrow = c(2, 2), cex.axis = 1.5, cex.lab = 1.5, bty = "n",
    mar = c(4.5, 4.5, 2, 1), cex.main = 2, las = 1)
beeswarm(split(prot_data_ohc_adj[, "P01019"], prot_data_ohc_adj$hcu_class),
         ylab = expression(log[2]~abundance[adj]),
         las = 2, pch = 16, cex = 1, col = "darkgrey",
         method = "compactswarm", spacing = 0.4,
         main = prot_ann["P01019", "Genes"])
bxplot(split(prot_data_ohc_adj[, "P01019"], prot_data_ohc$hcu_class),
       add = TRUE, probs = 0.5, lwd = 1)
grid()
beeswarm(split(prot_data_ohc_adj[, "P00450"], prot_data_ohc_adj$hcu_class),
         ylab = expression(log[2]~abundance[adj]),
         las = 2, pch = 16, cex = 1, col = "darkgrey",
         method = "compactswarm", spacing = 0.4,
         main = prot_ann["P00450", "Genes"])
bxplot(split(prot_data_ohc_adj[, "P00450"], prot_data_ohc$hcu_class),
       add = TRUE, probs = 0.5, lwd = 1)
grid()
beeswarm(split(prot_data_ohc_adj[, "Q96PD5"], prot_data_ohc_adj$hcu_class),
         ylab = expression(log[2]~abundance[adj]),
         las = 2, pch = 16, cex = 1, col = "darkgrey",
         method = "compactswarm", spacing = 0.4,
         main = prot_ann["Q96PD5", "Genes"])
bxplot(split(prot_data_ohc_adj[, "Q96PD5"], prot_data_ohc$hcu_class),
       add = TRUE, probs = 0.5, lwd = 1)
grid()
beeswarm(split(prot_data_ohc_adj[, "P08185"], prot_data_ohc_adj$hcu_class),
         ylab = expression(log[2]~abundance[adj]),
         las = 2, pch = 16, cex = 1, col = "darkgrey",
         method = "compactswarm", spacing = 0.4,
         main = prot_ann["P08185", "Genes"])
bxplot(split(prot_data_ohc_adj[, "P08185"], prot_data_ohc$hcu_class),
       add = TRUE, probs = 0.5, lwd = 1)
grid()

dev.off()

```

Next we fit protein-wise linear models to the data to identify proteins with
significant differences in abundances between participants taking different
hormonal contraceptive preparations and participants without HCU.

```{r}
#' Linear model
fit_lm <- function(analyte, data) {
    y <- data[, analyte]
    l <- lm(y ~ Age_10 + Fasting + BMIcat + hcu_class, data = data)
    res <- coef(summary(l))[-1, ]
    vals <- data.frame(c(res[, 1L], res[, 4L]))
    names(vals) <- analyte
    rownames(vals) <- c(paste0("coef_", rownames(res)),
                        paste0("p-value_", rownames(res)))
    vals
}
res <- t(do.call(cbind, lapply(proteins, fit_lm, data = prot_data_ohc)))
#' Adjusting for multiple hypothesis testing
adjp <- apply(res[, grep("p-value", colnames(res))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
res <- cbind(res, adjp)

#' Effect sizes
prot_data_ohc_scaled <- prot_data_ohc
prot_data_ohc_scaled[, proteins] <-
    scale(prot_data_ohc_scaled[, proteins], scale = TRUE)
es <- t(do.call(cbind, lapply(proteins, fit_lm, prot_data_ohc_scaled)))
es <- es[, grep("coef", colnames(es))]
colnames(es) <- sub("coef", "effect_size", colnames(es))

#' Determine statistical significance
comps <- colnames(res)[grep("coef", colnames(res))]
sign_p <- res[, grep("p-adj", colnames(res))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
#' Determine significance
diff_perc <- apply(res[, comps], MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
#' define which metabolites are considered significant: require p-value < 0.05
#' and difference in concentration > CV_QC (expressed in %)
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]

#' Compile the results
results_ohc <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    res, es, diff_perc, sign_cv)

#' Export results.
colnames(results_ohc) <- sub("^description$", "Uniprot", colnames(results_ohc))
colnames(results_ohc) <- sub("^long_description$", "Description",
                             colnames(results_ohc))
write_xlsx(
    results_ohc, path = paste0(dr, "/results_sex_age_bmi_ohc.xlsx"))
```

The table below lists the significant proteins for the various variables.

```{r, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown",
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

Only very few proteins, or none were found significant for BE or P4 while a
large number of significant proteins were identified for EE
contraceptives. These results are however also affected by the large differences
in sample sizes for the different preparation groups: the statistical power for
BE or P4, with sample sizes of 17 or 6, respectively, is much lower resulting
thus in fewer significant proteins.

```{r fig.cap = "Volcano plots showing the effect of different oral hormonal contraceptive preparations.", fig.width = 15, fig.height = 5}
yl <- range(-log10(results_ohc[, c("p.adj_hcu_classEE", "p.adj_hcu_classBE",
                                   "p.adj_hcu_classP4")]))
xl <- range(results_ohc[, c("coef_hcu_classEE", "coef_hcu_classBE",
                            "coef_hcu_classP4")])

par(mfrow = c(1, 3))
y <- -log10(results_ohc$p.adj_hcu_classBE)
x <- results_ohc$coef_hcu_classBE
plot(x, y, main = "HCU: BE", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
     col = "#000000ce", bg = "#00000060")
grid()

y <- -log10(results_ohc$p.adj_hcu_classP4)
x <- results_ohc$coef_hcu_classP4
plot(x, y, main = "HCU: P4", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
     col = "#000000ce", bg = "#00000060")
grid()

y <- -log10(results_ohc$p.adj_hcu_classEE)
x <- results_ohc$coef_hcu_classEE
plot(x, y, main = "HCU: EE", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
     col = "#000000ce", bg = "#00000060")
grid()

```

The range of observed coefficients seem to be somewhat comparable between the
different groups, but p-values are much lower for EE preparations. We thus next
directly compare the coefficients for the different hormonal preparation
classes.

```{r, fig.cap = "Comparison of coefficients representing differential abundance between EE, BE or P4 preparations against no HCU.", fig.width = 10, fig.height = 5}
a <- results_ohc$coef_hcu_classEE
b <- results_ohc$coef_hcu_classBE
d <- results_ohc$coef_hcu_classP4

par(mfrow = c(1, 2), bty = "n")
xl <- range(c(a, b, d))
plot(a, b, pch = 21, col = "#00000080", bg = "#00000040",
     xlab = expression(coef[EE]), ylab = expression(coef[BE]),
     xlim = xl, ylim = xl)
l <- lm(b ~ a)
abline(l)
legend("topleft", legend = paste0("slope: ", format(coef(l)[2], digits = 3)))
grid()

plot(a, d, pch = 21, col = "#00000080", bg = "#00000040",
     xlab = expression(coef[EE]), ylab = expression(coef[P4]),
     xlim = xl, ylim = xl)
l2 <- lm(d ~ a)
abline(l2)
legend("topleft", legend = paste0("slope: ", format(coef(l2)[2], digits = 3)))
grid()

```

The result from the linear model comparing BE against EE.

```{r}
summary(l)
```

Correlation between coefficients is thus high, and, with a slope of 0.39, EE
show a stronger effect than BE. The result from the linear model comparing P4
against EE.

```{r}
summary(l2)
```

Correlation of coefficients between P4 and EE is poor.

```{r}
xl <- range(results_ohc[, c("coef_hcu_classEE", "coef_hcu_classBE")])
gg <- ggplot(results_ohc, aes(x = coef_hcu_classEE, y = coef_hcu_classBE)) +
    geom_point(color = "#00000080", aes(text = Genes)) +
    xlim(xl) +
    ylim(xl) +
    theme_bw()
ggplotly(gg)
```

```{r}
xl <- range(results_ohc[, c("coef_hcu_classEE", "coef_hcu_classP4")])
gg <- ggplot(results_ohc, aes(x = coef_hcu_classEE, y = coef_hcu_classP4)) +
    geom_point(color = "#00000080", aes(text = Genes)) +
    xlim(xl) +
    ylim(xl) +
    theme_bw()
ggplotly(gg)
```

Summarizing, both, ethinylestradiol and bioidentical estrogen containing
preparations affect about the same plasma proteins, but the effect of
ethinylestradiol is stronger. Progestogen containing preparations, on the other
hand, affect, if at all, different proteins and seem to have in general a much
lower impact on the plasma proteome.

For completeness we also compare the effect sizes against each other.

```{r, fig.cap = "Comparison of effect sizes between EE, BE or P4 preparations against no HCU.", fig.width = 10, fig.height = 5}
a <- results_ohc$effect_size_hcu_classEE
b <- results_ohc$effect_size_hcu_classBE
d <- results_ohc$effect_size_hcu_classP4

par(mfrow = c(1, 2), bty = "n")
xl <- range(c(a, b, d))
plot(a, b, pch = 21, col = "#00000080", bg = "#00000040",
     xlab = expression(ES[EE]), ylab = expression(ES[BE]),
     xlim = xl, ylim = xl)
l <- lm(b ~ a)
abline(l)
legend("topleft", legend = paste0("slope: ", format(coef(l)[2], digits = 3)))
grid()

plot(a, d, pch = 21, col = "#00000080", bg = "#00000040",
     xlab = expression(ES[EE]), ylab = expression(ES[P4]),
     xlim = xl, ylim = xl)
l2 <- lm(d ~ a)
abline(l2)
legend("topleft", legend = paste0("slope: ", format(coef(l2)[2], digits = 3)))
grid()

```

```{r}

#' PAPER Figure 5
#' Effect of hormonal contraceptive preparations.
png("images/manuscript/Figure_5.png", width = 8, height = 4, units = "cm",
    res = 600, pointsize = 4, type = "cairo-png")
par(mfrow = c(1, 2), mar = c(4.5, 4, 1.5, 0.5), cex.lab = 1, bty = "n",
    cex.axis = 1, cex = 1, las = 1)

a <- results_ohc$effect_size_hcu_classEE
b <- results_ohc$effect_size_hcu_classBE
d <- results_ohc$effect_size_hcu_classP4
xl <- range(c(a, b, d))
plot(a, b, pch = 21, col = "#00000080", bg = "#00000040",
     xlab = "", ylab = "",
     xlim = xl, ylim = xl, xaxt = "n", yaxt = "n")
l <- lm(b ~ a)
abline(l)
grid()
mtext(side = 1, text = expression(ES[ethinylestradiol]),
      cex = 1.3, line = 2.7)
mtext(side = 2, text = expression(ES[bioidentical~estrogen]),
      cex = 1.3, line = 2.5, las = 0)
xat <- axTicks(1, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(xat))
axis(1, at = xat, labels = labs)
yat <- axTicks(2, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(yat))
axis(2, at = yat, labels = labs, padj = 0.4)
legend("topleft",
       legend = c(paste0("slope:\t", format(coef(l)[2], digits = 3)),
                  paste0("p-value:\t", format(summary(l)$coef[2, 4],
                                             digits = 2))))
mtext(side = 3, outer = FALSE, text = "a", cex = 3, at = -2, line = -1)

plot(a, d, pch = 21, col = "#00000080", bg = "#00000040",
     xlab = "", ylab = "",
     xlim = xl, ylim = xl, xaxt = "n", yaxt = "n")
l2 <- lm(d ~ a)
abline(l2)
grid()
mtext(side = 1, text = expression(ES[ethinylestradiol]),
      cex = 1.3, line = 2.7)
mtext(side = 2, text = expression(ES[progesterone]),
      cex = 1.3, line = 2.5, las = 0)
xat <- axTicks(1, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(xat))
axis(1, at = xat, labels = labs)
yat <- axTicks(2, usr = par("usr")[1:2])
labs <- gsub("-", "\U2212", print.default(yat))
axis(2, at = yat, labels = labs, padj = 0.4)
legend("topleft",
       legend = c(paste0("slope:\t", format(coef(l2)[2], digits = 3)),
                  paste0("p-value:\t ", format(summary(l2)$coef[2, 4],
                                             digits = 2))))
mtext(side = 3, outer = FALSE, text = "b", cex = 3, at = -2, line = -1)
dev.off()


```

The summary for the linear model comparing effect sizes for EE against BE:

```{r}
summary(l)
```

And the Spearman correlation coefficient:

```{r}
cor(a, b, method = "spearman")
```

The summary for the linear model comparing effect sizes for EE against P4:

```{r}
summary(l2)
```

And the Spearman correlation coefficient:

```{r}
cor(a, d, method = "spearman")
```


# Medication associated proteins: ATC level 3

To investigate general influence of (any) medication on plasma protein
concentrations we perform a multiple linear regression analysis including
(binary) variables for the most common medications in the present data
subset. We focus on ATC level 3 medications and restrict to medications taken by
at least 15 study participants at a frequency of at least 2 times per week.

```{r, results = "asis"}
med <- med[med$x0dd13 %in% c("daily", "every 2. day", "every 3. day",
                             "every 4. day = 2x per week"),]

level3 <- split(med$AID, med$atc_3)
level3 <- lapply(level3, unique)
level3_sum <- data.frame(
    atc = names(level3),
    name = med$atc_3_name[match(names(level3), med$atc_3)],
    count = lengths(level3))

tab <- level3_sum[level3_sum$count > 14, ]
tab <- tab[order(tab$count, decreasing = TRUE), ]
rownames(tab) <- NULL
pandoc.table(tab, style = "rmarkdown", split.tables = Inf,
      caption = "Most frequent medication in the present CHRIS subset (ATC 3)")
```

```{r}
level3_part <- unlist(level3[tab$atc])
```

There are `r length(unique(level3_part))` subjects taking any of the above
medications which corresponds to
`r round((length(unique(level3_part))/nrow(prot_data))*100, 2)` % of our
subjects.

A multiple linear regression model is next fitted including a binary variable
for any of the above defined (ATC3) medication groups.

```{r function-medication-classes}
#' make function to define class of affected subjects with medications
ClassFunction <- function(treat)  {
    Class <- rep("No", nrow(prot_data))
    Class[rownames(prot_data) %in% level3[[treat]]] <- "Yes"
    return(Class)
}

MedClasses <- do.call(
    cbind.data.frame,
    lapply(tab$atc,
           function(x) ClassFunction(x)))
colnames(MedClasses) <- tab$name

#' Fitting the linear model
FunForLinRegAtc <- function(AnalyteName, Data) {
    Data2 <- cbind.data.frame(Data[, c("Sex", "Age", "Fasting", "BMIcat")],
                              MedClasses)
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ .,
                data = Data2)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    return(p.Values)
}
#' do it in one run
LMRegtestsAtc3 <- do.call(
    cbind, lapply(proteins, function(x) FunForLinRegAtc(x, prot_data)))
Test_outputAtc3 <- t(LMRegtestsAtc3)
#' Adjusting for multiple hypothesis testing
adjp <- apply(Test_outputAtc3[, grep("p-value", colnames(Test_outputAtc3))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_outputAtc3 <- cbind(Test_outputAtc3, adjp)

```

Effect sizes are in addition calculated for each variable and an additional
criteria that considers the magnitude of the coefficient relative to the
technical variance is implemented.

```{r, message = FALSE}
#' Calculate effect sizes
prot_data_AS <- prot_data
prot_data_AS[, proteins] <- scale(prot_data_AS[, proteins], scale = TRUE)
LMRegtests_AS2 <- do.call(cbind.data.frame,
                         lapply(proteins,
                                function(x) FunForLinRegAtc(x, prot_data_AS)))
Test_output_AS2 <- t(LMRegtests_AS2)
EffectSize <- Test_output_AS2[, grep("coef", colnames(Test_output_AS2))]
colnames(EffectSize) <- sub("coef", "effect_size", colnames(EffectSize))

#' Determine statistical significance
comps <- colnames(Test_outputAtc3)[grep("coef", colnames(Test_outputAtc3))]
sign_p <- Test_outputAtc3[, grep("p-adj", colnames(Test_outputAtc3))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
#' Determine significance
diff_perc <- apply(Test_outputAtc3[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
#' define which metabolites are considered significant: require p-value < 0.05
#' and difference in concentration > CV_QC (expressed in %)
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age"] <- sign_p[, "significant_Age"]

#' Compile the results
results_atc3 <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_outputAtc3,
    EffectSize,
    diff_perc,
    sign_cv)

#' Export results.
colnames(results_atc3) <- sub("^description$", "Uniprot",
                              colnames(results_atc3))
colnames(results_atc3) <- sub("^long_description$", "Description",
                              colnames(results_atc3))
write_xlsx(
    results_atc3, path = paste0(dr, "/results_sex_age_bmi_atc3.xlsx"))

```

The number of significant proteins per variable are listed below.

```{r table-sig-prot, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
rownames(sig_tab) <- sub("significant_", "", rownames(sig_tab))
rownames(sig_tab) <- gsub("`", "", rownames(sig_tab))
rownames(sig_tab) <- sub("Yes", "", rownames(sig_tab))
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

```{r atc3-upset-plot, fig.path = IMAGE_PATH, fig.width = 14, fig.height = 14, fig.cap = "Numbers and overlap of significant proteins for each variable; ATC3 medication."}
library(UpSetR)

idx <- grep("significant", colnames(results_atc3))
prts <- lapply(idx, function(z) {
    rownames(results_atc3)[results_atc3[, z]]
})
names(prts) <- colnames(results_atc3)[idx]
names(prts) <- sub("significant_", "", names(prts))
names(prts) <- sub("Yes", "", names(prts))
names(prts) <- sub("^\\.", "", names(prts))
names(prts) <- sub("\\.$", "", names(prts))
names(prts) <- gsub("\\.", " ", names(prts))
names(prts) <- gsub("  ", " ", names(prts), fixed = TRUE)

upset(fromList(prts), nsets = length(prts), order.by = c("freq"))

```

```{r atc3-significant-proteins, fig.path = IMAGE_PATH, fig.width = 14, fig.height = 12, fig.cap = "Significant protein associations for various medications; ATC3 medication."}
#' PAPER Figure 4
#' ATC3 medication
cns <- colnames(results_atc3)
cns <- cns[grep("significant", cns)]
cns <- sub("significant_", "", cns)
## Fix names
names(cns) <- sub("Yes", "", cns)
names(cns) <- sub("^\\.", "", names(cns))
names(cns) <- sub("\\.$", "", names(cns))
names(cns) <- gsub("\\.", " ", names(cns))
names(cns) <- gsub("  ", " ", names(cns), fixed = TRUE)

cns <- cns[!cns %in% c("SexFemale", "Age", "FastingNo",
                       "BMIcat1", "BMIcat3", "BMIcat4")]
FM <- results_atc3[, paste0("effect_size_", cns)] *
    results_atc3[, paste0("significant_", cns)]
colnames(FM) <- names(cns)
FM <- FM[rowSums(FM) != 0, ]
FM <- FM[, colSums(FM) != 0]

clust1 <- hclust(dist((FM)), method = "ward.D")
clust2 <- hclust(dist(t(FM)), method = "ward.D")

FM[FM == 0] <- NA
FM <- round(FM, 2)
rownames(FM) <- paste0(prot_ann[rownames(FM), "Genes"], " [", rownames(FM), "]")
## in long format
melted <- melt(as.matrix(FM))
## add "long_description" & "Genes"
melted <- cbind.data.frame(melted, prot_ann[melted$Var1, c("Genes")])

plt <- melted |>
ggplot(aes(Var2, Var1, fill = value, label = value)) +
    geom_tile() +
    labs(x = NULL, y = NULL, fill = "Effect Size") +
    scale_fill_gradient2(mid = "#FBFEF9", low = "#0C6291", high = "#A63446") +
    geom_text(size = 2.5) +
    theme_classic() +
    scale_x_discrete(expand=c(0,0), limits = colnames(FM)[clust2$order]) +
    scale_y_discrete(expand=c(0,0), limits = rownames(FM)[clust1$order]) +
    theme(axis.text.x = element_text(angle = 20, hjust = 1, size = 10),
          axis.text.y = element_text(hjust = 1, size = 10),
          plot.margin = margin(t = 5, l = 45, r = 5, b = 5),
          legend.position = "none")
    ## theme(text = element_text(family = "Roboto"),
    ##       axis.text.x = element_text(angle = 33, hjust = 1, size = 7),
    ##       axis.text.y = element_text(hjust = 1, size = 7),
    ##       plot.margin = margin(t = 5, l = 40))

ggsave("images/manuscript/supplement/Figure_ATC3.png", plot = plt, dpi = 600,
       width = 12, height = 12, units = "cm", scale = 2.5)
plt
```


# Medication associated proteins: ATC level 4

We repeat the analysis for ATC level 4 medication.

```{r, results = "asis"}
level4 <- split(med$AID, med$atc_4_name)
level4 <- lapply(level4, unique)
level4_sum <- data.frame(
    atc = NA_character_,
    name = names(level4),
    count = lengths(level4))
for (i in seq_len(nrow(level4_sum))) {
    level4_sum$atc[i] <- paste0(
        unique(med[med$atc_4_name %in% level4_sum$name[i], "atc_4"]),
        collapse = ", ")
}

tab <- level4_sum[level4_sum$count > 14, ]
tab <- tab[order(tab$count, decreasing = TRUE), ]
rownames(tab) <- NULL
pandoc.table(tab, style = "rmarkdown", split.tables = Inf,
      caption = "Most frequent medication in the present CHRIS subset (ATC 4)")
```

```{r}
level4_part <- unlist(level4[tab$name])
```

There are `r length(unique(level4_part))` subjects taking any of the above
medications which corresponds to
`r round((length(unique(level4_part))/nrow(prot_data))*100, 2)` % of our
subjects.

A multiple linear regression model is next fitted including a binary variable
for any of the above defined (ATC4) medication groups.

```{r}
#' make function to define class of affected subjects with medications
ClassFunction <- function(treat)  {
    Class <- rep("No", nrow(prot_data))
    Class[rownames(prot_data) %in% level4[[treat]]] <- "Yes"
    return(Class)
}

MedClasses <- do.call(
    cbind.data.frame,
    lapply(tab$name,
           function(x) ClassFunction(x)))
colnames(MedClasses) <- tab$name

#' do it in one run
LMRegtestsAtc4 <- do.call(
    cbind, lapply(proteins, function(x) FunForLinRegAtc(x, prot_data)))
Test_outputAtc4 <- t(LMRegtestsAtc4)
#' Adjusting for multiple hypothesis testing
adjp <- apply(Test_outputAtc4[, grep("p-value", colnames(Test_outputAtc4))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_outputAtc4 <- cbind(Test_outputAtc4, adjp)

```

Effect sizes are in addition calculated for each variable and an additional
criteria that considers the magnitude of the coefficient relative to the
technical variance is implemented.

```{r, message = FALSE}
#' Calculate effect sizes
prot_data_AS <- prot_data
prot_data_AS[, proteins] <- scale(prot_data_AS[, proteins], scale = TRUE)
LMRegtests_AS2 <- do.call(cbind.data.frame,
                          lapply(proteins,
                                 function(x) FunForLinRegAtc(x, prot_data_AS)))
Test_output_AS2 <- t(LMRegtests_AS2)
EffectSize <- Test_output_AS2[, grep("coef", colnames(Test_output_AS2))]
colnames(EffectSize) <- sub("coef", "effect_size", colnames(EffectSize))

#' Determine statistical significance
comps <- colnames(Test_outputAtc4)[grep("coef", colnames(Test_outputAtc4))]
sign_p <- Test_outputAtc4[, grep("p-adj", colnames(Test_outputAtc4))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))
#' Determine significance
diff_perc <- apply(Test_outputAtc4[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
#' define which metabolites are considered significant: require p-value < 0.05
#' and difference in concentration > CV_QC (expressed in %)
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age"] <- sign_p[, "significant_Age"]

#' Compile the results
results_atc4 <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_outputAtc4,
    EffectSize,
    diff_perc,
    sign_cv)

#' Export results.
colnames(results_atc4) <- sub("^description$", "Uniprot",
                              colnames(results_atc4))
colnames(results_atc4) <- sub("^long_description$", "Description",
                              colnames(results_atc4))
write_xlsx(
    results_atc4, path = paste0(dr, "/results_sex_age_bmi_atc4.xlsx"))

```

The number of significant proteins per variable are listed below.

```{r table-sig-prot-atc4, results = "asis"}
sig_tab <- data.frame(
    `p_adj < 0.05` = colSums(sign_p),
    cv = colSums(sign_cv),
    check.names = FALSE)
rownames(sig_tab) <- sub("significant_", "", rownames(sig_tab))
rownames(sig_tab) <- gsub("`", "", rownames(sig_tab))
rownames(sig_tab) <- sub("Yes", "", rownames(sig_tab))
pandoc.table(
    data.frame(no_sign = sig_tab, check.names = FALSE),
    style = "rmarkdown", split.table = Inf,
    caption = paste0("Number of significant proteins for each ",
                     "comparison and definition of significance.",
                     "cv: difference in concentration > * CV in QC samples."))
```

```{r atc4-upset-plot, fig.path = IMAGE_PATH, fig.width = 14, fig.height = 14, fig.cap = "Numbers and overlap of significant proteins for each variable; ATC4 medication."}
idx <- grep("significant", colnames(results_atc4))
prts <- lapply(idx, function(z) {
    rownames(results_atc4)[results_atc4[, z]]
})
names(prts) <- colnames(results_atc4)[idx]
names(prts) <- sub("significant_", "", names(prts))
names(prts) <- sub("Yes", "", names(prts))
names(prts) <- sub("^\\.", "", names(prts))
names(prts) <- sub("\\.$", "", names(prts))
names(prts) <- gsub("\\.", " ", names(prts))
names(prts) <- gsub("  ", " ", names(prts), fixed = TRUE)

upset(fromList(prts), nsets = length(prts), order.by = c("freq"))

```

```{r atc4-significant-proteins, fig.path = IMAGE_PATH, fig.width = 14, fig.height = 12, fig.cap = "Significant protein associations for various medications; ATC4 medication."}
cns <- colnames(results_atc4)
cns <- cns[grep("significant", cns)]
cns <- sub("significant_", "", cns)
## Fix names
names(cns) <- sub("Yes", "", cns)
names(cns) <- sub("^\\.", "", names(cns))
names(cns) <- sub("\\.$", "", names(cns))
names(cns) <- gsub("\\.", " ", names(cns))
names(cns) <- gsub("  ", " ", names(cns), fixed = TRUE)

cns <- cns[!cns %in% c("SexFemale", "Age", "FastingNo",
                       "BMIcat1", "BMIcat3", "BMIcat4")]
FM <- results_atc4[, paste0("effect_size_", cns)] *
    results_atc4[, paste0("significant_", cns)]
colnames(FM) <- names(cns)
FM <- FM[rowSums(FM) != 0, ]
FM <- FM[, colSums(FM) != 0]

clust1 <- hclust(dist((FM)), method = "ward.D")
clust2 <- hclust(dist(t(FM)), method = "ward.D")

FM[FM == 0] <- NA
FM <- round(FM, 2)
rownames(FM) <- paste0(prot_ann[rownames(FM), "Genes"], " [", rownames(FM), "]")
## in long format
melted <- melt(as.matrix(FM))
## add "long_description" & "Genes"
melted <- cbind.data.frame(melted, prot_ann[melted$Var1, c("Genes")])

plt <- melted |>
ggplot(aes(Var2, Var1, fill = value, label = value)) +
    geom_tile() +
    labs(x = NULL, y = NULL, fill = "Effect Size") +
    scale_fill_gradient2(mid = "#FBFEF9", low = "#0C6291", high = "#A63446") +
    geom_text(size = 2.5) +
    theme_classic() +
    scale_x_discrete(expand=c(0,0), limits = colnames(FM)[clust2$order]) +
    scale_y_discrete(expand=c(0,0), limits = rownames(FM)[clust1$order]) +
    theme(axis.text.x = element_text(angle = 20, hjust = 1, size = 7),
          axis.text.y = element_text(hjust = 1, size = 7),
          plot.margin = margin(t = 5, l = 5))
    ## theme(text = element_text(family = "Roboto"),
    ##       axis.text.x = element_text(angle = 33, hjust = 1, size = 7),
    ##       axis.text.y = element_text(hjust = 1, size = 7),
    ##       plot.margin = margin(t = 5, l = 40))

ggsave("images/manuscript/supplement/Figure_ATC4.png", plot = plt, dpi = 600,
       width = 12, height = 9, units = "cm", scale = 2.5)
plt
```


# Associations with previous use of hormonal contraceptives

We also evaluate previous usage of hormonal contraceptives. All women that state
to have taken hormonal contraceptives before and that are not taking hormonal
contraceptives at present are identified. We in addition define a new variable
*HCU_history* that categorizes women into those that take contraceptives at
present, those that did take contraceptives before and those that never took
contraceptives.

```{r}
#' Get info on previous contraceptive usage from the interview data.
#' - x0wo03: do you currently take contraceptive pills?
#' - x0wo02: have you ever taken contraceptive pills?
#' - x0wo04b: how many years have you taken contraceptive pills?
wo <- interview_info[rownames(prot_data), c("x0wo03", "x0wo02", "x0wo04b",
                                            "x0wo12a")]

#' Previous hormonal contraceptive use pHCU:
#' - currently not taking contraceptives (x0wo03 == "No" & HCU == "No")
#' - previous HCU x0wo02 == "Yes"
prot_data$pHCU <- "No"
prot_data$pHCU[which(prot_data$HCU == "No" &
                     wo$x0wo03 == "No" &
                     wo$x0wo02 == "Yes")] <- "Yes"
prot_data$HCU_duration <- wo$x0wo04b

#' Defining also a variable that categorizes participants in:
#' - current HCU
#' - past HCU
#' - never HCU
prot_data$HCU_history <- "never"
prot_data$HCU_history[prot_data$HCU == "Yes"] <- "current"
prot_data$HCU_history[prot_data$pHCU == "Yes"] <- "past"
prot_data$HCU_history <- factor(prot_data$HCU_history,
                                levels = c("never", "current", "past"))
```

We next subset the data to female participants only excluding also participants
(about 160) for which the information on contraceptive use is ambiguous or
partially missing.

```{r, fig.cap = "Age distribution for women in the various HCU categories."}
women_data <- prot_data[prot_data$Sex == "Female", ]

#' Remove women that state to take hormonal contraceptives at present. These
#' are not included in the HCU definition.
aid_rem <- rownames(wo[which(prot_data$HCU == "No" &
                             wo$x0wo03 == "Yes"), ])
#' These are strange ones - answers to x0wo02 are either NA or Yes.
women_data <- women_data[!rownames(women_data) %in% aid_rem, ]

table(women_data$HCU_history)

vioplot(split(women_data$Age, women_data$HCU_history),
        main = "HCU", ylab = "age")
grid()
```

As expected, a clear difference in age distribution can be seen between the
categories. To reduce the influence of age we restrict the analysis to a smaller
age range.

```{r, fig.cap = "Age distribution for women below the age of 40 in the various HCU categories."}
women_data_40 <- women_data[women_data$Age < 40, ]

table(women_data_40$HCU_history)
vioplot(split(women_data_40$Age, women_data_40$HCU_history),
        main = "HCU, age < 40", ylab = "age")
grid()

```

We also evaluate which contraceptives the participants are taking.

```{r}
#' For each participant, check the class of combined contraceptive they
#' are/were taking
women_data_40$hcu_class <- NA_character_
for (i in seq_len(nrow(women_data_40))) {
    tmp <- med$atc_5[med$AID == rownames(women_data_40)[i]]
    if (length(tmp)) {
        tmp_med <- unique(oral_hcu$hcu_class[oral_hcu$ATC5 %in% tmp])
        if (length(tmp_med) > 1)
            stop("Have multiple classes!")
        if (length(tmp_med) == 1L)
            women_data_40$hcu_class[i] <- tmp_med
    }
}
#' Never used contraceptives
table(women_data_40$hcu_class[women_data_40$HCU_history == "never"])
#' Current contraceptives
table(women_data_40$hcu_class[women_data_40$HCU_history == "current"])
#' Previous contraceptives
table(women_data_40$hcu_class[women_data_40$HCU_history == "past"])

```

For women that previously took contraceptives we don't know which preparation
they were taking. Among the women with current HCU we have 17 taking
bioequivalent estrogens and 6 with progesteron.

We also add information whether the women gave birth to (at least) one child
before.

```{r}
women_data_40$ever_pregnant <- wo[rownames(women_data_40), "x0wo12a"] > 0
women_data_40$ever_pregnant[is.na(women_data_40$ever_pregnant)] <- FALSE

split(women_data_40$ever_pregnant, women_data_40$HCU_history) |>
    lapply(summary)
```

We next evaluate the concentration of AGT in the different groups.

```{r, fig.cap = "Abundance of AGT in women (below the age of 40) in the various HCU categories. Left: full data set, middle: women never pregnant, right: previously pregnant women."}
par(mfrow = c(1, 3))
vioplot(split(women_data_40$P01019, women_data_40$HCU_history),
        main = "AGT, HCU, age < 40", ylab = expression(log[2]~abundance))
grid()
#' never pregnant
women_data_40_np <- women_data_40[!women_data_40$ever_pregnant, ]
vioplot(split(women_data_40_np$P01019, women_data_40_np$HCU_history),
        main = "AGT, HCU, age < 40, never pregnant",
        ylab = expression(log[2]~abundance))
grid()

#' previously pregnant
women_data_40_pp <- women_data_40[women_data_40$ever_pregnant, ]
vioplot(split(women_data_40_pp$P01019, women_data_40_pp$HCU_history),
        main = "AGT, HCU, age < 40, previously pregnant",
        ylab = expression(log[2]~abundance))
grid()

```

AGT concentrations are thus only higher in participants **currently** taking
hormonal contraceptives, but do not differ between participants that never took
contraceptives or that did previously take contraceptives. Also, no difference
can be seen between women that were previously pregnant or not. This analysis
did however not take the last time hormonal contraceptives were taken into
account.

We next evaluate differences in protein concentrations between participants
currently taking hormonal contraceptives and those that never used them as well
as between participants previously taking hormonal contraceptives and those that
never used them. This analysis is performed on female participants below the age
40 for which clear information on current or past hormonal contraceptive use is
available.

```{r}
#' Linear model LM1
FunForLinRegW <- function(AnalyteName, Data) {
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ Age_10 + Fasting + BMIcat + HCU_history,
                data = Data)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    return(p.Values)
}
## do it in one run
LMRegtestsW <- do.call(
    cbind, lapply(proteins, function(x) FunForLinRegW(x, women_data_40)))
Test_output <- t(LMRegtestsW)
## Adjusting for multiple hypothesis testing
adjp <- apply(Test_output[, grep("p-value", colnames(Test_output))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_output <- cbind(Test_output, adjp)

#' Define statistically signicant results
comps <- colnames(Test_output)[grep("coef", colnames(Test_output))]
sign_p <- Test_output[, grep("p-adj", colnames(Test_output))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))

#' perform data normalization (autoscaling)
prot_data_AS <- women_data_40
prot_data_AS[, proteins] <- scale(prot_data_AS[, proteins], scale = TRUE)
LMRegtests_AS <- do.call(cbind.data.frame,
                         lapply(proteins,
                                function(x) FunForLinRegW(x, prot_data_AS)))
Test_output_AS <- t(LMRegtests_AS)
EffectSize <- Test_output_AS[, grep("coef", colnames(Test_output_AS))]
colnames(EffectSize) <- sub("coef", "effect_size", colnames(EffectSize))

#' Calculate the (absolute) difference in abundance expressed as a percentage
#' based on a (log2) coefficient.
comps <- colnames(Test_output)[grep("coef", colnames(Test_output))]
diff_perc <- apply(Test_output[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
#' define which metabolites are considered significant: require p-value < 0.05
#' and difference in concentration > CV_QC (expressed in %)
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]

results_women_40 <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_output,
    EffectSize,
    diff_perc,
    sign_cv)

```

The volcano plots below show the results from this comparisons.

```{r, fig.width = 14, fig.cap = "Volcano plots showing the results for the comparison of women (aged < 40) currently (left), or previously (right) taking hormonal contraceptives against those that never used hormonal contraceptives."}
par(mfrow = c(1, 2))
y <- -log10(results_women_40$p.adj_HCU_historycurrent)
x <- results_women_40$coef_HCU_historycurrent
xl <- range(x)
yl <- range(y)
plot(x, y, main = "HCU: current vs never", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
         col = "#000000ce", bg = "#00000060")
grid()
#' Can re-use the x and y limits because there is in fact no difference here.
y <- -log10(results_women_40$p.adj_HCU_historypast)
x <- results_women_40$coef_HCU_historypast
plot(x, y, main = "HCU: past vs never", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
         col = "#000000ce", bg = "#00000060")
grid()

```

```{r}
#' Supplementary Figure for previous HCU.
png("images/manuscript/supplement/Figure_previous_hcu.png", width = 15,
    height = 5, units = "cm", res = 600, pointsize = 4)
par(mfrow = c(1, 3), mar = c(5, 6, 3.5, 0.5), cex.lab = 2, bty = "n",
    cex.main = 3, cex.axis = 2.5, las = 1)
y <- -log10(results_women_40$p.adj_HCU_historycurrent)
x <- results_women_40$coef_HCU_historycurrent
xl <- range(x)
yl <- range(y)
plot(x, y, main = "", xlab = "coef",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
     col = "#000000ce", bg = "#00000060")
sigs <- results_women_40$significant_HCU_historycurrent
points(x[sigs], y[sigs], pch = 21, col = "#E41A1CCE", bg = "#E41A1C40")
grid()
mtext(side = 3, outer = FALSE, text = "A", cex = 3, at = -1.5, line = -0)
#' Can re-use the x and y limits because there is in fact no difference here.
y <- -log10(results_women_40$p.adj_HCU_historypast)
x <- results_women_40$coef_HCU_historypast
plot(x, y, main = "", xlab = "coef",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
         col = "#000000ce", bg = "#00000060")
grid()
mtext(side = 3, outer = FALSE, text = "B", cex = 3, at = -1.5, line = -0)
vioplot(split(women_data_40$P01019, women_data_40$HCU_history),
        main = "", ylab = "")
grid()
mtext(text = expression(log[2]~abundance), side = 2, line = 3, cex = 1.5, las = 0)
mtext(side = 3, outer = FALSE, text = "C", cex = 3, at = 0.2, line = -0)
dev.off()

```


No difference in protein concentrations between participants that used
contraceptives in past and those that never used them can be seen. In contrast,
a large number of proteins are strongly affected by current use of hormonal
contraceptives. The full table with the results is shown below.

```{r}
tmp <- results_women_40[,
                        c("Protein.Names", "Genes",
                          "coef_HCU_historycurrent",
                          "p.adj_HCU_historycurrent",
                          "effect_size_HCU_historycurrent",
                          "significant_HCU_historycurrent",
                          "coef_HCU_historypast",
                          "p.adj_HCU_historypast",
                          "effect_size_HCU_historypast",
                          "significant_HCU_historypast")]
datatable(tmp) |>
    formatRound(columns = which(vapply(tmp, is.numeric, logical(1))),
                digits = 3)
```

We next compare the significant proteins from the full analysis on HCU with
those defined in the analysis directly comparing current HCU against never HCU.

```{r, fig.cap = "Comparisons of coefficients for associations with HCU."}
results_hcu <- results_hcu[rownames(results_women_40), ]

plot(results_hcu$coef_HCUYes, results_women_40$coef_HCU_historycurrent,
     pch = 21, col = "#000000ce", bg = "#00000060", xlab = "coef HCU",
     ylab = "coef HCU current vs never")
grid()
abline(0, 1)

sum(results_hcu$significant_HCUYes)
sum(results_hcu$significant_HCUYes & results_women_40$significant_HCU_historycurrent)
```

The HCU association results on the full data set are highly comparable to the
results from the comparison between female participants currently taking
hormonal contraceptives and those that never took them. The coefficients are
slightly larger in the latter case, mostly because of the *cleaner* data
set/comparison.

For completeness we also create a heatmap of concentrations for

```{r, fig.cap = "Heatmap of HCU associated proteins in women (below 40)."}
tmp <- prot_data[, rownames(results_women_40)[results_women_40$significant_HCU_historycurrent]]
tmp <- t(tmp)
tmp <- tmp - rowMeans(tmp)
tmp <- tmp[, rownames(women_data_40)]

ann <- data.frame(HCU = women_data_40$HCU_history)
rownames(ann) <- rownames(women_data_40)
pheatmap(tmp, annotation_col = ann)

```

Participants get grouped into two main clusters: those currently taking hormonal
contraceptives and all other participants.

Estimate when the women from the past HCU approximately stopped taking hormonal
contraceptives. This is done based on the estimated age women first start taking
contraceptives, the age at participation and the self-reported duration HCUs are
taken.

```{r}
#' Defining the average age (estimated) when women start taking hormonal
#' contraceptives.
avg_age <- 16

#' Estimate the HCU free period.
women_data_40$eHCU_free_period <- women_data_40$Age -
    women_data_40$HCU_duration - avg_age

```

The distribution of this *HCU free period* for women that took HCU in the past
is shown below. Note that we remove negative values.

```{r, fig.cap = "Distribution of estimated HCU free period for women (below the age of 40) with past HCU."}
women_data_40_past <- women_data_40[women_data_40$HCU_history == "past", ]

#' Remove negative values
sum(women_data_40_past$eHCU_free_period < 0, na.rm = TRUE)
women_data_40_past$eHCU_free_period[which(women_data_40_past$eHCU_free_period < 0)] <- NA

plot(density(women_data_40_past$eHCU_free_period, na.rm = TRUE),
     xlab = "estimated time period without HCU",
     main = "Women below the age of 40 with previous HCU")
```

For these we next evaluate whether there is a dependency between the estimated
HCU free time period and concentrations for selected proteins.

```{r, fig.cap = "Relationship between AGT plasma concentrations and HCU free time for women below the age of 40 with past HCU."}

x <- women_data_40_past$eHCU_free_period
y <- women_data_40_past$P01019
plot(x, y, pch = 21, col = "#000000ce",
     bg = "#00000060", xlab = "estimated HCU free period",
     ylab = expression(log[2]~abundance), main = "AGT (P01019)")
grid()
l <- lm(y ~ x)
abline(l)
summary(l)
```

At last also fitting linear models to each proteins.

```{r}
#' Linear model LM1
lm_hcu_period <- function(AnalyteName, Data) {
    Analyte <- Data[, AnalyteName]
    LMReg <- lm(Analyte ~ Age_10 + Fasting + BMIcat + eHCU_free_period,
                data = Data)
    res <- coef(summary(LMReg))[-1, ]
    p.Values <- data.frame(c(res[, 1],
                             res[, 4]))
    names(p.Values) <- AnalyteName
    rownames(p.Values) <- c(paste0("coef_", rownames(res)),
                            paste0("p-value_", rownames(res)))
    return(p.Values)
}
## do it in one run
Test_output <- do.call(
    cbind, lapply(proteins, function(x) lm_hcu_period(x, women_data_40_past))) |>
    t()
## Adjusting for multiple hypothesis testing
adjp <- apply(Test_output[, grep("p-value", colnames(Test_output))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_output <- cbind(Test_output, adjp)

min(Test_output[, "p-adj_eHCU_free_period"])
## Test_output[order(Test_output[, "p-value_eHCU_free_period"]), ]
```

No significant association could be found.

## Previous use of contraceptives after excluding previously pregnant women

To exclude any potential confounding of previous or recent pregnancy, we repeat
the analysis excluding all participants who gave birth previously. The data set
is thus reduced to:

```{r}
table(women_data_40_np$HCU_history)
```

We fit the same linear model as above.

```{r}
LMRegtestsW <- do.call(
    cbind, lapply(proteins, function(x) FunForLinRegW(x, women_data_40_np)))
Test_output <- t(LMRegtestsW)
## Adjusting for multiple hypothesis testing
adjp <- apply(Test_output[, grep("p-value", colnames(Test_output))],
              MARGIN = 2, p.adjust, method = "bonferroni")
colnames(adjp) <- sub("value", "adj", colnames(adjp))
Test_output <- cbind(Test_output, adjp)

#' Define statistically signicant results
comps <- colnames(Test_output)[grep("coef", colnames(Test_output))]
sign_p <- Test_output[, grep("p-adj", colnames(Test_output))] < 0.05
colnames(sign_p) <- sub("p-adj", "significant", colnames(sign_p))

#' perform data normalization (autoscaling)
prot_data_AS <- women_data_40_np
prot_data_AS[, proteins] <- scale(prot_data_AS[, proteins], scale = TRUE)
LMRegtests_AS <- do.call(cbind.data.frame,
                         lapply(proteins,
                                function(x) FunForLinRegW(x, prot_data_AS)))
Test_output_AS <- t(LMRegtests_AS)
EffectSize <- Test_output_AS[, grep("coef", colnames(Test_output_AS))]
colnames(EffectSize) <- sub("coef", "effect_size", colnames(EffectSize))

#' Calculate the (absolute) difference in abundance expressed as a percentage
#' based on a (log2) coefficient.
comps <- colnames(Test_output)[grep("coef", colnames(Test_output))]
diff_perc <- apply(Test_output[, comps],
                   MARGIN = 2, diff_percentage)
colnames(diff_perc) <- sub("coef", "diff_perc", colnames(diff_perc))
#' define which metabolites are considered significant: require p-value < 0.05
#' and difference in concentration > CV_QC (expressed in %)
sign_cv <- lapply(comps, function(z) {
    z <- sub("coef_", "", z)
    sign_p[, paste0("significant_", z)] &
        diff_perc[, paste0("diff_perc_", z)] > 100 *
        prot_ann[proteins, "cv_qc_chris"]
})
sign_cv <- do.call(cbind, sign_cv)
colnames(sign_cv) <- sub("coef", "significant", comps)
#' Add the significant for numeric variable
sign_cv[, "significant_Age_10"] <- sign_p[, "significant_Age_10"]

results_women_40_np <- data.frame(
    prot_ann[proteins, c("description", "long_description", "PeptideNN",
                         "PeptideNames", "Protein.Ids", "Protein.Names",
                         "Genes", "Modified.Sequence", "Stripped.Sequence",
                         "Precursor.Id", "cv_qc_chris")],
    Test_output,
    EffectSize,
    diff_perc,
    sign_cv)

```

The volcano plots below show the results from this comparisons.

```{r, fig.width = 14, fig.cap = "Volcano plots showing the results for the comparison of women (aged < 40) currently (left), or previously (right) taking hormonal contraceptives against those that never used hormonal contraceptives. Results from the analysis on the subset of women that were never pregnant."}
par(mfrow = c(1, 2))
y <- -log10(results_women_40_np$p.adj_HCU_historycurrent)
x <- results_women_40_np$coef_HCU_historycurrent
xl <- range(x)
yl <- range(y)
plot(x, y, main = "HCU: current vs never", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
         col = "#000000ce", bg = "#00000060")
grid()
#' Can re-use the x and y limits because there is in fact no difference here.
y <- -log10(results_women_40_np$p.adj_HCU_historypast)
x <- results_women_40_np$coef_HCU_historypast
plot(x, y, main = "HCU: past vs never", xlab = "coefficient",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
         col = "#000000ce", bg = "#00000060")
grid()

```

There is no long-lasting effect, even after removing previously pregnant women.

```{r}
#' Supplementary Figure for previous HCU.
png("images/manuscript/supplement/Figure_previous_hcu_np.png", width = 15,
    height = 5, units = "cm", res = 600, pointsize = 4)

par(mfrow = c(1, 3), mar = c(5, 5.5, 3.5, 0.5), cex.lab = 2, bty = "n",
    cex.main = 3, cex.axis = 2.5, las = 1)
y <- -log10(results_women_40_np$p.adj_HCU_historycurrent)
x <- results_women_40_np$coef_HCU_historycurrent
xl <- range(x)
yl <- range(y)
plot(x, y, main = "", xlab = "coef",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
     col = "#000000ce", bg = "#00000060")
sigs <- results_women_40_np$significant_HCU_historycurrent
points(x[sigs], y[sigs], pch = 21, col = "#E41A1CCE", bg = "#E41A1C40")
mtext(side = 3, outer = FALSE, text = "A", cex = 3, at = -1.5, line = -0)
grid()
#' Can re-use the x and y limits because there is in fact no difference here.
y <- -log10(results_women_40_np$p.adj_HCU_historypast)
x <- results_women_40_np$coef_HCU_historypast
plot(x, y, main = "", xlab = "coef",
     ylab = expression(-log[10](p[adj])), xlim = xl, ylim = yl, pch = 21,
         col = "#000000ce", bg = "#00000060")
grid()
mtext(side = 3, outer = FALSE, text = "B", cex = 3, at = -1.5, line = -0)
vioplot(split(women_data_40_np$P01019, women_data_40_np$HCU_history),
        main = "", ylab = "")
grid()
mtext(text = expression(log[2]~abundance), side = 2, line = 3, cex = 1.5, las = 0)
mtext(side = 3, outer = FALSE, text = "C", cex = 3, at = 0.2, line = -0)
dev.off()

```


# Correlation between clinical/lab parameters and plasma proteomics data

We now consider selected clinical laboratory parameters measured for CHRIS
participants and perform a correlation analysis with related plasma protein
concentrations. The table below lists the laboratory parameters and selected
protein as well as their correlation coefficient (Spearman rho). Since most
laboratory measurements don't quantify directly proteins we don't expect a
perfect similarity between the mass spectrometry and laboratory assay based
measurements.

```{r, results = "asis"}
pairs <- data.frame(
    label = c("x0lp04q", "x0lp09", "x0lp17", "x0lp18",
              "x0lp19", "x0lp29", "x0lp41"),
    Uniprot = c("P01008", "P02768", "P02647", "P04114",
                "P02656", "P02787", "P69905"))
anns <- chrisTraitAnnotations()
pairs$parameter_name <- anns[pairs$label, "name"]
pairs$hgnc_symbol <- prot_ann[pairs$Uniprot, "Genes"]
pairs$description <- prot_ann[pairs$Uniprot, "long_description"]
pairs$rho <- NA_real_
pairs$p.value <- NA_real_
for (i in seq_len(nrow(pairs))) {
    tmp <- cor.test(prot_data[, pairs[i, "Uniprot"]],
                    labdata_info[rownames(prot_data), pairs[i, "label"]],
                    use = "pairwise.complete.obs", method = "spearman")
    pairs$rho[i] <- tmp$estimate
    pairs$p.value[i] <- tmp$p.value
    ## pairs$rho[i] <- cor(prot_data[, pairs[i, "Uniprot"]],
    ##                     labdata_info[rownames(prot_data), pairs[i, "label"]],
    ##                     use = "pairwise.complete.obs", method = "spearman")
}
pandoc.table(pairs, style = "rmarkdown", split.tables = Inf,
             caption = "Correlation with selected laboratory parameters")
```

Apolipoprotein A-I is the major structural component of HDL while apolipoprotein
B is the primary apolipoprotein of LDL

```{r}
for (i in seq_len(nrow(pairs))) {
    plot(labdata_info[rownames(prot_data), pairs[i, "label"]],
         2^prot_data[, pairs[i, "Uniprot"]],
         pch = 21, col = "#00000080", bg = "#00000040",
         xlab = pairs[i, "parameter_name"], ylab = pairs[i, "hgnc_symbol"])
    grid()
}
```

- *x0lp04q* (Antithrombin (%)): measured in plasma citrate using the enzymatic
  Siemens Innovance Anthithrombine assay on a ROCHE SYSMEX CA1500 to quantify
  the functionally active antithrombin; after 2014-01 using the enzymatic
  STA-Stachrom AT III - REF 00596 assay on a STAGO STA COMPACT MAX instrument.
- *x0lp09* (Albumin (g/dL): measured in serum using the colorimetric Cobas ALB
  plus Albumin BCG assay on a ROCHE MODULAR PPE and since 2014-05 using the
  colorimetric ALBUMIN BCG assay on an ABBOT DIAGNOSTIC ARCHITECT instrument.
- *x0lp17* (HDL (mg/dL)): measured in serum using the enzymatic Cobas HDL-C PLUS
  3rd generation assay on a ROCHE MODULAR PPE instrument. After 2014-05 using
  the ULTRA HDL on an ABBOT DIAGNOSTIC ARCHITECT instrument.
- *x0lp18* (LDL (mg/dL)): measured in serum using the enzymatic Cobas LDL-C plus
  2nd generation assay on a ROCHE MODULAR PPE and since 2014-05 using the DIRECT
  LDL assay on an ABBOT DIAGNOSTIC ARCHITECT instrument.
- *x0lp19* (Triglycerides (mg/dL))): measured in serum using the enzymatic Cobas
  Triglyceride GPO-PAP assay on a ROCHE MODULAR PPE and since 2014-05 using the
  TRIGLYCERIDE assay on an ABBOT DIAGNOSTIC ARCHITECT instrument.
- *x0lp29* (Transferrin (mg/dL)): measured in serum using the immunological
  Cobas Tina-quant Transferring ver.2 assay on an ROCHE MODULAR PPE
  instrument. Since 2014-05 using the immunological TRANSFERRIN assay on an
  ABBOT DIAGNOSTIC ARCHITECT.
- *x0lp41* (HGB (g/dL)): measured in EDTA plasma using the electronic impedance
  laser light scattering based assay with Cell-Dyn Sapphire reagents on an ABBOT
  CD SAPPHIRE instrument and since 2014-03 on a SYSMEX XN-1000 (also electronic
  impedance laser light scattering based assay).

```{r}
png("images/manuscript/supplement/Figure_1.png", width = 10, height = 5, units = "cm",
    res = 600, pointsize = 4)
par(mfrow = c(1, 2), mar = c(4.5, 4.3, 0.5, 0.5), cex.lab = 1.5,
    bty = "n", las = 1)
plot(labdata_info[rownames(prot_data), pairs[4, "label"]],
     2^prot_data[, pairs[4, "Uniprot"]],
     pch = 16, col = "#00000040",
     xlab = pairs[4, "parameter_name"], ylab = pairs[4, "hgnc_symbol"])
grid()
plot(labdata_info[rownames(prot_data), pairs[6, "label"]],
     2^prot_data[, pairs[6, "Uniprot"]],
     pch = 16, col = "#00000040",
     xlab = pairs[6, "parameter_name"], ylab = pairs[6, "hgnc_symbol"])
grid()
dev.off()
```

# Comparison of results with Enroth et al

Enroth et al [@enroth_systemic_2018] evaluated the impact of antihypertensive
and lipid-lowering medication on plasma proteins on Olink-based plasma proteome
data. Below we load their results and evaluate the overlap of their results with
ours. We base the mapping on the UniProt identifiers and gene names available
for both data sets.

```{r}
enroth <- read_xlsx("data/xlsx/41598_2018_23860_MOESM1_ESM.xlsx", sheet = 1) |>
as.data.frame()

tmp <- strsplit(proteins, split = ";")
enroth_mapping <- data.frame(
    ms_id = rep(proteins, lengths(tmp)),
    ms_uniprot = unlist(tmp, use.names = FALSE)
)
enroth_mapping$ms_genes = prot_ann[enroth_mapping$ms_id, "Genes"]
enroth_mapping$ol_uniprot <- NA_character_
for (i in seq_len(nrow(enroth_mapping))) {
    idx <- which(enroth[, "Uniprot Id"] == enroth_mapping$ms_uniprot[i])
    if (!length(idx))
        idx <- which(enroth[, "Gene"] == enroth_mapping$ms_genes[i])
    if (length(idx))
        enroth_mapping$ol_uniprot[i] <- paste0(enroth[idx, "Uniprot Id"],
                                               collapse = "; ")
}

```

Overlap between the two data sets if for
`r sum(!is.na(enroth_mapping$ol_uniprot))` proteins.


# Comparison of results with Ramsey et al

Ramsey et al [@ramsey_variation_2016] evaluated influence of oral contraceptives
and menstrual cycle on plasma protein levels.

```{r}
ramsey <- read_xlsx("data/xlsx/Ramsey.xlsx") |> as.data.frame()

hcu_sig <- results_hcu[results_hcu$significant_HCUYes, ]
sum(ramsey$UniProt %in% rownames(hcu_sig))

nrow(ramsey)

plot(results_hcu[ramsey$UniProt, "coef_HCUYes"], ramsey$coef)
```


# Data, figures etc for the manuscript's supplement

```{r}
library(MetaboCoreUtils)
dist_plot <- function(x, protein, ...) {
    plot(density(x[, protein]), xlab = expression(log[2]~abundance),
         main = paste0(protein, " (", prot_ann[protein, "Genes"], ")"), ...)
    grid()
    legend("topleft",
           legend = c(
               paste0("IQR(study): ", format(IQR(x[, protein]), digits = 2)),
               paste0("CV (study): ", format(rsd(2^x[, protein]), digits = 2)),
               paste0("CV (QC): ", format(prot_ann[protein, "cv_qc_chris"],
                                          digits = 2))))
}
dr <- paste0("images/manuscript/supplement/")
dir.create(dr, showWarnings = FALSE, recursive = TRUE)
nr_plots <- 10
i <- 1
while (TRUE) {
    idx <- seq(i, length.out = nr_plots)
    idx <- idx[idx <= length(proteins)]
    if (!length(idx))
        break
    fl <- paste0(dr, "density-", i, ".png")
    png(fl, width = 10, height = 18, units = "cm", res = 300, pointsize = 4)
    par(mfrow = c(length(idx) / 2, 2), mar = c(4.5, 4.2, 1.5, 0.5))
    for (j in idx)
        dist_plot(prot_data, proteins[j], lwd = 2)
    dev.off()
    i <- i + nr_plots
}
## for (protein in proteins) {
##     fl <- paste0(dr, "density-", protein, ".png")
##     png(fl, width = 10, height = 4, units = "cm", res = 300, pointsize = 4)
##     par(mar = c(4.5, 4.2, 1.5, 0.5))
##     dist_plot(prot_data, protein, lwd = 2)
##     dev.off()
## }

```

Note on Figures for the manuscript. Plotting ggplot and base R plots into the
same figure is not (easily) possible, and also some of the plots were created in
Berlin hence we need to merge the figures manually. This is done with
Gimp. Original plots are not modified. Labels (A, B, ...) are added manually
using the *sans serif* font and using a size of 120px.

# Session information

R packages used for the analysis:

```{r}
sessionInfo()
```


# References