-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathSeverity_pipeline_simulated_data.Rmd
290 lines (203 loc) · 11.8 KB
/
Severity_pipeline_simulated_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Severity pipeline-simulated data"
output: html_document
date: "2023-05-24"
---
```{r, include=FALSE}
library(tidyverse)
```
<br>
Epiverse severity (and underreporting) pipeline illustrated with data from 100-days workshop simulated scenario data.
```{r setup, include=TRUE}
Linelist_group3 <- read_csv("Data/Linelist_group3.csv")
head(Linelist_group3)
```
### Task 1: Identify and correct inconsistencies and errors in the data
<br>
The package [linelist](https://github.com/epiverse-trace/linelist) was used to create a `linelist` object through the `make_linelist()` function, which adds tags to df columns, that represent relevant variables in epidemiological analysis. Current tags can be checked using the function `tags()`.
Currently (June 2023) it is necessary to install the development version of the package:
```{r}
#pak::pak("epiverse-trace/linelist")
library(linelist)
```
```{r}
linelist_3 <- make_linelist(Linelist_group3,
id = "case_name", date_onset = "onset_date",
date_reporting = "report_date",
date_admission = "hospitalisation_date",
date_death = "death_date", gender = "gender",
age = "age", location = "island_group",
occupation = "occupation")
tags(linelist_3)
```
<br>
The function `validate_linelist()` was used to check that column formats fit the class of the corresponding tags. This is useful to perform checks on the data to ensure that it is in the right format before conducting any further analyses.
_Won't knit when using this function because it generates an error if classes are incorrect_
Multiple errors were detected, and subsequently corrected thanks to this check:
```{r}
linelist_3$onset_date <- as.Date(linelist_3$onset_date, format = "%Y-%m-%d")
linelist_3$report_date <- as.Date(linelist_3$report_date, format = "%Y-%m-%d")
linelist_3$hospitalisation_date <- as.Date(linelist_3$hospitalisation_date, format = "%Y-%m-%d")
linelist_3$death_date <- as.Date(linelist_3$death_date, format = "%Y-%m-%d")
linelist_3$gender <- as.character(linelist_3$gender)
validate_linelist(linelist_3)
```
<br>
The `lost_tags_action()` was used to determine the message that the user receives when tags are lost in the subsequent steps of the analysis, e.g., when subsetting data. Also, the `get_lost_tags_action()` is used to check which action had been set- which can be changed throughout the analysis.
```{r}
lost_tags_action("warning")
get_lost_tags_action()
```
<br>
The package [cleanepi](https://github.com/epiverse-trace/cleanepi) was used to clean linelist data. This package will include multiple functions, to remove duplicate entries, check IDs and verify that they comply with the expected format, automatically standardise dates, or calculate the age of cases included in the linelist.
_At the moment, it is not possible to use `linelist` tags to remove duplicates across columns_
```{r}
library(cleanepi)
linelist3_clean <- cleanepi(linelist_3, remove.duplicates = T, duplicates.from = c("case_name","ct_value","age","hospitalisation_date","date_first_contact","report_date","occupation","case_type","gender","onset_date","death_date","date_last_contact","island_group"), clean.col.names = T)
```
<br>
## Task 2: What is the CFR?
<br>
In this section, the packages [datadelay](https://github.com/epiverse-trace/datadelay), [epiparameter](https://github.com/epiverse-trace/epiparameter), and [quickfit](https://github.com/epiverse-trace/quickfit) are used to estimate the case fatality rate of disease, while correcting the calculation to account for censoring of cases.
<br>
```{r, echo=FALSE, out.width="80%"}
knitr::include_graphics('images/Severity and underreporting pipeline.png')
```
<br>
To be able to use `datadelay`, it is necessary to first format the data as a dataframe with the number of cases, deaths, and dates of death, for which the package `incidence2` was used.
_Issue raised on `datadelay` to facilitate the input of data so that users don't have to change from a long to wide format using several lines of code_
```{r}
library(incidence2)
daily_cases_deaths <- incidence2::incidence(linelist3_clean,c("onset_date","death_date")) |> complete_dates()
daily_cases_deaths <- pivot_wider(daily_cases_deaths, names_from = count_variable, values_from = count)
names(daily_cases_deaths)[2:3]=c("deaths","cases")
daily_cases_deaths <- daily_cases_deaths[c(1,3,2)]
daily_cases_deaths <- as.data.frame(daily_cases_deaths)
#Adding location
daily_cases_deaths$location <- "Philippines"
```
<br>
The function `estimate_static()` was used to estimate the overall value of the CFR in this scenario. The argument `correct = TRUE/FALSE` allows the user to correct the CFR estimation, based on the distribution of the onset-death delay. The function `format_output()` provides a nicely formatted summary of the CFR value and its 95% CI.
On the following example, these functions are first used to estimate the naive CFR:
```{r}
library(datadelay)
# Naive, static CFR:
n_static_CFR <- estimate_static(daily_cases_deaths, correct_for_delays = F, location = "location")
format_output(n_static_CFR, estimate_type = "severity")
```
<br>
The corrected overall CFR was calculated by importing the onset-death delay distribution for Ebola from the `epiparameter` package:
```{r}
library(epiparameter)
# Loading delay distribution:
onset_to_death_ebola <- epiparameter::epidist_db(
disease = "Ebola Virus Disease",
epi_dist = "onset_to_death",
author = "WHO_Ebola_Response_Team")
# Corrected, static CFR:
c_static_CFR <- estimate_static(daily_cases_deaths, epi_dist = onset_to_death_ebola, correct_for_delays = T, location = "location")
format_output(c_static_CFR, estimate_type = "severity")
```
<br>
_The `estimate_static` function can also be used to compare the CFR estimates in different areas when the input dataset has the death counts per day and also per affected region_ <- not true at the moment
```{r}
daily_cases_deaths_location <- incidence2::incidence(linelist3_clean,c("onset_date","death_date"), groups = "island_group") |> complete_dates()
daily_cases_deaths_location <- pivot_wider(daily_cases_deaths_location, names_from = count_variable, values_from = count)
daily_cases_deaths_location <- daily_cases_deaths_location[c(1,4,3,2)]
names(daily_cases_deaths_location)[2:3]=c("cases","deaths")
#daily_cases_deaths_location <- as.data.frame(daily_cases_deaths_location)
```
```{r}
n_static_CFR_region <- estimate_static(daily_cases_deaths_location, correct_for_delays = F, location = "island_group")
format_output(n_static_CFR_region, estimate_type = "severity")
```
<br>
The function `estimate_time_varying()` provides an estimation of the CFR over time. Similarly to the example above, this estimation can be naive or corrected.
_Further, the function `plot_data_and_CFR()` helps to easily visualise the data along with the estimated CFR: <Is this function gone??_
```{r}
n_rolling_CFR <- estimate_time_varying(daily_cases_deaths, epi_dist = onset_to_death_ebola, correct_for_delays = FALSE)
c_rolling_CRF <- estimate_time_varying(daily_cases_deaths, epi_dist = onset_to_death_ebola, correct_for_delays = T)
plot_time_varying(c_rolling_CRF)
```
<br>
Because the disease in this scenario is Reston Ebolavirus, a pathogen that (to our knowledge) does not affect humans, another option is to fit a distribution to the data ourselves, as there are no estimates for this variant in humans. This is applicable to any other emerging disease. In the following example, `fitdistrplus` and `quickfit` were used to fit a probability distribution to the onset-death delay, so that in can be used in the corrected CFR estimation.
First, data was formatted and a histogram was produced to visualise the distribution of deaths:
```{r}
onset_to_death <- linelist3_clean[,c(6,8)]
onset_to_death <- na.omit(onset_to_death)
delay_onset_death <- as.numeric(onset_to_death$death_date - onset_to_death$onset_date)
hist(delay_onset_death)
```
<br>
Then, different distributions were fitted and compared:
```{r}
library(fitdistrplus)
fit_lnorm_od <- fitdist(delay_onset_death, "lnorm")
fit_gamma_od <- fitdist(delay_onset_death, "gamma")
fit_weibull_od <- fitdist(delay_onset_death, "weibull")
#Comparing AIC and BIC
fit_comparison_od <- data.frame(distribution = c("Weibull", "Gamma", "Lognormal"),
AIC = c(fit_weibull_od$aic, fit_gamma_od$aic, fit_lnorm_od$aic),
BIC = c(fit_weibull_od$bic, fit_gamma_od$bic, fit_lnorm_od$bic))
fit_comparison_od
```
To compare the fit of probability distributions, the function `multfitdist()`, from `quickfit` can also be used, which provides the AIC, BIC, and the log-likelihood:
```{r}
library(quickfit)
multi_fitdist(
data = delay_onset_death,
models = c("lnorm", "gamma", "weibull"))
```
<br>
Next, this distribution was converted to an `epidist` object, so that it could be used with `datadelay`'s functions. This involved using the function `extract_param()` to extract the distribution's parameters. These parameters, along with the summary statistics of the scenario data, are used to create an `epidist` object with the function `epidist()`:
```{r}
library(epiparameter)
summary(delay_onset_death)
extract_param(
type = "percentiles",
values = c(12, 21.75),
distribution = "lnorm",
percentiles = c(0.25, 0.75)) #values- meanlog: 2.78, sdlog: 0.44
summary_stats <- create_epidist_summary_stats(mean = 19.43, median = 18.5, quantiles = c(q_25=12,q_75=21.75))
restonebv_o_d <- epidist(disease = "ebola", pathogen = "reston_ebv",
epi_dist = "onset_to_death", prob_distribution = "lnorm",
prob_distribution_params = c(meanlog=2.78, sdlog=0.44),
summary_stats = summary_stats)
```
<br>
Lastly, the CFR was recalculated using `datadelay` with custom distribution parameters:
```{r}
rEbv_rolling_CFR <- estimate_time_varying(daily_cases_deaths, correct_for_delays = T, epi_dist = restonebv_o_d)
plot_time_varying(rEbv_rolling_CFR)
```
<br>
### Task 3: Estimation of proportion of ascertained cases
The function `known_outcomes()` from `datadelay` allows the user to estimate the expected number of individuals with known outcomes, given the case and death data collected during an outbreak.
_What does the outcome mean? what is the number in the new column, expected no. of cases with known outcomes??_
_How does the known_outcomes() function fit with the estimate_underreporting() function?_
```{r}
ebola_known_outcomes <- known_outcomes(daily_cases_deaths, epi_dist = onset_to_death_ebola)
plot_known_outcomes(ebola_known_outcomes)
```
<br>
The function `estimate_underreporting()` estimates the proportion of cases that have been ascertained. This proportion is calculated as the ratio of the "true" baseline severity estimate (e.g., obtained from the literature), and the delay-adjusted severity estimate (obtained using the functions `estimate_static()` and `estimate_time_varying()`)
For this simulated scenario, we will assume a baseline CFR of 4%.
_For the argument "severity baseline" clarify whether this is just the number or divided by 100_
```{r}
ebola_reporting_static <- estimate_reporting(daily_cases_deaths, epi_dist = onset_to_death_ebola, type = "static", severity_baseline = 0.04, correct_for_delays = F)
format_output(ebola_reporting_static, estimate_type = "reporting", type ="Under-reporting")
```
<br>
Then, estimating the proportion of ascertained cases over time:
_In the vignette, the format of the df_in for this function is a linelist, not cases-deaths counts_
```{r}
ebola_reporting_rolling <- estimate_reporting(
daily_cases_deaths,
epi_dist = onset_to_death_ebola,
type = "varying",
severity_baseline = 0.04,
correct_for_delays = T,
smooth_inputs = T,
burn_in_value = 0)
format_output(ebola_reporting_rolling, estimate_type = "reporting", type ="Ascertainment")
```