-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathVignette_workshop.Rmd
358 lines (255 loc) · 20.6 KB
/
Vignette_workshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: "workshop vignette"
output:
pdf_document:
latex_engine: xelatex
date: "2023-01-12"
self_contained: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = F)
```
<br>
This vignette summarises the findings from the _100 days and 100 lines of code_ workshop, hosted in December 2022 by [Epiverse-TRACE](https://data.org/news/epiverse-trace-a-values-based-approach-to-open-source-ecosystems/).
**This document is a draft, the final version will be published on Epiverse's blog after it has been reviewed by other Epiverse members and workshop participants**
* Participants who have contributed so far: Sara Hollis, Anne Cori, Geraldine Gomez, Juan Daniel Umana, David Santiago Quevedo, Chaoran Chen, and John Lees
## What should the first 100 lines of code written during an epidemic look like?
To answer this question, we invited 40 experts, including academics, field epidemiologists, and software engineers, to take part in a 3-day workshop, where they discussed the current challenges, and potential solutions, in data analytic pipelines used to analyse epidemic data. In addition to highlighting existing technical solutions and their use cases, presentations on best practices in fostering collaboration across institutions and disciplines set the scene for the subsequent workshop scenario exercises.
### What R packages and tools are available to use during an epidemic?
To investigate this in a similar setting to what an outbreak response team would experience, workshop participants were divided into groups, and asked to develop a plausible epidemic scenario, that included:
* A situation report, describing the characteristics of the epidemic
* A linelist of cases and contact tracing data, by modifying provided datasets containing simulated data
* A set of questions to address during the analytic process
Groups then exchanged epidemic scenarios and analysed the provided data to answer the questions indicated the previous group, as if they were a response team working to solve an outbreak.
Details about each of these outbreak scenarios and the analytic pipelines developed by the groups are summarised in this vignette.
### Simulating epidemic data
Before the workshop, a fictitious dataset was created, which consisted of a linelist and contact tracing information.
To generate linelist data, the package [`bpmodels`](https://github.com/epiverse-trace/bpmodels) was used to generate a branching process network. Cases were then transformed from the model output to a linelist format. To add plausible hospitalisations and deaths, delay distributions for SARS-CoV were extracted from [`epiparameter`](https://github.com/epiverse-trace/epiparameter).
To create the contact tracing database, a random number of contacts was generated for each of the cases included in the linelist. These contacts were then assigned a category of _became case_, _under follow up_ or _lost to follow up_, at random.
* Through this workshop, we identified the need for a tool to simulate outbreak data in a linelist format, to test analysis methods and other packages while having control over the characteristics of the test data. For this purpose, an R package is currently in progress, see [simulist](https://github.com/epiverse-trace/pitcher/issues/7).
## Scenario 1: Novel respiratory disease in The Gambia
<br>
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics('images/Scenario1smq.png')
```
<br>
### Analytic pipeline for scenario 1 (analysed by group 2)
* Data cleaning
+ [`linelist`](https://cran.r-project.org/web/packages/linelist/index.html) to standardise date format
+ [`cleanr`](https://github.com/Hackout3/cleanr) from previous Hackathon
* Delay distributions
+ [`fitdisrplus`](https://cran.r-project.org/web/packages/fitdistrplus/index.html) to fit parameteric distributions to scenario data
+ [`epiparameter`](https://github.com/epiverse-trace/epiparameter) to extract delay distributions from respiratory pathogens
+ [`EpiNow2`](https://github.com/epiforecasts/EpiNow2) to fit reporting delays
+ [`EpiEstim`](https://cran.r-project.org/web/packages/EpiEstim/index.html) / [`coarseDataTools`](https://cran.r-project.org/web/packages/coarseDataTools/index.html) to estimate generation time/serial interval of disease
+ [`epicontacts`](https://cran.r-project.org/web/packages/epicontacts/index.html)
+ [`mixdiff`](https://github.com/MJomaba/MixDiff) to estimate delay distributions and correct erroneous dates at the same time (still under development)
* Population demographics
+ Would like to have had access to an R package similar to [`ColOpenData`](https://github.com/biomac-lab/TRACE_ColOpenData)
* Risk factors of infection
+ Used [R4epis](https://r4epis.netlify.app/training/walk-through/univariate/) as a guide on how to create two-way tables and perform Chi-squared tests
* Severity of disease
+ [`datadelay`](https://github.com/epiverse-trace/datadelay/) for CFR calculation
+ Implementation of method developed by [AC Ghani, 2005](https://pubmed.ncbi.nlm.nih.gov/16076827/) to estimate CFR
* Contact matching
+ [`diyar`](https://cran.r-project.org/web/packages/diyar/index.html) to match and link records
+ [`fuzzyjoin`](https://cran.r-project.org/web/packages/fuzzyjoin/index.html) to join contact and case data despite misspellings or missing cell contents
* Epi curve and maps
+ Used [`incidence`](https://cran.r-project.org/web/packages/incidence/) and [`incidence2`](https://cran.r-project.org/web/packages/incidence2/) for incidence calculation and visualisation
+ [`raster`](https://cran.r-project.org/web/packages/raster/index.html) to extract spatial information from library of shapefiles
* Reproduction number
+ [`APEestim`](https://github.com/kpzoo/APEestim)
+ [`bayEStim`](https://github.com/thlytras/bayEStim)
+ [`earlyR`](https://cran.r-project.org/web/packages/earlyR/index.html)
+ [`epicontacts`](https://cran.r-project.org/web/packages/epicontacts/index.html)
+ [`epidemia`](https://github.com/ImperialCollegeLondon/epidemia)
+ [`epiFilter`](https://github.com/kpzoo/EpiFilter)
+ [`EpiNow2`](https://github.com/epiforecasts/EpiNow2)
+ [`EpiEstim`](https://cran.r-project.org/web/packages/EpiEstim/index.html)
+ [`R0`](https://cran.r-project.org/web/packages/R0/index.html)
+ [`outbreaker2`](https://cran.r-project.org/web/packages/outbreaker2/index.html)
+ Used [this comparison table](https://github.com/mrc-ide/EpiEstim/blob/master/vignettes/alternative_software.Rmd) to choose the most appropriate package.
* Superspreading, by using these resources:
+ [`fitdistrplus`](https://cran.r-project.org/web/packages/fitdistrplus/index.html)
+ [`epicontacts`](https://cran.r-project.org/web/packages/epicontacts/index.html)
* Epidemic projections
+ [`incidence`](https://cran.r-project.org/web/packages/incidence/vignettes/incidence_fit_class.html) R estimation using a loglinear model
+ [`projections`](https://cran.r-project.org/web/packages/projections/index.html) using Rt estimates, SI distributions and overdispersion estimates
* Transmission chains and strain characterisation
+ [IQtree](http://www.iqtree.org) and [nextclade](https://clades.nextstrain.org) to build a maximum likelihood tree and mannually inspect it
+ Advanced modelling through phylodynamic methods, using tools like [BEAST](https://beast.community)
<br>
Data analysis step | Challenges
------------- | ---------------------------------------
Data cleaning | Not knowing what packages are available for this purpose
Delay distributions | Dealing with right truncation <br> Accounting for multiple infectors
Population demographics | Lacking tools that provide information about population by age, gender, etc.
Risk factors of infection | Distinguishing between risk factors vs detecting differences in reporting frequencies among groups
Severity of disease | Knowing the prevalence of disease (denominator) <br> Right truncated data <br> Varying severity of different strains
Contact matching | Missing data <br> Misspellings
Epicurve and maps| NA dates entries not included <br> Reporting levels varying over time
Offspring distribution | Right truncation <br> Time varying reporting efforts <br> Assumption of a single homogeneous epidemic <br> Importation of cases
Forecasting | Underlying assumption of a given R distribution, e.g., single trend, homogeneous mixing, no saturation
<br>
## Scenario 2: Outbreak of an unidentified disease in rural Colombia
<br>
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics('images/Scenario2smq.png')
```
<br>
### Analytic pipeline for scenario 2 (analysed by group 3)
* Data cleaning: manually, using R (no packages specified), to
+ Fix data entry issues in columns _onset_date_ and _gender_
+ Check for missing data
+ Check sequence of dates: symptom onset → hospitalisation → death
* Data anonymisation to share with partners
+ [`fastlink`](https://cran.r-project.org/web/packages/fastLink/index.html) for probabilistic matching between cases ↔ contacts, based on names, dates, and ages
* Case demographics
+ [`apyramid`](https://cran.r-project.org/web/packages/apyramid/index.html) to stratify data by age, gender, and health status
* Reproductive number calculation, by using two approaches:
+ Manually, by calculating the number of cases generated by each source case, data management through [`dplyr`](https://dplyr.tidyverse.org) and [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html)
+ Using serial interval of disease, through [`EpiEstim`](https://cran.r-project.org/web/packages/EpiEstim/index.html) or [`EpiNow2`](https://github.com/epiforecasts/EpiNow2)
* Severity of disease
+ Manual calculation of CFR and hospitalisation ratio
* Projection of hospital bed requirements
+ [`EpiNow2`](https://github.com/epiforecasts/EpiNow2) to calculate average hospitalisation duration and forecasting
* Zoonotic transmission of disease
+ Manual inspection of cases' occupation
+ Use of [IQtree](http://www.iqtree.org) and [`ggtree`](https://guangchuangyu.github.io/software/ggtree/) to plot phylogenetic data
* Superspreading
+ [`epicontacts`](https://cran.r-project.org/web/packages/epicontacts/index.html)
* Calculation of attack rate
+ Unable to calculate, given the lack of seroprevalence data
<br>
Data analysis step | Challenges
------------- | ---------------------------------------
Data anonymisation | Dealing with typos and missing data when generating random unique identifiers
Reproduction number | Right truncation <br> Underestimation of cases due to reporting delays
Projection of hospital bed requirements | Incomplete data (missing discharge date) <br> Undocumented functionality in R packages used
Zoonotic transmission | Poor documentation <br> Unavailability of packages in R <br> Differentiation between zoonotic transmission and risk factors- need for population data
Attack rate | Not enough information provided
## Scenario 3: Reston Ebolavirus in the Philippines
<br>
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics('images/Scenario3smq.png')
```
<br>
### Analytic pipeline for scenario 3 (analysed by group 4)
* Data cleaning
+ Importing data with [`rio`](https://cran.r-project.org/web/packages/rio/index.html), [`readxl`](https://readxl.tidyverse.org), [`readr`](https://cran.r-project.org/web/packages/readr/index.html), or [`openxlsx`](https://cran.r-project.org/web/packages/openxlsx/index.html)
+ Rename variables with [`janitor`](https://cran.r-project.org/web/packages/janitor/index.html)
+ Initial data checks with [`pointblank`](https://cran.r-project.org/web/packages/pointblank/index.html), [`assertr`](https://cran.r-project.org/web/packages/assertr/index.html), [`compareDF`](https://cran.r-project.org/web/packages/compareDF/index.html), or [`skimr`](https://cran.r-project.org/web/packages/skimr/index.html)
+ Vertical data checks with [`matchmaker`](https://cran.rstudio.com/web/packages/matchmaker/index.html), [`lubridate`](https://cran.r-project.org/web/packages/lubridate/index.html), or [`parsedate`](https://cran.r-project.org/web/packages/parsedate/index.html)
+ Horizontal data checks with [`hmatch`](https://github.com/epicentre-msf/hmatch), [`assertr`](https://cran.r-project.org/web/packages/assertr/index.html), or [`queryR`](https://github.com/epicentre-msf/queryr)
+ Detect duplicates with [`janitor`](https://cran.r-project.org/web/packages/janitor/index.html) and [`tidyverse`](https://www.tidyverse.org)
+ Checking for consistency with [`dplyr`](https://dplyr.tidyverse.org), or [`powerjoin`](https://cran.r-project.org/web/packages/powerjoin/index.html)
+ Translation with [`matchmaker`](https://cran.rstudio.com/web/packages/matchmaker/index.html)
* Delay distributions
+ [`fitdistrplus`](https://cran.r-project.org/web/packages/fitdistrplus/index.html) to fit parameteric distributions to epidemic data
* Case demographics
+ [`apyramid`](https://cran.r-project.org/web/packages/apyramid/index.html) to stratify data by age, gender, and health status
+ [`ggplot2`](https://ggplot2.tidyverse.org/reference/ggplot.html) to visualise data
* Outbreak description
+ [`sitrep`](https://github.com/R4EPI/sitrep) to generate reports
* Visualisation of geographic data
+ [`sf`](https://cran.r-project.org/web/packages/sf/index.html) for static maps
+ [`leaflet`](https://cran.r-project.org/web/packages/leaflet/index.html) for interactive maps
* Generation of tables
+ [`gtsummary`](https://cran.r-project.org/web/packages/gtsummary/index.html) for static tables
+ [`janitor`](https://cran.r-project.org/web/packages/janitor/index.html) for interactive tables
* Severity of disease
+ [`EpiNow2`](https://epiforecasts.io/EpiNow2/) and [`survival`](https://cran.r-project.org/web/packages/survival/index.html) to calculate CFR
* Attack rate
+ [`gadm`](https://github.com/rspatial/geodata) function to get population data
+ [`epitabulate`](https://github.com/R4EPI/epitabulate/) to describe data
+ [`sf`](https://cran.r-project.org/web/packages/sf/index.html) and [`ggplot2`](https://ggplot2.tidyverse.org/reference/ggplot.html) to plot data
* Forecasting
+ [`EpiEstim`](https://cran.r-project.org/web/packages/EpiEstim/index.html)
+ [`EpiNow2`](https://epiforecasts.io/EpiNow2/)
+ [`bpmodels`](https://github.com/epiverse-trace/bpmodels)
* Spillover events
+ By cross-referencing contact data with occupations
* Effectiveness of contact tracing
+ By calculating the proportion of case follow-ups and comparing the delay of disease exposure to the follow-up delay
* Transmission trees
+ [`epicontacts`](https://cran.r-project.org/web/packages/epicontacts/index.html)
+ [`ggplot2`](https://ggplot2.tidyverse.org/reference/ggplot.html)
<br>
Data analysis step | Challenges
------------- | ---------------------------------------
Detection of outliers | No known tools to use
Severity of disease | Censoring
Spillover events | Missing data
<br>
## Scenario 4: Emerging avian influenza in Cambodia
<br>
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics('images/Scenario4smq.png')
```
<br>
### Analytic pipeline for scenario 4 (analysed by group 5)
* Data cleaning
+ [tidyverse](https://www.tidyverse.org)
+ [`readxl`](https://readxl.tidyverse.org) to import data
+ [`dplyr`](https://dplyr.tidyverse.org) to remove names
+ [`lubridate`](https://cran.r-project.org/web/packages/lubridate/index.html) to standardise date formats
+ Manually scanning through excel to check for errors
* Reproduction number
+ [`EpiEstim`](https://cran.r-project.org/web/packages/EpiEstim/index.html)
* Severity of disease
+ Manually using R to detect missing cases
+ [`epiR`](https://cran.r-project.org/web/packages/epiR/index.html) to check for data censoring
<br>
Data analysis step | Challenges
------------- | ---------------------------------------
Data cleaning | No available R packages specific for epidemic data
Reproduction number | Difficulty finding parameter estimations in the literature
Serial interval | Lack of a tool to check for parameter estimates
Severity | Missing cases <br> Need for an R package for systematic censoring analysis
<br>
## Scenario 5: Outbreak of respiratory disease in Canada
<br>
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics('images/Scenario5smq.png')
```
<br>
### Analytic pipeline for scenario 5 (analysed by group 1)
* Define project structure
+ Defining the script's structure with [`cookiecutter`](https://github.com/jacobcvt12/cookiecutter-R-package), [`reportfactory`](https://cran.r-project.org/web/packages/reportfactory/index.html), and [`orderly`](https://cran.rstudio.com/web/packages/orderly/index.html)
+ Ensuring reproducibility of the analysis with [iRODS](https://irods.org) and [Git](https://git-scm.com)
+ Working in a group with [GitHub](https://github.com)
* Data cleaning
+ Importing data with [`readr`](https://cran.r-project.org/web/packages/readr/index.html) or [`rio`](https://cran.r-project.org/web/packages/rio/index.html)
+ Checking for errors with [`linelist`](https://cran.r-project.org/web/packages/linelist/index.html), [`janitor`](https://cran.r-project.org/web/packages/janitor/index.html), [`parsedate`](https://cran.r-project.org/web/packages/parsedate/index.html), [`matchmaker`](https://cran.rstudio.com/web/packages/matchmaker/index.html), or [`lubridate`](https://cran.r-project.org/web/packages/lubridate/index.html)
+ [`janitor`](https://cran.r-project.org/web/packages/janitor/index.html) to eliminate duplicates
+ [`naniar`](https://cran.r-project.org/web/packages/naniar/index.html) to check for missing data
+ [`epitrix`](https://cran.r-project.org/web/packages/epitrix/index.html) to anonymise data
* Delay distributions
+ [`epitrix`](https://cran.r-project.org/web/packages/epitrix/index.html)
+ [`fitdistrplus`](https://cran.r-project.org/web/packages/fitdistrplus/index.html) to fit parameteric distributions to scenario data
* Case demographics
+ [`apyramid`](https://cran.r-project.org/web/packages/apyramid/index.html) to stratify data by age, gender, and health status
* Nowcasting
+ [`incidence2`](https://cran.r-project.org/web/packages/incidence2/vignettes/Introduction.html) to visualise incidence from linelist data
+ [`epiparameter`](https://github.com/epiverse-trace/epiparameter) to extract infectious disease parameter data
+ [`EpiEstim`](https://cran.r-project.org/web/packages/EpiEstim/index.html) or [`EpiNow2`](https://github.com/epiforecasts/EpiNow2) for Rt calculation
* Severity of disease
+ Calculation of hospitalisation and mortality rates- no R package specified
* Zoonotic transmission
+ [`forecast`](https://cran.r-project.org/web/packages/forecast/index.html)
* Generation of reports
+ [`incidence`](https://cran.r-project.org/web/packages/incidence/vignettes/overview.html) for static reports
+ [Quarto](https://quarto.org) and [R markdown](https://rmarkdown.rstudio.com) for dashboards
<br>
Data analysis step | Challenges
------------- | ---------------------------------------
Project structure | Working simultaneously on the same script and managing parallel tasks <br> Anticipating future incoming data in early pipeline design
Data cleaning | Large amount of code lines used on (reasonably) predictable cleaning (e.g. data sense checks) <br> Omitting too many data entries when simply removing _NA_ rows <br> Non standardised data formats <br> Implementing rapid quality check reports before analysis
Delay distributions | Identifying the best method to calculate, or compare functionality of tools <br> Need to fit multiple parametric distributions and return best, and store as usable objects
Severity of disease | Censoring and truncation <br> Underestimation of mild cases <br> Need database of age/gender pyramids for comparisons
Forecasts | Need option for fitting with range of plausible pathogen serial intervals and comparing results <br> Changing reporting delays over time <br> Matching inputs/outputs between packages
Zoonotic transmisison | Need for specific packages with clear documentation <br> How to compare simple trend-based forecasts
## What next?
Scenarios developed by the _100 days_ workshop participants illustrate that there are many commonalities across proposed analytics pipelines, which could support interoperability across different epidemiological questions.
However, there are also several remaining gaps and challenges, which creates an opportunity to build on existing work to tackle common outbreak scenarios, using the issues here as a starting point. This will also require consideration of wider interactions with existing software ecosystems and users of outbreak analytics insights.
We are therefore planning to follow up this vignette with a more detailed perspective article discussing potential for broader progress in developing a 'first 100 lines of code'.