Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()
+
Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence().
-
Evaluate and extract by position - str_length(), str_sub(), word()
+
Evaluate and extract by position - str_length(), str_sub(), word().
-
Patterns
+
Patterns.
-
Detect and locate - str_detect(), str_subset(), str_match(), str_extract()
+
Detect and locate - str_detect(), str_subset(), str_match(), str_extract().
-
Modify and replace - str_sub(), str_replace_all()
+
Modify and replace - str_sub(), str_replace_all().
-
Regular expressions (“regex”)
+
Regular expressions (“regex”).
For ease of display most examples are shown acting on a short defined character vector, however they can easily be adapted to a column within a data frame.
This stringr vignette provided much of the inspiration for this page.
@@ -866,7 +864,7 @@
Load packages
Import data
-
In this page we will occassionally reference the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export(importing.qmd) page for details).
+
In this page we will occassionally reference the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).
Warning: The `trust` argument of `import()` should be explicit for serialization formats
@@ -883,8 +881,8 @@
Import data
The first 50 rows of the linelist are displayed below.
-
-
+
+
@@ -894,11 +892,11 @@
Import data
10.2 Unite, split, and arrange
This section covers:
-
Using str_c(), str_glue(), and unite() to combine strings
+
Using str_c(), str_glue(), and unite() to combine strings.
-
Using str_order() to arrange strings
+
Using str_order() to arrange strings.
-
Using str_split() and separate() to split strings
+
Using str_split() and separate() to split strings.
@@ -950,21 +948,21 @@
Combine strings
Dynamic strings
Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.
-
All content goes between double quotation marks str_glue("")
+
All content goes between double quotation marks str_glue("").
Any dynamic code or references to pre-defined values are placed within curly brackets {} within the double quotation marks. There can be many curly brackets in the same str_glue() command.
-
To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below)
+
To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below).
-
Tip: You can use \n to force a new line
+
Tip: You can use \n to force a new line.
-
Tip: You use format() to adjust date display, and use Sys.Date() to display the current date
+
Tip: You use format() to adjust date display, and use Sys.Date() to display the current date.
A simple example, of a dynamic plot caption:
str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
-
Data include 5888 cases and are current to 24 Jul 2024.
+
Data include 5888 cases and are current to 08 Sep 2024.
An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.
Linelist as of 08 Sep 2024.
Last case hospitalized on 30 Apr 2015.
256 cases are missing date of onset and not shown
@@ -992,8 +990,8 @@
Dynamic strings
-
-
+
+
Use str_glue_data(), which is specially made for taking data from data frame rows:
@@ -1033,9 +1031,9 @@
Unite columns
By default, the separator used in the united column is underscore _, but this can be changed with the sep = argument.
-
remove = removes the input columns from the data frame (TRUE by default)
+
remove = removes the input columns from the data frame (TRUE by default).
-
na.rm = removes missing values while uniting (FALSE by default)
+
na.rm = removes missing values while uniting (FALSE by default).
Below, we define a mini-data frame to demonstrate with:
@@ -1058,8 +1056,8 @@
Unite columns
Here is the example data frame:
-
-
+
+
Below, we unite the three symptom columns:
@@ -1173,24 +1171,24 @@
Split columns
Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.
-
-
+
+
Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.
-
sep = the separator, can be a character, or a number (interpreted as the character position to split at)
-
remove = FALSE by default, removes the input column
+
sep = the separator, can be a character, or a number (interpreted as the character position to split at).
+
remove = FALSE by default, removes the input column.
-
convert = FALSE by default, will cause string “NA”s to become NA
+
convert = FALSE by default, will cause string “NA”s to become NA.
extra = this controls what happens if there are more values created by the separation than new columns named.
-
extra = "warn" means you will see a warning but it will drop excess values (the default)
+
extra = "warn" means you will see a warning but it will drop excess values (the default).
-
extra = "drop" means the excess values will be dropped with no warning
+
extra = "drop" means the excess values will be dropped with no warning.
-
extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data
+
extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data.
An example with extra = "merge" is below - no data is lost. Two new columns are defined but any third symptoms are left in the second new column:
@@ -1438,11 +1436,10 @@
Extract by character position
Use str_sub() to return only a part of a string. The function takes three main arguments:
-
the character vector(s)
+
the character vector(s).
-
start position
-
-
end position
+
start position.
+
end position.
A few notes on position numbers:
@@ -1814,11 +1811,6 @@
Subset and cou
-
-
Regex groups
-
UNDER CONSTRUCTION
-
-
10.6 Special characters
@@ -1863,24 +1855,20 @@
Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).
-
-
10.7 Regular expressions (regex)
-
-
-
-
10.8 Regex and special characters
+
+
10.7 Regular expressions (regex) and special characters
Regular expressions, or “regex”, is a concise language for describing patterns in strings. If you are not familiar with it, a regular expression can look like an alien language. Here we try to de-mystify this language a little bit.
Much of this section is adapted from this tutorial and this cheatsheet. We selectively adapt here knowing that this handbook might be viewed by people without internet access to view the other tutorials.
A regular expression is often applied to extract specific patterns from “unstructured” text - for example medical notes, chief complaints, patient history, or other free text columns in a data frame
There are four basic tools one can use to create a basic regular expression:
-
Character sets
+
Character sets.
-
Meta characters
+
Meta characters.
-
Quantifiers
+
Quantifiers.
-
Groups
+
Groups.
Character sets
Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:
@@ -1961,7 +1949,7 @@
will return instances of two capital A letters.
@@ -1971,7 +1959,7 @@
will return instances of one or more capital A letters (group extended until a different character is encountered).
-
Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)
+
Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present).
Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"
@@ -2080,8 +2068,8 @@
tutorial.
-
-
10.9 Resources
+
+
10.8 Resources
A reference sheet for stringr functions can be found here
You can view the first 50 rows of the the data frame below. Note: the base R function head(n) allow you to view just the first n rows in the R console.
-
-
+
+
@@ -1537,7 +1537,7 @@
Automatic cl
Manual name cleaning
-
Re-naming columns manually is often necessary, even after the standardization step above. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style NEW = OLD - the new column name is given before the old column name.
+
Re-naming columns manually is often necessary, even after the standardization step above. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style NEW = OLD, the new column name is given before the old column name.
Below, a re-naming command is added to the cleaning pipeline. Spaces have been added strategically to align code for easier reading.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
@@ -1581,9 +1581,10 @@
As a shortcut, you can also rename columns within the dplyrselect() and summarise() functions. select() is used to keep only certain columns (and is covered later in this page). summarise() is covered in the Grouping data and Descriptive tables pages. These functions also uses the format new_name = old_name. Here is an example:
linelist_raw %>%
-select(# NEW name # OLD name
-date_infection =`infection date`, # rename and KEEP ONLY these columns
-date_hospitalisation =`hosp date`)
+# rename and KEEP ONLY these columns
+select(# NEW name # OLD name
+date_infection =`infection date`,
+date_hospitalisation =`hosp date`)
@@ -1676,37 +1677,35 @@
“tidyselect
Here are other “tidyselect” helper functions that also work withindplyr functions like select(), across(), and summarise():
-
everything() - all other columns not mentioned
+
everything() - all other columns not mentioned.
-
last_col() - the last column
+
last_col() - the last column.
+
where() - applies a function to all columns and selects those which are TRUE.
-
where() - applies a function to all columns and selects those which are TRUE
-
-
contains() - columns containing a character string
+
contains() - columns containing a character string.
-
example: select(contains("time"))
+
example: select(contains("time")).
-
starts_with() - matches to a specified prefix
+
starts_with() - matches to a specified prefix.
-
example: select(starts_with("date_"))
+
example: select(starts_with("date_")).
-
ends_with() - matches to a specified suffix
+
ends_with() - matches to a specified suffix.
-
example: select(ends_with("_post"))
+
example: select(ends_with("_post")).
-
matches() - to apply a regular expression (regex)
+
matches() - to apply a regular expression (regex).
-
example: select(matches("[pt]al"))
-
+
example: select(matches("[pt]al")).
-
num_range() - a numerical range like x01, x02, x03
+
num_range() - a numerical range like x01, x02, x03.
-
any_of() - matches IF column exists but returns no error if it is not found
+
any_of() - matches IF column exists but returns no error if it is not found.
In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.
@@ -1868,7 +1867,7 @@
8.7 Column creation and transformation
We recommend using the dplyr function mutate() to add a new column, or to modify an existing one.
-
Below is an example of creating a new column with mutate(). The syntax is: mutate(new_column_name = value or transformation)
+
Below is an example of creating a new column with mutate(). The syntax is: mutate(new_column_name = value or transformation).
In Stata, this is similar to the command generate, but R’s mutate() can also be used to modify an existing column.
New columns
@@ -1896,8 +1895,8 @@
New columns
Review the new columns. For demonstration purposes, only the new columns and the columns used to create them are shown:
-
-
+
+
TIP: A variation on mutate() is the function transmute(). This function adds a new column just like mutate(), but also drops/removes all other columns that you do not mention within its parentheses.
@@ -1983,16 +1982,16 @@
a
Note that within across() we also use the function where() as is.POSIXct is evaluating to either TRUE or FALSE.
-
Note that is.POSIXct() is from the package lubridate. Other similar “is” functions like is.character(), is.numeric(), and is.logical() are from base R
+
Note that is.POSIXct() is from the package lubridate. Other similar “is” functions like is.character(), is.numeric(), and is.logical() are from base R.
across() functions
You can read the documentation with ?across for details on how to provide functions to across(). A few summary points: there are several ways to specify the function(s) to perform on a column and you can even define your own functions:
-
You can provide the function name alone (e.g. mean or as.character)
+
You can provide the function name alone (e.g. mean or as.character).
-
You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page)
+
You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page).
You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))).
@@ -2113,12 +2112,12 @@
Add to pipe c
8.8 Re-code values
Here are a few scenarios where you need to re-code (change) values:
-
to edit one specific value (e.g. one date with an incorrect year or format)
+
to edit one specific value (e.g. one date with an incorrect year or format).
-
to reconcile values not spelled the same
-
to create a new column of categorical values
+
to reconcile values not spelled the same.
+
to create a new column of categorical values.
-
to create a new column of numeric categories (e.g. age categories)
+
to create a new column of numeric categories (e.g. age categories).
Specific values
@@ -2188,8 +2187,8 @@
Specific values
By logic
Below we demonstrate how to re-code values in a column using logic and conditions:
-
Using replace(), ifelse() and if_else() for simple logic
-
Using case_when() for more complex logic
+
Using replace(), ifelse() and if_else() for simple logic.
+
Using case_when() for more complex logic.
@@ -2197,16 +2196,18 @@
Simple logic
replace()
To re-code with simple logical criteria, you can use replace() within mutate(). replace() is a function from base R. Use a logic condition to specify the rows to change . The general syntax is:
-
mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).
+
+
mutate(col_to_change =replace(col_to_change, criteria for rows, new value))
+
One common situation to use replace() is changing just one value in one row, using an unique row identifier. Below, the gender is changed to “Female” in the row where the column case_id is “2195”.
-
# Example: change gender of one specific observation to "Female"
-linelist <- linelist %>%
-mutate(gender =replace(gender, case_id =="2195", "Female"))
+
# Example: change gender of one specific observation to "Female"
+linelist <- linelist %>%
+mutate(gender =replace(gender, case_id =="2195", "Female"))
The equivalent command using base R syntax and indexing brackets [ ] is below. It reads as “Change the value of the dataframe linelist‘s column gender (for the rows where linelist’s column case_id has the value ’2195’) to ‘Female’”.
ifelse
ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)
Below, the column source_known is defined. Its value in a given row is set to “known” if the row’s value in column source is not missing. If the value in sourceis missing, then the value in source_known is set to “unknown”.
if_else() is a special version from dplyr that handles dates. Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special value NA_real_ instead of just NA.
-
# Create a date of death column, which is NA if patient has not died.
-linelist <- linelist %>%
-mutate(date_death =if_else(outcome =="Death", date_outcome, NA_real_))
+
# Create a date of death column, which is NA if patient has not died.
+linelist <- linelist %>%
+mutate(date_death =if_else(outcome =="Death", date_outcome, NA_real_))
Avoid stringing together many ifelse commands… use case_when() instead!case_when() is much easier to read and you’ll make fewer errors.
@@ -2243,32 +2244,32 @@
Complex logic
case_when() commands consist of statements that have a Right-Hand Side (RHS) and a Left-Hand Side (LHS) separated by a “tilde” ~. The logic criteria are in the left side and the pursuant values are in the right side of each statement. Statements are separated by commas.
For example, here we utilize the columns age and age_unit to create a column age_years:
-
linelist <- linelist %>%
-mutate(age_years =case_when(
- age_unit =="years"~ age, # if age unit is years
- age_unit =="months"~ age/12, # if age unit is months, divide age by 12
-is.na(age_unit) ~ age)) # if age unit is missing, assume years
-# any other circumstance, assign NA (missing)
+
linelist <- linelist %>%
+mutate(age_years =case_when(
+ age_unit =="years"~ age, # if age unit is years
+ age_unit =="months"~ age/12, # if age unit is months, divide age by 12
+is.na(age_unit) ~ age)) # if age unit is missing, assume years
+# any other circumstance, assign NA (missing)
-
As each row in the data is evaluated, the criteria are applied/evaluated in the order the case_when() statements are written - from top-to-bottom. If the top criteria evaluates to TRUE for a given row, the RHS value is assigned, and the remaining criteria are not even tested for that row in the data. Thus, it is best to write the most specific criteria first, and the most general last. A data row that does not meet any of the RHS criteria will be assigned NA.
+
As each row in the data is evaluated, the criteria are applied/evaluated in the order the case_when() statements are written, from top-to-bottom. If the top criteria evaluates to TRUE for a given row, the RHS value is assigned, and the remaining criteria are not even tested for that row in the data. Thus, it is best to write the most specific criteria first, and the most general last. A data row that does not meet any of the RHS criteria will be assigned NA.
Sometimes, you may with to write a final statement that assigns a value for all other scenarios not described by one of the previous lines. To do this, place TRUE on the left-side, which will capture any row that did not meet any of the previous criteria. The right-side of this statement could be assigned a value like “check me!” or missing.
Below is another example of case_when() used to create a new column with the patient classification, according to a case definition for confirmed and suspect cases:
-
linelist <- linelist %>%
-mutate(case_status =case_when(
-
-# if patient had lab test and it is positive,
-# then they are marked as a confirmed case
- ct_blood <20~"Confirmed",
-
-# given that a patient does not have a positive lab result,
-# if patient has a "source" (epidemiological link) AND has fever,
-# then they are marked as a suspect case
-!is.na(source) & fever =="yes"~"Suspect",
-
-# any other patient not addressed above
-# is marked for follow up
-TRUE~"To investigate"))
+
linelist <- linelist %>%
+mutate(case_status =case_when(
+
+# if patient had lab test and it is positive,
+# then they are marked as a confirmed case
+ ct_blood <20~"Confirmed",
+
+# given that a patient does not have a positive lab result,
+# if patient has a "source" (epidemiological link) AND has fever,
+# then they are marked as a suspect case
+!is.na(source) & fever =="yes"~"Suspect",
+
+# any other patient not addressed above
+# is marked for follow up
+TRUE~"To investigate"))
DANGER:Values on the right-side must all be the same class - either numeric, character, date, logical, etc. To assign missing (NA), you may need to use special variations of NA such as NA_character_, NA_real_ (for numeric or POSIX), and as.Date(NA). Read more in Working with dates.
@@ -2279,33 +2280,33 @@
Missing values
replace_na()
To change missing values (NA) to a specific value, such as “Missing”, use the dplyr function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().
This is a function from the forcats package. The forcats package handles columns of class Factor. Factors are R’s way to handle ordered values such as c("First", "Second", "Third") or to set the order that values (e.g. hospitals) appear in tables and plots. See the page on Factors.
If your data are class Factor and you try to convert NA to “Missing” by using replace_na(), you will get this error: invalid factor level, NA generated. You have tried to add “Missing” as a value, when it was not defined as a possible level of the factor, and it was rejected.
The easiest way to solve this is to use the forcats function fct_explicit_na() which converts a column to class factor, and converts NA values to the character “(Missing)”.
A slower alternative would be to add the factor level using fct_expand() and then convert the missing values.
na_if()
To convert a specific value toNA, use dplyr’s na_if(). The command below performs the opposite operation of replace_na(). In the example below, any values of “Missing” in the column hospital are converted to NA.
Note: na_if()cannot be used for logic criteria (e.g. “all values > 99”) - use replace() or case_when() for this:
-
# Convert temperatures above 40 to NA
-linelist <- linelist %>%
-mutate(temp =replace(temp, temp >40, NA))
-
-# Convert onset dates earlier than 1 Jan 2000 to missing
-linelist <- linelist %>%
-mutate(date_onset =replace(date_onset, date_onset >as.Date("2000-01-01"), NA))
+
# Convert temperatures above 40 to NA
+linelist <- linelist %>%
+mutate(temp =replace(temp, temp >40, NA))
+
+# Convert onset dates earlier than 1 Jan 2000 to missing
+linelist <- linelist %>%
+mutate(date_onset =replace(date_onset, date_onset >as.Date("2000-01-01"), NA))
@@ -2314,11 +2315,11 @@
Cleaning di
Create a cleaning dictionary with 3 columns:
-
A “from” column (the incorrect value)
+
A “from” column (the incorrect value).
-
A “to” column (the correct value)
+
A “to” column (the correct value).
-
A column specifying the column for the changes to be applied (or “.global” to apply to all columns)
+
A column specifying the column for the changes to be applied (or “.global” to apply to all columns).
Note: .global dictionary entries will be overridden by column-specific dictionary entries.
@@ -2335,26 +2336,26 @@
Cleaning di
Import the dictionary file into R. This example can be downloaded via instructions on the Download handbook and data page.
-
cleaning_dict <-import("cleaning_dict.csv")
+
cleaning_dict <-import("cleaning_dict.csv")
Pipe the raw linelist to match_df(), specifying to dictionary = the cleaning dictionary data frame. The from = argument should be the name of the dictionary column which contains the “old” values, the by = argument should be dictionary column which contains the corresponding “new” values, and the third column lists the column in which to make the change. Use .global in the by = column to apply a change across all columns. A fourth dictionary column order can be used to specify factor order of new values.
Read more details in the package documentation by running ?match_df. Note this function can take a long time to run for a large dataset.
-
linelist <- linelist %>%# provide or pipe your dataset
- matchmaker::match_df(
-dictionary = cleaning_dict, # name of your dictionary
-from ="from", # column with values to be replaced (default is col 1)
-to ="to", # column with final values (default is col 2)
-by ="col"# column with column names (default is col 3)
- )
+
linelist <- linelist %>%# provide or pipe your dataset
+ matchmaker::match_df(
+dictionary = cleaning_dict, # name of your dictionary
+from ="from", # column with values to be replaced (default is col 1)
+to ="to", # column with final values (default is col 2)
+by ="col"# column with column names (default is col 3)
+ )
Now scroll to the right to see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.
-
-
+
+
Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. See this online reference for the linelist package for more details.
@@ -2362,62 +2363,62 @@
Cleaning di
Add to pipe chain
Below, some new columns and column transformations are added to the pipe chain.
-
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
-##################################################################################
-
-# begin cleaning pipe chain
-###########################
-linelist <- linelist_raw %>%
-
-# standardize column name syntax
- janitor::clean_names() %>%
-
-# manually re-name columns
-# NEW name # OLD name
-rename(date_infection = infection_date,
-date_hospitalisation = hosp_date,
-date_outcome = date_of_outcome) %>%
-
-# remove column
-select(-c(row_num, merged_header, x28)) %>%
-
-# de-duplicate
-distinct() %>%
-
-# add column
-mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
-
-# convert class of columns
-mutate(across(contains("date"), as.Date),
-generation =as.numeric(generation),
-age =as.numeric(age)) %>%
-
-# add column: delay to hospitalisation
-mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
-
-# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
-###################################################
-
-# clean values of hospital column
-mutate(hospital =recode(hospital,
-# OLD = NEW
-"Mitylira Hopital"="Military Hospital",
-"Mitylira Hospital"="Military Hospital",
-"Military Hopital"="Military Hospital",
-"Port Hopital"="Port Hospital",
-"Central Hopital"="Central Hospital",
-"other"="Other",
-"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
- )) %>%
-
-mutate(hospital =replace_na(hospital, "Missing")) %>%
-
-# create age_years column (from age and age_unit)
-mutate(age_years =case_when(
- age_unit =="years"~ age,
- age_unit =="months"~ age/12,
-is.na(age_unit) ~ age,
-TRUE~NA_real_))
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# add column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age)) %>%
+
+# add column: delay to hospitalisation
+mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+
+# clean values of hospital column
+mutate(hospital =recode(hospital,
+# OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ )) %>%
+
+mutate(hospital =replace_na(hospital, "Missing")) %>%
+
+# create age_years column (from age and age_unit)
+mutate(age_years =case_when(
+ age_unit =="years"~ age,
+ age_unit =="months"~ age/12,
+is.na(age_unit) ~ age,
+TRUE~NA_real_))
@@ -2429,38 +2430,38 @@
Add to pipe
8.9 Numeric categories
Here we describe some special approaches for creating categories from numerical columns. Common examples include age categories, groups of lab values, etc. Here we will discuss:
-
age_categories(), from the epikit package
+
age_categories(), from the epikit package.
-
cut(), from base R
+
cut(), from base R.
-
case_when()
+
case_when().
-
quantile breaks with quantile() and ntile()
+
quantile breaks with quantile() and ntile().
Review distribution
For this example we will create an age_cat column using the age_years column.
-
#check the class of the linelist variable age
-class(linelist$age_years)
+
#check the class of the linelist variable age
+class(linelist$age_years)
[1] "numeric"
First, examine the distribution of your data, to make appropriate cut-points. See the page on ggplot basics.
-
# examine the distribution
-hist(linelist$age_years)
+
# examine the distribution
+hist(linelist$age_years)
-
summary(linelist$age_years, na.rm=T)
+
summary(linelist$age_years, na.rm=T)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 6.00 13.00 16.04 23.00 84.00 107
@@ -2480,19 +2481,19 @@
age_catego
First, the simplest example:
-
# Simple example
-################
-pacman::p_load(epikit) # load package
-
-linelist <- linelist %>%
-mutate(
-age_cat =age_categories( # create new column
- age_years, # numeric column to make groups from
-breakers =c(0, 5, 10, 15, 20, # break points
-30, 40, 50, 60, 70)))
-
-# show table
-table(linelist$age_cat, useNA ="always")
+
# Simple example
+################
+pacman::p_load(epikit) # load package
+
+linelist <- linelist %>%
+mutate(
+age_cat =age_categories( # create new column
+ age_years, # numeric column to make groups from
+breakers =c(0, 5, 10, 15, 20, # break points
+30, 40, 50, 60, 70)))
+
+# show table
+table(linelist$age_cat, useNA ="always")
The break values you specify are by default the lower bounds - that is, they are included in the “higher” group / the groups are “open” on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.
-
# Include upper ends for the same categories
-############################################
-linelist <- linelist %>%
-mutate(
-age_cat =age_categories(
- age_years,
-breakers =c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))
-
-# show table
-table(linelist$age_cat, useNA ="always")
+
# Include upper ends for the same categories
+############################################
+linelist <- linelist %>%
+mutate(
+age_cat =age_categories(
+ age_years,
+breakers =c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))
+
+# show table
+table(linelist$age_cat, useNA ="always")
You can adjust how the labels are displayed with separator =. The default is “-”
You can adjust how the top numbers are handled, with the ceiling = arguemnt. To set an upper cut-off set ceiling = TRUE. In this use, the highest break value provided is a “ceiling” and a category “XX+” is not created. Any values above highest break value (or to upper =, if defined) are categorized as NA. Below is an example with ceiling = TRUE, so that there is no category of XX+ and values above 70 (the highest break value) are assigned as NA.
-
# With ceiling set to TRUE
-##########################
-linelist <- linelist %>%
-mutate(
-age_cat =age_categories(
- age_years,
-breakers =c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
-ceiling =TRUE)) # 70 is ceiling, all above become NA
-
-# show table
-table(linelist$age_cat, useNA ="always")
+
# With ceiling set to TRUE
+##########################
+linelist <- linelist %>%
+mutate(
+age_cat =age_categories(
+ age_years,
+breakers =c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
+ceiling =TRUE)) # 70 is ceiling, all above become NA
+
+# show table
+table(linelist$age_cat, useNA ="always")
cut() is a base R alternative to age_categories(), but I think you will see why age_categories() was developed to simplify this process. Some notable differences from age_categories() are:
-
You do not need to install/load another package
+
You do not need to install/load another package.
-
You can specify whether groups are open/closed on the right/left
+
You can specify whether groups are open/closed on the right/left.
-
You must provide accurate labels yourself
+
You must provide accurate labels yourself.
-
If you want 0 included in the lowest group you must specify this
+
If you want 0 included in the lowest group you must specify this.
The basic syntax within cut() is to first provide the numeric column to be cut (age_years), and then the breaks argument, which is a numeric vector c() of break points. Using cut(), the resulting column is an ordered factor.
-
By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is. Reverse this behavior by providing the right = TRUE argument.
+
By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is.Reverse this behavior by providing the right = TRUE argument.
Thus, by default, “0” values are excluded from the lowest group, and categorized as NA! “0” values could be infants coded as age 0 so be careful! To change this, add the argument include.lowest = TRUE so that any “0” values will be included in the lowest group. The automatically-generated label for the lowest category will then be “[A],B]”. Note that if you include the include.lowest = TRUE argument andright = TRUE, the extreme inclusion will now apply to the highest break point value and category, not the lowest.
You can provide a vector of customized labels using the labels = argument. As these are manually written, be very careful to ensure they are accurate! Check your work using cross-tabulation, as described below.
An example of cut() applied to age_years to make the new variable age_cat is below:
-
# Create new variable, by cutting the numeric age variable
-# lower break is excluded but upper break is included in each category
-linelist <- linelist %>%
-mutate(
-age_cat =cut(
- age_years,
-breaks =c(0, 5, 10, 15, 20,
-30, 50, 70, 100),
-include.lowest =TRUE# include 0 in lowest group
- ))
-
-# tabulate the number of observations per group
-table(linelist$age_cat, useNA ="always")
+
# Create new variable, by cutting the numeric age variable
+# lower break is excluded but upper break is included in each category
+linelist <- linelist %>%
+mutate(
+age_cat =cut(
+ age_years,
+breaks =c(0, 5, 10, 15, 20,
+30, 50, 70, 100),
+include.lowest =TRUE# include 0 in lowest group
+ ))
+
+# tabulate the number of observations per group
+table(linelist$age_cat, useNA ="always")
Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 16-20).
-
# Cross tabulation of the numeric and category columns.
-table("Numeric Values"= linelist$age_years, # names specified in table for clarity.
-"Categories"= linelist$age_cat,
-useNA ="always") # don't forget to examine NA values
+
# Cross tabulation of the numeric and category columns.
+table("Numeric Values"= linelist$age_years, # names specified in table for clarity.
+"Categories"= linelist$age_cat,
+useNA ="always") # don't forget to examine NA values
You may want to assign NA values a label such as “Missing”. Because the new column is class Factor (restricted values), you cannot simply mutate it with replace_na(), as this value will be rejected. Instead, use fct_explicit_na() from forcats as explained in the Factors page.
For a fast way to make breaks and label vectors, use something like below. See the R basics page for references on seq() and rep().
-
# Make break points from 0 to 90 by 5
-age_seq =seq(from =0, to =90, by =5)
-age_seq
-
-# Make labels for the above categories, assuming default cut() settings
-age_labels =paste0(age_seq +1, "-", age_seq +5)
-age_labels
-
-# check that both vectors are the same length
-length(age_seq) ==length(age_labels)
+
# Make break points from 0 to 90 by 5
+age_seq =seq(from =0, to =90, by =5)
+age_seq
+
+# Make labels for the above categories, assuming default cut() settings
+age_labels =paste0(age_seq +1, "-", age_seq +5)
+age_labels
+
+# check that both vectors are the same length
+length(age_seq) ==length(age_labels)
Read more about cut() in its Help page by entering ?cut in the R console.
@@ -2853,9 +2854,9 @@
Quantile breaks
Set names = FALSE to get an un-named numeric vector
-
quantile(linelist$age_years, # specify numeric vector to work on
-probs =c(0, .25, .50, .75, .90, .95), # specify the percentiles you want
-na.rm =TRUE) # ignore missing values
+
quantile(linelist$age_years, # specify numeric vector to work on
+probs =c(0, .25, .50, .75, .90, .95), # specify the percentiles you want
+na.rm =TRUE) # ignore missing values
0% 25% 50% 75% 90% 95%
0 6 13 23 33 41
@@ -2863,14 +2864,14 @@
Quantile breaks
You can use the results of quantile() as break points in age_categories() or cut(). Below we create a new column deciles using cut() where the breaks are defined using quantiles() on age_years. Below, we display the results using tabyl() from janitor so you can see the percentages (see the Descriptive tables page). Note how they are not exactly 10% in each group.
-
linelist %>%# begin with linelist
-mutate(deciles =cut(age_years, # create new column decile as cut() on column age_years
-breaks =quantile( # define cut breaks using quantile()
- age_years, # operate on age_years
-probs =seq(0, 1, by =0.1), # 0.0 to 1.0 by 0.1
-na.rm =TRUE), # ignore missing values
-include.lowest =TRUE)) %>%# for cut() include age 0
- janitor::tabyl(deciles) # pipe to table to display
+
linelist %>%# begin with linelist
+mutate(deciles =cut(age_years, # create new column decile as cut() on column age_years
+breaks =quantile( # define cut breaks using quantile()
+ age_years, # operate on age_years
+probs =seq(0, 1, by =0.1), # 0.0 to 1.0 by 0.1
+na.rm =TRUE), # ignore missing values
+include.lowest =TRUE)) %>%# for cut() include age 0
+ janitor::tabyl(deciles) # pipe to table to display
Another tool to make numeric groups is the the dplyr function ntile(), which attempts to break your data into n evenly-sized groups - but be aware that unlike with quantile() the same value could appear in more than one group. Provide the numeric vector and then the number of groups. The values in the new column created is just group “numbers” (e.g. 1 to 10), not the range of values themselves as when using cut().
-
# make groups with ntile()
-ntile_data <- linelist %>%
-mutate(even_groups =ntile(age_years, 10))
-
-# make table of counts and proportions by group
-ntile_table <- ntile_data %>%
- janitor::tabyl(even_groups)
-
-# attach min/max values to demonstrate ranges
-ntile_ranges <- ntile_data %>%
-group_by(even_groups) %>%
-summarise(
-min =min(age_years, na.rm=T),
-max =max(age_years, na.rm=T)
- )
+
# make groups with ntile()
+ntile_data <- linelist %>%
+mutate(even_groups =ntile(age_years, 10))
+
+# make table of counts and proportions by group
+ntile_table <- ntile_data %>%
+ janitor::tabyl(even_groups)
+
+# attach min/max values to demonstrate ranges
+ntile_ranges <- ntile_data %>%
+group_by(even_groups) %>%
+summarise(
+min =min(age_years, na.rm=T),
+max =max(age_years, na.rm=T)
+ )
Warning: There were 2 warnings in `summarise()`.
The first warning was:
@@ -2915,8 +2916,8 @@
Evenly-size
! no non-missing arguments to min; returning Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
-
# combine and print - note that values are present in multiple groups
-left_join(ntile_table, ntile_ranges, by ="even_groups")
+
# combine and print - note that values are present in multiple groups
+left_join(ntile_table, ntile_ranges, by ="even_groups")
even_groups n percent valid_percent min max
1 651 0.09851695 0.10013844 0 2
@@ -2943,67 +2944,67 @@
case_when()Add to pipe chain
Below, code to create two categorical age columns is added to the cleaning pipe chain:
-
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
-##################################################################################
-
-# begin cleaning pipe chain
-###########################
-linelist <- linelist_raw %>%
-
-# standardize column name syntax
- janitor::clean_names() %>%
-
-# manually re-name columns
-# NEW name # OLD name
-rename(date_infection = infection_date,
-date_hospitalisation = hosp_date,
-date_outcome = date_of_outcome) %>%
-
-# remove column
-select(-c(row_num, merged_header, x28)) %>%
-
-# de-duplicate
-distinct() %>%
-
-# add column
-mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
-
-# convert class of columns
-mutate(across(contains("date"), as.Date),
-generation =as.numeric(generation),
-age =as.numeric(age)) %>%
-
-# add column: delay to hospitalisation
-mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
-
-# clean values of hospital column
-mutate(hospital =recode(hospital,
-# OLD = NEW
-"Mitylira Hopital"="Military Hospital",
-"Mitylira Hospital"="Military Hospital",
-"Military Hopital"="Military Hospital",
-"Port Hopital"="Port Hospital",
-"Central Hopital"="Central Hospital",
-"other"="Other",
-"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
- )) %>%
-
-mutate(hospital =replace_na(hospital, "Missing")) %>%
-
-# create age_years column (from age and age_unit)
-mutate(age_years =case_when(
- age_unit =="years"~ age,
- age_unit =="months"~ age/12,
-is.na(age_unit) ~ age)) %>%
-
-# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
-###################################################
-mutate(
-# age categories: custom
-age_cat = epikit::age_categories(age_years, breakers =c(0, 5, 10, 15, 20, 30, 50, 70)),
-
-# age categories: 0 to 85 by 5s
-age_cat5 = epikit::age_categories(age_years, breakers =seq(0, 85, 5)))
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# add column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age)) %>%
+
+# add column: delay to hospitalisation
+mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
+
+# clean values of hospital column
+mutate(hospital =recode(hospital,
+# OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ )) %>%
+
+mutate(hospital =replace_na(hospital, "Missing")) %>%
+
+# create age_years column (from age and age_unit)
+mutate(age_years =case_when(
+ age_unit =="years"~ age,
+ age_unit =="months"~ age/12,
+is.na(age_unit) ~ age)) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+mutate(
+# age categories: custom
+age_cat = epikit::age_categories(age_years, breakers =c(0, 5, 10, 15, 20, 30, 50, 70)),
+
+# age categories: 0 to 85 by 5s
+age_cat5 = epikit::age_categories(age_years, breakers =seq(0, 85, 5)))
@@ -3014,12 +3015,12 @@
One-by-one
Adding rows one-by-one manually is tedious but can be done with add_row() from dplyr. Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.
Use .before and .after. to specify the placement of the row you want to add. .before = 3 will put the new row before the current 3rd row. The default behavior is to add the row to the end. Columns not specified will be left empty (NA).
The new row number may look strange (“…23”) but the row numbers in the pre-existing rows have changed. So if using the command twice, examine/test the insertion carefully.
@@ -3046,8 +3047,8 @@
Simple filter
In this example, the logical statement is gender == "f", which is asking whether the value in the column gender is equal to “f” (case sensitive).
Before the filter is applied, the number of rows in linelist is nrow(linelist).
-
linelist <- linelist %>%
-filter(gender =="f") # keep only rows where gender is equal to "f"
+
linelist <- linelist %>%
+filter(gender =="f") # keep only rows where gender is equal to "f"
After the filter is applied, the number of rows in linelist is linelist %>% filter(gender == "f") %>% nrow().
@@ -3055,8 +3056,8 @@
Simple filter
Filter out missing values
It is fairly common to want to filter out rows that have missing values. Resist the urge to write filter(!is.na(column) & !is.na(column)) and instead use the tidyr function that is custom-built for this purpose: drop_na(). If run with empty parentheses, it removes rows with any missing values. Alternatively, you can provide names of specific columns to be evaluated for missingness, or use the “tidyselect” helper functions described above.
-
linelist %>%
-drop_na(case_id, age_years) # drop rows with missing values for case_id or age_years
+
linelist %>%
+drop_na(case_id, age_years) # drop rows with missing values for case_id or age_years
See the page on Missing data for many techniques to analyse and manage missingness in your data.
@@ -3065,14 +3066,14 @@
Filter by
In a data frame or tibble, each row will usually have a “row number” that (when seen in R Viewer) appears to the left of the first column. It is not itself a true column in the data, but it can be used in a filter() statement.
To filter based on “row number”, you can use the dplyr function row_number() with open parentheses as part of a logical filtering statement. Often you will use the %in% operator and a range of numbers as part of that logical statement, as shown below. To see the first N rows, you can also use the special dplyr function head().
-
# View first 100 rows
-linelist %>%head(100) # or use tail() to see the n last rows
-
-# Show row 5 only
-linelist %>%filter(row_number() ==5)
-
-# View rows 2 through 20, and three specific columns
-linelist %>%filter(row_number() %in%2:20) %>%select(date_onset, outcome, age)
+
# View first 100 rows
+linelist %>%head(100) # or use tail() to see the n last rows
+
+# Show row 5 only
+linelist %>%filter(row_number() ==5)
+
+# View rows 2 through 20, and three specific columns
+linelist %>%filter(row_number() %in%2:20) %>%select(date_onset, outcome, age)
You can also convert the row numbers to a true column by piping your data frame to the tibble function rownames_to_column() (do not put anything in the parentheses).
@@ -3085,11 +3086,11 @@
Complex filter
Examine the data
Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this raw dataset. For our analyses, we want to remove entries from this earlier outbreak.
-
hist(linelist$date_onset, breaks =50)
+
hist(linelist$date_onset, breaks =50)
@@ -3105,9 +3106,9 @@
Design the filter
Examine a cross-tabulation to make sure we exclude only the correct rows:
-
table(Hospital = linelist$hospital, # hospital name
-YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
-useNA ="always") # show missing values
+
table(Hospital = linelist$hospital, # hospital name
+YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
+useNA ="always") # show missing values
The nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) rows with onset in 2012 and 2013 at either hospital A, B, or Port:
+
The rows with onset in 2012 and 2013 at either hospital A, B, or Port: nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01")))
-
Exclude nrow(linelist %>% filter(date_onset < as.Date("2013-06-01"))) rows with onset in 2012 and 2013
-
Exclude nrow(linelist %>% filter(hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset))) rows from Hospitals A & B with missing onset dates
-
-
Do not exclude nrow(linelist %>% filter(!hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset))) other rows with missing onset dates.
+
Exclude rows with onset in 2012 and 2013 nrow(linelist %>% filter(date_onset < as.Date("2013-06-01")))
+
Exclude rows from Hospitals A & B with missing onset dates
+nrow(linelist %>% filter(hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))
+
Do not exclude other rows with missing onset dates.
+nrow(linelist %>% filter(!hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))
We start with a linelist of nrow(linelist)`. Here is our filter statement:
-
linelist <- linelist %>%
-# keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
-filter(date_onset >as.Date("2013-06-01") | (is.na(date_onset) &!hospital %in%c("Hospital A", "Hospital B")))
-
-nrow(linelist)
+
linelist <- linelist %>%
+# keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
+filter(date_onset >as.Date("2013-06-01") | (is.na(date_onset) &!hospital %in%c("Hospital A", "Hospital B")))
+
+nrow(linelist)
[1] 6019
When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.
-
table(Hospital = linelist$hospital, # hospital name
-YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
-useNA ="always") # show missing values
+
table(Hospital = linelist$hospital, # hospital name
+YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
+useNA ="always") # show missing values
Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.
-
# dataframe <- filter(dataframe, condition(s) for rows to keep)
-
-linelist <-filter(linelist, !is.na(case_id))
+
# dataframe <- filter(dataframe, condition(s) for rows to keep)
+
+linelist <-filter(linelist, !is.na(case_id))
You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.
Often you want to quickly review a few records, for only a few columns. The base R function View() will print a data frame for viewing in your RStudio.
View the linelist in RStudio:
-
View(linelist)
+
View(linelist)
Here are two examples of viewing specific cells (specific rows, and specific columns):
With dplyr functions filter() and select():
Within View(), pipe the dataset to filter() to keep certain rows, and then to select() to keep certain columns. For example, to review onset and hospitalization dates of 3 specific cases:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
-##################################################################################
-
-# begin cleaning pipe chain
-###########################
-linelist <- linelist_raw %>%
-
-# standardize column name syntax
- janitor::clean_names() %>%
-
-# manually re-name columns
-# NEW name # OLD name
-rename(date_infection = infection_date,
-date_hospitalisation = hosp_date,
-date_outcome = date_of_outcome) %>%
-
-# remove column
-select(-c(row_num, merged_header, x28)) %>%
-
-# de-duplicate
-distinct() %>%
-
-# add column
-mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
-
-# convert class of columns
-mutate(across(contains("date"), as.Date),
-generation =as.numeric(generation),
-age =as.numeric(age)) %>%
-
-# add column: delay to hospitalisation
-mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
-
-# clean values of hospital column
-mutate(hospital =recode(hospital,
-# OLD = NEW
-"Mitylira Hopital"="Military Hospital",
-"Mitylira Hospital"="Military Hospital",
-"Military Hopital"="Military Hospital",
-"Port Hopital"="Port Hospital",
-"Central Hopital"="Central Hospital",
-"other"="Other",
-"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
- )) %>%
-
-mutate(hospital =replace_na(hospital, "Missing")) %>%
-
-# create age_years column (from age and age_unit)
-mutate(age_years =case_when(
- age_unit =="years"~ age,
- age_unit =="months"~ age/12,
-is.na(age_unit) ~ age)) %>%
-
-mutate(
-# age categories: custom
-age_cat = epikit::age_categories(age_years, breakers =c(0, 5, 10, 15, 20, 30, 50, 70)),
-
-# age categories: 0 to 85 by 5s
-age_cat5 = epikit::age_categories(age_years, breakers =seq(0, 85, 5))) %>%
-
-# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
-###################################################
-filter(
-# keep only rows where case_id is not missing
-!is.na(case_id),
-
-# also filter to keep only the second outbreak
- date_onset >as.Date("2013-06-01") | (is.na(date_onset) &!hospital %in%c("Hospital A", "Hospital B")))
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# add column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age)) %>%
+
+# add column: delay to hospitalisation
+mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
+
+# clean values of hospital column
+mutate(hospital =recode(hospital,
+# OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ )) %>%
+
+mutate(hospital =replace_na(hospital, "Missing")) %>%
+
+# create age_years column (from age and age_unit)
+mutate(age_years =case_when(
+ age_unit =="years"~ age,
+ age_unit =="months"~ age/12,
+is.na(age_unit) ~ age)) %>%
+
+mutate(
+# age categories: custom
+age_cat = epikit::age_categories(age_years, breakers =c(0, 5, 10, 15, 20, 30, 50, 70)),
+
+# age categories: 0 to 85 by 5s
+age_cat5 = epikit::age_categories(age_years, breakers =seq(0, 85, 5))) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+filter(
+# keep only rows where case_id is not missing
+!is.na(case_id),
+
+# also filter to keep only the second outbreak
+ date_onset >as.Date("2013-06-01") | (is.na(date_onset) &!hospital %in%c("Hospital A", "Hospital B")))
@@ -3286,11 +3288,11 @@
Add to pipe
8.12 Row-wise calculations
If you want to perform a calculation within a row, you can use rowwise() from dplyr. See this online vignette on row-wise calculations. For example, this code applies rowwise() and then creates a new column that sums the number of the specified symptom columns that have value “yes”, for each row in the linelist. The columns are specified within sum() by name within a vector c(). rowwise() is essentially a special kind of group_by(), so it is best to use ungroup() when you are done (page on Grouping data).
As you specify the column to evaluate, you may want to use the “tidyselect” helper functions described in the select() section of this page. You just have to make one adjustment (because you are not using them within a dplyr function like select() or summarise()).
Put the column-specification criteria within the dplyr function c_across(). This is because c_across (documentation) is designed to work with rowwise() specifically. For example, the following code:
-
Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns)
+
Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns).
Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data).
-
ungroup() to remove the effects of rowwise() for subsequent steps
+
ungroup() to remove the effects of rowwise() for subsequent steps.
Sorting data with arrange() is particularly useful when making Tables for presentation, using slice() to take the “top” rows per group, or setting factor level order by order of appearance.
For example, to sort the our linelist rows by hospital, then by date_onset in descending order, we would use:
# assign the current time to a column
-time_now <-Sys.time()
-time_now
+
# assign the current time to a column
+time_now <-Sys.time()
+time_now
-
[1] "2024-07-24 13:57:25 PDT"
-
-
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
-time_london_real <-with_tz(time_now, "Europe/London")
-
-# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
-time_london_local <-force_tz(time_now, "Europe/London")
-
-
-# note that as long as the computer that was used to run this code is NOT set to London time,
-# there will be a difference in the times
-# (the number of hours difference from the computers time zone to london)
-time_london_real - time_london_local
+
[1] "2024-09-08 11:03:47 BST"
+
+
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
+time_london_real <-with_tz(time_now, "Europe/London")
+
+# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
+time_london_local <-force_tz(time_now, "Europe/London")
+
+
+# note that as long as the computer that was used to run this code is NOT set to London time,
+# there will be a difference in the times
+# (the number of hours difference from the computers time zone to london)
+time_london_real - time_london_local
-
Time difference of 8 hours
+
Time difference of 0 secs
This may seem largely abstract, and is often not needed if the user isn’t working across time zones.
@@ -1434,39 +1420,39 @@
-
-
+
+
-
When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending
+
When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending.
First, create a new column containing the value of the previous (lagged) week.
-
Control the number of units back/forward with n = (must be a non-negative integer)
+
Control the number of units back/forward with n = (must be a non-negative integer).
Use default = to define the value placed in non-existing rows (e.g. the first row for which there is no lagged value). By default this is NA.
-
Use order_by = TRUE if your the rows are not ordered by your reference column
+
Use order_by = TRUE if your the rows are not ordered by your reference column.
-
counts <- counts %>%
-mutate(cases_prev_wk =lag(cases_wk, n =1))
+
counts <- counts %>%
+mutate(cases_prev_wk =lag(cases_wk, n =1))
-
-
+
+
Next, create a new column which is the difference between the two cases columns:
Likewise, if we make a bar plot, the values also appear in this order on the x-axis (see the ggplot basics page for more on ggplot2 - the most common visualization package in R).
The package forcats offers useful functions to easily adjust the order of a factor’s levels (after a column been defined as class factor):
These functions can be applied to a factor column in two contexts:
-
To the column in the data frame, as usual, so the transformation is available for any subsequent use of the data
+
To the column in the data frame, as usual, so the transformation is available for any subsequent use of the data.
-
Inside of a plot, so that the change is applied only within the plot
+
Inside of a plot, so that the change is applied only within the plot.
Manually
This function is used to manually order the factor levels. If used on a non-factor column, the column will first be converted to class factor.
Within the parentheses first provide the factor column name, then provide either:
-
All the levels in the desired order (as a character vector c()), or
+
All the levels in the desired order (as a character vector c()), or,
-
One level and it’s corrected placement using the after = argument
+
One level and it’s corrected placement using the after = argument.
Here is an example of redefining the column delay_cat (which is already class Factor) and specifying all the desired order of levels.
@@ -992,11 +992,11 @@
Within a plot
Below, two plots are created with ggplot() (see the ggplot basics page). In the first, the delay_cat column is mapped to the x-axis of the plot, with it’s default level order as in the data linelist. In the second example it is wrapped within fct_relevel() and the order is changed in the plot.
# Alpha-numeric default order - no adjustment within ggplot
-ggplot(data = linelist)+
+ggplot(data = linelist) +geom_bar(mapping =aes(x = delay_cat))# Factor level order adjusted within ggplot
-ggplot(data = linelist)+
+ggplot(data = linelist) +geom_bar(mapping =aes(x =fct_relevel(delay_cat, c("<2 days", "2-5 days", ">5 days"))))
@@ -1026,14 +1026,14 @@
By frequency
This function can be used within a ggplot(), as shown below.
# ordered by frequency
-ggplot(data = linelist, aes(x =fct_infreq(delay_cat)))+
-geom_bar()+
+ggplot(data = linelist, aes(x =fct_infreq(delay_cat))) +
+geom_bar() +labs(x ="Delay onset to admission (days)",title ="Ordered by frequency")# reversed frequency
-ggplot(data = linelist, aes(x =fct_rev(fct_infreq(delay_cat))))+
-geom_bar()+
+ggplot(data = linelist, aes(x =fct_rev(fct_infreq(delay_cat)))) +
+geom_bar() +labs(x ="Delay onset to admission (days)",title ="Reverse of order by frequency")
@@ -1063,26 +1063,26 @@
# boxplots ordered by original factor levels
-ggplot(data = linelist)+
+ggplot(data = linelist) +geom_boxplot(aes(x = delay_cat,y = ct_blood,
-fill = delay_cat))+
+fill = delay_cat)) +labs(x ="Delay onset to admission (days)",
-title ="Ordered by original alpha-numeric levels")+
-theme_classic()+
+title ="Ordered by original alpha-numeric levels") +
+theme_classic() +theme(legend.position ="none")# boxplots ordered by median CT value
-ggplot(data = linelist)+
+ggplot(data = linelist) +geom_boxplot(aes(x =fct_reorder(delay_cat, ct_blood, "median"),y = ct_blood,
-fill = delay_cat))+
+fill = delay_cat)) +labs(x ="Delay onset to admission (days)",
-title ="Ordered by median CT value in group")+
-theme_classic()+
+title ="Ordered by median CT value in group") +
+theme_classic() +theme(legend.position ="none")
@@ -1113,12 +1113,12 @@
By “end” value
hospital )
-ggplot(data = epidemic_data)+# start plot
+ggplot(data = epidemic_data) +# start plotgeom_line( # make linesaes(x = epiweek, # x-axis epiweeky = n, # height is number of cases per week
-color =fct_reorder2(hospital, epiweek, n)))+# data grouped and colored by hospital, with factor order by height at end of plot
+color =fct_reorder2(hospital, epiweek, n))) +# data grouped and colored by hospital, with factor order by height at end of plotlabs(title ="Factor levels (and legend display) by line height at end of plot",color ="Hospital") # change legend title
@@ -1162,7 +1162,7 @@
Manually
You can adjust the level displays manually manually with fct_recode(). This is like the dplyr function recode() (see Cleaning data and core functions), but it allows the creation of new factor levels. If you use the simple recode() on a factor, new re-coded values will be rejected unless they have already been set as permissible levels.
This tool can also be used to “combine” levels, by assigning multiple levels the same re-coded value. Just be careful to not lose information! Consider doing these combining steps in a new column (not over-writing the existing column).
-
fct_recode() has a different syntax than recode(). recode() uses OLD = NEW, whereas fct_recode() uses NEW = OLD.
+
DANGER:fct_recode() has a different syntax than recode(). recode() uses OLD = NEW, whereas fct_recode() uses NEW = OLD.
The current levels of delay_cat are:
levels(linelist$delay_cat)
@@ -1257,15 +1257,15 @@
In plots
In a ggplot() figure, simply add the argument drop = FALSE in the relevant scale_xxxx() function. All factor levels will be displayed, regardless of whether they are present in the data. If your factor column levels are displayed using fill =, then in scale_fill_discrete() you include drop = FALSE, as shown below. If your levels are displayed with x = (to the x-axis) color = or size = you would provide this to scale_color_discrete() or scale_size_discrete() accordingly.
This example is a stacked bar plot of age category, by hospital. Adding scale_fill_discrete(drop = FALSE) ensures that all age groups appear in the legend, even if not present in the data.
-
ggplot(data = linelist)+
+
ggplot(data = linelist) +geom_bar(mapping =aes(x = hospital, fill = age_cat)) +
-scale_fill_discrete(drop =FALSE)+# show all age groups in the legend, even those not present
+scale_fill_discrete(drop =FALSE) +# show all age groups in the legend, even those not presentlabs(title ="All age groups will appear in legend, even if not present in data")
@@ -1289,8 +1289,8 @@
Epiweeks in
linelist %>%mutate(epiweek_date =floor_date(date_onset, "week")) %>%# create week column
-ggplot()+# begin ggplot
-geom_histogram(mapping =aes(x = epiweek_date))+# histogram of date of onset
+ggplot() +# begin ggplot
+geom_histogram(mapping =aes(x = epiweek_date)) +# histogram of date of onsetscale_x_date(date_labels ="%Y-W%W") # adjust disply of dates to be YYYY-WWw
@@ -1931,7 +1931,7 @@
var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","loop":false,"closeEffect":"zoom","openEffect":"zoom"});
window.onload = () => {
lightboxQuarto.on('slide_before_load', (data) => {
const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/html_outputs/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png b/html_outputs/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png
index 54543386..f913527c 100644
Binary files a/html_outputs/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png and b/html_outputs/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png differ
diff --git a/html_outputs/search.json b/html_outputs/search.json
index 7478a4e2..0f41e3a2 100644
--- a/html_outputs/search.json
+++ b/html_outputs/search.json
@@ -1011,7 +1011,7 @@
"href": "new_pages/factors.html#preparation",
"title": "11 Factors",
"section": "",
- "text": "Load packages\nThis code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.\n\npacman::p_load(\n rio, # import/export\n here, # filepaths\n lubridate, # working with dates\n forcats, # factors\n aweek, # create epiweeks with automatic factor levels\n janitor, # tables\n tidyverse # data mgmt and viz\n )\n\n\n\nImport data\nWe import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).\n\n# import your dataset\nlinelist <- import(\"linelist_cleaned.rds\")\n\n\n\nNew categorical variable\nFor demonstration in this page we will use a common scenario - the creation of a new categorical variable.\nNote that if you convert a numeric column to class factor, you will not be able to calculate numeric statistics on it.\n\nCreate column\nWe use the existing column days_onset_hosp (days from symptom onset to hospital admission) and create a new column delay_cat by classifying each row into one of several categories. We do this with the dplyr function case_when(), which sequentially applies logical criteria (right-side) to each row and returns the corresponding left-side value for the new column delay_cat. Read more about case_when() in Cleaning data and core functions.\n\nlinelist <- linelist %>% \n mutate(delay_cat = case_when(\n # criteria # new value if TRUE\n days_onset_hosp < 2 ~ \"<2 days\",\n days_onset_hosp >= 2 & days_onset_hosp < 5 ~ \"2-5 days\",\n days_onset_hosp >= 5 ~ \">5 days\",\n is.na(days_onset_hosp) ~ NA_character_,\n TRUE ~ \"Check me\")) \n\n\n\nDefault value order\nAs created with case_when(), the new column delay_cat is a categorical column of class Character - not yet a factor. Thus, in a frequency table, we see that the unique values appear in a default alpha-numeric order - an order that does not make much intuitive sense:\n\ntable(linelist$delay_cat, useNA = \"always\")\n\n\n <2 days >5 days 2-5 days <NA> \n 2990 602 2040 256 \n\n\nLikewise, if we make a bar plot, the values also appear in this order on the x-axis (see the ggplot basics page for more on ggplot2 - the most common visualization package in R).\n\nggplot(data = linelist)+\n geom_bar(mapping = aes(x = delay_cat))",
+ "text": "Load packages\nThis code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.\n\npacman::p_load(\n rio, # import/export\n here, # filepaths\n lubridate, # working with dates\n forcats, # factors\n aweek, # create epiweeks with automatic factor levels\n janitor, # tables\n tidyverse # data mgmt and viz\n )\n\n\n\nImport data\nWe import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).\n\n\nWarning: The `trust` argument of `import()` should be explicit for serialization formats\nas of rio 1.0.3.\nℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.\nℹ The deprecated feature was likely used in the rio package.\n Please report the issue at <https://github.com/gesistsa/rio/issues>.\n\n\n\n# import your dataset\nlinelist <- import(\"linelist_cleaned.rds\")\n\n\n\nNew categorical variable\nFor demonstration in this page we will use a common scenario - the creation of a new categorical variable.\nNote that if you convert a numeric column to class factor, you will not be able to calculate numeric statistics on it.\n\nCreate column\nWe use the existing column days_onset_hosp (days from symptom onset to hospital admission) and create a new column delay_cat by classifying each row into one of several categories. We do this with the dplyr function case_when(), which sequentially applies logical criteria (right-side) to each row and returns the corresponding left-side value for the new column delay_cat. Read more about case_when() in Cleaning data and core functions.\n\nlinelist <- linelist %>% \n mutate(delay_cat = case_when(\n # criteria # new value if TRUE\n days_onset_hosp < 2 ~ \"<2 days\",\n days_onset_hosp >= 2 & days_onset_hosp < 5 ~ \"2-5 days\",\n days_onset_hosp >= 5 ~ \">5 days\",\n is.na(days_onset_hosp) ~ NA_character_,\n TRUE ~ \"Check me\")) \n\n\n\nDefault value order\nAs created with case_when(), the new column delay_cat is a categorical column of class Character - not yet a factor. Thus, in a frequency table, we see that the unique values appear in a default alpha-numeric order - an order that does not make much intuitive sense:\n\ntable(linelist$delay_cat, useNA = \"always\")\n\n\n <2 days >5 days 2-5 days <NA> \n 2990 602 2040 256 \n\n\nLikewise, if we make a bar plot, the values also appear in this order on the x-axis (see the ggplot basics page for more on ggplot2 - the most common visualization package in R).\n\nggplot(data = linelist) +\n geom_bar(mapping = aes(x = delay_cat))",
"crumbs": [
"Data Management",
"11Factors"
@@ -1022,7 +1022,7 @@
"href": "new_pages/factors.html#convert-to-factor",
"title": "11 Factors",
"section": "11.2 Convert to factor",
- "text": "11.2 Convert to factor\nTo convert a character or numeric column to class factor, you can use any function from the forcats package (many are detailed below). They will convert to class factor and then also perform or allow certain ordering of the levels - for example using fct_relevel() lets you manually specify the level order. The function as_factor() simply converts the class without any further capabilities.\nThe base R function factor() converts a column to factor and allows you to manually specify the order of the levels, as a character vector to its levels = argument.\nBelow we use mutate() and fct_relevel() to convert the column delay_cat from class character to class factor. The column delay_cat is created in the Preparation section above.\n\nlinelist <- linelist %>%\n mutate(delay_cat = fct_relevel(delay_cat))\n\nThe unique “values” in this column are now considered “levels” of the factor. The levels have an order, which can be printed with the base R function levels(), or alternatively viewed in a count table via table() from base R or tabyl() from janitor. By default, the order of the levels will be alpha-numeric, as before. Note that NA is not a factor level.\n\nlevels(linelist$delay_cat)\n\n[1] \"<2 days\" \">5 days\" \"2-5 days\"\n\n\nThe function fct_relevel() has the additional utility of allowing you to manually specify the level order. Simply write the level values in order, in quotation marks, separated by commas, as shown below. Note that the spelling must exactly match the values. If you want to create levels that do not exist in the data, use fct_expand() instead).\n\nlinelist <- linelist %>%\n mutate(delay_cat = fct_relevel(delay_cat, \"<2 days\", \"2-5 days\", \">5 days\"))\n\nWe can now see that the levels are ordered, as specified in the previous command, in a sensible order.\n\nlevels(linelist$delay_cat)\n\n[1] \"<2 days\" \"2-5 days\" \">5 days\" \n\n\nNow the plot order makes more intuitive sense as well.\n\nggplot(data = linelist)+\n geom_bar(mapping = aes(x = delay_cat))",
+ "text": "11.2 Convert to factor\nTo convert a character or numeric column to class factor, you can use any function from the forcats package (many are detailed below). They will convert to class factor and then also perform or allow certain ordering of the levels - for example using fct_relevel() lets you manually specify the level order. The function as_factor() simply converts the class without any further capabilities.\nThe base R function factor() converts a column to factor and allows you to manually specify the order of the levels, as a character vector to its levels = argument.\nBelow we use mutate() and fct_relevel() to convert the column delay_cat from class character to class factor. The column delay_cat is created in the Preparation section above.\n\nlinelist <- linelist %>%\n mutate(delay_cat = fct_relevel(delay_cat))\n\nThe unique “values” in this column are now considered “levels” of the factor. The levels have an order, which can be printed with the base R function levels(), or alternatively viewed in a count table via table() from base R or tabyl() from janitor. By default, the order of the levels will be alpha-numeric, as before. Note that NA is not a factor level.\n\nlevels(linelist$delay_cat)\n\n[1] \"<2 days\" \">5 days\" \"2-5 days\"\n\n\nThe function fct_relevel() has the additional utility of allowing you to manually specify the level order. Simply write the level values in order, in quotation marks, separated by commas, as shown below. Note that the spelling must exactly match the values. If you want to create levels that do not exist in the data, use fct_expand() instead).\n\nlinelist <- linelist %>%\n mutate(delay_cat = fct_relevel(delay_cat, \"<2 days\", \"2-5 days\", \">5 days\"))\n\nWe can now see that the levels are ordered, as specified in the previous command, in a sensible order.\n\nlevels(linelist$delay_cat)\n\n[1] \"<2 days\" \"2-5 days\" \">5 days\" \n\n\nNow the plot order makes more intuitive sense as well.\n\nggplot(data = linelist) +\n geom_bar(mapping = aes(x = delay_cat))",
"crumbs": [
"Data Management",
"11Factors"
@@ -1044,7 +1044,7 @@
"href": "new_pages/factors.html#fct_adjust",
"title": "11 Factors",
"section": "11.4 Adjust level order",
- "text": "11.4 Adjust level order\nThe package forcats offers useful functions to easily adjust the order of a factor’s levels (after a column been defined as class factor):\nThese functions can be applied to a factor column in two contexts:\n\nTo the column in the data frame, as usual, so the transformation is available for any subsequent use of the data\n\nInside of a plot, so that the change is applied only within the plot\n\n\nManually\nThis function is used to manually order the factor levels. If used on a non-factor column, the column will first be converted to class factor.\nWithin the parentheses first provide the factor column name, then provide either:\n\nAll the levels in the desired order (as a character vector c()), or\n\nOne level and it’s corrected placement using the after = argument\n\nHere is an example of redefining the column delay_cat (which is already class Factor) and specifying all the desired order of levels.\n\n# re-define level order\nlinelist <- linelist %>% \n mutate(delay_cat = fct_relevel(delay_cat, c(\"<2 days\", \"2-5 days\", \">5 days\")))\n\nIf you only want to move one level, you can specify it to fct_relevel() alone and give a number to the after = argument to indicate where in the order it should be. For example, the command below shifts “<2 days” to the second position:\n\n# re-define level order\nlinelist %>% \n mutate(delay_cat = fct_relevel(delay_cat, \"<2 days\", after = 1)) %>% \n tabyl(delay_cat)\n\n\n\nWithin a plot\nThe forcats commands can be used to set the level order in the data frame, or only within a plot. By using the command to “wrap around” the column name within the ggplot() plotting command, you can reverse/relevel/etc. the transformation will only apply within that plot.\nBelow, two plots are created with ggplot() (see the ggplot basics page). In the first, the delay_cat column is mapped to the x-axis of the plot, with it’s default level order as in the data linelist. In the second example it is wrapped within fct_relevel() and the order is changed in the plot.\n\n# Alpha-numeric default order - no adjustment within ggplot\nggplot(data = linelist)+\n geom_bar(mapping = aes(x = delay_cat))\n\n# Factor level order adjusted within ggplot\nggplot(data = linelist)+\n geom_bar(mapping = aes(x = fct_relevel(delay_cat, c(\"<2 days\", \"2-5 days\", \">5 days\"))))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote that default x-axis title is now quite complicated - you can overwrite this title with the ggplot2 labs() argument.\n\n\nReverse\nIt is rather common that you want to reverse the level order. Simply wrap the factor with fct_rev().\nNote that if you want to reverse only a plot legend but not the actual factor levels, you can do that with guides() (see ggplot tips).\n\n\nBy frequency\nTo order by frequency that the value appears in the data, use fct_infreq(). Any missing values (NA) will automatically be included at the end, unless they are converted to an explicit level (see this section). You can reverse the order by further wrapping with fct_rev().\nThis function can be used within a ggplot(), as shown below.\n\n# ordered by frequency\nggplot(data = linelist, aes(x = fct_infreq(delay_cat)))+\n geom_bar()+\n labs(x = \"Delay onset to admission (days)\",\n title = \"Ordered by frequency\")\n\n# reversed frequency\nggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay_cat))))+\n geom_bar()+\n labs(x = \"Delay onset to admission (days)\",\n title = \"Reverse of order by frequency\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBy appearance\nUse fct_inorder() to set the level order to match the order of appearance in the data, starting from the first row. This can be useful if you first carefully arrange() the data in the data frame, and then use this to set the factor order.\n\n\nBy summary statistic of another column\nYou can use fct_reorder() to order the levels of one column by a summary statistic of another column. Visually, this can result in pleasing plots where the bars/points ascend or descend steadily across the plot.\nIn the examples below, the x-axis is delay_cat, and the y-axis is numeric column ct_blood (cycle-threshold value). Box plots show the CT value distribution by delay_cat group. We want to order the box plots in ascending order by the group median CT value.\nIn the first example below, the default order alpha-numeric level order is used. You can see the box plot heights are jumbled and not in any particular order. In the second example, the delay_cat column (mapped to the x-axis) has been wrapped in fct_reorder(), the column ct_blood is given as the second argument, and “median” is given as the third argument (you could also use “max”, “mean”, “min”, etc). Thus, the order of the levels of delay_cat will now reflect ascending median CT values of each delay_cat group’s median CT value. This is reflected in the second plot - the box plots have been re-arranged to ascend. Note how NA (missing) will appear at the end, unless converted to an explicit level.\n\n# boxplots ordered by original factor levels\nggplot(data = linelist)+\n geom_boxplot(\n aes(x = delay_cat,\n y = ct_blood, \n fill = delay_cat))+\n labs(x = \"Delay onset to admission (days)\",\n title = \"Ordered by original alpha-numeric levels\")+\n theme_classic()+\n theme(legend.position = \"none\")\n\n\n# boxplots ordered by median CT value\nggplot(data = linelist)+\n geom_boxplot(\n aes(x = fct_reorder(delay_cat, ct_blood, \"median\"),\n y = ct_blood,\n fill = delay_cat))+\n labs(x = \"Delay onset to admission (days)\",\n title = \"Ordered by median CT value in group\")+\n theme_classic()+\n theme(legend.position = \"none\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote in this example above there are no steps required prior to the ggplot() call - the grouping and calculations are all done internally to the ggplot command.\n\n\nBy “end” value\nUse fct_reorder2() for grouped line plots. It orders the levels (and therefore the legend) to align with the vertical ordering of the lines at the “end” of the plot. Technically speaking, it “orders by the y-values associated with the largest x values.”\nFor example, if you have lines showing case counts by hospital over time, you can apply fct_reorder2() to the color = argument within aes(), such that the vertical order of hospitals appearing in the legend aligns with the order of lines at the terminal end of the plot. Read more in the online documentation.\n\nepidemic_data <- linelist %>% # begin with the linelist \n filter(date_onset < as.Date(\"2014-09-21\")) %>% # cut-off date, for visual clarity\n count( # get case counts per week and by hospital\n epiweek = lubridate::floor_date(date_onset, \"week\"), \n hospital \n ) \n \nggplot(data = epidemic_data)+ # start plot\n geom_line( # make lines\n aes(\n x = epiweek, # x-axis epiweek\n y = n, # height is number of cases per week\n color = fct_reorder2(hospital, epiweek, n)))+ # data grouped and colored by hospital, with factor order by height at end of plot\n labs(title = \"Factor levels (and legend display) by line height at end of plot\",\n color = \"Hospital\") # change legend title",
+ "text": "11.4 Adjust level order\nThe package forcats offers useful functions to easily adjust the order of a factor’s levels (after a column been defined as class factor):\nThese functions can be applied to a factor column in two contexts:\n\nTo the column in the data frame, as usual, so the transformation is available for any subsequent use of the data.\n\nInside of a plot, so that the change is applied only within the plot.\n\n\nManually\nThis function is used to manually order the factor levels. If used on a non-factor column, the column will first be converted to class factor.\nWithin the parentheses first provide the factor column name, then provide either:\n\nAll the levels in the desired order (as a character vector c()), or,\n\nOne level and it’s corrected placement using the after = argument.\n\nHere is an example of redefining the column delay_cat (which is already class Factor) and specifying all the desired order of levels.\n\n# re-define level order\nlinelist <- linelist %>% \n mutate(delay_cat = fct_relevel(delay_cat, c(\"<2 days\", \"2-5 days\", \">5 days\")))\n\nIf you only want to move one level, you can specify it to fct_relevel() alone and give a number to the after = argument to indicate where in the order it should be. For example, the command below shifts “<2 days” to the second position:\n\n# re-define level order\nlinelist %>% \n mutate(delay_cat = fct_relevel(delay_cat, \"<2 days\", after = 1)) %>% \n tabyl(delay_cat)\n\n\n\nWithin a plot\nThe forcats commands can be used to set the level order in the data frame, or only within a plot. By using the command to “wrap around” the column name within the ggplot() plotting command, you can reverse/relevel/etc. the transformation will only apply within that plot.\nBelow, two plots are created with ggplot() (see the ggplot basics page). In the first, the delay_cat column is mapped to the x-axis of the plot, with it’s default level order as in the data linelist. In the second example it is wrapped within fct_relevel() and the order is changed in the plot.\n\n# Alpha-numeric default order - no adjustment within ggplot\nggplot(data = linelist) +\n geom_bar(mapping = aes(x = delay_cat))\n\n# Factor level order adjusted within ggplot\nggplot(data = linelist) +\n geom_bar(mapping = aes(x = fct_relevel(delay_cat, c(\"<2 days\", \"2-5 days\", \">5 days\"))))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote that default x-axis title is now quite complicated - you can overwrite this title with the ggplot2 labs() argument.\n\n\nReverse\nIt is rather common that you want to reverse the level order. Simply wrap the factor with fct_rev().\nNote that if you want to reverse only a plot legend but not the actual factor levels, you can do that with guides() (see ggplot tips).\n\n\nBy frequency\nTo order by frequency that the value appears in the data, use fct_infreq(). Any missing values (NA) will automatically be included at the end, unless they are converted to an explicit level (see this section). You can reverse the order by further wrapping with fct_rev().\nThis function can be used within a ggplot(), as shown below.\n\n# ordered by frequency\nggplot(data = linelist, aes(x = fct_infreq(delay_cat))) +\n geom_bar() +\n labs(x = \"Delay onset to admission (days)\",\n title = \"Ordered by frequency\")\n\n# reversed frequency\nggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay_cat)))) +\n geom_bar() +\n labs(x = \"Delay onset to admission (days)\",\n title = \"Reverse of order by frequency\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBy appearance\nUse fct_inorder() to set the level order to match the order of appearance in the data, starting from the first row. This can be useful if you first carefully arrange() the data in the data frame, and then use this to set the factor order.\n\n\nBy summary statistic of another column\nYou can use fct_reorder() to order the levels of one column by a summary statistic of another column. Visually, this can result in pleasing plots where the bars/points ascend or descend steadily across the plot.\nIn the examples below, the x-axis is delay_cat, and the y-axis is numeric column ct_blood (cycle-threshold value). Box plots show the CT value distribution by delay_cat group. We want to order the box plots in ascending order by the group median CT value.\nIn the first example below, the default order alpha-numeric level order is used. You can see the box plot heights are jumbled and not in any particular order. In the second example, the delay_cat column (mapped to the x-axis) has been wrapped in fct_reorder(), the column ct_blood is given as the second argument, and “median” is given as the third argument (you could also use “max”, “mean”, “min”, etc). Thus, the order of the levels of delay_cat will now reflect ascending median CT values of each delay_cat group’s median CT value. This is reflected in the second plot - the box plots have been re-arranged to ascend. Note how NA (missing) will appear at the end, unless converted to an explicit level.\n\n# boxplots ordered by original factor levels\nggplot(data = linelist) +\n geom_boxplot(\n aes(x = delay_cat,\n y = ct_blood, \n fill = delay_cat)) +\n labs(x = \"Delay onset to admission (days)\",\n title = \"Ordered by original alpha-numeric levels\") +\n theme_classic() +\n theme(legend.position = \"none\")\n\n\n# boxplots ordered by median CT value\nggplot(data = linelist) +\n geom_boxplot(\n aes(x = fct_reorder(delay_cat, ct_blood, \"median\"),\n y = ct_blood,\n fill = delay_cat)) +\n labs(x = \"Delay onset to admission (days)\",\n title = \"Ordered by median CT value in group\") +\n theme_classic() +\n theme(legend.position = \"none\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote in this example above there are no steps required prior to the ggplot() call - the grouping and calculations are all done internally to the ggplot command.\n\n\nBy “end” value\nUse fct_reorder2() for grouped line plots. It orders the levels (and therefore the legend) to align with the vertical ordering of the lines at the “end” of the plot. Technically speaking, it “orders by the y-values associated with the largest x values.”\nFor example, if you have lines showing case counts by hospital over time, you can apply fct_reorder2() to the color = argument within aes(), such that the vertical order of hospitals appearing in the legend aligns with the order of lines at the terminal end of the plot. Read more in the online documentation.\n\nepidemic_data <- linelist %>% # begin with the linelist \n filter(date_onset < as.Date(\"2014-09-21\")) %>% # cut-off date, for visual clarity\n count( # get case counts per week and by hospital\n epiweek = lubridate::floor_date(date_onset, \"week\"), \n hospital \n ) \n \nggplot(data = epidemic_data) + # start plot\n geom_line( # make lines\n aes(\n x = epiweek, # x-axis epiweek\n y = n, # height is number of cases per week\n color = fct_reorder2(hospital, epiweek, n))) + # data grouped and colored by hospital, with factor order by height at end of plot\n labs(title = \"Factor levels (and legend display) by line height at end of plot\",\n color = \"Hospital\") # change legend title",
"crumbs": [
"Data Management",
"11Factors"
@@ -1066,7 +1066,7 @@
"href": "new_pages/factors.html#combine-levels",
"title": "11 Factors",
"section": "11.6 Combine levels",
- "text": "11.6 Combine levels\n\nManually\nYou can adjust the level displays manually manually with fct_recode(). This is like the dplyr function recode() (see Cleaning data and core functions), but it allows the creation of new factor levels. If you use the simple recode() on a factor, new re-coded values will be rejected unless they have already been set as permissible levels.\nThis tool can also be used to “combine” levels, by assigning multiple levels the same re-coded value. Just be careful to not lose information! Consider doing these combining steps in a new column (not over-writing the existing column).\nfct_recode() has a different syntax than recode(). recode() uses OLD = NEW, whereas fct_recode() uses NEW = OLD.\nThe current levels of delay_cat are:\n\nlevels(linelist$delay_cat)\n\n[1] \"<2 days\" \"2-5 days\" \">5 days\" \n\n\nThe new levels are created using syntax fct_recode(column, \"new\" = \"old\", \"new\" = \"old\", \"new\" = \"old\") and printed:\n\nlinelist %>% \n mutate(delay_cat = fct_recode(\n delay_cat,\n \"Less than 2 days\" = \"<2 days\",\n \"2 to 5 days\" = \"2-5 days\",\n \"More than 5 days\" = \">5 days\")) %>% \n tabyl(delay_cat)\n\n delay_cat n percent valid_percent\n Less than 2 days 2990 0.50781250 0.5308949\n 2 to 5 days 2040 0.34646739 0.3622159\n More than 5 days 602 0.10224185 0.1068892\n <NA> 256 0.04347826 NA\n\n\nHere they are manually combined with fct_recode(). Note there is no error raised at the creation of a new level “Less than 5 days”.\n\nlinelist %>% \n mutate(delay_cat = fct_recode(\n delay_cat,\n \"Less than 5 days\" = \"<2 days\",\n \"Less than 5 days\" = \"2-5 days\",\n \"More than 5 days\" = \">5 days\")) %>% \n tabyl(delay_cat)\n\n delay_cat n percent valid_percent\n Less than 5 days 5030 0.85427989 0.8931108\n More than 5 days 602 0.10224185 0.1068892\n <NA> 256 0.04347826 NA\n\n\n\n\nReduce into “Other”\nYou can use fct_other() to manually assign factor levels to an “Other” level. Below, all levels in the column hospital, aside from “Port Hospital” and “Central Hospital”, are combined into “Other”. You can provide a vector to either keep =, or drop =. You can change the display of the “Other” level with other_level =.\n\nlinelist %>% \n mutate(hospital = fct_other( # adjust levels\n hospital,\n keep = c(\"Port Hospital\", \"Central Hospital\"), # keep these separate\n other_level = \"Other Hospital\")) %>% # All others as \"Other Hospital\"\n tabyl(hospital) # print table\n\n hospital n percent\n Central Hospital 454 0.07710598\n Port Hospital 1762 0.29925272\n Other Hospital 3672 0.62364130\n\n\n\n\nReduce by frequency\nYou can combine the least-frequent factor levels automatically using fct_lump().\nTo “lump” together many low-frequency levels into an “Other” group, do one of the following:\n\nSet n = as the number of groups you want to keep. The n most-frequent levels will be kept, and all others will combine into “Other”.\n\nSet prop = as the threshold frequency proportion for levels above which you want to keep. All other values will combine into “Other”.\n\nYou can change the display of the “Other” level with other_level =. Below, all but the two most-frequent hospitals are combined into “Other Hospital”.\n\nlinelist %>% \n mutate(hospital = fct_lump( # adjust levels\n hospital,\n n = 2, # keep top 2 levels\n other_level = \"Other Hospital\")) %>% # all others as \"Other Hospital\"\n tabyl(hospital) # print table\n\n hospital n percent\n Missing 1469 0.2494905\n Port Hospital 1762 0.2992527\n Other Hospital 2657 0.4512568",
+ "text": "11.6 Combine levels\n\nManually\nYou can adjust the level displays manually manually with fct_recode(). This is like the dplyr function recode() (see Cleaning data and core functions), but it allows the creation of new factor levels. If you use the simple recode() on a factor, new re-coded values will be rejected unless they have already been set as permissible levels.\nThis tool can also be used to “combine” levels, by assigning multiple levels the same re-coded value. Just be careful to not lose information! Consider doing these combining steps in a new column (not over-writing the existing column).\nDANGER: fct_recode() has a different syntax than recode(). recode() uses OLD = NEW, whereas fct_recode() uses NEW = OLD. \nThe current levels of delay_cat are:\n\nlevels(linelist$delay_cat)\n\n[1] \"<2 days\" \"2-5 days\" \">5 days\" \n\n\nThe new levels are created using syntax fct_recode(column, \"new\" = \"old\", \"new\" = \"old\", \"new\" = \"old\") and printed:\n\nlinelist %>% \n mutate(delay_cat = fct_recode(\n delay_cat,\n \"Less than 2 days\" = \"<2 days\",\n \"2 to 5 days\" = \"2-5 days\",\n \"More than 5 days\" = \">5 days\")) %>% \n tabyl(delay_cat)\n\n delay_cat n percent valid_percent\n Less than 2 days 2990 0.50781250 0.5308949\n 2 to 5 days 2040 0.34646739 0.3622159\n More than 5 days 602 0.10224185 0.1068892\n <NA> 256 0.04347826 NA\n\n\nHere they are manually combined with fct_recode(). Note there is no error raised at the creation of a new level “Less than 5 days”.\n\nlinelist %>% \n mutate(delay_cat = fct_recode(\n delay_cat,\n \"Less than 5 days\" = \"<2 days\",\n \"Less than 5 days\" = \"2-5 days\",\n \"More than 5 days\" = \">5 days\")) %>% \n tabyl(delay_cat)\n\n delay_cat n percent valid_percent\n Less than 5 days 5030 0.85427989 0.8931108\n More than 5 days 602 0.10224185 0.1068892\n <NA> 256 0.04347826 NA\n\n\n\n\nReduce into “Other”\nYou can use fct_other() to manually assign factor levels to an “Other” level. Below, all levels in the column hospital, aside from “Port Hospital” and “Central Hospital”, are combined into “Other”. You can provide a vector to either keep =, or drop =. You can change the display of the “Other” level with other_level =.\n\nlinelist %>% \n mutate(hospital = fct_other( # adjust levels\n hospital,\n keep = c(\"Port Hospital\", \"Central Hospital\"), # keep these separate\n other_level = \"Other Hospital\")) %>% # All others as \"Other Hospital\"\n tabyl(hospital) # print table\n\n hospital n percent\n Central Hospital 454 0.07710598\n Port Hospital 1762 0.29925272\n Other Hospital 3672 0.62364130\n\n\n\n\nReduce by frequency\nYou can combine the least-frequent factor levels automatically using fct_lump().\nTo “lump” together many low-frequency levels into an “Other” group, do one of the following:\n\nSet n = as the number of groups you want to keep. The n most-frequent levels will be kept, and all others will combine into “Other”.\n\nSet prop = as the threshold frequency proportion for levels above which you want to keep. All other values will combine into “Other”.\n\nYou can change the display of the “Other” level with other_level =. Below, all but the two most-frequent hospitals are combined into “Other Hospital”.\n\nlinelist %>% \n mutate(hospital = fct_lump( # adjust levels\n hospital,\n n = 2, # keep top 2 levels\n other_level = \"Other Hospital\")) %>% # all others as \"Other Hospital\"\n tabyl(hospital) # print table\n\n hospital n percent\n Missing 1469 0.2494905\n Port Hospital 1762 0.2992527\n Other Hospital 2657 0.4512568",
"crumbs": [
"Data Management",
"11Factors"
@@ -1077,7 +1077,7 @@
"href": "new_pages/factors.html#show-all-levels",
"title": "11 Factors",
"section": "11.7 Show all levels",
- "text": "11.7 Show all levels\nOne benefit of using factors is to standardise the appearance of plot legends and tables, regardless of which values are actually present in a dataset.\nIf you are preparing many figures (e.g. for multiple jurisdictions) you will want the legends and tables to appear identically even with varying levels of data completion or data composition.\n\nIn plots\nIn a ggplot() figure, simply add the argument drop = FALSE in the relevant scale_xxxx() function. All factor levels will be displayed, regardless of whether they are present in the data. If your factor column levels are displayed using fill =, then in scale_fill_discrete() you include drop = FALSE, as shown below. If your levels are displayed with x = (to the x-axis) color = or size = you would provide this to scale_color_discrete() or scale_size_discrete() accordingly.\nThis example is a stacked bar plot of age category, by hospital. Adding scale_fill_discrete(drop = FALSE) ensures that all age groups appear in the legend, even if not present in the data.\n\nggplot(data = linelist)+\n geom_bar(mapping = aes(x = hospital, fill = age_cat)) +\n scale_fill_discrete(drop = FALSE)+ # show all age groups in the legend, even those not present\n labs(\n title = \"All age groups will appear in legend, even if not present in data\")\n\n\n\n\n\n\n\n\n\n\nIn tables\nBoth the base R table() and tabyl() from janitor will show all factor levels (even unused levels).\nIf you use count() or summarise() from dplyr to make a table, add the argument .drop = FALSE to include counts for all factor levels even those unused.\nRead more in the Descriptive tables page, or at the scale_discrete documentation, or the count() documentation. You can see another example in the Contact tracing page.",
+ "text": "11.7 Show all levels\nOne benefit of using factors is to standardise the appearance of plot legends and tables, regardless of which values are actually present in a dataset.\nIf you are preparing many figures (e.g. for multiple jurisdictions) you will want the legends and tables to appear identically even with varying levels of data completion or data composition.\n\nIn plots\nIn a ggplot() figure, simply add the argument drop = FALSE in the relevant scale_xxxx() function. All factor levels will be displayed, regardless of whether they are present in the data. If your factor column levels are displayed using fill =, then in scale_fill_discrete() you include drop = FALSE, as shown below. If your levels are displayed with x = (to the x-axis) color = or size = you would provide this to scale_color_discrete() or scale_size_discrete() accordingly.\nThis example is a stacked bar plot of age category, by hospital. Adding scale_fill_discrete(drop = FALSE) ensures that all age groups appear in the legend, even if not present in the data.\n\nggplot(data = linelist) +\n geom_bar(mapping = aes(x = hospital, fill = age_cat)) +\n scale_fill_discrete(drop = FALSE) + # show all age groups in the legend, even those not present\n labs(\n title = \"All age groups will appear in legend, even if not present in data\")\n\n\n\n\n\n\n\n\n\n\nIn tables\nBoth the base R table() and tabyl() from janitor will show all factor levels (even unused levels).\nIf you use count() or summarise() from dplyr to make a table, add the argument .drop = FALSE to include counts for all factor levels even those unused.\nRead more in the Descriptive tables page, or at the scale_discrete documentation, or the count() documentation. You can see another example in the Contact tracing page.",
"crumbs": [
"Data Management",
"11Factors"
@@ -1088,7 +1088,7 @@
"href": "new_pages/factors.html#epiweeks",
"title": "11 Factors",
"section": "11.8 Epiweeks",
- "text": "11.8 Epiweeks\nPlease see the extensive discussion of how to create epidemiological weeks in the Grouping data page.\nPlease also see the Working with dates page for tips on how to create and format epidemiological weeks.\n\nEpiweeks in a plot\nIf your goal is to create epiweeks to display in a plot, you can do this simply with lubridate’s floor_date(), as explained in the Grouping data page. The values returned will be of class Date with format YYYY-MM-DD. If you use this column in a plot, the dates will naturally order correctly, and you do not need to worry about levels or converting to class Factor. See the ggplot() histogram of onset dates below.\nIn this approach, you can adjust the display of the dates on an axis with scale_x_date(). See the page on Epidemic curves for more information. You can specify a “strptime” display format to the date_labels = argument of scale_x_date(). These formats use “%” placeholders and are covered in the Working with dates page. Use “%Y” to represent a 4-digit year, and either “%W” or “%U” to represent the week number (Monday or Sunday weeks respectively).\n\nlinelist %>% \n mutate(epiweek_date = floor_date(date_onset, \"week\")) %>% # create week column\n ggplot()+ # begin ggplot\n geom_histogram(mapping = aes(x = epiweek_date))+ # histogram of date of onset\n scale_x_date(date_labels = \"%Y-W%W\") # adjust disply of dates to be YYYY-WWw\n\n\n\n\n\n\n\n\n\n\nEpiweeks in the data\nHowever, if your purpose in factoring is not to plot, you can approach this one of two ways:\n\nFor fine control over the display, convert the lubridate epiweek column (YYYY-MM-DD) to the desired display format (YYYY-WWw) within the data frame itself, and then convert it to class Factor.\n\nFirst, use format() from base R to convert the date display from YYYY-MM-DD to YYYY-Www display (see the Working with dates page). In this process the class will be converted to character. Then, convert from character to class Factor with factor().\n\nlinelist <- linelist %>% \n mutate(epiweek_date = floor_date(date_onset, \"week\"), # create epiweeks (YYYY-MM-DD)\n epiweek_formatted = format(epiweek_date, \"%Y-W%W\"), # Convert to display (YYYY-WWw)\n epiweek_formatted = factor(epiweek_formatted)) # Convert to factor\n\n# Display levels\nlevels(linelist$epiweek_formatted)\n\n [1] \"2014-W13\" \"2014-W14\" \"2014-W15\" \"2014-W16\" \"2014-W17\" \"2014-W18\"\n [7] \"2014-W19\" \"2014-W20\" \"2014-W21\" \"2014-W22\" \"2014-W23\" \"2014-W24\"\n[13] \"2014-W25\" \"2014-W26\" \"2014-W27\" \"2014-W28\" \"2014-W29\" \"2014-W30\"\n[19] \"2014-W31\" \"2014-W32\" \"2014-W33\" \"2014-W34\" \"2014-W35\" \"2014-W36\"\n[25] \"2014-W37\" \"2014-W38\" \"2014-W39\" \"2014-W40\" \"2014-W41\" \"2014-W42\"\n[31] \"2014-W43\" \"2014-W44\" \"2014-W45\" \"2014-W46\" \"2014-W47\" \"2014-W48\"\n[37] \"2014-W49\" \"2014-W50\" \"2014-W51\" \"2015-W00\" \"2015-W01\" \"2015-W02\"\n[43] \"2015-W03\" \"2015-W04\" \"2015-W05\" \"2015-W06\" \"2015-W07\" \"2015-W08\"\n[49] \"2015-W09\" \"2015-W10\" \"2015-W11\" \"2015-W12\" \"2015-W13\" \"2015-W14\"\n[55] \"2015-W15\" \"2015-W16\"\n\n\nDANGER: If you place the weeks ahead of the years (“Www-YYYY”) (“%W-%Y”), the default alpha-numeric level ordering will be incorrect (e.g. 01-2015 will be before 35-2014). You could need to manually adjust the order, which would be a long painful process.\n\nFor fast default display, use the aweek package and it’s function date2week(). You can set the week_start = day, and if you set factor = TRUE then the output column is an ordered factor. As a bonus, the factor includes levels for all possible weeks in the span - even if there are no cases that week.\n\n\ndf <- linelist %>% \n mutate(epiweek = date2week(date_onset, week_start = \"Monday\", factor = TRUE))\n\nlevels(df$epiweek)\n\nSee the Working with dates page for more information about aweek. It also offers the reverse function week2date().",
+ "text": "11.8 Epiweeks\nPlease see the extensive discussion of how to create epidemiological weeks in the Grouping data page.\nPlease also see the Working with dates page for tips on how to create and format epidemiological weeks.\n\nEpiweeks in a plot\nIf your goal is to create epiweeks to display in a plot, you can do this simply with lubridate’s floor_date(), as explained in the Grouping data page. The values returned will be of class Date with format YYYY-MM-DD. If you use this column in a plot, the dates will naturally order correctly, and you do not need to worry about levels or converting to class Factor. See the ggplot() histogram of onset dates below.\nIn this approach, you can adjust the display of the dates on an axis with scale_x_date(). See the page on Epidemic curves for more information. You can specify a “strptime” display format to the date_labels = argument of scale_x_date(). These formats use “%” placeholders and are covered in the Working with dates page. Use “%Y” to represent a 4-digit year, and either “%W” or “%U” to represent the week number (Monday or Sunday weeks respectively).\n\nlinelist %>% \n mutate(epiweek_date = floor_date(date_onset, \"week\")) %>% # create week column\n ggplot() + # begin ggplot\n geom_histogram(mapping = aes(x = epiweek_date)) + # histogram of date of onset\n scale_x_date(date_labels = \"%Y-W%W\") # adjust disply of dates to be YYYY-WWw\n\n\n\n\n\n\n\n\n\n\nEpiweeks in the data\nHowever, if your purpose in factoring is not to plot, you can approach this one of two ways:\n\nFor fine control over the display, convert the lubridate epiweek column (YYYY-MM-DD) to the desired display format (YYYY-WWw) within the data frame itself, and then convert it to class Factor.\n\nFirst, use format() from base R to convert the date display from YYYY-MM-DD to YYYY-Www display (see the Working with dates page). In this process the class will be converted to character. Then, convert from character to class Factor with factor().\n\nlinelist <- linelist %>% \n mutate(epiweek_date = floor_date(date_onset, \"week\"), # create epiweeks (YYYY-MM-DD)\n epiweek_formatted = format(epiweek_date, \"%Y-W%W\"), # Convert to display (YYYY-WWw)\n epiweek_formatted = factor(epiweek_formatted)) # Convert to factor\n\n# Display levels\nlevels(linelist$epiweek_formatted)\n\n [1] \"2014-W13\" \"2014-W14\" \"2014-W15\" \"2014-W16\" \"2014-W17\" \"2014-W18\"\n [7] \"2014-W19\" \"2014-W20\" \"2014-W21\" \"2014-W22\" \"2014-W23\" \"2014-W24\"\n[13] \"2014-W25\" \"2014-W26\" \"2014-W27\" \"2014-W28\" \"2014-W29\" \"2014-W30\"\n[19] \"2014-W31\" \"2014-W32\" \"2014-W33\" \"2014-W34\" \"2014-W35\" \"2014-W36\"\n[25] \"2014-W37\" \"2014-W38\" \"2014-W39\" \"2014-W40\" \"2014-W41\" \"2014-W42\"\n[31] \"2014-W43\" \"2014-W44\" \"2014-W45\" \"2014-W46\" \"2014-W47\" \"2014-W48\"\n[37] \"2014-W49\" \"2014-W50\" \"2014-W51\" \"2015-W00\" \"2015-W01\" \"2015-W02\"\n[43] \"2015-W03\" \"2015-W04\" \"2015-W05\" \"2015-W06\" \"2015-W07\" \"2015-W08\"\n[49] \"2015-W09\" \"2015-W10\" \"2015-W11\" \"2015-W12\" \"2015-W13\" \"2015-W14\"\n[55] \"2015-W15\" \"2015-W16\"\n\n\nDANGER: If you place the weeks ahead of the years (“Www-YYYY”) (“%W-%Y”), the default alpha-numeric level ordering will be incorrect (e.g. 01-2015 will be before 35-2014). You could need to manually adjust the order, which would be a long painful process.\n\nFor fast default display, use the aweek package and it’s function date2week(). You can set the week_start = day, and if you set factor = TRUE then the output column is an ordered factor. As a bonus, the factor includes levels for all possible weeks in the span - even if there are no cases that week.\n\n\ndf <- linelist %>% \n mutate(epiweek = date2week(date_onset, week_start = \"Monday\", factor = TRUE))\n\nlevels(df$epiweek)\n\nSee the Working with dates page for more information about aweek. It also offers the reverse function week2date().",
"crumbs": [
"Data Management",
"11Factors"
diff --git a/new_pages/characters_strings.html b/new_pages/characters_strings.html
new file mode 100644
index 00000000..a1291eb4
--- /dev/null
+++ b/new_pages/characters_strings.html
@@ -0,0 +1,2688 @@
+
+
+
+
+
+
+
+
+
+The Epidemiologist R Handbook - 10 Characters and strings
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence().
+
+
+
Evaluate and extract by position - str_length(), str_sub(), word().
+
+
Patterns.
+
+
Detect and locate - str_detect(), str_subset(), str_match(), str_extract().
+
+
Modify and replace - str_sub(), str_replace_all().
+
+
+
Regular expressions (“regex”).
+
+
For ease of display most examples are shown acting on a short defined character vector, however they can easily be adapted to a column within a data frame.
+
This stringr vignette provided much of the inspiration for this page.
+
+
+
10.1 Preparation
+
+
Load packages
+
Install or load the stringr and other tidyverse packages.
+
+
# install/load packages
+pacman::p_load(
+ stringr, # many functions for handling strings
+ tidyverse, # for optional data manipulation
+ tools) # alternative for converting to title case
+
+
+
+
Import data
+
In this page we will occassionally reference the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).
+
+
+
Warning: The `trust` argument of `import()` should be explicit for serialization formats
+as of rio 1.0.3.
+ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
+ℹ The deprecated feature was likely used in the rio package.
+ Please report the issue at <https://github.com/gesistsa/rio/issues>.
+
+
+
+
# import case linelist
+linelist <-import("linelist_cleaned.rds")
+
+
The first 50 rows of the linelist are displayed below.
+
+
+
+
+
+
+
+
+
+
+
10.2 Unite, split, and arrange
+
This section covers:
+
+
Using str_c(), str_glue(), and unite() to combine strings.
+
+
Using str_order() to arrange strings.
+
+
Using str_split() and separate() to split strings.
+
+
+
+
Combine strings
+
To combine or concatenate multiple strings into one string, we suggest using str_c from stringr. If you have distinct character values to combine, simply provide them as unique arguments, separated by commas.
+
+
str_c("String1", "String2", "String3")
+
+
[1] "String1String2String3"
+
+
+
The argument sep = inserts a character value between each of the arguments you provided (e.g. inserting a comma, space, or newline "\n")
+
+
str_c("String1", "String2", "String3", sep =", ")
+
+
[1] "String1, String2, String3"
+
+
+
The argument collapse = is relevant if you are inputting multiple vectors as arguments to str_c(). It is used to separate the elements of what would be an output vector, such that the output vector only has one long character element.
+
The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts. In this example:
+
+
The sep = value appears between each first and last name
+
+
The collapse = value appears between each person
+
+
+
first_names <-c("abdul", "fahruk", "janice")
+last_names <-c("hussein", "akinleye", "okeke")
+
+# sep displays between the respective input strings, while collapse displays between the elements produced
+str_c(first_names, last_names, sep =" ", collapse ="; ")
Note: Depending on your desired display context, when printing such a combined string with newlines, you may need to wrap the whole phrase in cat() for the newlines to print properly:
+
+
# For newlines to print correctly, the phrase may need to be wrapped in cat()
+cat(str_c(first_names, last_names, sep =" ", collapse =";\n"))
+
+
abdul hussein;
+fahruk akinleye;
+janice okeke
+
+
+
+
+
+
Dynamic strings
+
Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.
+
+
All content goes between double quotation marks str_glue("").
+
+
Any dynamic code or references to pre-defined values are placed within curly brackets {} within the double quotation marks. There can be many curly brackets in the same str_glue() command.
+
+
To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below).
+
+
Tip: You can use \n to force a new line.
+
+
Tip: You use format() to adjust date display, and use Sys.Date() to display the current date.
+
+
A simple example, of a dynamic plot caption:
+
+
str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
+
+
Data include 5888 cases and are current to 08 Sep 2024.
+
+
+
An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.
+
+
str_glue("Linelist as of {current_date}.\nLast case hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
+current_date =format(Sys.Date(), '%d %b %Y'),
+last_hospital =format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
+n_missing_onset =nrow(linelist %>%filter(is.na(date_onset)))
+ )
+
+
Linelist as of 08 Sep 2024.
+Last case hospitalized on 30 Apr 2015.
+256 cases are missing date of onset and not shown
+
+
+
Pulling from a data frame
+
Sometimes, it is useful to pull data from a data frame and have it pasted together in sequence. Below is an example data frame. We will use it to to make a summary statement about the jurisdictions and the new and total case counts.
Use str_glue_data(), which is specially made for taking data from data frame rows:
+
+
case_table %>%
+str_glue_data("{zone}: {new_cases} ({total_cases} total cases)")
+
+
Zone 1: 3 (40 total cases)
+Zone 2: 0 (4 total cases)
+Zone 3: 7 (25 total cases)
+Zone 4: 0 (10 total cases)
+Zone 5: 15 (103 total cases)
+
+
+
Combine strings across rows
+
If you are trying to “roll-up” values in a data frame column, e.g. combine values from multiple rows into just one row by pasting them together with a separator, see the section of the De-duplication page on “rolling-up” values.
+
Data frame to one line
+
You can make the statement appear in one line using str_c() (specifying the data frame and column names), and providing sep = and collapse = arguments.
[1] "Zone 1 = 3; Zone 2 = 0; Zone 3 = 7; Zone 4 = 0; Zone 5 = 15"
+
+
+
You could add the pre-fix text “New Cases:” to the beginning of the statement by wrapping with a separate str_c() (if “New Cases:” was within the original str_c() it would appear multiple times).
[1] "New Cases: Zone 1 = 3; Zone 2 = 0; Zone 3 = 7; Zone 4 = 0; Zone 5 = 15"
+
+
+
+
+
Unite columns
+
Within a data frame, bringing together character values from multiple columns can be achieved with unite() from tidyr. This is the opposite of separate().
+
Provide the name of the new united column. Then provide the names of the columns you wish to unite.
+
+
By default, the separator used in the united column is underscore _, but this can be changed with the sep = argument.
+
+
remove = removes the input columns from the data frame (TRUE by default).
+
+
na.rm = removes missing values while uniting (FALSE by default).
+
+
Below, we define a mini-data frame to demonstrate with:
df_split <-separate(df, symptoms, into =c("sym_1", "sym_2", "sym_3"), extra ="merge")
+
+
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].
+
+
+
Here is the example data frame:
+
+
+
+
+
+
+
Below, we unite the three symptom columns:
+
+
df_split %>%
+unite(
+col ="all_symptoms", # name of the new united column
+c("sym_1", "sym_2", "sym_3"), # columns to unite
+sep =", ", # separator to use in united column
+remove =TRUE, # if TRUE, removes input cols from the data frame
+na.rm =TRUE# if TRUE, missing values are removed before uniting
+ )
To split a string based on a pattern, use str_split(). It evaluates the string(s) and returns a list of character vectors consisting of the newly-split values.
+
The simple example below evaluates one string and splits it into three. By default it returns an object of class list with one element (a character vector) for each string initially provided. If simplify = TRUE it returns a character matrix.
+
In this example, one string is provided, and the function returns a list with one element - a character vector with three values.
If the output is saved, you can then access the nth split value with bracket syntax. To access a specific value you can use syntax like this: the_returned_object[[1]][2], which would access the second value from the first evaluated string (“fever”). See the R basics page for more detail on accessing elements.
+
+
pt1_symptoms <-str_split("jaundice, fever, chills", ",")
+
+pt1_symptoms[[1]][2] # extracts 2nd value from 1st (and only) element of the list
+
+
[1] " fever"
+
+
+
If multiple strings are provided by str_split(), there will be more than one element in the returned list.
You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits to 2. Any further commas remain within the second values.
Note - the same outputs can be achieved with str_split_fixed(), in which you do not give the simplify argument, but must instead designate the number of columns (n).
+
+
str_split_fixed(symptoms, ",", n =2)
+
+
+
+
Split columns
+
If you are trying to split data frame column, it is best to use the separate() function from dplyr. It is used to split one character column into other columns.
+
Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.
+
+
+
+
+
+
+
Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.
+
+
sep = the separator, can be a character, or a number (interpreted as the character position to split at).
+
remove = FALSE by default, removes the input column.
+
+
convert = FALSE by default, will cause string “NA”s to become NA.
+
+
extra = this controls what happens if there are more values created by the separation than new columns named.
+
+
extra = "warn" means you will see a warning but it will drop excess values (the default).
+
+
extra = "drop" means the excess values will be dropped with no warning.
+
+
extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data.
+
+
+
An example with extra = "merge" is below - no data is lost. Two new columns are defined but any third symptoms are left in the second new column:
+
+
# third symptoms combined into second new column
+df %>%
+separate(symptoms, into =c("sym_1", "sym_2"), sep=",", extra ="merge")
+
+
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.
+
+
+
+
Arrange alphabetically
+
Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.
+
+
# strings
+health_zones <-c("Alba", "Takota", "Delta")
+
+# return the alphabetical order
+str_order(health_zones)
+
+
[1] 1 3 2
+
+
# return the strings in alphabetical order
+str_sort(health_zones)
+
+
[1] "Alba" "Delta" "Takota"
+
+
+
To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.
+
+
+
+
base R functions
+
It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. They act similarly to str_c() but the syntax is arguably more complicated - in the parentheses each part is separated by a comma. The parts are either character text (in quotes) or pre-defined code objects (no quotes). For example:
[1] "Regional hospital needs 10 beds and 20 masks."
+
+
+
sep = and collapse = arguments can be specified. paste() is simply paste0() with a default sep = " " (one space).
+
+
+
+
10.3 Clean and standardise
+
+
+
Change case
+
Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_lower(), and str_to_title(), from stringr, as shown below:
+
+
str_to_upper("California")
+
+
[1] "CALIFORNIA"
+
+
str_to_lower("California")
+
+
[1] "california"
+
+
+
Using *base** R, the above can also be achieved with toupper(), tolower().
+
Title case
+
Transforming the string so each word is capitalized can be achieved with str_to_title():
+
+
str_to_title("go to the US state of california ")
+
+
[1] "Go To The Us State Of California "
+
+
+
Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).
+
+
tools::toTitleCase("This is the US state of california")
+
+
[1] "This is the US State of California"
+
+
+
You can also use str_to_sentence(), which capitalizes only the first letter of the string.
+
+
str_to_sentence("the patient must be transported")
+
+
[1] "The patient must be transported"
+
+
+
+
+
Pad length
+
Use str_pad() to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad = argument.
+
+
# ICD codes of differing length
+ICD_codes <-c("R10.13",
+"R10.819",
+"R17")
+
+# ICD codes padded to 7 characters on the right side
+str_pad(ICD_codes, 7, "right")
+
+
[1] "R10.13 " "R10.819" "R17 "
+
+
# Pad with periods instead of spaces
+str_pad(ICD_codes, 7, "right", pad =".")
+
+
[1] "R10.13." "R10.819" "R17...."
+
+
+
For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".
+
+
# Add leading zeros to two digits (e.g. for times minutes/hours)
+str_pad("4", 2, pad ="0")
+
+
[1] "04"
+
+
# example using a numeric column named "hours"
+# hours <- str_pad(hours, 2, pad = "0")
+
+
+
+
Truncate
+
str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).
+
+
original <-"Symptom onset on 4/3/2020 with vomiting"
+str_trunc(original, 10, "center")
+
+
[1] "Symp...ing"
+
+
+
+
+
Standardize length
+
Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then one very short value is padded to achieve length of 6.
+
+
# ICD codes of differing length
+ICD_codes <-c("R10.13",
+"R10.819",
+"R17")
+
+# truncate to maximum length of 6
+ICD_codes_2 <-str_trunc(ICD_codes, 6)
+ICD_codes_2
+
+
[1] "R10.13" "R10..." "R17"
+
+
# expand to minimum length of 6
+ICD_codes_3 <-str_pad(ICD_codes_2, 6, "right")
+ICD_codes_3
+
+
[1] "R10.13" "R10..." "R17 "
+
+
+
+
+
Remove leading/trailing whitespace
+
Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input. Add "right""left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").
+
+
# ID numbers with excess spaces on right
+IDs <-c("provA_1852 ", # two excess spaces
+"provA_2345", # zero excess spaces
+"provA_9460 ") # one excess space
+
+# IDs trimmed to remove excess spaces on right side only
+str_trim(IDs)
+
+
[1] "provA_1852" "provA_2345" "provA_9460"
+
+
+
+
+
Remove repeated whitespace within
+
Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().
+
+
# original contains excess spaces within string
+str_squish(" Pt requires IV saline\n")
+
+
[1] "Pt requires IV saline"
+
+
+
Enter ?str_trim, ?str_pad in your R console to see further details.
+
+
+
Wrap into paragraphs
+
Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.
+
+
pt_course <-"Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."
+
+str_wrap(pt_course, 40)
+
+
[1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."
+
+
+
The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.
+
+
cat(str_wrap(pt_course, 40))
+
+
Symptom onset 1/4/2020 vomiting chills
+fever. Pt saw traditional healer in
+home village on 2/4/2020. On 5/4/2020
+pt symptoms worsened and was admitted
+to Lumta clinic. Sample was taken and pt
+was transported to regional hospital on
+6/4/2020. Pt died at regional hospital
+on 7/4/2020.
+
+
+
+
+
+
+
10.4 Handle by position
+
+
Extract by character position
+
Use str_sub() to return only a part of a string. The function takes three main arguments:
+
+
the character vector(s).
+
+
start position.
+
end position.
+
+
A few notes on position numbers:
+
+
If a position number is positive, the position is counted starting from the left end of the string.
+
+
If a position number is negative, it is counted starting from the right end of the string.
+
+
Position numbers are inclusive.
+
+
Positions extending beyond the string will be truncated (removed).
+
+
Below are some examples applied to the string “pneumonia”:
+
+
# start and end third from left (3rd letter from left)
+str_sub("pneumonia", 3, 3)
+
+
[1] "e"
+
+
# 0 is not present
+str_sub("pneumonia", 0, 0)
+
+
[1] ""
+
+
# 6th from left, to the 1st from right
+str_sub("pneumonia", 6, -1)
+
+
[1] "onia"
+
+
# 5th from right, to the 2nd from right
+str_sub("pneumonia", -5, -2)
+
+
[1] "moni"
+
+
# 4th from left to a position outside the string
+str_sub("pneumonia", 4, 15)
+
+
[1] "umonia"
+
+
+
+
+
Extract by word position
+
To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.
+
By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.
+
+
# strings to evaluate
+chief_complaints <-c("I just got out of the hospital 2 days ago, but still can barely breathe.",
+"My stomach hurts",
+"Severe ear pain")
+
+# extract 1st to 3rd words of each string
+word(chief_complaints, start =1, end =3, sep =" ")
+
+
[1] "I just got" "My stomach hurts" "Severe ear pain"
+
+
+
+
+
Replace by character position
+
str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:
+
+
word <-"pneumonia"
+
+# convert the third and fourth characters to X
+str_sub(word, 3, 4) <-"XX"
+
+# print
+word
+
+
[1] "pnXXmonia"
+
+
+
An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.
+
+
words <-c("pneumonia", "tubercolosis", "HIV")
+
+# convert the third and fourth characters to X
+str_sub(words, 3, 4) <-"XX"
+
+words
+
+
[1] "pnXXmonia" "tuXXrcolosis" "HIXX"
+
+
+
+
+
Evaluate length
+
+
str_length("abc")
+
+
[1] 3
+
+
+
Alternatively, use nchar() from base R
+
+
+
+
+
10.5 Patterns
+
Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.
+
+
+
Detect a pattern
+
Use str_detect() as below to detect presence/absence of a pattern within a string. First provide the string or vector to search in (string =), and then the pattern to look for (pattern =). Note that by default the search is case sensitive!
+
+
str_detect(string ="primary school teacher", pattern ="teach")
+
+
[1] TRUE
+
+
+
The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.
+
+
str_detect(string ="primary school teacher", pattern ="teach", negate =TRUE)
+
+
[1] FALSE
+
+
+
To ignore case/capitalization, wrap the pattern within regex(), and withinregex() add the argument ignore_case = TRUE (or T as shorthand).
When str_detect() is applied to a character vector or a data frame column, it will return TRUE or FALSE for each of the values.
+
+
# a vector/column of occupations
+occupations <-c("field laborer",
+"university professor",
+"primary school teacher & tutor",
+"tutor",
+"nurse at regional hospital",
+"lineworker at Amberdeen Fish Factory",
+"physican",
+"cardiologist",
+"office worker",
+"food service")
+
+# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
+str_detect(occupations, "teach")
If you need to build a long list of search terms, you can combine them using str_c() and sep = |, then define this is a character object, and then reference the vector later more succinctly. The example below includes possible occupation search terms for front-line medical providers.
The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve the regex() function).
+
Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.
+
+
Convert commas to periods
+
Here is an example of using gsub() to convert commas to periods in a vector of numbers. This could be useful if your data come from parts of the world other than the United States or Great Britain.
+
The inner gsub() which acts first on lengths is converting any periods to no space ““. The period character”.” has to be “escaped” with two slashes to actually signify a period, because “.” in regex means “any character”. Then, the result (with only commas) is passed to the outer gsub() in which commas are replaced by periods.
+
+
lengths <-c("2.454,56", "1,2", "6.096,5")
+
+as.numeric(gsub(pattern =",", # find commas
+replacement =".", # replace with periods
+x =gsub("\\.", "", lengths) # vector with other periods removed (periods escaped)
+ )
+ ) # convert outcome to numeric
+
+
+
+
+
Replace all
+
Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated to string =, then the pattern to be replaced to pattern =, and then the replacement value to replacement =. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.
[1] "Karl: deceased" "Samantha: deceased" "Marco: not deceased"
+
+
+
Notes:
+
+
To replace a pattern with NA, use str_replace_na().
+
+
The function str_replace() replaces only the first instance of the pattern within each evaluated string.
+
+
+
+
+
Detect within logic
+
Within case_when()
+
str_detect() is often used within case_when() (from dplyr). Let’s say occupations is a column in the linelist. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().
+
+
df <- df %>%
+mutate(is_educator =case_when(
+# term search within occupation, not case sensitive
+str_detect(occupations,
+regex("teach|prof|tutor|university",
+ignore_case =TRUE)) ~"Educator",
+# all others
+TRUE~"Not an educator"))
+
+
As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):
+
+
df <- df %>%
+# value in new column is_educator is based on conditional logic
+mutate(is_educator =case_when(
+
+# occupation column must meet 2 criteria to be assigned "Educator":
+# it must have a search term AND NOT any exclusion term
+
+# Must have a search term
+str_detect(occupations,
+regex("teach|prof|tutor|university", ignore_case = T)) &
+
+# AND must NOT have an exclusion term
+str_detect(occupations,
+regex("admin", ignore_case = T),
+negate =TRUE~"Educator"
+
+# All rows not meeting above criteria
+TRUE~"Not an educator"))
+
+
+
+
+
Locate pattern position
+
To locate the first position of a pattern, use str_locate(). It outputs a start and end position.
+
+
str_locate("I wish", "sh")
+
+
start end
+[1,] 5 6
+
+
+
Like other str functions, there is an “_all” version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.
+
+
phrases <-c("I wish", "I hope", "he hopes", "He hopes")
+
+str_locate(phrases, "h" ) # position of *first* instance of the pattern
+
+
start end
+[1,] 6 6
+[2,] 3 3
+[3,] 1 1
+[4,] 4 4
+
+
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
+
+
[[1]]
+ start end
+[1,] 6 6
+
+[[2]]
+ start end
+[1,] 3 3
+
+[[3]]
+ start end
+[1,] 1 1
+[2,] 4 4
+
+[[4]]
+ start end
+[1,] 4 4
+
+
+
+
+
+
Extract a match
+
str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.
+
str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.
str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.
+
+
str_extract(occupations, "teach|prof|tutor")
+
+
[1] NA "prof" "teach" "tutor" NA NA NA NA NA
+[10] NA
+
+
+
+
+
+
Subset and count
+
Aligned functions include str_subset() and str_count().
+
str_subset() returns the actual values which contained the pattern:
+
+
str_subset(occupations, "teach|prof|tutor")
+
+
[1] "university professor" "primary school teacher & tutor"
+[3] "tutor"
+
+
+
str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.
The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.
+
Note - thus, if you want to display a backslash, you must escape it’s meaning with another backslash. So you must write two backslashes \\ to display one.
Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).
+
+
+
+
10.7 Regular expressions (regex) and special characters
+
Regular expressions, or “regex”, is a concise language for describing patterns in strings. If you are not familiar with it, a regular expression can look like an alien language. Here we try to de-mystify this language a little bit.
+
Much of this section is adapted from this tutorial and this cheatsheet. We selectively adapt here knowing that this handbook might be viewed by people without internet access to view the other tutorials.
+
A regular expression is often applied to extract specific patterns from “unstructured” text - for example medical notes, chief complaints, patient history, or other free text columns in a data frame
+
There are four basic tools one can use to create a basic regular expression:
+
+
Character sets.
+
+
Meta characters.
+
+
Quantifiers.
+
+
Groups.
+
+
Character sets
+
Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:
+
+
+
+
+
+
+
+
Character set
+
Matches for
+
+
+
+
+
"[A-Z]"
+
any single capital letter
+
+
+
"[a-z]"
+
any single lowercase letter
+
+
+
"[0-9]"
+
any digit
+
+
+
[:alnum:]
+
any alphanumeric character
+
+
+
[:digit:]
+
any numeric digit
+
+
+
[:alpha:]
+
any letter (upper or lowercase)
+
+
+
[:upper:]
+
any uppercase letter
+
+
+
[:lower:]
+
any lowercase letter
+
+
+
+
Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).
+
Meta characters
+
Meta characters are shorthand for character sets. Some of the important ones are listed below:
+
+
+
+
+
+
+
+
Meta character
+
Represents
+
+
+
+
+
"\\s"
+
a single space
+
+
+
"\\w"
+
any single alphanumeric character (A-Z, a-z, or 0-9)
+
+
+
"\\d"
+
any single numeric digit (0-9)
+
+
+
+
Quantifiers
+
Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.
+
Quantifiers are numbers written within curly brackets { }after the character they are quantifying, for example:
+
+
"A{2}" will return instances of two capital A letters.
+
+
"A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
+
+
"A{2,}" will return instances of two or more capital A letters.
+
+
"A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
+
+
Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present).
+
+
Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"
+
+
# test string for quantifiers
+test <-"A-AA-AAA-AAAA"
+
+
When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.
+
+
str_extract_all(test, "A{2}")
+
+
[[1]]
+[1] "AA" "AA" "AA" "AA"
+
+
+
When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.
+
+
str_extract_all(test, "A{2,4}")
+
+
[[1]]
+[1] "AA" "AAA" "AAAA"
+
+
+
With the quantifier +, groups of one or more are returned:
+
+
str_extract_all(test, "A+")
+
+
[[1]]
+[1] "A" "AA" "AAA" "AAAA"
+
+
+
Relative position
+
These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])
+
+
str_extract_all(test, "")
+
+
[[1]]
+ [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
+
+
+
+
+
+
+
+
+
+
Position statement
+
Matches to
+
+
+
+
+
"(?<=b)a"
+
“a” that is preceded by a “b”
+
+
+
"(?<!b)a"
+
“a” that is NOT preceded by a “b”
+
+
+
"a(?=b)"
+
“a” that is followed by a “b”
+
+
+
"a(?!b)"
+
“a” that is NOT followed by a “b”
+
+
+
+
Groups
+
Capturing groups in your regular expression is a way to have a more organized output upon extraction.
+
Regex examples
+
Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.
+
+
pt_note <-"Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."
+
+
This expression matches to all words (any character until hitting non-character such as a space):
This page demonstrates common steps used in the process of “cleaning” a dataset, and also explains the use of many essential R data management functions.
+
To demonstrate data cleaning, this page begins by importing a raw case linelist dataset, and proceeds step-by-step through the cleaning process. In the R code, this manifests as a “pipe” chain, which references the “pipe” operator %>% that passes a dataset from one operation to the next.
+
+
Core functions
+
This handbook emphasizes use of the functions from the tidyverse family of R packages. The essential R functions demonstrated in this page are listed below.
+
Many of these functions belong to the dplyr R package, which provides “verb” functions to solve data manipulation challenges (the name is a reference to a “data frame-plier. dplyr is part of the tidyverse family of R packages (which also includes ggplot2, tidyr, stringr, tibble, purrr, magrittr, and forcats among others).
re-code values in a column using more complex logical criteria
+
dplyr
+
+
+
replace_na(), na_if(), coalesce()
+
special functions for re-coding
+
tidyr
+
+
+
age_categories() and cut()
+
create categorical groups from a numeric column
+
epikit and base R
+
+
+
match_df()
+
re-code/clean values using a data dictionary
+
matchmaker
+
+
+
which()
+
apply logical criteria; return indices
+
base R
+
+
+
+
If you want to see how these functions compare to Stata or SAS commands, see the page on Transition to R.
+
You may encounter an alternative data management framework from the data.table R package with operators like := and frequent use of brackets [ ]. This approach and syntax is briefly explained in the Data Table page.
+
+
+
Nomenclature
+
In this handbook, we generally reference “columns” and “rows” instead of “variables” and “observations”. As explained in this primer on “tidy data”, most epidemiological statistical datasets consist structurally of rows, columns, and values.
+
Variables contain the values that measure the same underlying attribute (like age group, outcome, or date of onset). Observations contain all values measured on the same unit (e.g. a person, site, or lab sample). So these aspects can be more difficult to tangibly define.
+
In “tidy” datasets, each column is a variable, each row is an observation, and each cell is a single value. However some datasets you encounter will not fit this mold - a “wide” format dataset may have a variable split across several columns (see an example in the Pivoting data page). Likewise, observations could be split across several rows.
+
Most of this handbook is about managing and transforming data, so referring to the concrete data structures of rows and columns is more relevant than the more abstract observations and variables. Exceptions occur primarily in pages on data analysis, where you will see more references to variables and observations.
+
+
+
+
+
+
8.1 Cleaning pipeline
+
This page proceeds through typical cleaning steps, adding them sequentially to a cleaning pipe chain.
+
In epidemiological analysis and data processing, cleaning steps are often performed sequentially, linked together. In R, this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another.
+
Such chains utilize dplyr “verb” functions and the magrittr pipe operator %>%. This pipe begins with the “raw” data (“linelist_raw.xlsx”) and ends with a “clean” R data frame (linelist) that can be used, saved, exported, etc.
+
In a cleaning pipeline the order of the steps is important. Cleaning steps might include:
+
+
Importing of data.
+
+
Column names cleaned or changed.
+
+
De-duplication.
+
+
Column creation and transformation (e.g. re-coding or standardising values).
+
+
Rows filtered or added.
+
+
+
+
+
+
+
8.2 Load packages
+
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.
+
+
pacman::p_load(
+ rio, # importing data
+ here, # relative file pathways
+ janitor, # data cleaning and tables
+ lubridate, # working with dates
+ matchmaker, # dictionary-based cleaning
+ epikit, # age_categories() function
+ tidyverse # data management and visualization
+)
+
+
+
+
+
+
+
8.3 Import data
+
+
Import
+
Here we import the “raw” case linelist Excel file using the import() function from the package rio. The rio package flexibly handles many types of files (e.g. .xlsx, .csv, .tsv, .rds. See the page on Import and export for more information and tips on unusual situations (e.g. skipping rows, setting missing values, importing Google sheets, etc).
If your dataset is large and takes a long time to import, it can be useful to have the import command be separate from the pipe chain and the “raw” saved as a distinct file. This also allows easy comparison between the original and cleaned versions.
+
Below we import the raw Excel file and save it as the data frame linelist_raw. We assume the file is located in your working directory or R project root, and so no sub-folders are specified in the file path.
+
+
linelist_raw <-import("linelist_raw.xlsx")
+
+
You can view the first 50 rows of the the data frame below. Note: the base R function head(n) allow you to view just the first n rows in the R console.
+
+
+
+
+
+
+
+
+
Review
+
You can use the function skim() from the package skimr to get an overview of the entire dataframe (see page on Descriptive tables for more info). Columns are summarised by class/type such as character, numeric. Note: “POSIXct” is a type of raw date class (see Working with dates).
+
+
skimr::skim(linelist_raw)
+
+
+
+
+
Data summary
+
+
+
Name
+
linelist_raw
+
+
+
Number of rows
+
6611
+
+
+
Number of columns
+
28
+
+
+
_______________________
+
+
+
+
Column type frequency:
+
+
+
+
character
+
17
+
+
+
numeric
+
8
+
+
+
POSIXct
+
3
+
+
+
________________________
+
+
+
+
Group variables
+
None
+
+
+
+
Variable type: character
+
+
+
+
+
+
+
+
+
+
+
+
+
+
skim_variable
+
n_missing
+
complete_rate
+
min
+
max
+
empty
+
n_unique
+
whitespace
+
+
+
+
+
case_id
+
137
+
0.98
+
6
+
6
+
0
+
5888
+
0
+
+
+
date onset
+
293
+
0.96
+
10
+
10
+
0
+
580
+
0
+
+
+
outcome
+
1500
+
0.77
+
5
+
7
+
0
+
2
+
0
+
+
+
gender
+
324
+
0.95
+
1
+
1
+
0
+
2
+
0
+
+
+
hospital
+
1512
+
0.77
+
5
+
36
+
0
+
13
+
0
+
+
+
infector
+
2323
+
0.65
+
6
+
6
+
0
+
2697
+
0
+
+
+
source
+
2323
+
0.65
+
5
+
7
+
0
+
2
+
0
+
+
+
age
+
107
+
0.98
+
1
+
2
+
0
+
75
+
0
+
+
+
age_unit
+
7
+
1.00
+
5
+
6
+
0
+
2
+
0
+
+
+
fever
+
258
+
0.96
+
2
+
3
+
0
+
2
+
0
+
+
+
chills
+
258
+
0.96
+
2
+
3
+
0
+
2
+
0
+
+
+
cough
+
258
+
0.96
+
2
+
3
+
0
+
2
+
0
+
+
+
aches
+
258
+
0.96
+
2
+
3
+
0
+
2
+
0
+
+
+
vomit
+
258
+
0.96
+
2
+
3
+
0
+
2
+
0
+
+
+
time_admission
+
844
+
0.87
+
5
+
5
+
0
+
1091
+
0
+
+
+
merged_header
+
0
+
1.00
+
1
+
1
+
0
+
1
+
0
+
+
+
…28
+
0
+
1.00
+
1
+
1
+
0
+
1
+
0
+
+
+
+
Variable type: numeric
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
skim_variable
+
n_missing
+
complete_rate
+
mean
+
sd
+
p0
+
p25
+
p50
+
p75
+
p100
+
+
+
+
+
generation
+
7
+
1.00
+
16.60
+
5.71
+
0.00
+
13.00
+
16.00
+
20.00
+
37.00
+
+
+
lon
+
7
+
1.00
+
-13.23
+
0.02
+
-13.27
+
-13.25
+
-13.23
+
-13.22
+
-13.21
+
+
+
lat
+
7
+
1.00
+
8.47
+
0.01
+
8.45
+
8.46
+
8.47
+
8.48
+
8.49
+
+
+
row_num
+
0
+
1.00
+
3240.91
+
1857.83
+
1.00
+
1647.50
+
3241.00
+
4836.50
+
6481.00
+
+
+
wt_kg
+
7
+
1.00
+
52.69
+
18.59
+
-11.00
+
41.00
+
54.00
+
66.00
+
111.00
+
+
+
ht_cm
+
7
+
1.00
+
125.25
+
49.57
+
4.00
+
91.00
+
130.00
+
159.00
+
295.00
+
+
+
ct_blood
+
7
+
1.00
+
21.26
+
1.67
+
16.00
+
20.00
+
22.00
+
22.00
+
26.00
+
+
+
temp
+
158
+
0.98
+
38.60
+
0.95
+
35.20
+
38.30
+
38.80
+
39.20
+
40.80
+
+
+
+
Variable type: POSIXct
+
+
+
+
+
+
+
+
+
+
+
+
+
skim_variable
+
n_missing
+
complete_rate
+
min
+
max
+
median
+
n_unique
+
+
+
+
+
infection date
+
2322
+
0.65
+
2012-04-09
+
2015-04-27
+
2014-10-04
+
538
+
+
+
hosp date
+
7
+
1.00
+
2012-04-20
+
2015-04-30
+
2014-10-15
+
570
+
+
+
date_of_outcome
+
1068
+
0.84
+
2012-05-14
+
2015-06-04
+
2014-10-26
+
575
+
+
+
+
+
+
+
+
+
+
+
+
8.4 Column names
+
In R, column names are the “header” or “top” value of a column. They are used to refer to columns in the code, and serve as a default label in figures.
+
Other statistical software such as SAS and STATA use “labels” that co-exist as longer printed versions of the shorter column names. While R does offer the possibility of adding column labels to the data, this is not emphasized in most practice. To make column names “printer-friendly” for figures, one typically adjusts their display within the plotting commands that create the outputs (e.g. axis or legend titles of a plot, or column headers in a printed table - see the scales section of the ggplot tips page and Tables for presentation pages). If you want to assign column labels in the data, read more online here and here.
+
As R column names are used very often, so they must have “clean” syntax. We suggest the following:
+
+
Short names.
+
No spaces (replace with underscores _ ).
+
No unusual characters (&, #, <, >, …).
+
+
Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…).
+
+
The columns names of linelist_raw are printed below using names() from base R. We can see that initially:
+
+
Some names contain spaces (e.g. infection date).
+
+
Different naming patterns are used for dates (date onset vs. infection date).
+
+
There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column).
NOTE: To reference a column name that includes spaces, surround the name with back-ticks, for example: linelist$`infection date`. note that on your keyboard, the back-tick (`) is different from the single quotation mark (’).
+
+
Automatic cleaning
+
The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:
+
+
Converts all names to consist of only underscores, numbers, and letters.
+
+
Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”).
+
+
Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…).
+
+
You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset")).
+
Below, the cleaning pipeline begins by using clean_names() on the raw linelist.
+
+
# pipe the raw dataset through the function clean_names(), assign result as "linelist"
+linelist <- linelist_raw %>%
+ janitor::clean_names()
+
+# see the new column names
+names(linelist)
NOTE: The last column name “…28” was changed to “x28”.
+
+
+
Manual name cleaning
+
Re-naming columns manually is often necessary, even after the standardization step above. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style NEW = OLD, the new column name is given before the old column name.
+
Below, a re-naming command is added to the cleaning pipeline. Spaces have been added strategically to align code for easier reading.
+
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome)
+
+
Now you can see that the columns names have been changed:
As a shortcut, you can also rename columns within the dplyrselect() and summarise() functions. select() is used to keep only certain columns (and is covered later in this page). summarise() is covered in the Grouping data and Descriptive tables pages. These functions also uses the format new_name = old_name. Here is an example:
+
+
linelist_raw %>%
+# rename and KEEP ONLY these columns
+select(# NEW name # OLD name
+date_infection =`infection date`,
+date_hospitalisation =`hosp date`)
+
+
+
+
+
Other challenges
+
+
Empty Excel column names
+
R cannot have dataset columns that do not have column names (headers). So, if you import an Excel dataset with data but no column headers, R will fill-in the headers with names like “…1” or “…2”. The number represents the column number (e.g. if the 4th column in the dataset has no header, then R will name it “…4”).
+
You can clean these names manually by referencing their position number (see example above), or their assigned name (linelist_raw$...1).
+
+
+
Merged Excel column names and cells
+
Merged cells in an Excel file are a common occurrence when receiving data. As explained in Transition to R, merged cells can be nice for human reading of data, but are not “tidy data” and cause many problems for machine reading of data. R cannot accommodate merged cells.
+
Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.
+
+
Each variable must have its own column.
+
+
Each observation must have its own row.
+
+
Each value must have its own cell.
+
+
When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.
+
One solution to deal with merged cells is to import the data with the function readWorkbook() from the package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.
DANGER: If column names are merged with readWorkbook(), you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning.
+
+
+
+
+
+
+
+
8.5 Select or re-order columns
+
Use select() from dplyr to select the columns you want to retain, and to specify their order in the data frame.
+
CAUTION: In the examples below, the linelist data frame is modified with select() and displayed, but not saved. This is for demonstration purposes. The modified column names are printed by piping the data frame to names().
+
Here are ALL the column names in the linelist at this point in the cleaning pipe chain:
Put their names in the select() command, with no quotation marks. They will appear in the data frame in the order you provide. Note that if you include a column that does not exist, R will return an error (see use of any_of() below if you want no error in this situation).
+
+
# linelist dataset is piped through select() command, and names() prints just the column names
+linelist %>%
+select(case_id, date_onset, date_hospitalisation, fever) %>%
+names() # display the column names
These helper functions exist to make it easy to specify columns to keep, discard, or transform. They are from the package tidyselect, which is included in tidyverse and underlies how columns are selected in dplyr functions.
+
For example, if you want to re-order the columns, everything() is a useful function to signify “all other columns not yet mentioned”. The command below moves columns date_onset and date_hospitalisation to the beginning (left) of the dataset, but keeps all the other columns afterward. Note that everything() is written with empty parentheses:
+
+
# move date_onset and date_hospitalisation to beginning
+linelist %>%
+select(date_onset, date_hospitalisation, everything()) %>%
+names()
In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.
+
Use where() to specify logical criteria for columns. If providing a function inside where(), do not include the function’s empty parentheses. The command below selects columns that are class Numeric.
+
+
# select columns that are class Numeric
+linelist %>%
+select(where(is.numeric)) %>%
+names()
Use contains() to select only columns in which the column name contains a specified character string. ends_with() and starts_with() provide more nuance.
The function matches() works similarly to contains() but can be provided a regular expression (see page on Characters and strings), such as multiple strings separated by OR bars within the parentheses:
+
+
# searched for multiple character matches
+linelist %>%
+select(matches("onset|hosp|fev")) %>%# note the OR symbol "|"
+names()
CAUTION: If a column name that you specifically provide does not exist in the data, it can return an error and stop your code. Consider using any_of() to cite columns that may or may not exist, especially useful in negative (remove) selections.
+
Only one of these columns exists, but no error is produced and the code continues without stopping your cleaning chain.
Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained.
+
+
linelist %>%
+select(-c(date_onset, fever:vomit)) %>%# remove date_onset and all columns from fever to vomit
+names()
You can also remove a column using base R syntax, by defining it as NULL. For example:
+
+
linelist$date_onset <-NULL# deletes column with base R syntax
+
+
+
+
Standalone
+
select() can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.
+
+
# Create a new linelist with id and age-related columns
+linelist_age <-select(linelist, case_id, contains("age"))
+
+# display the column names
+names(linelist_age)
+
+
[1] "case_id" "age" "age_unit"
+
+
+
+
Add to the pipe chain
+
In the linelist_raw, there are a few columns we do not need: row_num, merged_header, and x28. We remove them with a select() command in the cleaning pipe chain:
+
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+#####################################################
+
+# remove column
+select(-c(row_num, merged_header, x28))
+
+
+
+
+
+
+
+
+
8.6 Deduplication
+
See the handbook page on De-duplication for extensive options on how to de-duplicate data. Only a very simple row de-duplication example is presented here.
+
The package dplyr offers the distinct() function. This function examines every row and reduce the data frame to only the unique rows. That is, it removes rows that are 100% duplicates.
+
When evaluating duplicate rows, it takes into account a range of columns - by default it considers all columns. As shown in the de-duplication page, you can adjust this column range so that the uniqueness of rows is only evaluated in regards to certain columns.
+
In this simple example, we just add the empty command distinct() to the pipe chain. This ensures there are no rows that are 100% duplicates of other rows (evaluated across all columns).
+
We begin with nrow(linelist) rows in linelist.
+
+
linelist <- linelist %>%
+distinct()
+
+
After de-duplication there are nrow(linelist) rows. Any removed rows would have been 100% duplicates of other rows.
+
Below, the distinct() command is added to the cleaning pipe chain:
+
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+#####################################################
+
+# de-duplicate
+distinct()
+
+
+
+
+
+
+
8.7 Column creation and transformation
+
We recommend using the dplyr function mutate() to add a new column, or to modify an existing one.
+
Below is an example of creating a new column with mutate(). The syntax is: mutate(new_column_name = value or transformation).
+
In Stata, this is similar to the command generate, but R’s mutate() can also be used to modify an existing column.
+
+
New columns
+
The most basic mutate() command to create a new column might look like this. It creates a new column new_col where the value in every row is 10.
+
+
linelist <- linelist %>%
+mutate(new_col =10)
+
+
You can also reference values in other columns, to perform calculations. Below, a new column bmi is created to hold the Body Mass Index (BMI) for each case - as calculated using the formula BMI = kg/m^2, using column ht_cm and column wt_kg.
If creating multiple new columns, separate each with a comma and new line. Below are examples of new columns, including ones that consist of values from other columns combined using str_glue() from the stringr package (see page on Characters and strings.
+
+
new_col_demo <- linelist %>%
+mutate(
+new_var_dup = case_id, # new column = duplicate/copy another existing column
+new_var_static =7, # new column = all values the same
+new_var_static = new_var_static +5, # you can overwrite a column, and it can be a calculation using other variables
+new_var_paste = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
+ ) %>%
+select(case_id, hospital, date_hospitalisation, contains("new")) # show only new columns, for demonstration purposes
+
+
Review the new columns. For demonstration purposes, only the new columns and the columns used to create them are shown:
+
+
+
+
+
+
+
TIP: A variation on mutate() is the function transmute(). This function adds a new column just like mutate(), but also drops/removes all other columns that you do not mention within its parentheses.
+
+
# HIDDEN FROM READER
+# removes new demo columns created above
+# linelist <- linelist %>%
+# select(-contains("new_var"))
+
+
+
+
Convert column class
+
Columns containing values that are dates, numbers, or logical values (TRUE/FALSE) will only behave as expected if they are correctly classified. There is a difference between “2” of class character and 2 of class numeric!
+
There are ways to set column class during the import commands, but this is often cumbersome. See the R Basics section on object classes to learn more about converting the class of objects and columns.
+
First, let’s run some checks on important columns to see if they are the correct class. We also saw this in the beginning when we ran skim().
+
Currently, the class of the age column is character. To perform quantitative analyses, we need these numbers to be recognized as numeric!
+
+
class(linelist$age)
+
+
[1] "character"
+
+
+
The class of the date_onset column is also character! To perform analyses, these dates must be recognized as dates!
+
+
class(linelist$date_onset)
+
+
[1] "character"
+
+
+
To resolve this, use the ability of mutate() to re-define a column with a transformation. We define the column as itself, but converted to a different class. Here is a basic example, converting or ensuring that the column age is class Numeric:
In a similar way, you can use as.character() and as.logical(). To convert to class Factor, you can use factor() from base R or as_factor() from forcats. Read more about this in the Factors page.
+
You must be careful when converting to class Date. Several methods are explained on the page Working with dates. Typically, the raw date values must all be in the same format for conversion to work correctly (e.g “MM/DD/YYYY”, or “DD MM YYYY”). After converting to class Date, check your data to confirm that each value was converted correctly.
+
+
+
Grouped data
+
If your data frame is already grouped (see page on Grouping data), mutate() may behave differently than if the data frame is not grouped. Any summarizing functions, like mean(), median(), max(), etc. will calculate by group, not by all the rows.
+
+
# age normalized to mean of ALL rows
+linelist %>%
+mutate(age_norm = age /mean(age, na.rm=T))
+
+# age normalized to mean of hospital group
+linelist %>%
+group_by(hospital) %>%
+mutate(age_norm = age /mean(age, na.rm=T))
Often to write concise code you want to apply the same transformation to multiple columns at once. A transformation can be applied to multiple columns at once using the across() function from the package dplyr (also contained within tidyverse package). across() can be used with any dplyr function, but is commonly used within select(), mutate(), filter(), or summarise(). See how it is applied to summarise() in the page on Descriptive tables.
+
Specify the columns to the argument .cols = and the function(s) to apply to .fns =. Any additional arguments to provide to the .fns function can be included after a comma, still within across().
+
+
across() column selection
+
Specify the columns to the argument .cols =. You can name them individually, or use “tidyselect” helper functions. Specify the function to .fns =. Note that using the function mode demonstrated below, the function is written without its parentheses ( ).
+
Here the transformation as.character() is applied to specific columns named within across().
The “tidyselect” helper functions are available to assist you in specifying columns. They are detailed above in the section on Selecting and re-ordering columns, and they include: everything(), last_col(), where(), starts_with(), ends_with(), contains(), matches(), num_range() and any_of().
+
Here is an example of how one would change all columns to character class:
+
+
#to change all columns to character class
+linelist <- linelist %>%
+mutate(across(.cols =everything(), .fns = as.character))
+
+
Convert to character all columns where the name contains the string “date” (note the placement of commas and parentheses):
+
+
#to change all columns to character class
+linelist <- linelist %>%
+mutate(across(.cols =contains("date"), .fns = as.character))
+
+
Below, an example of mutating the columns that are currently class POSIXct (a raw datetime class that shows timestamps) - in other words, where the function is.POSIXct() evaluates to TRUE. Then we want to apply the function as.Date() to these columns to convert them to a normal class Date.
Note that within across() we also use the function where() as is.POSIXct is evaluating to either TRUE or FALSE.
+
+
Note that is.POSIXct() is from the package lubridate. Other similar “is” functions like is.character(), is.numeric(), and is.logical() are from base R.
+
+
+
+
across() functions
+
You can read the documentation with ?across for details on how to provide functions to across(). A few summary points: there are several ways to specify the function(s) to perform on a column and you can even define your own functions:
+
+
You can provide the function name alone (e.g. mean or as.character).
+
+
You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page).
+
+
You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))).
+
+
If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function.
This dplyr function finds the first non-missing value at each position. It “fills-in” missing values with the first available value in an order you specify.
+
Here is an example outside the context of a data frame: Let us say you have two vectors, one containing the patient’s village of detection and another containing the patient’s village of residence. You can use coalesce to pick the first non-missing value for each index:
This works the same if you provide data frame columns: for each row, the function will assign the new column value with the first non-missing value in the columns you provided (in order provided).
This is an example of a “row-wise” operation. For more complicated row-wise calculations, see the section below on Row-wise calculations.
+
+
+
Cumulative math
+
If you want a column to reflect the cumulative sum/mean/min/max etc as assessed down the rows of a dataframe to that point, use the following functions:
+
cumsum() returns the cumulative sum, as shown below:
+
+
sum(c(2,4,15,10)) # returns only one number
+
+
[1] 31
+
+
cumsum(c(2,4,15,10)) # returns the cumulative sum at each step
+
+
[1] 2 6 21 31
+
+
+
This can be used in a dataframe when making a new column. For example, to calculate the cumulative number of cases per day in an outbreak, consider code like this:
+
+
cumulative_case_counts <- linelist %>%# begin with case linelist
+count(date_onset) %>%# count of rows per day, as column 'n'
+mutate(cumulative_cases =cumsum(n)) # new column, of the cumulative sum at each row
See the page on Epidemic curves for how to plot cumulative incidence with the epicurve.
+
See also:
+cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()
+
+
+
Using base R
+
To define a new column (or re-define a column) using base R, write the name of data frame, connected with $, to the new column (or the column to be modified). Use the assignment operator <- to define the new value(s). Remember that when using base R you must specify the data frame name before the column name every time (e.g. dataframe$column). Here is an example of creating the bmi column using base R:
Below, a new column is added to the pipe chain and some classes are converted.
+
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+# add new column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age))
+
+
+
+
+
8.8 Re-code values
+
Here are a few scenarios where you need to re-code (change) values:
+
+
to edit one specific value (e.g. one date with an incorrect year or format).
+
+
to reconcile values not spelled the same.
+
to create a new column of categorical values.
+
+
to create a new column of numeric categories (e.g. age categories).
+
+
+
Specific values
+
To change values manually you can use the recode() function within the mutate() function.
+
Imagine there is a nonsensical date in the data (e.g. “2014-14-15”): you could fix the date manually in the raw source data, or, you could write the change into the cleaning pipeline via mutate() and recode(). The latter is more transparent and reproducible to anyone else seeking to understand or repeat your analysis.
+
+
# fix incorrect values # old value # new value
+linelist <- linelist %>%
+mutate(date_onset =recode(date_onset, "2014-14-15"="2014-04-15"))
+
+
The mutate() line above can be read as: “mutate the column date_onset to equal the column date_onset re-coded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this.
+
Here is another example re-coding multiple values within one column.
+
In linelist the values in the column “hospital” must be cleaned. There are several different spellings and many missing values.
+
+
table(linelist$hospital, useNA ="always") # print table of all unique values, including missing
+
+
+ Central Hopital Central Hospital
+ 11 457
+ Hospital A Hospital B
+ 290 289
+ Military Hopital Military Hospital
+ 32 798
+ Mitylira Hopital Mitylira Hospital
+ 1 79
+ Other Port Hopital
+ 907 48
+ Port Hospital St. Mark's Maternity Hospital (SMMH)
+ 1756 417
+ St. Marks Maternity Hopital (SMMH) <NA>
+ 11 1512
+
+
+
The recode() command below re-defines the column “hospital” as the current column “hospital”, but with the specified recode changes. Don’t forget commas after each!
+
+
linelist <- linelist %>%
+mutate(hospital =recode(hospital,
+# for reference: OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ ))
+
+
Now we see the spellings in the hospital column have been corrected and consolidated:
+
+
table(linelist$hospital, useNA ="always")
+
+
+ Central Hospital Hospital A
+ 468 290
+ Hospital B Military Hospital
+ 289 910
+ Other Port Hospital
+ 907 1804
+St. Mark's Maternity Hospital (SMMH) <NA>
+ 428 1512
+
+
+
TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.
+
TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween (““).
+
+
+
By logic
+
Below we demonstrate how to re-code values in a column using logic and conditions:
+
+
Using replace(), ifelse() and if_else() for simple logic.
+
Using case_when() for more complex logic.
+
+
+
+
Simple logic
+
+
replace()
+
To re-code with simple logical criteria, you can use replace() within mutate(). replace() is a function from base R. Use a logic condition to specify the rows to change . The general syntax is:
+
+
mutate(col_to_change =replace(col_to_change, criteria for rows, new value))
+
+
One common situation to use replace() is changing just one value in one row, using an unique row identifier. Below, the gender is changed to “Female” in the row where the column case_id is “2195”.
+
+
# Example: change gender of one specific observation to "Female"
+linelist <- linelist %>%
+mutate(gender =replace(gender, case_id =="2195", "Female"))
+
+
The equivalent command using base R syntax and indexing brackets [ ] is below. It reads as “Change the value of the dataframe linelist‘s column gender (for the rows where linelist’s column case_id has the value ’2195’) to ‘Female’”.
Another tool for simple logic is ifelse() and its partner if_else(). However, in most cases for re-coding it is more clear to use case_when() (detailed below). These “if else” commands are simplified versions of an if and else programming statement. The general syntax is:
+ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)
+
Below, the column source_known is defined. Its value in a given row is set to “known” if the row’s value in column source is not missing. If the value in sourceis missing, then the value in source_known is set to “unknown”.
if_else() is a special version from dplyr that handles dates. Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special value NA_real_ instead of just NA.
+
+
# Create a date of death column, which is NA if patient has not died.
+linelist <- linelist %>%
+mutate(date_death =if_else(outcome =="Death", date_outcome, NA_real_))
+
+
Avoid stringing together many ifelse commands… use case_when() instead!case_when() is much easier to read and you’ll make fewer errors.
+
+
+
+
+
+
+
+
Outside of the context of a data frame, if you want to have an object used in your code switch its value, consider using switch() from base R.
+
+
+
+
Complex logic
+
Use dplyr’s case_when() if you are re-coding into many new groups, or if you need to use complex logic statements to re-code values. This function evaluates every row in the data frame, assess whether the rows meets specified criteria, and assigns the correct new value.
+
case_when() commands consist of statements that have a Right-Hand Side (RHS) and a Left-Hand Side (LHS) separated by a “tilde” ~. The logic criteria are in the left side and the pursuant values are in the right side of each statement. Statements are separated by commas.
+
For example, here we utilize the columns age and age_unit to create a column age_years:
+
+
linelist <- linelist %>%
+mutate(age_years =case_when(
+ age_unit =="years"~ age, # if age unit is years
+ age_unit =="months"~ age/12, # if age unit is months, divide age by 12
+is.na(age_unit) ~ age)) # if age unit is missing, assume years
+# any other circumstance, assign NA (missing)
+
+
As each row in the data is evaluated, the criteria are applied/evaluated in the order the case_when() statements are written, from top-to-bottom. If the top criteria evaluates to TRUE for a given row, the RHS value is assigned, and the remaining criteria are not even tested for that row in the data. Thus, it is best to write the most specific criteria first, and the most general last. A data row that does not meet any of the RHS criteria will be assigned NA.
+
Sometimes, you may with to write a final statement that assigns a value for all other scenarios not described by one of the previous lines. To do this, place TRUE on the left-side, which will capture any row that did not meet any of the previous criteria. The right-side of this statement could be assigned a value like “check me!” or missing.
+
Below is another example of case_when() used to create a new column with the patient classification, according to a case definition for confirmed and suspect cases:
+
+
linelist <- linelist %>%
+mutate(case_status =case_when(
+
+# if patient had lab test and it is positive,
+# then they are marked as a confirmed case
+ ct_blood <20~"Confirmed",
+
+# given that a patient does not have a positive lab result,
+# if patient has a "source" (epidemiological link) AND has fever,
+# then they are marked as a suspect case
+!is.na(source) & fever =="yes"~"Suspect",
+
+# any other patient not addressed above
+# is marked for follow up
+TRUE~"To investigate"))
+
+
DANGER:Values on the right-side must all be the same class - either numeric, character, date, logical, etc. To assign missing (NA), you may need to use special variations of NA such as NA_character_, NA_real_ (for numeric or POSIX), and as.Date(NA). Read more in Working with dates.
+
+
+
Missing values
+
Below are special functions for handling missing values in the context of data cleaning.
+
See the page on Missing data for more detailed tips on identifying and handling missing values. For example, the is.na() function which logically tests for missingness.
+
replace_na()
+
To change missing values (NA) to a specific value, such as “Missing”, use the dplyr function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().
This is a function from the forcats package. The forcats package handles columns of class Factor. Factors are R’s way to handle ordered values such as c("First", "Second", "Third") or to set the order that values (e.g. hospitals) appear in tables and plots. See the page on Factors.
+
If your data are class Factor and you try to convert NA to “Missing” by using replace_na(), you will get this error: invalid factor level, NA generated. You have tried to add “Missing” as a value, when it was not defined as a possible level of the factor, and it was rejected.
+
The easiest way to solve this is to use the forcats function fct_explicit_na() which converts a column to class factor, and converts NA values to the character “(Missing)”.
A slower alternative would be to add the factor level using fct_expand() and then convert the missing values.
+
na_if()
+
To convert a specific value toNA, use dplyr’s na_if(). The command below performs the opposite operation of replace_na(). In the example below, any values of “Missing” in the column hospital are converted to NA.
Note: na_if()cannot be used for logic criteria (e.g. “all values > 99”) - use replace() or case_when() for this:
+
+
# Convert temperatures above 40 to NA
+linelist <- linelist %>%
+mutate(temp =replace(temp, temp >40, NA))
+
+# Convert onset dates earlier than 1 Jan 2000 to missing
+linelist <- linelist %>%
+mutate(date_onset =replace(date_onset, date_onset >as.Date("2000-01-01"), NA))
+
+
+
+
Cleaning dictionary
+
Use the R package matchmaker and its function match_df() to clean a data frame with a cleaning dictionary.
+
+
Create a cleaning dictionary with 3 columns:
+
+
A “from” column (the incorrect value).
+
+
A “to” column (the correct value).
+
+
A column specifying the column for the changes to be applied (or “.global” to apply to all columns).
+
+
+
Note: .global dictionary entries will be overridden by column-specific dictionary entries.
+
+
+
+
+
+
+
+
+
Import the dictionary file into R. This example can be downloaded via instructions on the Download handbook and data page.
+
+
+
cleaning_dict <-import("cleaning_dict.csv")
+
+
+
Pipe the raw linelist to match_df(), specifying to dictionary = the cleaning dictionary data frame. The from = argument should be the name of the dictionary column which contains the “old” values, the by = argument should be dictionary column which contains the corresponding “new” values, and the third column lists the column in which to make the change. Use .global in the by = column to apply a change across all columns. A fourth dictionary column order can be used to specify factor order of new values.
+
+
Read more details in the package documentation by running ?match_df. Note this function can take a long time to run for a large dataset.
+
+
linelist <- linelist %>%# provide or pipe your dataset
+ matchmaker::match_df(
+dictionary = cleaning_dict, # name of your dictionary
+from ="from", # column with values to be replaced (default is col 1)
+to ="to", # column with final values (default is col 2)
+by ="col"# column with column names (default is col 3)
+ )
+
+
Now scroll to the right to see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.
+
+
+
+
+
+
+
Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. See this online reference for the linelist package for more details.
+
+
Add to pipe chain
+
Below, some new columns and column transformations are added to the pipe chain.
+
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# add column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age)) %>%
+
+# add column: delay to hospitalisation
+mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+
+# clean values of hospital column
+mutate(hospital =recode(hospital,
+# OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ )) %>%
+
+mutate(hospital =replace_na(hospital, "Missing")) %>%
+
+# create age_years column (from age and age_unit)
+mutate(age_years =case_when(
+ age_unit =="years"~ age,
+ age_unit =="months"~ age/12,
+is.na(age_unit) ~ age,
+TRUE~NA_real_))
+
+
+
+
+
+
+
+
+
8.9 Numeric categories
+
Here we describe some special approaches for creating categories from numerical columns. Common examples include age categories, groups of lab values, etc. Here we will discuss:
+
+
age_categories(), from the epikit package.
+
+
cut(), from base R.
+
+
case_when().
+
+
quantile breaks with quantile() and ntile().
+
+
+
Review distribution
+
For this example we will create an age_cat column using the age_years column.
+
+
#check the class of the linelist variable age
+class(linelist$age_years)
+
+
[1] "numeric"
+
+
+
First, examine the distribution of your data, to make appropriate cut-points. See the page on ggplot basics.
+
+
# examine the distribution
+hist(linelist$age_years)
+
+
+
+
+
+
+
+
summary(linelist$age_years, na.rm=T)
+
+
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+ 0.00 6.00 13.00 16.04 23.00 84.00 107
+
+
+
CAUTION: Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years)..
+
+
+
+
age_categories()
+
With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this function can be applied to non-age numeric variables too). As a bonum, the output column is automatically an ordered factor.
+
Here are the required inputs:
+
+
A numeric vector (column)
+
+
The breakers = argument - provide a numeric vector of break points for the new groups
+
+
First, the simplest example:
+
+
# Simple example
+################
+pacman::p_load(epikit) # load package
+
+linelist <- linelist %>%
+mutate(
+age_cat =age_categories( # create new column
+ age_years, # numeric column to make groups from
+breakers =c(0, 5, 10, 15, 20, # break points
+30, 40, 50, 60, 70)))
+
+# show table
+table(linelist$age_cat, useNA ="always")
The break values you specify are by default the lower bounds - that is, they are included in the “higher” group / the groups are “open” on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.
+
+
# Include upper ends for the same categories
+############################################
+linelist <- linelist %>%
+mutate(
+age_cat =age_categories(
+ age_years,
+breakers =c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))
+
+# show table
+table(linelist$age_cat, useNA ="always")
You can adjust how the labels are displayed with separator =. The default is “-”
+
You can adjust how the top numbers are handled, with the ceiling = arguemnt. To set an upper cut-off set ceiling = TRUE. In this use, the highest break value provided is a “ceiling” and a category “XX+” is not created. Any values above highest break value (or to upper =, if defined) are categorized as NA. Below is an example with ceiling = TRUE, so that there is no category of XX+ and values above 70 (the highest break value) are assigned as NA.
+
+
# With ceiling set to TRUE
+##########################
+linelist <- linelist %>%
+mutate(
+age_cat =age_categories(
+ age_years,
+breakers =c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
+ceiling =TRUE)) # 70 is ceiling, all above become NA
+
+# show table
+table(linelist$age_cat, useNA ="always")
See the function’s Help page for more details (enter ?age_categories in the R console).
+
+
+
+
cut()
+
cut() is a base R alternative to age_categories(), but I think you will see why age_categories() was developed to simplify this process. Some notable differences from age_categories() are:
+
+
You do not need to install/load another package.
+
+
You can specify whether groups are open/closed on the right/left.
+
+
You must provide accurate labels yourself.
+
+
If you want 0 included in the lowest group you must specify this.
+
+
The basic syntax within cut() is to first provide the numeric column to be cut (age_years), and then the breaks argument, which is a numeric vector c() of break points. Using cut(), the resulting column is an ordered factor.
+
By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is.Reverse this behavior by providing the right = TRUE argument.
+
Thus, by default, “0” values are excluded from the lowest group, and categorized as NA! “0” values could be infants coded as age 0 so be careful! To change this, add the argument include.lowest = TRUE so that any “0” values will be included in the lowest group. The automatically-generated label for the lowest category will then be “[A],B]”. Note that if you include the include.lowest = TRUE argument andright = TRUE, the extreme inclusion will now apply to the highest break point value and category, not the lowest.
+
You can provide a vector of customized labels using the labels = argument. As these are manually written, be very careful to ensure they are accurate! Check your work using cross-tabulation, as described below.
+
An example of cut() applied to age_years to make the new variable age_cat is below:
+
+
# Create new variable, by cutting the numeric age variable
+# lower break is excluded but upper break is included in each category
+linelist <- linelist %>%
+mutate(
+age_cat =cut(
+ age_years,
+breaks =c(0, 5, 10, 15, 20,
+30, 50, 70, 100),
+include.lowest =TRUE# include 0 in lowest group
+ ))
+
+# tabulate the number of observations per group
+table(linelist$age_cat, useNA ="always")
Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 16-20).
+
+
# Cross tabulation of the numeric and category columns.
+table("Numeric Values"= linelist$age_years, # names specified in table for clarity.
+"Categories"= linelist$age_cat,
+useNA ="always") # don't forget to examine NA values
You may want to assign NA values a label such as “Missing”. Because the new column is class Factor (restricted values), you cannot simply mutate it with replace_na(), as this value will be rejected. Instead, use fct_explicit_na() from forcats as explained in the Factors page.
Warning: There was 1 warning in `mutate()`.
+ℹ In argument: `age_cat = fct_explicit_na(age_cat, na_level = "Missing age")`.
+Caused by warning:
+! `fct_explicit_na()` was deprecated in forcats 1.0.0.
+ℹ Please use `fct_na_value_to_level()` instead.
+
+
# table to view counts
+table(linelist$age_cat, useNA ="always")
For a fast way to make breaks and label vectors, use something like below. See the R basics page for references on seq() and rep().
+
+
# Make break points from 0 to 90 by 5
+age_seq =seq(from =0, to =90, by =5)
+age_seq
+
+# Make labels for the above categories, assuming default cut() settings
+age_labels =paste0(age_seq +1, "-", age_seq +5)
+age_labels
+
+# check that both vectors are the same length
+length(age_seq) ==length(age_labels)
+
+
Read more about cut() in its Help page by entering ?cut in the R console.
+
+
+
Quantile breaks
+
In common understanding, “quantiles” or “percentiles” typically refer to a value below which a proportion of values fall. For example, the 95th percentile of ages in linelist would be the age below which 95% of the age fall.
+
However in common speech, “quartiles” and “deciles” can also refer to the groups of data as equally divided into 4, or 10 groups (note there will be one more break point than group).
+
To get quantile break points, you can use quantile() from the stats package from base R. You provide a numeric vector (e.g. a column in a dataset) and vector of numeric probability values ranging from 0 to 1.0. The break points are returned as a numeric vector. Explore the details of the statistical methodologies by entering ?quantile.
+
+
If your input numeric vector has any missing values it is best to set na.rm = TRUE
+
+
Set names = FALSE to get an un-named numeric vector
+
+
+
quantile(linelist$age_years, # specify numeric vector to work on
+probs =c(0, .25, .50, .75, .90, .95), # specify the percentiles you want
+na.rm =TRUE) # ignore missing values
+
+
0% 25% 50% 75% 90% 95%
+ 0 6 13 23 33 41
+
+
+
You can use the results of quantile() as break points in age_categories() or cut(). Below we create a new column deciles using cut() where the breaks are defined using quantiles() on age_years. Below, we display the results using tabyl() from janitor so you can see the percentages (see the Descriptive tables page). Note how they are not exactly 10% in each group.
+
+
linelist %>%# begin with linelist
+mutate(deciles =cut(age_years, # create new column decile as cut() on column age_years
+breaks =quantile( # define cut breaks using quantile()
+ age_years, # operate on age_years
+probs =seq(0, 1, by =0.1), # 0.0 to 1.0 by 0.1
+na.rm =TRUE), # ignore missing values
+include.lowest =TRUE)) %>%# for cut() include age 0
+ janitor::tabyl(deciles) # pipe to table to display
Another tool to make numeric groups is the the dplyr function ntile(), which attempts to break your data into n evenly-sized groups - but be aware that unlike with quantile() the same value could appear in more than one group. Provide the numeric vector and then the number of groups. The values in the new column created is just group “numbers” (e.g. 1 to 10), not the range of values themselves as when using cut().
+
+
# make groups with ntile()
+ntile_data <- linelist %>%
+mutate(even_groups =ntile(age_years, 10))
+
+# make table of counts and proportions by group
+ntile_table <- ntile_data %>%
+ janitor::tabyl(even_groups)
+
+# attach min/max values to demonstrate ranges
+ntile_ranges <- ntile_data %>%
+group_by(even_groups) %>%
+summarise(
+min =min(age_years, na.rm=T),
+max =max(age_years, na.rm=T)
+ )
+
+
Warning: There were 2 warnings in `summarise()`.
+The first warning was:
+ℹ In argument: `min = min(age_years, na.rm = T)`.
+ℹ In group 11: `even_groups = NA`.
+Caused by warning in `min()`:
+! no non-missing arguments to min; returning Inf
+ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
+
+
# combine and print - note that values are present in multiple groups
+left_join(ntile_table, ntile_ranges, by ="even_groups")
It is possible to use the dplyr function case_when() to create categories from a numeric column, but it is easier to use age_categories() from epikit or cut() because these will create an ordered factor automatically.
+
If using case_when(), please review the proper use as described earlier in the Re-code values section of this page. Also be aware that all right-hand side values must be of the same class. Thus, if you want NA on the right-side you should either write “Missing” or use the special NA value NA_character_.
+
+
+
Add to pipe chain
+
Below, code to create two categorical age columns is added to the cleaning pipe chain:
+
+
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# add column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age)) %>%
+
+# add column: delay to hospitalisation
+mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
+
+# clean values of hospital column
+mutate(hospital =recode(hospital,
+# OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ )) %>%
+
+mutate(hospital =replace_na(hospital, "Missing")) %>%
+
+# create age_years column (from age and age_unit)
+mutate(age_years =case_when(
+ age_unit =="years"~ age,
+ age_unit =="months"~ age/12,
+is.na(age_unit) ~ age)) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+mutate(
+# age categories: custom
+age_cat = epikit::age_categories(age_years, breakers =c(0, 5, 10, 15, 20, 30, 50, 70)),
+
+# age categories: 0 to 85 by 5s
+age_cat5 = epikit::age_categories(age_years, breakers =seq(0, 85, 5)))
+
+
+
+
+
+
8.10 Add rows
+
+
One-by-one
+
Adding rows one-by-one manually is tedious but can be done with add_row() from dplyr. Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.
Use .before and .after. to specify the placement of the row you want to add. .before = 3 will put the new row before the current 3rd row. The default behavior is to add the row to the end. Columns not specified will be left empty (NA).
+
The new row number may look strange (“…23”) but the row numbers in the pre-existing rows have changed. So if using the command twice, examine/test the insertion carefully.
+
If a class you provide is off you will see an error like this:
+
Error: Can't combine ..1$infection date <date> and ..2$infection date <character>.
+
(when inserting a row with a date value, remember to wrap the date in the function as.Date() like as.Date("2020-10-10")).
+
+
+
Bind rows
+
To combine datasets together by binding the rows of one dataframe to the bottom of another data frame, you can use bind_rows() from dplyr. This is explained in more detail in the page Joining data.
+
+
+
+
+
+
+
8.11 Filter rows
+
A typical cleaning step after you have cleaned the columns and re-coded values is to filter the data frame for specific rows using the dplyr verb filter().
+
Within filter(), specify the logic that must be TRUE for a row in the dataset to be kept. Below we show how to filter rows based on simple and complex logical conditions.
+
+
+
Simple filter
+
This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses evaluates to TRUE are kept.
+
In this example, the logical statement is gender == "f", which is asking whether the value in the column gender is equal to “f” (case sensitive).
+
Before the filter is applied, the number of rows in linelist is nrow(linelist).
+
+
linelist <- linelist %>%
+filter(gender =="f") # keep only rows where gender is equal to "f"
+
+
After the filter is applied, the number of rows in linelist is linelist %>% filter(gender == "f") %>% nrow().
+
+
+
Filter out missing values
+
It is fairly common to want to filter out rows that have missing values. Resist the urge to write filter(!is.na(column) & !is.na(column)) and instead use the tidyr function that is custom-built for this purpose: drop_na(). If run with empty parentheses, it removes rows with any missing values. Alternatively, you can provide names of specific columns to be evaluated for missingness, or use the “tidyselect” helper functions described above.
+
+
linelist %>%
+drop_na(case_id, age_years) # drop rows with missing values for case_id or age_years
+
+
See the page on Missing data for many techniques to analyse and manage missingness in your data.
+
+
+
Filter by row number
+
In a data frame or tibble, each row will usually have a “row number” that (when seen in R Viewer) appears to the left of the first column. It is not itself a true column in the data, but it can be used in a filter() statement.
+
To filter based on “row number”, you can use the dplyr function row_number() with open parentheses as part of a logical filtering statement. Often you will use the %in% operator and a range of numbers as part of that logical statement, as shown below. To see the first N rows, you can also use the special dplyr function head().
+
+
# View first 100 rows
+linelist %>%head(100) # or use tail() to see the n last rows
+
+# Show row 5 only
+linelist %>%filter(row_number() ==5)
+
+# View rows 2 through 20, and three specific columns
+linelist %>%filter(row_number() %in%2:20) %>%select(date_onset, outcome, age)
+
+
You can also convert the row numbers to a true column by piping your data frame to the tibble function rownames_to_column() (do not put anything in the parentheses).
+
+
+
+
Complex filter
+
More complex logical statements can be constructed using parentheses ( ), OR |, negate !, %in%, and AND & operators. An example is below:
+
Note: You can use the ! operator in front of a logical criteria to negate it. For example, !is.na(column) evaluates to true if the column value is not missing. Likewise !column %in% c("a", "b", "c") evaluates to true if the column value is not in the vector.
+
+
Examine the data
+
Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this raw dataset. For our analyses, we want to remove entries from this earlier outbreak.
+
+
hist(linelist$date_onset, breaks =50)
+
+
+
+
+
+
+
+
+
How filters handle missing numeric and date values
+
Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would remove any rows in the later epidemic with a missing date of onset!
+
DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.
+
(See the page on Working with dates for more information on working with dates and the package lubridate)
+
+
+
Design the filter
+
Examine a cross-tabulation to make sure we exclude only the correct rows:
+
+
table(Hospital = linelist$hospital, # hospital name
+YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
+useNA ="always") # show missing values
What other criteria can we filter on to remove the first outbreak (in 2012 & 2013) from the dataset? We see that:
+
+
The first epidemic in 2012 & 2013 occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
+
+
Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.
+
+
We want to exclude:
+
+
The rows with onset in 2012 and 2013 at either hospital A, B, or Port: nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01")))
+
+
Exclude rows with onset in 2012 and 2013 nrow(linelist %>% filter(date_onset < as.Date("2013-06-01")))
+
Exclude rows from Hospitals A & B with missing onset dates
+nrow(linelist %>% filter(hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))
+
Do not exclude other rows with missing onset dates.
+nrow(linelist %>% filter(!hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))
+
+
+
We start with a linelist of nrow(linelist)`. Here is our filter statement:
+
+
linelist <- linelist %>%
+# keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
+filter(date_onset >as.Date("2013-06-01") | (is.na(date_onset) &!hospital %in%c("Hospital A", "Hospital B")))
+
+nrow(linelist)
+
+
[1] 6019
+
+
+
When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.
+
+
table(Hospital = linelist$hospital, # hospital name
+YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
+useNA ="always") # show missing values
+
+
YearOnset
+Hospital 2014 2015 <NA>
+ Central Hospital 351 99 18
+ Military Hospital 676 200 34
+ Missing 1117 318 77
+ Other 684 177 46
+ Port Hospital 1372 347 75
+ St. Mark's Maternity Hospital (SMMH) 322 93 13
+ <NA> 0 0 0
+
+
+
Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.
+
Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete with no missing values. This is true. But date_onset is used for purposes of demonstrating a complex filter.
+
+
+
+
Standalone
+
Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.
+
+
# dataframe <- filter(dataframe, condition(s) for rows to keep)
+
+linelist <-filter(linelist, !is.na(case_id))
+
+
You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.
Often you want to quickly review a few records, for only a few columns. The base R function View() will print a data frame for viewing in your RStudio.
+
View the linelist in RStudio:
+
+
View(linelist)
+
+
Here are two examples of viewing specific cells (specific rows, and specific columns):
+
With dplyr functions filter() and select():
+
Within View(), pipe the dataset to filter() to keep certain rows, and then to select() to keep certain columns. For example, to review onset and hospitalization dates of 3 specific cases:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
+##################################################################################
+
+# begin cleaning pipe chain
+###########################
+linelist <- linelist_raw %>%
+
+# standardize column name syntax
+ janitor::clean_names() %>%
+
+# manually re-name columns
+# NEW name # OLD name
+rename(date_infection = infection_date,
+date_hospitalisation = hosp_date,
+date_outcome = date_of_outcome) %>%
+
+# remove column
+select(-c(row_num, merged_header, x28)) %>%
+
+# de-duplicate
+distinct() %>%
+
+# add column
+mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
+
+# convert class of columns
+mutate(across(contains("date"), as.Date),
+generation =as.numeric(generation),
+age =as.numeric(age)) %>%
+
+# add column: delay to hospitalisation
+mutate(days_onset_hosp =as.numeric(date_hospitalisation - date_onset)) %>%
+
+# clean values of hospital column
+mutate(hospital =recode(hospital,
+# OLD = NEW
+"Mitylira Hopital"="Military Hospital",
+"Mitylira Hospital"="Military Hospital",
+"Military Hopital"="Military Hospital",
+"Port Hopital"="Port Hospital",
+"Central Hopital"="Central Hospital",
+"other"="Other",
+"St. Marks Maternity Hopital (SMMH)"="St. Mark's Maternity Hospital (SMMH)"
+ )) %>%
+
+mutate(hospital =replace_na(hospital, "Missing")) %>%
+
+# create age_years column (from age and age_unit)
+mutate(age_years =case_when(
+ age_unit =="years"~ age,
+ age_unit =="months"~ age/12,
+is.na(age_unit) ~ age)) %>%
+
+mutate(
+# age categories: custom
+age_cat = epikit::age_categories(age_years, breakers =c(0, 5, 10, 15, 20, 30, 50, 70)),
+
+# age categories: 0 to 85 by 5s
+age_cat5 = epikit::age_categories(age_years, breakers =seq(0, 85, 5))) %>%
+
+# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
+###################################################
+filter(
+# keep only rows where case_id is not missing
+!is.na(case_id),
+
+# also filter to keep only the second outbreak
+ date_onset >as.Date("2013-06-01") | (is.na(date_onset) &!hospital %in%c("Hospital A", "Hospital B")))
+
+
+
+
+
+
+
+
+
8.12 Row-wise calculations
+
If you want to perform a calculation within a row, you can use rowwise() from dplyr. See this online vignette on row-wise calculations. For example, this code applies rowwise() and then creates a new column that sums the number of the specified symptom columns that have value “yes”, for each row in the linelist. The columns are specified within sum() by name within a vector c(). rowwise() is essentially a special kind of group_by(), so it is best to use ungroup() when you are done (page on Grouping data).
# A tibble: 5,888 × 6
+ fever chills cough aches vomit num_symptoms
+ <chr> <chr> <chr> <chr> <chr> <int>
+ 1 no no yes no yes 2
+ 2 <NA> <NA> <NA> <NA> <NA> NA
+ 3 <NA> <NA> <NA> <NA> <NA> NA
+ 4 no no no no no 0
+ 5 no no yes no yes 2
+ 6 no no yes no yes 2
+ 7 <NA> <NA> <NA> <NA> <NA> NA
+ 8 no no yes no yes 2
+ 9 no no yes no yes 2
+10 no no yes no no 1
+# ℹ 5,878 more rows
+
+
+
As you specify the column to evaluate, you may want to use the “tidyselect” helper functions described in the select() section of this page. You just have to make one adjustment (because you are not using them within a dplyr function like select() or summarise()).
+
Put the column-specification criteria within the dplyr function c_across(). This is because c_across (documentation) is designed to work with rowwise() specifically. For example, the following code:
+
+
Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns).
+
+
Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data).
+
+
ungroup() to remove the effects of rowwise() for subsequent steps.
Use the dplyr function arrange() to sort or order the rows by column values.
+
Simple list the columns in the order they should be sorted on. Specify .by_group = TRUE if you want the sorting to to first occur by any groupings applied to the data (see page on Grouping data).
+
By default, column will be sorted in “ascending” order (which applies to numeric and also to character columns). You can sort a variable in “descending” order by wrapping it with desc().
+
Sorting data with arrange() is particularly useful when making Tables for presentation, using slice() to take the “top” rows per group, or setting factor level order by order of appearance.
+
For example, to sort the our linelist rows by hospital, then by date_onset in descending order, we would use:
Working with dates in R requires more attention than working with other object classes. Below, we offer some tools and example to make this process less painful. Luckily, dates can be wrangled easily with practice, and with a set of helpful packages such as lubridate.
+
Upon import of raw data, R often interprets dates as character objects - this means they cannot be used for general date operations such as making time series and calculating time intervals. To make matters more difficult, there are many ways a date can be formatted and you must help R know which part of a date represents what (month, day, hour, etc.).
+
Dates in R are their own class of object - the Date class. It should be noted that there is also a class that stores objects with date and time. Date time objects are formally referred to as POSIXt, POSIXct, and/or POSIXlt classes (the difference isn’t important). These objects are informally referred to as datetime classes.
+
+
It is important to make R recognize when a column contains dates.
+
+
Dates are an object class and can be tricky to work with.
+
+
Here we present several ways to convert date columns to Date class.
+
+
+
+
9.1 Preparation
+
+
Load packages
+
This code chunk shows the loading of packages required for this page. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.
+
+
# Checks if package is installed, installs if necessary, and loads package for current session
+
+pacman::p_load(
+ lubridate, # general package for handling and converting dates
+ parsedate, # has function to "guess" messy dates
+ aweek, # another option for converting dates to weeks, and weeks to dates
+ zoo, # additional date/time functions
+ here, # file management
+ rio, # data import/export
+ tidyverse) # data management and visualization
+
+
+
+
Import data
+
We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow along step-by-step, see instruction in the Download handbook and data page. We assume the file is in the working directory so no sub-folders are specified in this file path.
+
+
+
Warning: The `trust` argument of `import()` should be explicit for serialization formats
+as of rio 1.0.3.
+ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
+ℹ The deprecated feature was likely used in the rio package.
+ Please report the issue at <https://github.com/gesistsa/rio/issues>.
+
+
+
+
linelist <-import("linelist_cleaned.xlsx")
+
+
+
+
+
+
9.2 Current date
+
You can get the current “system” date or system datetime of your computer by doing the following with base R.
+
+
# get the system date - this is a DATE class
+Sys.Date()
+
+
[1] "2024-09-08"
+
+
# get the system time - this is a DATETIME class
+Sys.time()
+
+
[1] "2024-09-08 11:03:46 BST"
+
+
+
With the lubridate package these can also be returned with today() and now(), respectively. date() returns the current date and time with weekday and month names.
+
+
+
+
9.3 Convert to Date
+
After importing a dataset into R, date column values may look like “1989/12/30”, “05/06/2014”, or “13 Jan 2020”. In these cases, R is likely still treating these values as Character values. R must be told that these values are dates… and what the format of the date is (which part is Day, which is Month, which is Year, etc).
+
Once told, R converts these values to class Date. In the background, R will store the dates as numbers (the number of days from its “origin” date 1 Jan 1970). You will not interface with the date number often, but this allows for R to treat dates as continuous variables and to allow special operations such as calculating the distance between dates.
+
By default, values of class Date in R are displayed as YYYY-MM-DD. Later in this section we will discuss how to change the display of date values.
+
Below we present two approaches to converting a column from character values to class Date.
+
TIP: You can check the current class of a column with base R function class(), like class(linelist$date_onset).
+
+
base R
+
as.Date() is the standard, base R function to convert an object or column to class Date (note capitalization of “D”).
+
Use of as.Date() requires that:
+
+
You specify the existing format of the raw character date or the origin date if supplying dates as numbers (see section on Excel dates)
+
+
If used on a character column, all date values must have the same exact format (if this is not the case, try parse_date() from the parsedate package)
+
+
First, check the class of your column with class() from base R. If you are unsure or confused about the class of your data (e.g. you see “POSIXct”, etc.) it can be easiest to first convert the column to class Character with as.character(), and then convert it to class Date.
+
Second, within the as.Date() function, use the format = argument to tell R the current format of the character date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (“YYYY-MM-DD” or “YYYY/MM/DD”) the format = argument is not necessary.
+
To format =, provide a character string (in quotes) that represents the current date format using the special “strptime” abbreviations below. For example, if your character dates are currently in the format “DD/MM/YYYY”, like “24/04/1968”, then you would use format = "%d/%m/%Y" to convert the values into dates. Putting the format in quotation marks is necessary. And don’t forget any slashes or dashes!
+
+
# Convert to class date
+linelist <- linelist %>%
+mutate(date_onset =as.Date(date_of_onset, format ="%d/%m/%Y"))
+
+
Most of the strptime abbreviations are listed below. You can see the complete list by running ?strptime.
+
%d = Day number of month (5, 17, 28, etc.)
+%j = Day number of the year (Julian day 001-366)
+%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
+%A = Full weekday (Monday, Tuesday, etc.) %w = Weekday number (0-6, Sunday is 0)
+%u = Weekday number (1-7, Monday is 1)
+%W = Week number (00-53, Monday is week start)
+%U = Week number (01-53, Sunday is week start)
+%m = Month number (e.g. 01, 02, 03, 04)
+%b = Abbreviated month (Jan, Feb, etc.)
+%B = Full month (January, February, etc.)
+%y = 2-digit year (e.g. 89)
+%Y = 4-digit year (e.g. 1989)
+%h = hours (24-hr clock)
+%m = minutes
+%s = seconds %z = offset from GMT
+%Z = Time zone (character)
+
TIP: The format = argument of as.Date() is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.
+
TIP: Be sure that in the format = argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.
+
Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.
+
+
+
lubridate
+
Converting character objects to dates can be made easier by using the lubridate package. This is a tidyverse package designed to make working with dates and times more simple and consistent than in base R. For these reasons, lubridate is often considered the gold-standard package for dates and time, and is recommended whenever working with them.
+
The lubridate package provides several different helper functions designed to convert character objects to dates in an intuitive, and more lenient way than specifying the format in as.Date(). These functions are specific to the rough date format, but allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January) - they are named after abbreviations of date formats.
Once complete, you can run class() to verify the class of the column
+
+
# Check the class of the column
+class(linelist$date_onset)
+
+
Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.
+
Note that the above functions work best with 4-digit years. 2-digit years can produce unexpected results, as lubridate attempts to guess the century.
+
To convert a 2-digit year into a 4-digit year (all in the same century) you can convert to class character and then combine the existing digits with a pre-fix using str_glue() from the stringr package (see Characters and strings). Then convert to date.
You can use the lubridate functions make_date() and make_datetime() to combine multiple numeric columns into one date column. For example if you have numeric columns onset_day, onset_month, and onset_year in the data frame linelist:
In the background, most software store dates as numbers. R stores dates from an origin of 1st January, 1970. Thus, if you run as.numeric(as.Date("1970-01-01)) you will get 0.
+
Microsoft Excel stores dates with an origin of either December 30, 1899 (Windows) or January 1, 1904 (Mac), depending on your operating system. See this Microsoft guidance for more information.
+
Excel dates often import into R as these numeric values instead of as characters. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use as.Date() (or lubridate’s as_date() function) to convert, but instead of supplying a “format” as above, supply the Excel origin date to the argument origin =.
+
This will not work if the Excel date is stored in R as a character type, so be sure to ensure the number is class Numeric!
+
NOTE: You should provide the origin date in R’s default date format (“YYYY-MM-DD”).
+
+
# An example of providing the Excel 'origin date' when converting Excel number dates
+data_cleaned <- data %>%
+mutate(date_onset =as.numeric(date_onset)) %>%# ensure class is numeric
+mutate(date_onset =as.Date(date_onset, origin ="1899-12-30")) # convert to date using Excel origin
+
+
+
+
+
9.5 Messy dates
+
The function parse_date() from the parsedate package attempts to read a “messy” date column containing dates in many different formats and convert the dates to a standard format. You can read more online about parse_date().
+
For example parse_date() would see a vector of the following character dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them to class Date as: 2018-01-03, 1982-03-07, and 1985-08-20.
+
+
parsedate::parse_date(c("03 January 2018",
+"07/03/1982",
+"08/20/85"))
# An example using parse_date() on the column date_onset
+linelist <- linelist %>%
+mutate(date_onset =parse_date(date_onset))
+
+
+
+
+
9.6 Working with date-time class
+
As previously mentioned, R also supports a datetime class - a column that contains date and time information. As with the Date class, these often need to be converted from character objects to datetime objects.
+
+
Convert dates with times
+
A standard datetime object is formatted with the date first, which is followed by a time component - for example 01 Jan 2020, 16:30. As with dates, there are many ways this can be formatted, and there are numerous levels of precision (hours, minutes, seconds) that can be supplied.
+
Luckily, lubridate helper functions also exist to help convert these strings to datetime objects. These functions are extensions of the date helper functions, with _h (only hours supplied), _hm (hours and minutes supplied), or _hms (hours, minutes, and seconds supplied) appended to the end (e.g. dmy_hms()). These can be used as shown:
+
Convert datetime with only hours to datetime object
+
+
ymd_h("2020-01-01 16hrs")
+
+
[1] "2020-01-01 16:00:00 UTC"
+
+
ymd_h("2020-01-01 4PM")
+
+
[1] "2020-01-01 16:00:00 UTC"
+
+
+
Convert datetime with hours and minutes to datetime object
+
+
dmy_hm("01 January 2020 16:20")
+
+
[1] "2020-01-01 16:20:00 UTC"
+
+
+
Convert datetime with hours, minutes, and seconds to datetime object
+
+
mdy_hms("01 January 2020, 16:20:40")
+
+
[1] "2020-01-20 16:20:40 UTC"
+
+
+
You can supply time zone but it is ignored. See section later in this page on time zones.
+
+
mdy_hms("01 January 2020, 16:20:40 PST")
+
+
[1] "2020-01-20 16:20:40 UTC"
+
+
+
When working with a data frame, time and date columns can be combined to create a datetime column using str_glue() from stringr package and an appropriate lubridate function. See the page on Characters and strings for details on stringr.
+
In this example, the linelist data frame has a column in format “hours:minutes”. To convert this to a datetime we follow a few steps:
+
+
Create a “clean” time of admission column with missing values filled-in with the column median. We do this because lubridate won’t operate on missing values. Combine it with the column date_hospitalisation, and then use the function ymd_hm() to convert.
+
+
+
# packages
+pacman::p_load(tidyverse, lubridate, stringr)
+
+# time_admission is a column in hours:minutes
+linelist <- linelist %>%
+
+# when time of admission is not given, assign the median admission time
+mutate(
+time_admission_clean =ifelse(
+is.na(time_admission), # if time is missing
+median(time_admission), # assign the median
+ time_admission # if not missing keep as is
+ ) %>%
+
+# use str_glue() to combine date and time columns to create one character column
+# and then use ymd_hm() to convert it to datetime
+mutate(
+date_time_of_admission =str_glue("{date_hospitalisation} {time_admission_clean}") %>%
+ymd_hm()
+ )
+
+
+
+
Convert times alone
+
If your data contain only a character time (hours and minutes), you can convert and manipulate them as times using strptime() from base R. For example, to get the difference between two of these times:
+
+
# raw character times
+time1 <-"13:45"
+time2 <-"15:20"
+
+# Times converted to a datetime class
+time1_clean <-strptime(time1, format ="%H:%M")
+time2_clean <-strptime(time2, format ="%H:%M")
+
+# Difference is of class "difftime" by default, here converted to numeric hours
+as.numeric(time2_clean - time1_clean) # difference in hours
+
+
[1] 1.583333
+
+
+
Note however that without a date value provided, it assumes the date is today. To combine a string date and a string time together see how to use stringr in the section just above. Read more about strptime()here.
You can extract elements of a time with hour(), minute(), or second() from lubridate.
+
Here is an example of extracting the hour, and then classifing by part of the day. We begin with the column time_admission, which is class Character in format “HH:MM”. First, the strptime() is used as described above to convert the characters to datetime class. Then, the hour is extracted with hour(), returning a number from 0-24. Finally, a column time_period is created using logic with case_when() to classify rows into Morning/Afternoon/Evening/Night based on their hour of admission.
lubridate can also be used for a variety of other functions, such as extracting aspects of a date/datetime, performing date arithmetic, or calculating date intervals
+
Here we define a date to use for the examples:
+
+
# create object of class Date
+example_date <-ymd("2020-03-01")
+
+
+
Extract date components
+
You can extract common aspects such as month, day, weekday:
+
+
month(example_date) # month number
+
+
[1] 3
+
+
day(example_date) # day (number) of the month
+
+
[1] 1
+
+
wday(example_date) # day number of the week (1-7)
+
+
[1] 1
+
+
+
You can also extract time components from a datetime object or column. This can be useful if you want to view the distribution of admission times.
There are several options to retrieve weeks. See the section on Epidemiological weeks below.
+
Note that if you are seeking to display a date a certain way (e.g. “Jan 2020” or “Thursday 20 March” or “Week 20, 1977”) you can do this more flexibly as described in the section on Date display.
+
+
+
Date math
+
You can add certain numbers of days or weeks using their respective function from lubridate.
+
+
# add 3 days to this date
+example_date +days(3)
+
+
[1] "2020-03-04"
+
+
# add 7 weeks and subtract two days from this date
+example_date +weeks(7) -days(2)
+
+
[1] "2020-04-17"
+
+
+
+
+
Date intervals
+
The difference between dates can be calculated by:
+
+
Ensure both dates are of class date.
+
+
Use subtraction to return the “difftime” difference between the two dates.
+
+
If necessary, convert the result to numeric class to perform subsequent mathematical calculations.
+
+
Below the interval between two dates is calculated and displayed. You can find intervals by using the subtraction “minus” symbol on values that are class Date. Note, however that the class of the returned value is “difftime” as displayed below, and must be converted to numeric.
+
+
# find the interval between this date and Feb 20 2020
+output <- example_date -ymd("2020-02-20")
+output # print
+
+
Time difference of 10 days
+
+
class(output)
+
+
[1] "difftime"
+
+
+
To do subsequent operations on a “difftime”, convert it to numeric with as.numeric().
+
This can all be brought together to work with data - for example:
+
+
pacman::p_load(lubridate, tidyverse) # load packages
+
+linelist <- linelist %>%
+
+# convert date of onset from character to date objects by specifying dmy format
+mutate(date_onset =dmy(date_onset),
+date_hospitalisation =dmy(date_hospitalisation)) %>%
+
+# filter out all cases without onset in march
+filter(month(date_onset) ==3) %>%
+
+# find the difference in days between onset and hospitalisation
+mutate(days_onset_to_hosp = date_hospitalisation - date_of_onset)
+
+
In a data frame context, if either of the above dates is missing, the operation will fail for that row. This will result in an NA instead of a numeric value. When using this column for calculations, be sure to set the na.rm = argument to TRUE. For example:
+
+
# calculate the median number of days to hospitalisation for all cases where data are available
+median(linelist_delay$days_onset_to_hosp, na.rm = T)
+
+
+
+
+
+
9.8 Date display
+
Once dates are the correct class, you often want them to display differently, for example to display as “Monday 05 January” instead of “2018-01-05”. You may also want to adjust the display in order to then group rows by the date elements displayed - for example to group by month-year.
+
+
format()
+
Adjust date display with the base R function format(). This function accepts a character string (in quotes) specifying the desired output format in the “%” strptime abbreviations (the same syntax as used in as.Date()). Below are most of the common abbreviations.
+
Note: using format() will convert the values to class Character, so this is generally used towards the end of an analysis or for display purposes only! You can see the complete list by running ?strptime.
+
%d = Day number of month (5, 17, 28, etc.)
+%j = Day number of the year (Julian day 001-366)
+%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
+%A = Full weekday (Monday, Tuesday, etc.)
+%w = Weekday number (0-6, Sunday is 0)
+%u = Weekday number (1-7, Monday is 1)
+%W = Week number (00-53, Monday is week start)
+%U = Week number (01-53, Sunday is week start)
+%m = Month number (e.g. 01, 02, 03, 04)
+%b = Abbreviated month (Jan, Feb, etc.)
+%B = Full month (January, February, etc.)
+%y = 2-digit year (e.g. 89)
+%Y = 4-digit year (e.g. 1989)
+%h = hours (24-hr clock)
+%m = minutes
+%s = seconds
+%z = offset from GMT
+%Z = Time zone (character)
+
An example of formatting today’s date:
+
+
# today's date, with formatting
+format(Sys.Date(), format ="%d %B %Y")
+
+
[1] "08 September 2024"
+
+
# easy way to get full date and time (default formatting)
+date()
+
+
[1] "Sun Sep 8 11:03:47 2024"
+
+
# formatted combined date, time, and time zone using str_glue() function
+str_glue("{format(Sys.Date(), format = '%A, %B %d %Y, %z %Z, ')}{format(Sys.time(), format = '%H:%M:%S')}")
+
+
Sunday, September 08 2024, +0000 UTC, 11:03:47
+
+
# Using format to display weeks
+format(Sys.Date(), "%Y Week %W")
+
+
[1] "2024 Week 36"
+
+
+
Note that if using str_glue(), be aware of that within the expected double quotes ” you should only use single quotes (as above).
+
+
+
Month-Year
+
To convert a Date column to Month-year format, we suggest you use the function as.yearmon() from the zoo package. This converts the date to class “yearmon” and retains the proper ordering. In contrast, using format(column, "%Y %B") will convert to class Character and will order the values alphabetically (incorrectly).
+
Below, a new column yearmonth is created from the column date_onset, using the as.yearmon() function. The default (correct) ordering of the resulting values are shown in the table.
+Apr 2014 Apr 2015 Aug 2014 Dec 2014 Feb 2015 Jan 2015 Jul 2014 Jun 2014
+ 7 186 528 562 306 431 226 100
+Mar 2015 May 2014 Nov 2014 Oct 2014 Sep 2014
+ 277 64 763 1112 1070
+
+
+
Note: if you are working within a ggplot() and want to adjust how dates are displayed only, it may be sufficient to provide a strptime format to the date_labels = argument in scale_x_date() - you can use "%b %Y" or "%Y %b". See the ggplot tips page.
+
zoo also offers the function as.yearqtr(), and you can use scale_x_yearmon() when using ggplot().
+
+
+
+
+
9.9 Epidemiological weeks
+
+
lubridate
+
See the page on Grouping data for more extensive examples of grouping data by date. Below we briefly describe grouping data by weeks.
+
We generally recommend using the floor_date() function from lubridate, with the argument unit = "week". This rounds the date down to the “start” of the week, as defined by the argument week_start =. The default week start is 1 (for Mondays) but you can specify any day of the week as the start (e.g. 7 for Sundays). floor_date() is versitile and can be used to round down to other time units by setting unit = to “second”, “minute”, “hour”, “day”, “month”, or “year”.
+
The returned value is the start date of the week, in Date class. Date class is useful when plotting the data, as it will be easily recognized and ordered correctly by ggplot().
+
If you are only interested in adjusting dates to display by week in a plot, see the section in this page on Date display. For example when plotting an epicurve you can format the date display by providing the desired strptime “%” nomenclature. For example, use “%Y-%W” or “%Y-%U” to return the year and week number (given Monday or Sunday week start, respectively).
+
+
+
Weekly counts
+
See the page on Grouping data for a thorough explanation of grouping data with count(), group_by(), and summarise(). A brief example is below.
+
+
Create a new ‘week’ column with mutate(), using floor_date() with unit = "week"
+
+
Get counts of rows (cases) per week with count(); filter out any cases with missing date
+
+
Finish with complete() from tidyr to ensure that all weeks appear in the data - even those with no rows/cases. By default the count values for any “new” rows are NA, but you can make them 0 with the fill = argument, which expects a named list (below, n is the name of the counts column).
+
+
+
# Make aggregated dataset of weekly case counts
+weekly_counts <- linelist %>%
+drop_na(date_onset) %>%# remove cases missing onset date
+mutate(weekly_cases =floor_date( # make new column, week of onset
+ date_onset,
+unit ="week")) %>%
+count(weekly_cases) %>%# group data by week and count rows per group (creates column 'n')
+ tidyr::complete( # ensure all weeks are present, even those with no cases reported
+weekly_cases =seq.Date( # re-define the "weekly_cases" column as a complete sequence,
+from =min(weekly_cases), # from the minimum date
+to =max(weekly_cases), # to the maxiumum date
+by ="week"), # by weeks
+fill =list(n =0)) # fill-in NAs in the n counts column with 0
+
+
Here are the first rows of the resulting data frame:
+
+
+
+
+
+
+
+
+
Epiweek alternatives
+
Note that lubridate also has functions week(), epiweek(), and isoweek(), each of which has slightly different start dates and other nuances. Generally speaking though, floor_date() should be all that you need. Read the details for these functions by entering ?week into the console or reading the documentation here.
+
You might consider using the package aweek to set epidemiological weeks. You can read more about it on the RECON website. It has the functions date2week() and week2date() in which you can set the week start day with week_start = "Monday". This package is easiest if you want “week”-style outputs (e.g. “2020-W12”). Another advantage of aweek is that when date2week() is applied to a date column, the returned column (week format) is automatically of class Factor and includes levels for all weeks in the time span (this avoids the extra step of complete() described above). However, aweek does not have the functionality to round dates to other time units such as months, years, etc.
+
Another alternative for time series which also works well to show a a “week” format (“2020 W12”) is yearweek() from the package tsibble, as demonstrated in the page on Time series and outbreak detection.
+
+
+
+
+
9.10 Converting dates/time zones
+
When data is present in different time time zones, it can often be important to standardise this data in a unified time zone. This can present a further challenge, as the time zone component of data must be coded manually in most cases.
+
In R, each datetime object has a timezone component. By default, all datetime objects will carry the local time zone for the computer being used - this is generally specific to a location rather than a named timezone, as time zones will often change in locations due to daylight savings time. It is not possible to accurately compensate for time zones without a time component of a date, as the event a date column represents cannot be attributed to a specific time, and therefore time shifts measured in hours cannot be reasonably accounted for.
+
To deal with time zones, there are a number of helper functions in lubridate that can be used to change the time zone of a datetime object from the local time zone to a different time zone. Time zones are set by attributing a valid tz database time zone to the datetime object. A list of these can be found here - if the location you are using data from is not on this list, nearby large cities in the time zone are available and serve the same purpose.
# assign the current time to a column
+time_now <-Sys.time()
+time_now
+
+
[1] "2024-09-08 11:03:47 BST"
+
+
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
+time_london_real <-with_tz(time_now, "Europe/London")
+
+# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
+time_london_local <-force_tz(time_now, "Europe/London")
+
+
+# note that as long as the computer that was used to run this code is NOT set to London time,
+# there will be a difference in the times
+# (the number of hours difference from the computers time zone to london)
+time_london_real - time_london_local
+
+
Time difference of 0 secs
+
+
+
This may seem largely abstract, and is often not needed if the user isn’t working across time zones.
+
+
+
+
9.11 Lagging and leading calculations
+
lead() and lag() are functions from the dplyr package which help find previous (lagged) or subsequent (leading) values in a vector - typically a numeric or date vector. This is useful when doing calculations of change/difference between time units.
+
Let’s say you want to calculate the difference in cases between a current week and the previous one. The data are initially provided in weekly counts as shown below.
+
+
+
+
+
+
+
When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending.
+
First, create a new column containing the value of the previous (lagged) week.
+
+
Control the number of units back/forward with n = (must be a non-negative integer).
+
+
Use default = to define the value placed in non-existing rows (e.g. the first row for which there is no lagged value). By default this is NA.
+
+
Use order_by = TRUE if your the rows are not ordered by your reference column.
+
+
+
counts <- counts %>%
+mutate(cases_prev_wk =lag(cases_wk, n =1))
+
+
+
+
+
+
+
+
Next, create a new column which is the difference between the two cases columns:
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/new_pages/factors.qmd b/new_pages/factors.qmd
index cd88dee5..b740dbbb 100644
--- a/new_pages/factors.qmd
+++ b/new_pages/factors.qmd
@@ -84,7 +84,7 @@ table(linelist$delay_cat, useNA = "always")
Likewise, if we make a bar plot, the values also appear in this order on the x-axis (see the [ggplot basics](ggplot_basics.qmd) page for more on **ggplot2** - the most common visualization package in R).
```{r, warning=F, message=F}
-ggplot(data = linelist)+
+ggplot(data = linelist) +
geom_bar(mapping = aes(x = delay_cat))
```
@@ -125,7 +125,7 @@ levels(linelist$delay_cat)
Now the plot order makes more intuitive sense as well.
```{r, warning=F, message=F}
-ggplot(data = linelist)+
+ggplot(data = linelist) +
geom_bar(mapping = aes(x = delay_cat))
```
@@ -164,8 +164,8 @@ The package **forcats** offers useful functions to easily adjust the order of a
These functions can be applied to a factor column in two contexts:
-1) To the column in the data frame, as usual, so the transformation is available for any subsequent use of the data
-2) *Inside of a plot*, so that the change is applied only within the plot
+1) To the column in the data frame, as usual, so the transformation is available for any subsequent use of the data.
+2) *Inside of a plot*, so that the change is applied only within the plot.
@@ -175,8 +175,8 @@ This function is used to manually order the factor levels. If used on a non-fact
Within the parentheses first provide the factor column name, then provide either:
-* All the levels in the desired order (as a character vector `c()`), or
-* One level and it's corrected placement using the `after = ` argument
+* All the levels in the desired order (as a character vector `c()`), or,
+* One level and it's corrected placement using the `after = ` argument.
Here is an example of redefining the column `delay_cat` (which is already class Factor) and specifying all the desired order of levels.
@@ -214,11 +214,11 @@ linelist <- linelist %>%
```{r, warning=F, message=F, out.width = c('50%', '50%'), fig.show='hold'}
# Alpha-numeric default order - no adjustment within ggplot
-ggplot(data = linelist)+
+ggplot(data = linelist) +
geom_bar(mapping = aes(x = delay_cat))
# Factor level order adjusted within ggplot
-ggplot(data = linelist)+
+ggplot(data = linelist) +
geom_bar(mapping = aes(x = fct_relevel(delay_cat, c("<2 days", "2-5 days", ">5 days"))))
```
@@ -244,14 +244,14 @@ This function can be used within a `ggplot()`, as shown below.
```{r, out.width = c('50%', '50%', '50%'), fig.show='hold', warning=F, message=F}
# ordered by frequency
-ggplot(data = linelist, aes(x = fct_infreq(delay_cat)))+
- geom_bar()+
+ggplot(data = linelist, aes(x = fct_infreq(delay_cat))) +
+ geom_bar() +
labs(x = "Delay onset to admission (days)",
title = "Ordered by frequency")
# reversed frequency
-ggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay_cat))))+
- geom_bar()+
+ggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay_cat)))) +
+ geom_bar() +
labs(x = "Delay onset to admission (days)",
title = "Reverse of order by frequency")
```
@@ -272,26 +272,26 @@ In the first example below, the default order alpha-numeric level order is used.
```{r, fig.show='hold', message=FALSE, warning=FALSE, out.width=c('50%', '50%')}
# boxplots ordered by original factor levels
-ggplot(data = linelist)+
+ggplot(data = linelist) +
geom_boxplot(
aes(x = delay_cat,
y = ct_blood,
- fill = delay_cat))+
+ fill = delay_cat)) +
labs(x = "Delay onset to admission (days)",
- title = "Ordered by original alpha-numeric levels")+
- theme_classic()+
+ title = "Ordered by original alpha-numeric levels") +
+ theme_classic() +
theme(legend.position = "none")
# boxplots ordered by median CT value
-ggplot(data = linelist)+
+ggplot(data = linelist) +
geom_boxplot(
aes(x = fct_reorder(delay_cat, ct_blood, "median"),
y = ct_blood,
- fill = delay_cat))+
+ fill = delay_cat)) +
labs(x = "Delay onset to admission (days)",
- title = "Ordered by median CT value in group")+
- theme_classic()+
+ title = "Ordered by median CT value in group") +
+ theme_classic() +
theme(legend.position = "none")
```
@@ -312,12 +312,12 @@ epidemic_data <- linelist %>% # begin with the linelist
hospital
)
-ggplot(data = epidemic_data)+ # start plot
+ggplot(data = epidemic_data) + # start plot
geom_line( # make lines
aes(
x = epiweek, # x-axis epiweek
y = n, # height is number of cases per week
- color = fct_reorder2(hospital, epiweek, n)))+ # data grouped and colored by hospital, with factor order by height at end of plot
+ color = fct_reorder2(hospital, epiweek, n))) + # data grouped and colored by hospital, with factor order by height at end of plot
labs(title = "Factor levels (and legend display) by line height at end of plot",
color = "Hospital") # change legend title
```
@@ -350,7 +350,7 @@ You can adjust the level displays manually manually with `fct_recode()`. This is
This tool can also be used to "combine" levels, by assigning multiple levels the same re-coded value. Just be careful to not lose information! Consider doing these combining steps in a new column (not over-writing the existing column).
-`fct_recode()` has a different syntax than `recode()`. `recode()` uses `OLD = NEW`, whereas `fct_recode()` uses `NEW = OLD`.
+**_DANGER:_** `fct_recode()` has a different syntax than `recode()`. `recode()` uses `OLD = NEW`, whereas `fct_recode()` uses `NEW = OLD`.
The current levels of `delay_cat` are:
```{r, echo=F}
@@ -444,10 +444,10 @@ In a `ggplot()` figure, simply add the argument `drop = FALSE` in the relevant `
This example is a stacked bar plot of age category, by hospital. Adding `scale_fill_discrete(drop = FALSE)` ensures that all age groups appear in the legend, even if not present in the data.
-```{r}
-ggplot(data = linelist)+
+```{r, fig.width = 10.5}
+ggplot(data = linelist) +
geom_bar(mapping = aes(x = hospital, fill = age_cat)) +
- scale_fill_discrete(drop = FALSE)+ # show all age groups in the legend, even those not present
+ scale_fill_discrete(drop = FALSE) + # show all age groups in the legend, even those not present
labs(
title = "All age groups will appear in legend, even if not present in data")
```
@@ -463,8 +463,7 @@ Read more in the [Descriptive tables](tables_descriptive.qmd) page, or at the [s
## Epiweeks
-Please see the extensive discussion of how to create epidemiological weeks in the [Grouping data](grouping.qmd) page.
-Please also see the [Working with dates](dates.qmd) page for tips on how to create and format epidemiological weeks.
+Please see the extensive discussion of how to create epidemiological weeks in the [Grouping data](grouping.qmd) page. Also see the [Working with dates](dates.qmd) page for tips on how to create and format epidemiological weeks.
### Epiweeks in a plot {.unnumbered}
@@ -476,8 +475,8 @@ In this approach, you can adjust the *display* of the dates on an axis with `sca
```{r, warning=F, message=F}
linelist %>%
mutate(epiweek_date = floor_date(date_onset, "week")) %>% # create week column
- ggplot()+ # begin ggplot
- geom_histogram(mapping = aes(x = epiweek_date))+ # histogram of date of onset
+ ggplot() + # begin ggplot
+ geom_histogram(mapping = aes(x = epiweek_date)) + # histogram of date of onset
scale_x_date(date_labels = "%Y-W%W") # adjust disply of dates to be YYYY-WWw
```
@@ -486,7 +485,7 @@ linelist %>%
However, if your purpose in factoring is *not* to plot, you can approach this one of two ways:
-1) *For fine control over the display*, convert the **lubridate** epiweek column (YYYY-MM-DD) to the desired display format (YYYY-WWw) *within the data frame itself*, and then convert it to class Factor.
+1) *For fine control over the display*, convert the **lubridate** epiweek column (YYYY-MM-DD) to the desired display format (YYYY-Www) *within the data frame itself*, and then convert it to class Factor.
First, use `format()` from **base** R to convert the date display from YYYY-MM-DD to YYYY-Www display (see the [Working with dates](dates.qmd) page). In this process the class will be converted to character. Then, convert from character to class Factor with `factor()`.