-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.Rmd
155 lines (115 loc) · 9.09 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# excluder <a href="https://docs.ropensci.org/excluder/"><img src="man/figures/logo.png" align="right" height="139" /></a>
<!-- badges: start -->
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![lifecycle](man/figures/lifecycle-stable.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/excluder)](https://cran.r-project.org/package=excluder)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/excluder)](https://CRAN.R-project.org/package=excluder)
[![R-CMD-check](https://github.com/ropensci/excluder/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/excluder/actions)
[![Codecov test coverage](https://codecov.io/gh/ropensci/excluder/graph/badge.svg)](https://app.codecov.io/gh/ropensci/excluder)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/455_status.svg)](https://github.com/ropensci/software-review/issues/455)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.03893/status.svg)](https://doi.org/10.21105/joss.03893)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5648202.svg)](https://doi.org/10.5281/zenodo.5648202)
<!-- badges: end -->
The goal of [`{excluder}`](https://docs.ropensci.org/excluder/) is to facilitate checking for, marking, and excluding rows of data frames for common exclusion criteria. This package applies to data collected from [Qualtrics](https://www.qualtrics.com/) surveys, and default column names come from importing data with the [`{qualtRics}`](https://docs.ropensci.org/qualtRics/) package.
This may be most useful for [Mechanical Turk](https://www.mturk.com/) data to screen for duplicate entries from the same location/IP address or entries from locations outside of the United States. But it can be used more generally to exclude based on response durations, preview status, progress, or screen resolution.
More details are available on the package [website](https://docs.ropensci.org/excluder/) and the [getting started vignette](https://docs.ropensci.org/excluder/articles/excluder.html).
## Installation
You can install the stable released version of `{excluder}` from [CRAN](https://cran.r-project.org/package=excluder) with:
```{r eval = FALSE}
install.packages("excluder")
```
You can install developmental versions from [GitHub](https://github.com/) with:
```{r eval = FALSE}
# install.packages("remotes")
remotes::install_github("ropensci/excluder")
```
## Verbs
This package provides three primary verbs:
* `mark` functions add a new column to the original data frame that labels the rows meeting the exclusion criteria. This is useful to label the potential exclusions for future processing without changing the original data frame.
* `check` functions search for the exclusion criteria and output a message with the number of rows meeting the criteria and a data frame of the rows meeting the criteria. This is useful for viewing the potential exclusions.
* `exclude` functions remove rows meeting the exclusion criteria. This is safest to do after checking the rows to ensure the exclusions are correct.
## Exclusion types
This package provides seven types of exclusions based on Qualtrics metadata. If you have ideas for other metadata exclusions, please submit them as [issues](https://github.com/ropensci/excluder/issues). Note, the intent of this package is not to develop functions for excluding rows based on survey-specific data but on general, frequently used metadata.
* `duplicates` works with rows that have duplicate IP addresses and/or locations (latitude/longitude).
* `duration` works with rows whose survey completion time is too short and/or too long.
* `ip` works with rows whose IP addresses are not found in the specified country (note: this exclusion type requires an internet connection to download the country's IP ranges).
* `location` works with rows whose latitude and longitude are not found in the United States.
* `preview` works with rows that are survey previews.
* `progress` works with rows in which the survey was not complete.
* `resolution` works with rows whose screen resolution is not acceptable.
## Usage
The verbs and exclusion types combine with `_` to create the functions, such as [`check_duplicates()`](https://docs.ropensci.org/excluder/reference/check_duplicates.html), [`exclude_ip()`](https://docs.ropensci.org/excluder/reference/exclude_ip.html), and [`mark_duration()`](https://docs.ropensci.org/excluder/reference/mark_duration.html). Multiple functions can be linked together using the [`{magrittr}`](https://magrittr.tidyverse.org/) pipe `%>%`. For datasets downloaded directly from Qualtrics, use [`remove_label_rows()`](https://docs.ropensci.org/excluder/reference/remove_label_rows.html) to remove the first two rows of labels and convert date and numeric columns in the metadata, and use [`deidentify()`](https://docs.ropensci.org/excluder/reference/deidentify.html) to remove standard Qualtrics columns with identifiable information (e.g., IP addresses, geolocation).
### Marking
The `mark_*()` functions output the original data set with a new column specifying rows that meet the exclusion criteria. These can be piped together with `%>%` for multiple exclusion types.
```{r mark1}
library(excluder)
# Mark preview and short duration rows
df <- qualtrics_text %>%
mark_preview() %>%
mark_duration(min_duration = 200)
tibble::glimpse(df)
```
Use the [`unite_exclusions()`](https://docs.ropensci.org/excluder/reference/unite_exclusions.html) function to unite all of the marked columns into a single column.
```{r mark2}
# Collapse labels for preview and short duration rows
df <- qualtrics_text %>%
mark_preview() %>%
mark_duration(min_duration = 200) %>%
unite_exclusions()
tibble::glimpse(df)
```
### Checking
The `check_*()` functions output messages about the number of rows that meet the exclusion criteria. Because checks return only the rows meeting the criteria, they **should not be connected via pipes** unless you want to subset the second check criterion within the rows that meet the first criterion. Thus, in general, `check_*()` functions should be used individually. If you want to view the potential exclusions for multiple criteria, use the `mark_*()` functions.
```{r check1}
# Check for preview rows
qualtrics_text %>%
check_preview()
```
### Excluding
The `exclude_*()` functions remove the rows that meet exclusion criteria. These, too, can be piped together. Since the output of each function is a subset of the original data with the excluded rows removed, the order of the functions will influence the reported number of rows meeting the exclusion criteria.
```{r exclude1}
# Exclude preview then incomplete progress rows
df <- qualtrics_text %>%
exclude_duration(min_duration = 100) %>%
exclude_progress()
dim(df)
```
```{r exclude2}
# Exclude incomplete progress then preview rows
df <- qualtrics_text %>%
exclude_progress() %>%
exclude_duration(min_duration = 100)
dim(df)
```
Though the order of functions should not influence the final data set, it may speed up processing large files by removing preview and incomplete progress data first and waiting to check IP addresses and locations after other exclusions have been performed.
```{r exclude3}
# Exclude rows
df <- qualtrics_text %>%
exclude_preview() %>%
exclude_progress() %>%
exclude_duplicates() %>%
exclude_duration(min_duration = 100) %>%
exclude_resolution() %>%
exclude_ip() %>%
exclude_location()
```
## Citing this package
To cite `{excluder}`, use:
> Stevens, J. R. (2021). excluder: An R package that checks for exclusion criteria in online data. _Journal of Open Source Software_, 6(67), 3893. https://doi.org/10.21105/joss.03893
## Contributing to this package
[Contributions](https://docs.ropensci.org/excluder/CONTRIBUTING.html) to `{excluder}` are most welcome! Feel free to check out [open issues](https://github.com/ropensci/excluder/issues) for ideas. And [pull requests](https://github.com/ropensci/excluder/pulls) are encouraged, but you may want to [raise an issue](https://github.com/ropensci/excluder/issues/new/choose) or [contact the maintainer](mailto:[email protected]) first.
Please note that the excluder project is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). By contributing to this project, you agree to abide by its terms.
## Acknowledgments
I thank [Francine Goh](https://orcid.org/0000-0002-7364-4398) and Billy Lim for comments on an early version of the package, as well as the insightful feedback from [rOpenSci](https://ropensci.org/) editor [Mauro Lepore](https://orcid.org/0000-0002-1986-7988) and reviewers [Joseph O'Brien](https://orcid.org/0000-0001-9851-5077) and [Julia Silge](https://orcid.org/0000-0002-3671-836X). This work was funded by US National Science Foundation grant NSF-1658837.