-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathMiniProject1.Rmd
188 lines (101 loc) · 10.2 KB
/
MiniProject1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Mini-project 1 - Reading and exploring datasets"
author: "Mike K Smith"
date: "02/08/2023"
output: html_document
---
## Aims:
- Read in ADSL xpt dataset
- Subset for Efficacy Population
- Sort data
- View Demographic Listing
## How to use this document:
In this document you'll see code chunks (which typically have a light grey background) and text. This is an example of an "Rmarkdown" document. You can write and run code within the document and the results will be presented underneath each code chunk. You should follow the instructions as written in the text, amending the code chunks, then running them to produce the outputs as instructed.
## Data Source
For these projects we are using anonymized CDISC datasets, which can be found here: <https://github.com/phuse-org/phuse-scripts/tree/master/data/adam/cdisc>
## R objects and functions
Within R we typically use objects of different types e.g. data, vectors, lists etc. and then we apply functions. Functions have the construct `<function_name>(<argument1>= , <argument2> = )`. When you use functions, you don't ***have*** to use the argument name and instead you can implicitly refer to the arguments by position e.g. `myFunction(foo, 1, "bar")` passes in the R object `foo` as the value for argument 1; argument 2 takes the value `1` and argument 3 takes the character value `"bar"`. While you're learning R we recommend that you explicitly name and use the arguments in functions, except where functions only have one argument. You can use the tab-completion in the RStudio IDE to help complete function call arguments. To see the arguments of a function type `?<functionName>` in the Console.
## Start the mini project
1. The following steps will load the packages `tidyverse`, `rio`, `skimr` and `htmlTable`. You need to load packages before using the functions and content in them, and it's best to do so at the beginning of a program / script. Run the chunk below by clicking on the green arrow to the right of the code chunk.
```{r setup}
library(tidyverse)
library(rio)
library(skimr)
library(htmlTable)
```
2. There are numerous ways and packages to read in the data in R. For instance, the `haven` package has a function called read_xpt() that allows you to make an R data object converting from a \*.xpt file. The `readr` package provides a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). The `readxl` package makes it easy to get data out of Excel and into R. The functions that allow you to read in the data regardless of the package tend to have the first argument to be the data path, however it is always important to read the documentation before using a new package or a function.
In our mini-projects we are going to use the `rio` package to read in the data. `rio` supports a variety of different file formats for import and export and provides streamlined data import. Specifically, `rio` uses the file extension of a file name to determine what kind of file it is so the only function you need to remember to read in the data file using `rio` is import().
Read in the adsl data from the GitHub location (`https://github.com/phuse-org/phuse-scripts/raw/master/data/adam/cdisc/adsl.xpt`) and assign this to an object in R using the assignment operator `<-`. Make sure that the URL location is a character string by enclosing it in `" "`. Paste your download path location above within the "" argument in the code below, then run the chunk using the green arrow.
```{r read_adsl_data}
adsl <- import(file = "")
```
3. To view the first 10 rows of the R object `adsl`, you can just simply type the dataset name. Type in `adsl` into the chunk below and run using the green arrow:
```{r view_adsl_object}
```
Note that the `adsl` data object has 254 rows and 48 columns.
4. Next, we'll use the R function `sapply` to view the type of each column in the dataset - this will show us whether the columns are character, numeric, integer, double precision, factor etc. The `class` function tells us the type of an R object.
The `sapply` function applies the same function `class` to each column of the data and returns the data type in a vector.
```{r view_adsl_data_types}
sapply(X = adsl, FUN = class)
```
5. If you want to check (or test) the type (or `class`) of an individual variable you can is the functions `is.numeric`, `is.character`, etc. To reference an individual column of a data object, you can use the construct `<object>$<variable>`.
[**Note that R is case sensitive! When typing names of R objects, variables or datasets, ensure that the capitalisation matches what is shown.**]{.smallcaps} So "AGE" is not the same as "age" or "Age". Note that variable names below are capitalised.
Let's check whether the `AGE` variable is numeric and whether `RACE` is character. As these functions
```{r check_variable_types}
is.numeric( )
is.character( )
```
6. There are *many* R packages that will help us do tasks with R. The `skimr` package is one of these that quickly displays helpful summary information about a dataset - the number of rows, columns, missing data, column types, distribution of values in data columns.
```{r skimr}
skim(data = adsl)
```
Note: You may see Warning messages as part of the output.
If some columns have missing values, you can use these warning messages to check that the data has the correct formats and that columns have been parsed correctly e.g. columns with coded values = 0,1 may be read as numeric. Is this how we want them to remain, or would we want to interpret these as factors?
7. Now create a new data object `adsl_eff` for Efficacy Analysis population. We need to filter the `adsl` data object where the variable `EFFFL` has the value `Y`. To test for equivalence in R we need to use double equals `==`. Type in the `adsl` object below as the first argument (`.data=`) of the `filter` function.
```{r create_adsl_eff}
adsl_eff <- filter(.data = , EFFFL == 'Y')
adsl_eff
```
After filtering out data where `EFFFL == 'Y'` we now have 234 rows and 48 columns.
Again, recall that R is case-sensitive. Most variable names in SAS datasets are uppercase, so we need to ensure that the variable name is UPPERCASE i.e. `EFFFL` and not `Efffl` etc. Also, since the value is character, we need to ensure that the case of the equivalent value matches the expected result i.e. `Y` and not `y` or `Yes` and that it is enclosed in quotes `" "`.
8. Sort the `adsl_eff` data object by the `USUBJID` variable. Type in the variable you want to sort by after the first argument.
```{r arrange_adsl_eff}
sort_adsl_eff <- arrange(.data = adsl_eff, )
```
In the `tidyverse` package in R, there is a pipe operator `%>%` that allows us to combine a number of steps. The pipe passes the output object from the previous step as the value to be used for the first argument of the next function. So we combine the previous steps into the following data pipeline:
```{r using_pipe}
sort_adsl_eff <- adsl %>%
filter(EFFFL == "Y") %>%
arrange(USUBJID)
```
When we use the `%>%` operator, we read this as "...then...". So the above pipeline is read as "(Create an object called) **adsl_eff** which is **assigned by** (starting with the object) **adsl** *...then...* **filter** (where EFFFL is"Y") *...then...* **arrange** (by USUBJID)." The pipe operator allows us to combine steps where the intermediate objects aren't really worth keeping as named objects in R. Notice that when we use the pipe operator, we're no longer explicitly using the `.data =` argument of the following function. This is because the `%>%` passes the previous object through as the value of the first argument by default.
9. Try the following to see a `skim` summary of `adsl_eff` that shows that we now have 234 observations where `EFFFL=='Y'`
```{r skim_adsl_eff}
skim( )
```
10. Sort the `adsl_eff` data by `TRT01A`, `USUBJID` variables and create a new dataset `adsl_eff_srt` You can use the previous code (HINT: `arrange( )`) to sort the data.
```{r sort_adsl_eff}
```
11. The `select` function will select out only the specified columns of a data object. Select only the following variables from the adsl_saf_srt data object: `USUBJID, AGE, AGEU, SEX, RACE, ETHNIC, TRT01A`. Copy and paste this list of variables into the `select` statement.
```{r adsl_saf_srt_select}
adsl_eff_srt <- adsl_eff_srt %>%
select( )
adsl_eff_srt
```
Now the `adsl_eff_srt` data object still has 234 rows, but only 7 columns.
12. Generally, demography listings will not directly use the variable names from the dataset. Hence, add another step to change column names. The `names()` function assigns names of the nominated object by passing a vector of characters / names. We create a vector using the collection function `c( )` and a comma-separated list of names. Use the code below to change the names of the variables in the `adsl_saf_srt` function to `"Subject ID","Age","AgeUnits","Sex","Race","Ethnicity","Treatment"`.
```{r rename_adsl_eff_srt}
names(x = adsl_eff_srt) <- c( )
```
13. We can now generate a listing of the data. R has several packages that will allow us to display tables / print listings. In Rmarkdown documents like this one, simply printing the output will create a nice HTML representation of the data. Add in the data object name below to display it in the Rmarkdown document.
```{r display_adsl_eff_srt}
print( )
```
But to save this output for later use, you'll want to use one of these packages to create an output file with the data listing. Each table package has pros and cons, but for now let's use the `htmlTable` package to display an HTML listing. Add in the data object name below to show the `htmlTable` output. The additional arguments here ensure that the table formats nicely. These are beyond the scope of this first project. Try changing the argument `align = "l"` and try options `align = "c"` or `align = "r"` to align the column content to left, centre or right respectively.
```{r create_listing_with_htmlTable}
htmlTable(x = adsl_eff_srt,
align = "l",
rnames=FALSE,
css.cell = "padding-left: .5em; padding-right: .2em;")
```
14. Next, save this .rmd file to your desktop and then click on the "Knit" button at the top of the file to render an HTML version of this document.