This repository has been archived by the owner on Aug 20, 2019. It is now read-only.
forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
178 lines (133 loc) · 6.06 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Reproducible Research: Peer Assessment 1
Stephen Wade
9 August 2015
## Loading and preprocessing the data
The data is contained in an archive which must be extracted and loaded into R.
```{r loaddata}
library(knitr)
opts_chunk$set(fig.width=7, fig.height=5.5, fig.align='center')
if(!dir.exists('./data')) dir.create('./data')
unzip('activity.zip', exdir='./data/')
activity_data <- read.csv('./data/activity.csv')
str(activity_data)
```
At this stage there is some missing data which is not treated and the analysis
is performed by simply ignoring missing values.
The date variable is more useful when stored as class `date`, while a weekday
column is added to the data frame. The interval is stored in a strange numeric
format, and so this is converted into a `posixct` format, using todays date as
a dummy.
``` {r preprocess}
library(lubridate)
library(dplyr)
activity_data$date <- as.POSIXct(activity_data$date) +
(floor(activity_data$interval / 100) * 3600) +
((activity_data$interval %% 100) * 60)
activity_data <- select(activity_data, steps, date, interval) %>%
mutate(weekday = weekdays(activity_data$date))
```
## What is mean total number of steps taken per day?
A histogram of the total number of steps taken per day is given below.
```{r histogram_daysteps}
library(ggplot2)
ad_total <- group_by(activity_data, date=as.Date(date)) %>%
summarize(steps=sum(steps, na.rm=TRUE))
g <- ggplot(data=ad_total, aes(steps))
g + geom_histogram(binwidth=range(ad_total$steps)[2]/20,
color='light grey') +
xlab('Total steps per day') +
ggtitle('Histogram of steps/day') +
theme(axis.title.y=element_blank()) + theme_bw()
a <- summary(ad_total$steps)
```
The mean total number of steps per day is `r a['Mean']` and the median is
`r a['Median']`
## What is the average daily activity pattern?
A time series is given below showing the average daily activity pattern by time
interval.
```{r timeseries_intervalsteps}
ad_total <- group_by(activity_data, minute=hour(date)*60+minute(date)) %>%
summarize(steps=mean(steps, na.rm=TRUE))
g <- ggplot(data=ad_total, aes(x=minute, y=steps))
g + geom_line() +
xlab('Total steps per interval') +
theme(axis.title.y=element_blank()) + theme_bw() +
scale_x_continuous(labels = function(time) sprintf('%02d:%02d',
floor(time/60),
time%%60),
breaks=c(0,360,720,1080,1440))
max_interval <- ad_total$minute[which(ad_total$steps == max(ad_total$steps))]
max_interval <- sprintf('%02d:%02d',
floor(max_interval/60),
max_interval%%60)
```
The interval which contains the maximum number of steps is the five minute
interval starting at `r max_interval`.
## Imputing missing values
```{r}
missing_rate <- apply(is.na(activity_data), 2, mean)
missing_count <- apply(is.na(activity_data), 2, sum)
```
There are `r missing_count['steps']` missing values of the steps variable.
The rate of missing data for the steps variable is then
`r missing_rate['steps']`. This is addressed by imputing the median value of the not-N/A values of the interval/weekday pair.
```{r}
ad_medians <- group_by(activity_data,
weekday,
interval) %>%
summarise(msteps=median(as.numeric(steps), na.rm=TRUE))
na_rows <- filter(activity_data, is.na(steps)==TRUE) %>%
left_join(ad_medians,
by=c('weekday', 'interval')) %>%
mutate(steps = msteps) %>%
select(steps,
date,
weekday,
interval)
activity_clean <- activity_data
activity_clean$steps[is.na(activity_clean$steps)] <- na_rows$steps
missing_rate_clean <- apply(is.na(activity_clean), 2, mean)
print(missing_rate_clean)
```
There is now no missing data, it has been imputed.
A histogram of the steps per day activity is produced using the clean data,
shown below.
```{r histogram_stepsday_clean}
ad_total <- group_by(activity_clean, date=as.Date(date)) %>%
summarize(steps=sum(steps, na.rm=TRUE))
g <- ggplot(data=ad_total, aes(steps))
g + geom_histogram(binwidth=range(ad_total$steps)[2]/20,
color='light grey') +
xlab('Total steps per day') +
ggtitle('Histogram of steps/day') +
theme(axis.title.y=element_blank()) + theme_bw()
b <- summary(ad_total$steps)
```
The mean of the clean data is `r b['Mean']` with a median of `r b['Median']`.
This demonstrates that with imputed missing data, the median and mean both
increase compared to the values obtained with the missing data being ignored.
## Are there differences in activity patterns between weekdays and weekends?
Time series are given below using the clean data by separating days into two
categories; weekdays and weekends. It is clear that on weekdays the pattern
of activity is slightly diminished from about 9am onwards, whereas the
activity signal stays stronger throughout the day on weekends. Additionally,
the peak in activity at `r max_interval` is much more pronounced
on weekdays than on weekends.
```{r timeseries_byweekday}
activity_clean <- mutate(activity_clean,
daytype = factor(ifelse(weekday %in% c('Saturday',
'Sunday'),
'Weekend',
'Weekday')))
ad_total <- group_by(activity_clean, daytype, minute=hour(date)*60+minute(date)) %>%
summarize(steps=mean(steps, na.rm=TRUE))
g <- ggplot(data=ad_total, aes(x=minute, y=steps, group=daytype))
g + geom_line() +
xlab('Total steps per interval') +
facet_wrap(~daytype, nrow=2) +
theme(axis.title.y=element_blank()) + theme_bw() +
scale_x_continuous(labels = function(time) sprintf('%02d:%02d',
floor(time/60),
time%%60),
breaks=c(0,360,720,1080,1440))
```