forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
121 lines (92 loc) · 3.4 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# Reproducible Research: Peer Assessment 1
```{r setoptions, echo=FALSE}
opts_chunk$set(echo = TRUE)
```
## Loading and preprocessing the data
Load data file.
```{r}
D <- read.csv("activity.csv",
colClasses = c("integer", "Date", "integer"))
str(D)
head(D)
```
## What is mean total number of steps taken per day?
A histogram of the total number of steps taken each day.
```{r meanstep}
D.step <- tapply(D$steps, D$date, sum, na.rm = TRUE)
hist(D.step,
xlab = "Total number of steps taken each days",
ylab = "Number of days",
breaks = 10)
```
```{r MeanAndMedian}
D.sum <- summary(D.step)
str(D.sum)
```
The mean and median total number of steps taken
per day are respectively `r D.sum[["Mean"]]` and `r as.integer(D.sum[["Median"]])`.
## What is the average daily activity pattern?
Time series plot of the average steps taken in each 5-minute interval across all the days.
```{r dailyactivityplot}
D.interval <- tapply(D$steps, D$interval, mean, na.rm = TRUE)
str(D.interval)
plot(D.interval, type = "l",
xlab = "5-minute interval", ylab = "average of steps taken")
```
Which 5-minute interval on avarage contains the maximum?
```{r maximuminterval}
D.asc <- D.interval[order(-D.interval)]
D.maxintvl <- D.asc[1]
```
The 5-minute interval of `r names(D.maxintvl)` contains `r D.maxintvl` as the maximum value.
## Imputing missing values
```{r totalna}
D.na <- sum(is.na(D$steps))
```
The total number of missing values is `r D.na`.
Assign median of 5-minute interval to missing values of data.
```{r missingvaluereplacment}
D.medianintvl <- tapply(D$steps, D$interval, median, na.rm = TRUE)
D.filter <- D
for (i in names(D.medianintvl)) {
D.filter$steps <- replace(
D.filter$steps,
which(is.na(D.filter$steps)),
D.medianintvl[[i]])
}
```
The histogram of the total number of steps taken each days.
```{r histosumoffiltered}
D.step2 <- tapply(D.filter$steps, D.filter$date, sum)
hist(D.step2,
xlab = "Total number of steps taken each days",
ylab = "Number of days",
breaks = 10)
```
The mean and median total number of steps taken per days.
```{r differencefromfilter}
D.mean <- tapply(D$steps, D$date, mean)
D.median <- tapply(D$steps, D$date, median)
D.mean2 <- tapply(D.filter$steps, D.filter$date, mean)
D.median2 <- tapply(D.filter$steps, D.filter$date, median)
hist(D.mean, breaks = 10)
hist(D.mean2, breaks = 10)
```
When we compare two histogram of median total number of steps taken per days, we found that higher distribution around steps of 0 indicates that most of the missing values are treated as 0 for median value. Imputing missing values changes the distribution to higher density near 0.
## Are there differences in activity patterns between weekdays and weekends?
Add a new column for weekdays with two levels of "weekday" and "weekend".
```{r weekdays}
D.filter$week <- !(weekdays(D.filter$date, abbreviate = TRUE) %in% c("土","日"))
D.filter$week <- factor(D.filter$week,
levels = c(TRUE, FALSE),
labels = c("weekday", "weekend"))
head(D.filter)
```
The time series plot of 5-minute interval and steps taken all the weekdays and weekends.
```{r weekdayplot, fig.height=10}
library(lattice)
D.aggr<- aggregate(steps ~ interval + week, D.filter, mean)
head(D.aggr)
xyplot(steps ~ interval | week, data = D.aggr, type = "l",
xlab = "5-minute interval", ylab = "average steps taken")
```