-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCase_Study2.Rmd
286 lines (229 loc) · 8.48 KB
/
Case_Study2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "CaseStudy2 Notebook"
author: "Jason Fields & Scott Payne"
date: "December 3, 2016"
output:
html_document: default
html_notebook: default
---
```{r setup, include=TRUE}
knitr::opts_chunk$set(echo = TRUE)
```
#Documenting the code and package versions
```{r}
library(tseries)
sessionInfo()
```
## Question 1
Create the X matrix and print it from SAS, R, and Python.
SAS Matrix Input
proc IML;
X = {4 5 1 2, 1 0 3 5, 2 1 8 2};
print X;
R Matrix input
```{r}
X <- matrix(c(4, 5, 1, 2, 1, 0, 3, 5, 2, 1, 8, 2),nrow = 3,ncol = 4,byrow = TRUE)
X
```
Python Matrix input
```{python}
import numpy as np
X = np.matrix('4, 5, 1, 2; 1, 0, 3, 5; 2, 1, 8, 2')
print(X)
```
## Question 2
Please do the following with your assigned stock.
Jason & Scott: GW Pharamceuticals (GWPH)
* Download the data.
* Calculate log returns.
* Calculate volatility measure.
* Calculate volatility over entire length of series for various three different decay factors.
* Plot the results, overlaying the volatility curves on the data, just as was done in the S&P example.
Download the data
```{r}
stock_data<-get.hist.quote('GWPH',quote="Close")
length(stock_data)
```
Calculate log returns.
```{r}
stock_log<-log(lag(stock_data))-log(stock_data)
length(stock_log)
```
Calculate volatility measure.
```{r}
stock_vol<-sd(stock_log)*sqrt(250)*100
print(stock_vol)
```
Calculate volatility over entire length of series for various three different decay factors.
```{r}
## volatility
Vol <- function(d, logrets)
{
var = 0
lam = 0
varlist <- c()
for (r in logrets) {
lam = lam*(1 - 1/d) + 1
var = (1 - 1/lam)*var + (1/lam)*r^2
varlist <- c(varlist, var)
}
sqrt(varlist)
}
```
Plot the results, overlaying the volatility curves on the data, just as was done in the S&P example.
```{r}
# Recreate Figure 6.12 in the text on page 155
volest <- Vol(10,stock_log)
volest2 <- Vol(30,stock_log)
volest3 <- Vol(100,stock_log)
plot(volest,type="l")
lines(volest2,type="l",col="red")
lines(volest3, type = "l", col="blue")
```
## Question 3
The built-in data set called Orange in R is about the growth of orange trees. The Orange data frame has 3 columns of records of the growth of orange trees.
Variable description
Tree : an ordered factor indicating the tree on which the measurement is made. The ordering is according to increasing maximum diameter.
age : a numeric vector giving the age of the tree (days since 1968/12/31)
circumference : a numeric vector of trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.
load the data
```{r}
data("Orange")
Orange
```
a) Calculate the mean and the median of the trunk circumferences for different size of the trees. (Tree)
```{r}
tree_mean<-aggregate(Orange$circumference, list(Orange$Tree), FUN=mean)
tree_mean
tree_median<-aggregate(Orange$circumference, list(Orange$Tree), FUN=median)
tree_median
```
b) Make a scatter plot of the trunk circumferences against the age of the tree. Use different plotting symbols for different size of trees.
```{r}
library(ggplot2)
ggplot(Orange, aes(y=circumference, x=age, shape=Tree)) +
geom_point()
```
c) Display the trunk circumferences on a comparative boxplot against tree. Be sure you order the boxplots in the increasing order of maximum diameter.
```{r}
ggplot(Orange, aes(y=circumference, x=Tree)) +
geom_boxplot()
```
## Question 4
Download “Temp” data set (check your SMU email)
```{r}
temp<-read.csv('Temp.csv', header = TRUE,
dec = ".", fill = TRUE )
head(temp)
summary(temp)
```
(i) Find the difference between the maximum and the minimum monthly average temperatures for each country and report/visualize top 20 countries with the maximum differences for the period since 1900.
```{r}
temp$Country <- as.character(temp$Country)
library(lubridate)
#package to parse multiple date formats
temp$Date <- parse_date_time(x = temp$Date,
orders = c("y-m-d", "m/d/y"),
locale = "eng")
#select dates 1900 and later
temp <- temp[temp$Date > "1899-12-1",]
# Max Average Monthly Temperature by Country
max.temp <-aggregate(temp$Monthly.AverageTemp, list(temp$Country), FUN=max, na.rm = T)
# Min Average Monthly Temperature by Country
min.temp <- aggregate(temp$Monthly.AverageTemp, list(temp$Country), FUN=min, na.rm = T)
# Difference in Min and Max
diff.temp <- (max.temp[2] - min.temp[2])
# Add Countries to temp.diff
diff.temp$Country <- max.temp$Group.1
# Sort by largest difference
diff.temp <- diff.temp[order(-diff.temp$x),]
high.temp.variation <- diff.temp[1:20,1:2]
# high.temp.variation$Country <- as.character(high.temp.variation$Country)
names(high.temp.variation) <- c("Temp", "Country")
#class(high.temp.variation$Country)
```
```{r}
plot_q4_i<-ggplot(high.temp.variation, aes(Country,Temp, labels(high.temp.variation$Country))) + geom_point(stat="identity")+ xlab("Country") + theme(axis.text.x=element_blank()) + geom_text(aes(label=high.temp.variation$Country))
plot_q4_i
```
(ii) Select a subset of data called “UStemp” where US land temperatures from 01/01/1990 in Temp data. Use UStemp dataset to answer the followings.
```{r}
UStemp <- temp[temp$Country == 'United States',]
head(temp)
summary(temp)
```
a) Create a new column to display the monthly average land temperatures in Fahrenheit (°F).
```{r}
library(weathermetrics)
UStemp$Monthly.AverageTempF <- convert_temperature(UStemp$Monthly.AverageTemp, old_metric = "celsius", new_metric = "fahrenheit")
```
b) Calculate average land temperature by year and plot it. The original file has the average land temperature by month.
```{r}
UStemp$Year <- format(UStemp$Date,format="%Y")
Yearly <- aggregate(UStemp$Monthly.AverageTempF ~ Year , UStemp, mean, na.rm = T)
```
```{r}
ggplot(Yearly, aes(x=Year, y=`UStemp$Monthly.AverageTempF`)) +
geom_point(shape=1) + xlab("Year") + ylab("Average Temperature in US")
```
c) Calculate the one year difference of average land temperature by year and provide the maximum difference (value) with corresponding two years.
(for example, year 2000: add all 12 monthly averages and divide by 12 to get average temperature in 2000. You can do the same thing for all the available years. Then you can calculate the one year difference as 1991-1990, 1992-1991, etc)
```{r}
Yearly$two.year <- diff(c(0,Yearly$`UStemp$Monthly.AverageTempF`))
Yearly$two.year
```
(iii) Download “CityTemp” data set (check your SMU email). Find the difference between the maximum and the minimum temperatures for each major city and report/visualize top 20 cities with maximum differences for the period since 1900.
Read in the the CityTemp file.
```{r}
CityTemp<-read.csv('CityTemp.csv', header = TRUE,
dec = ".", fill = TRUE )
```
Summarize the data
```{r}
head(CityTemp)
summary(CityTemp)
str(CityTemp)
```
Align date formats and subset for after the year 1900
```{r}
library(lubridate)
#package to parse multiple date formats and aligbn to single format
CityTemp$Date <- parse_date_time(x = CityTemp$Date,
orders = c("y-m-d", "m/d/y")
)
#select dates 1900 and later
CityTemp <- CityTemp[CityTemp$Date > "1899-12-1",]
```
Reshape the data to do analysis
```{r}
library(dplyr, tidyr)
#drop unneeded columns
CityTemp<-select(CityTemp, -Latitude, -Longitude )
```
```{r}
# Max Average Monthly Temperature by City
max.CityTemp <-aggregate(CityTemp$Monthly.AverageTemp, list(CityTemp$City), FUN=max, na.rm = T)
# Min Average Monthly Temperature by Country
min.CityTemp <- aggregate(CityTemp$Monthly.AverageTemp, list(CityTemp$City), FUN=min, na.rm = T)
# Difference in Min and Max
diff.CityTemp <- (max.CityTemp[2] - min.CityTemp[2])
# Add Cities to CityTemp.diff
diff.CityTemp$City <- max.CityTemp$Group.1
# Sort by largest difference
diff.CityTemp <- diff.CityTemp[order(-diff.CityTemp$x),]
high.CityTemp.variation <- diff.CityTemp[1:20,1:2]
high.CityTemp.variation$City<-as.character(high.CityTemp.variation$City)
```
```{r}
plot_q4_iii<-ggplot(high.CityTemp.variation, aes(City,x, labels(high.CityTemp.variation.variation$City))) + geom_point(stat="identity")+ xlab("City") + ylab("Temp") + theme(axis.text.x=element_blank()) + geom_text(aes(label=high.CityTemp.variation$City))
plot_q4_iii
```
(iv) Compare the two graphs in (i) and (iii) and comment it.
```{r}
dat<-data.frame(high.temp.variation,high.CityTemp.variation)
ggplot(dat,aes(y=high.temp.variation$Temp,x=high.CityTemp.variation$x))+geom_point()
#Compare the two plots
plot_q4_i
plot_q4_iii
```
The two charts show very similar trends between cities and countries.