-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.Rmd
266 lines (194 loc) · 8.91 KB
/
readme.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
title: "EDA"
author: "Shikhar"
output:
html_document:
keep_md: true
---
## Performing Exploratory Data Analysis
This is an EDA performed on the dataset **Pisa Scores 2013-2015.csv** which if you want you can download from [here](https://www.kaggle.com/zazueta/pisa-scores-2015/download) .
PISA stands for "Program for International Student Assessment" and it is applied to 15 year-old students across the world to assess their performance in Math, Reading and Science. Here, I have tried to analyze and explore the dataset and resultantly infer some meaningful insights from it.
The libraries I used in this process are:
1. **tidyverse** cleaning of the datase and performing some transitions.
2. **ggplot2** for plotting.
3. **corrplot** for the correlation plot.
4. **Hmisc** for cor and rcorr functions to find correlation and finding the P-values.
5. **ggmap** for retrieving map tiles from online servies.
Setting up the Working Dir. and calling the libraries.
```{r,message=FALSE,warning=FALSE}
setwd("D:/R/Visualization TDS")
library(tidyverse)
library(ggmap)
library(ggplot2)
library(Hmisc)
library(corrplot)
```
### Data Loading.
Loading the dataset from the system analyzing it's structure and the number of rows and columns present in the dataset.
```{r}
data <- read.csv(file = "Pisa mean perfromance scores 2013 - 2015 Data.csv",encoding = "UTF-8-BOM",na.string='..')
str(data)
ncol(data)
nrow(data)
```
### Data Pre-processing.
First we will choose those columns which are helpuful for us. For e.g., the 2013[YR2013] and 2014[YR2014] are NAs so we will not choose them and similarly some other columns.
We chose Country Names(column1), Series Code(column4) and 2015[YR2015] marks(column7).
We use the unique row values from Series.Code column of the dataset as the Column Names for the new DataSet(dataf) which we rename for better readability and removed the NA values. This was done using pipes method.
```{r}
dataf <- data[1:1161,c(1,4,7)] %>%
pivot_wider(names_from = Series.Code,values_from = X2015..YR2015.) %>%
rename(CountryName=ï..Country.Name,Maths=LO.PISA.MAT,MathsF=LO.PISA.MAT.FE,MathsM=LO.PISA.MAT.MA,Reading=LO.PISA.REA,ReadingF=LO.PISA.REA.FE,ReadingM=LO.PISA.REA.MA,Science=LO.PISA.SCI,ScienceF=LO.PISA.SCI.FE,ScienceM=LO.PISA.SCI.MA)%>%
drop_na()
head(dataf)
```
_A view of the newly formed dataset._
### Visualization
Since, in the dataset there are several Countries present, plotting all of them Graphically in the World map.
```{r,message=FALSE,warning=FALSE,fig.width=11.916666667,fig.height=5.97916666667}
wrldmap <- map_data("world")
mrgddata <- merge(wrldmap,dataf,by.x="region",by.y="CountryName")
mrgddata <- mrgddata[order(mrgddata$group,mrgddata$order),]
head(mrgddata)
ggplot(mrgddata) +
aes(x=long,y=lat,group=group) + geom_polygon() + aes(fill=region) +
theme_dark()
```
Visualizing the Math Score Data with the Country Names.
```{r,fig.height=8}
ggplot(dataf) +
aes(x=reorder(CountryName,Maths),y=Maths) +
geom_bar(stat = 'identity') +
aes(fill=Maths) +
coord_flip() +
scale_fill_gradient(name="Score Level") +
geom_hline(yintercept = mean(dataf$Maths)) +
theme_grey() +
labs(x="Country Name",y="Math Score",title = "Graph Relation b/w Math & Countries") +
geom_hline(yintercept = mean(dataf$Maths),size=1,col="salmon3")
```
Visualizing the Science Score Data with the Country Names.
```{r,fig.height=8}
ggplot(dataf) +
aes(x=reorder(CountryName,Science),y=Science) +
geom_bar(stat='identity') +
aes(fill=Science) +
coord_flip() +
scale_fill_gradient("Score Level") +
labs(x="Country Name",y="Science Score",title="Graph Relation b/w Science & Countries") +
theme_bw() +
geom_hline(yintercept = mean(dataf$Science),size=1,col="sienna3")
```
Visualizing the Reading Data with the Country Names.
```{r,fig.height=8}
ggplot(dataf) +
aes(x=reorder(CountryName,Reading),Reading) +
geom_bar(stat = 'identity') +
aes(fill=Reading) +
coord_flip() +
scale_fill_gradient(name="Score Level") +
theme_classic() +
labs(x="Country Name",y="Reading Score",title="Graoh Relation b/w Reading & Countries") +
geom_hline(yintercept = mean(dataf$Reading),size=1,col="seashell4")
```
Subsetting data in a new dataframe according to gender and subject with only CountryName, SubjectGender and value as the column names.
```{r}
dataf2 <- dataf[,c(1,3,4,6,7,9,10)] %>%
pivot_longer(c(2,3,4,5,6,7),names_to = 'SubjectGender')
```
Now, Gender wise each subject marks Visualization in Box - plot form.
```{r,fig.height=15}
ggplot(dataf2)+
aes(x=SubjectGender,y=value)+
geom_point(pos="jitter",alpha=0.25) +
geom_boxplot(alpha=0.75) +
aes(fill=SubjectGender) +
scale_fill_manual(values = c("wheat4","tomato","wheat","violet","seashell3","salmon")) +
theme_linedraw() +
labs(x="Subject and Gender",y="Values/Scores")+
facet_wrap(.~SubjectGender,scales = "free_x",nrow=3,ncol=2)
```
The boxplots look similar due to the scales="free_x" argument in facet_wrap() function.
Though it is early to judge but, the boxplots infer that the Gender Male performed better in Maths and Science but the womwn performed better in Reading.
Now we plot the correlation graph.
Subsetting relevant data and finding correlation between the Values by the Pearson Method.
```{r}
dataf3 <- dataf[,c(1,3,4,6,7,9,10)]
res <- cor(dataf3[,-1])
res
```
_Note : Pearson method measures the strength of linear relation ship b/w two variables. Value is always betweeen -1 and 1._
p-Value for the corr. Lower the p-value more significant the correlation.
```{r}
rcorr(as.matrix(dataf3[,-1]))$P
```
Visualizing the correlation. Stronger the color and bigger the size, higher the correlation is. Thus, all the values are correlated.
```{r}
corrplot(res,order="hclust",type = "upper",tl.col = "black",tl.cex =0.5)
```
Creating another new sub-dataset from the present dataset to find the difference in the various subjects for each specific gender and all the countries separately.
```{r}
datafg <- mutate(dataf[,1],MathDiff = ((dataf$MathsF-dataf$MathsM)/dataf$MathsM)*100,
ScienceDiff = (dataf$ScienceF-dataf$ScienceM)/dataf$ScienceM*100,
ReadingDiff = (dataf$ReadingF-dataf$ReadingM)/dataf$ReadingM*100,
Total=dataf$Maths+dataf$Science+dataf$Reading,
AverageDiff=(MathDiff+ScienceDiff+ReadingDiff)/3)
head(datafg)
```
_A view of the new dataset._
Forming graphs with respect to the present differences in the dataset.
Graph Representation of Maths Score Difference between Female and Male students.
```{r,fig.height=10}
ggplot(datafg)+
aes(x=reorder(CountryName,MathDiff),y=MathDiff)+
geom_bar(stat='identity') + aes(fill=MathDiff) +
scale_fill_gradient("Math Difference") +
geom_hline(yintercept = mean(datafg$MathDiff),size=1,col="brown") +
labs(x="Country Names",y="Math Score Diff %",title="Math Diff v/s Country") + coord_flip() +
theme_bw()
```
_From this plot it can be inferred that Men are better in Maths Subject in more countries._
Graph Representation of Reading Score Difference between Female and Male students.
```{r,fig.height=10}
ggplot(datafg) +
aes(x=reorder(CountryName,ReadingDiff),y=ReadingDiff) +
geom_bar(stat='identity') + aes(fill=ReadingDiff) +
scale_fill_gradient("Reading Difference Palette") +
geom_hline(yintercept = mean(datafg$ReadingDiff),size=1,col="tan2") +
labs(x="Country Names",y="Reading Score Diff",title="Reading Diff v/s Country") +
coord_flip() +
theme_bw()
```
_We can infer that females were better at reading in almost every country._
Graph Representation of Science Score Difference between Female and Male students.
```{r,fig.height=10}
ggplot(datafg) +
aes(x=reorder(CountryName,ScienceDiff),y=ScienceDiff) +
geom_bar(stat='identity') + aes(fill=ScienceDiff) +
scale_fill_gradient("Science Difference",) +
geom_hline(yintercept = mean(datafg$ScienceDiff),size=1,col="purple") +
labs(x="Country Names",y="Science Score Diff %",title="Science Diff v/s Country") +
coord_flip() +
theme_bw()
```
_Science also shows trend similar to Maths._
So now we form a graph with the average difference values in all the countries in all three subjects.
```{r,fig.height=10}
ggplot(datafg)+
aes(x=reorder(CountryName,AverageDiff),y=AverageDiff) +
geom_bar(stat='identity') + aes(fill=AverageDiff) +
scale_fill_gradient("Average Diff") +
geom_hline(yintercept = mean(datafg$AverageDiff),size=1,col="tomato3") +
labs(x="Country Name",y="Average Diff",title="Average Diff v/s Country") +
coord_flip() + theme_bw()
```
Final plot for showing the varying of Average Values with the total combined scores(maths + science + reading).
```{r}
ggplot(datafg) +
aes(x=AverageDiff,y=Total) +
geom_jitter() +
geom_smooth(fill="springgreen4",alpha=0.50,col="snow4") + theme_light() +
labs(x="Average Diff",y="Total",title="Average Diff v/s Total")
```
The difference between Male and Female candidates is very low i.e.; almost 0.
**Made with ❤️ by [Shikhar](https://www.linkedin.com/in/shikharkrdixit/) .**