forked from TuomoNieminen/Helsinki-Open-Data-Science
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter3.Rmd
889 lines (634 loc) · 26.3 KB
/
chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
---
title: 'Clustering and classification'
description: 'Datasets in R, Linear Discriminant Analysis (LDA) and K-means clustering'
---
## Datasets inside R
```yaml
type: NormalExercise
key: 3db2d48f25
lang: r
xp: 100
skills: 1
```
Welcome to the *Clustering and classification* chapter.
R has many (usually small) datasets already loaded in. There are also datasets included in the package installations. Some of the datasets are quite famous (like the [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) flower data) and they are frequently used for teaching purposes or to demonstrate statistical methods.
This week we will be using the [Boston](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html) dataset from the MASS package. Let's see how it looks like!
`@instructions`
- Load the `Boston` dataset from MASS
- Explore the `Boston` dataset. Look at the structure with `str()` and use `summary()` to see the details of the variables.
- Draw the plot matrix with `pairs()`
`@hint`
- You can draw the `pairs()` plot by typing the object name where it is saved.
`@pre_exercise_code`
```{r}
```
`@sample_code`
```{r}
# access the MASS package
library(MASS)
# load the data
data("Boston")
# explore the dataset
# plot matrix of the variables
```
`@solution`
```{r}
# access the MASS package
library(MASS)
# load the data
data("Boston")
# explore the dataset
str(Boston)
summary(Boston)
# plot matrix of the variables
pairs(Boston)
```
`@sct`
```{r}
test_output_contains("summary(Boston)", incorrect_msg = "Did you use the summary() function on the Boston data?")
test_output_contains("str(Boston)", incorrect_msg = "Did you use the str() function on the Boston data?")
test_function("pairs", incorrect_msg = "Did you draw the plot matrix with pairs?")
test_error()
success_msg("Good work!")
```
---
## Correlations plot
```yaml
type: NormalExercise
key: 603dff42ce
lang: r
xp: 100
skills: 1
```
It is often interesting to look at the correlations between variables in the data. The function `cor()` can be used to create the correlation matrix. A more visual way to look at the correlations is to use `corrplot()` function (from the corrplot package).
Use the corrplot to visualize the correlation between variables of the Boston dataset.
`@instructions`
- Calculate the correlation matrix and save it as `cor_matrix`. Print the matrix to see how it looks like.
- Adjust the code: use the pipe (`%>%`) to round the matrix. Rounding can be done with the `round()` function. Use the first two digits. Print the matrix again.
- Plot the rounded correlation matrix
- Adjust the code: add argument `type = "upper"` to the plot. Print the plot again.
- Adjust the code little more: add arguments `cl.pos = "b"`, `tl.pos = "d"` and `tl.cex = 0.6` to the plot. Print the plot again.
- See more of corrplot [here](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html)
`@hint`
- For correlation matrices, see `?cor`
- The pipe (`%>%`) takes the data before the pipe and applies the functionality assigned in the right. For example `data_column %>% summary()` creates a summary of the data_column.
- The link on the last line of the instructions have quite good examples about the corrplot
`@pre_exercise_code`
```{r}
library(MASS)
library(tidyr)
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrplot.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/colorlegend.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrMatOrder.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrRect.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrRect.hclust.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrplot.mixed.R")
data("Boston")
```
`@sample_code`
```{r}
# MASS, corrplot, tidyverse and Boston dataset are available
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
# visualize the correlation matrix
corrplot(cor_matrix, method="circle")
```
`@solution`
```{r}
# MASS, corrplot, tidyr and Boston dataset are available
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
```
`@sct`
```{r}
test_function("round", args = "digits",incorrect_msg = "Did you use the first two digits in rounding?")
test_error()
success_msg("Good work!")
```
---
## Scale the whole dataset
```yaml
type: NormalExercise
key: 3a4eacf66c
lang: r
xp: 100
skills: 1
```
Usually the R datasets do not need much data wrangling as they are already in a good shape. But we will need to do little adjustments.
For later use, we will need to scale the data. In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.
$$scaled(x) = \frac{x - mean(x)}{ sd(x)}$$
The Boston data contains only numerical values, so we can use the function `scale()` to standardize the whole dataset.
`@instructions`
- Use the `scale()` function on the `Boston` dataset. Save the scaled data to `boston_scaled` object.
- Use `summary()` to look at the scaled variables. Note the means of the variables.
- Find out the class of the scaled object by executing the `class()` function.
- Later we will want the data to be a data frame. Use `as.data.frame()` to convert the `boston_scaled` to a data frame format. Keep the object name as `boston_scaled`.
`@hint`
- See `?scale`
`@pre_exercise_code`
```{r}
library(MASS)
data("Boston")
```
`@sample_code`
```{r}
# MASS and Boston dataset are available
# center and standardize variables
boston_scaled <- "change me!"
# summaries of the scaled variables
# class of the boston_scaled object
class(boston_scaled)
# change the object to data frame
```
`@solution`
```{r}
# MASS and Boston dataset are available
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
# class of the boston_scaled object
class(boston_scaled)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
```
`@sct`
```{r}
test_function("scale", args = "x",incorrect_msg = "Did you scale the dataset?")
test_function("as.data.frame", args = "x",incorrect_msg = "Please use the function as.data.frame() to convert the data to a data frame.")
test_function("summary", args = "object", incorrect_msg = "Please use the function summary() to look at the summaries of the data variables.")
test_error()
success_msg("Good work!")
```
---
## Creating a factor variable
```yaml
type: NormalExercise
key: 3d45bd1da6
lang: r
xp: 100
skills: 1
```
We can create a categorical variable from a continuous one. There are many ways to to do that. Let's choose the variable `crim` (per capita crime rate by town) to be our factor variable. We want to cut the variable by [quantiles](https://en.wikipedia.org/wiki/Quantile) to get the high, low and middle rates of crime into their own categories.
See how it's done below!
`@instructions`
- Look at the summary of the scaled variable `crim`
- Use the function `quantile()` on the scaled crime rate variable and save the results to `bins`. Print the results.
- Create categorical crime vector with the `cut()` function. Set the `breaks` argument to be the quantile vector you just created.
- Use the function `table()` on the `crime` object
- Adjust the code of `cut()` by adding the `label` argument in the function. Create a string vector with the values `"low"`, `"med_low"`, `"med_high"`, `"high"` (in that order) and use it to set the labels.
- Do the table of the `crime` object again
- Execute the last lines of code to remove the original crime rate variable and adding the new one to scaled Boston dataset.
- NOTE! If you receive an error message regarding factors while submitting and you feel your solution is correct, try pressing the submit-button again without altering the code. This usually works. We are currently working on the problem.
`@hint`
- You can create a vector with `c()`
- Separate the values with comma in a vector
- Remember that strings need the quotes around them
- `table(*object_name*)` creates a cross tabulation of the object
`@pre_exercise_code`
```{r}
library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
```
`@sample_code`
```{r}
# MASS, Boston and boston_scaled are available
# summary of the scaled crime rate
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = "change me!", include.lowest = TRUE)
# look at the table of the new factor crime
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
```
`@solution`
```{r}
# MASS, Boston and boston_scaled are available
# summary of the scaled crime rate
summary(boston_scaled$crim)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
```
`@sct`
```{r}
test_function("summary", incorrect_msg = "Did you use the summary() function on the crim variable in boston_scaled data?")
test_function("cut", args = "labels", incorrect_msg = "Did you set the labels to cut()?")
test_function("table", incorrect_msg = "Did you look at the table of the new factor crime?")
test_error()
success_msg("Nicely done! By the way, sometimes there are functions with the same name in different R packages (like `select` in **dplyr** and **MASS**). You can choose which one to use by adding the name of the package with two colons in front of the function call (for example `dplyr::select()`).")
```
---
## Divide and conquer: train and test sets
```yaml
type: NormalExercise
key: bca5932717
lang: r
xp: 100
skills: 1
```
When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.
The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.
Time to split our data!
`@instructions`
- Use the function `nrow()` on the `boston_scaled` to get the number of rows in the dataset. Save the number of rows in `n`.
- Execute the code to choose randomly 80% of the rows and save the row numbers to `ind`
- Create `train` set by selecting the row numbers that are saved in `ind`.
- Create `test` set by subtracting the rows that are used in the train set
- Take the crime classes from the `test` and save them as `correct_classes`
- Execute the code to remove `crime` from `test` set
`@hint`
- You can get the number of rows with `nrow(*name_of_the_dataframe*)`
- `train` and `test` are data frames, so you can access their columns with `$` mark
`@pre_exercise_code`
```{r}
library(MASS)
boston_scaled <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/boston_scaled.txt", sep =",", header = T)
```
`@sample_code`
```{r}
# boston_scaled is available
# number of rows in the Boston dataset
n <- "change me!"
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- "change me!"
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
```
`@solution`
```{r}
# boston_scaled is available
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
```
`@sct`
```{r}
test_object("n", incorrect_msg = "Did you create `n` with nrow()?")
test_object("correct_classes", incorrect_msg = "Did you create `correct_classes`?")
test_error()
success_msg("Splitting data is heavy work, well done!")
```
---
## Video: Linear Discriminant Analysis
```yaml
type: VideoExercise
key: 5d4203282b
lang: r
xp: 50
skills: 1
video_link: //player.vimeo.com/video/203184662
```
---
## Linear Discriminant analysis
```yaml
type: NormalExercise
key: 2de2d66d11
lang: r
xp: 100
skills: 1
```
[Linear Discriminant analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis) is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
Linear discriminant analysis is closely related to many other methods, such as principal component analysis (we will look into that next week) and the already familiar logistic regression.
LDA can be visualized with a biplot. We will talk more about biplots next week. The LDA biplot arrow function used in the exercise is (with slight changes) taken from [this](http://stackoverflow.com/questions/17232251/how-can-i-plot-a-biplot-for-lda-in-r) Stack Overflow message thread.
`@instructions`
- Fit a linear discriminant analysis with the function `lda()`. The function takes a formula (like in regression) as a first argument. Use the `crime` as a target variable and all the other variables as predictors. Hint! You can type `target ~ .` where the dot means all other variables in the data.
- Print the `lda.fit` object
- Create a numeric vector of the train sets crime classes (for plotting purposes)
- Use the function `plot()` on the `lda.fit` model. The argument `dimen` can be used to choose how many discriminants is used.
- Adjust the code: add arguments `col = classes` and `pch = classes` to the plot.
- Execute the `lda.arrow()` function (if you haven't done that already). Draw the plot with the lda arrows. Note that in DataCamp you will need to select both lines of code and execute them at the same time for the `lda.arrow()` function to work.
- You can change the `myscale` argument in `lda.arrow()` to see more clearly which way the arrows are pointing.
`@hint`
- The formula looks like `target ~ .`
- The target variable in here is `crime`
- You can change a factor to numeric with `as.numeric()`
- Remember to execute the `lda.arrows()` code together with the plot in DataCamp. Otherwise the arrows won't work.
`@pre_exercise_code`
```{r}
library(MASS)
boston_scaled <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/boston_scaled.txt", sep =",", header = T)
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
```
`@sample_code`
```{r}
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda("change me!", data = train)
# print the lda.fit object
lda.fit
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot("change me!", dimen = 2)
lda.arrows(lda.fit, myscale = 1)
```
`@solution`
```{r}
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
```
`@sct`
```{r}
test_object("lda.fit", incorrect_msg = "Did you add a formula to the LDA model?")
test_function("plot", incorrect_msg = "Did you plot the lda.fit object?")
test_function("plot", args= "col", incorrect_msg = "Did you set the color in the plot?")
test_error()
success_msg("Expanding your statistcal skills! Nice!")
```
---
## Predict LDA
```yaml
type: NormalExercise
key: 0efb23a544
lang: r
xp: 100
skills: 1
```
Like in the regression, the function `predict()` can be used to predict values based on a model. The function arguments are almost the same. You can see the help page of prediction function for LDA with `?predict.lda`.
We split our data earlier so that we have the test set and the correct class labels. See how the LDA model performs when predicting on new (test) data.
`@instructions`
- Predict the crime classes with the `test` data. Like in regression, the `predict()` function takes the model object as a first argument.
- Create a table of the correct classes and the predicted ones. You can get the predicted classes with `lda.pred$class`.
- Look at the table. Did the classifier predict the crime rates correctly?
`@hint`
- `table(*object_name*)` creates a cross tabulation of the object
- You can get the predicted classes with `lda.pred$class`
`@pre_exercise_code`
```{r}
library(MASS)
boston_scaled <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/boston_scaled.txt", sep =",", header = T)
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.fit = lda(crime ~ ., data=train)
```
`@sample_code`
```{r}
# lda.fit, correct_classes and test are available
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = "change me!")
# cross tabulate the results
table(correct = "change me!", predicted = "change me!")
```
`@solution`
```{r}
# lda.fit, correct_classes and test are available
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
```
`@sct`
```{r}
test_output_contains("table(correct = correct_classes, predicted = lda.pred$class)", incorrect_msg = "Please create a cross tabulation of the correct classes versus the predicted ones")
test_error()
success_msg("Good work!")
```
---
## Video: Distance measures and clustering
```yaml
type: VideoExercise
key: fc6108e633
lang: r
xp: 50
skills: 1
video_link: //player.vimeo.com/video/203198932
```
---
## Towards clustering: distance measures
```yaml
type: NormalExercise
key: 6719c6079e
lang: r
xp: 100
skills: 1
```
Similarity or dissimilarity of objects can be measured with distance measures. There are many different measures for different types of data. The most common or "normal" distance measure is [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance).
There are functions that calculate the distances in R. In this exercise, we will be using the base R's `dist()` function. The function creates a distance matrix that is saved as dist object. The distance matrix is usually square matrix containing the pairwise distances of the observations. So with large datasets, the computation of distance matrix is time consuming and storing the matrix might take a lot of memory.
`@instructions`
- Load the MASS package and the `Boston` dataset from it
- Create `dist_eu` by calling the `dist()` function on the Boston dataset. Note that by default, the function uses Euclidean distance measure.
- Look at the summary of the `dist_eu`
- Next create object `dist_man` that contains the Manhattan distance matrix of the Boston dataset
- Look at the summary of the `dist_man`
`@hint`
- `data(*name_of_the_dataset*)` can be used to load dataset from R package
- See `?dist`. The argument you will need to change the distance measure is called `method`
- Remember that strings need the quotes in R
`@pre_exercise_code`
```{r}
library(MASS)
data('Boston')
```
`@sample_code`
```{r}
# load MASS and Boston
library(MASS)
data('Boston')
# euclidean distance matrix
dist_eu <- "change me!"
# look at the summary of the distances
# manhattan distance matrix
dist_man <- "change me!"
# look at the summary of the distances
```
`@solution`
```{r}
# load MASS and Boston
library(MASS)
data('Boston')
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
```
`@sct`
```{r}
test_object("dist_man", incorrect_msg = "Did you create dist_man?")
test_output_contains("summary(dist_eu)", incorrect_msg = "Did you use the summary() function on the dist_eu?")
test_output_contains("summary(dist_man)", incorrect_msg = "Did you use the summary() function on the dist_man?")
test_error()
success_msg("Dissimilar might be similar if measured in a different way. Mind = blown!")
```
---
## K-means clustering
```yaml
type: NormalExercise
key: 63da6251d5
lang: r
xp: 100
skills: 1
```
[K-means](https://en.wikipedia.org/wiki/K-means_clustering) is maybe the most used and known clustering method. It is an unsupervised method, that assigns observations to groups or **clusters** based on similarity of the objects. In the previous exercise we got a hang of **distances**. The `kmeans()` function counts the distance matrix automatically, but it is good to know the basics. Let's cluster a bit!
`@instructions`
- First change the centers in the `kmeans()` function to be `4` and execute the clustering code
- Plot the Boston data with `pairs()`. Adjust the code by adding the `col` argument. Set the color based on the clusters that k-means produced. You can access the cluster numbers with `km$cluster`. What variables do seem to effect the clustering results? Note: With `pairs()` you can reduce the number of pairs to see the plots more clearly. On line 7, just replace `Boston` with for example `Boston[6:10]` to pair up 5 columns (columns 6 to 10).
- Try a different number of clusters: `1`, `2` and `3` (leave it to `3`). Visualize the results.
`@hint`
- You can change the number of the cluster you want to have with the `centers` argument in `kmeans()`
- See `?kmeans`
- You can access the cluster numbers with `km$cluster`
`@pre_exercise_code`
```{r}
library(MASS)
library(ggplot2)
data('Boston')
set.seed(13)
```
`@sample_code`
```{r}
# Boston dataset is available
# k-means clustering
km <-kmeans(Boston, centers = "change me!")
# plot the Boston dataset with clusters
pairs(Boston, col = "change me!")
```
`@solution`
```{r}
# Boston dataset is available
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
```
`@sct`
```{r}
test_function("pairs", args = "col",incorrect_msg = "Did you set the color based on the clusters in `pairs()` ? ")
test_error()
success_msg("Clustering like a pro!")
```
---
## K-means: determine the k
```yaml
type: NormalExercise
key: f3d6cd6c00
lang: r
xp: 100
skills: 1
```
K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes (the calculation of total WCSS was explained in the video before). When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
`@instructions`
- Set the max number of clusters (`k_max`) to be 10
- Execute the code to calculate total WCSS. This might take a while.
- Visualize the total WCSS when the number of cluster goes from 1 to 10. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.
- Run `kmeans()` again with two clusters and visualize the results
`@hint`
- Simply adjust the part of the code that says "change me!"
- You can change the number of the cluster you want to have with the `centers` argument in `kmeans()`
- Numeric values do not need quotes in R
`@pre_exercise_code`
```{r}
library(MASS)
library(ggplot2)
data('Boston')
```
`@sample_code`
```{r}
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- "change me!"
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = "change me!")
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
```
`@solution`
```{r}
# Boston dataset is available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
```
`@sct`
```{r}
test_object("k_max", incorrect_msg = "Did you set the k_max to 10?")
test_object("km", incorrect_msg = "Did you do the clustering with 2 clusters?")
test_error()
success_msg("Well done! ")
```