chapter3.Rmd

---
title: 'Clustering and classification'
description: 'Datasets in R, Linear Discriminant Analysis (LDA) and K-means clustering'
---

## Datasets inside R

```yaml
type: NormalExercise
key: 3db2d48f25
lang: r
xp: 100
skills: 1
```

Welcome to the *Clustering and classification* chapter.

R has many (usually small) datasets already loaded in. There are also datasets included in the package installations. Some of the datasets are quite famous (like the [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) flower data) and they are frequently used for teaching purposes or to demonstrate statistical methods. 

This week we will be using the [Boston](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html) dataset from the MASS package. Let's see how it looks like!

`@instructions`
- Load the `Boston` dataset from MASS
- Explore the `Boston` dataset. Look at the structure with `str()` and use `summary()` to see the details of the variables.
- Draw the plot matrix with `pairs()`

`@hint`
- You can draw the `pairs()` plot by typing the object name where it is saved.

`@pre_exercise_code`
```{r}

```

`@sample_code`
```{r}
# access the MASS package
library(MASS)

# load the data
data("Boston")

# explore the dataset


# plot matrix of the variables


```

`@solution`
```{r}
# access the MASS package
library(MASS)

# load the data
data("Boston")

# explore the dataset
str(Boston)
summary(Boston)

# plot matrix of the variables
pairs(Boston)

```

`@sct`
```{r}
test_output_contains("summary(Boston)", incorrect_msg = "Did you use the summary() function on the Boston data?")
test_output_contains("str(Boston)", incorrect_msg = "Did you use the str() function on the Boston data?")
test_function("pairs", incorrect_msg = "Did you draw the plot matrix with pairs?")

test_error()
success_msg("Good work!")
```

---

## Correlations plot

```yaml
type: NormalExercise
key: 603dff42ce
lang: r
xp: 100
skills: 1
```

It is often interesting to look at the correlations between variables in the data. The function `cor()` can be used to create the correlation matrix. A more visual way to look at the correlations is to use `corrplot()` function (from the corrplot package).

Use the corrplot to visualize the correlation between variables of the Boston dataset.

`@instructions`
- Calculate the correlation matrix and save it as `cor_matrix`. Print the matrix to see how it looks like.
- Adjust the code: use the pipe (`%>%`) to round the matrix. Rounding can be done with the `round()` function. Use the first two digits. Print the matrix again. 
- Plot the rounded correlation matrix
- Adjust the code: add argument `type = "upper"` to the plot. Print the plot again. 
- Adjust the code little more: add arguments `cl.pos = "b"`, `tl.pos = "d"` and `tl.cex = 0.6` to the plot. Print the plot again. 
- See more of corrplot [here](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html)

`@hint`
- For correlation matrices, see `?cor`
- The pipe (`%>%`) takes the data before the pipe and applies the functionality assigned in the right. For example `data_column %>% summary()` creates a summary of the data_column.
- The link on the last line of the instructions have quite good examples about the corrplot

`@pre_exercise_code`
```{r}
library(MASS)
library(tidyr)

source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrplot.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/colorlegend.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrMatOrder.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrRect.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrRect.hclust.R")
source("https://raw.githubusercontent.com/taiyun/corrplot/master/R/corrplot.mixed.R")
data("Boston")

```

`@sample_code`
```{r}
# MASS, corrplot, tidyverse and Boston dataset are available

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix


# visualize the correlation matrix
corrplot(cor_matrix, method="circle")

```

`@solution`
```{r}
# MASS, corrplot, tidyr and Boston dataset are available

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

```

`@sct`
```{r}

test_function("round", args = "digits",incorrect_msg = "Did you use the first two digits in rounding?")

test_error()
success_msg("Good work!")
```

---

## Scale the whole dataset

```yaml
type: NormalExercise
key: 3a4eacf66c
lang: r
xp: 100
skills: 1
```

Usually the R datasets do not need much data wrangling as they are already in a good shape. But we will need to do little adjustments.

For later use, we will need to scale the data. In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.

$$scaled(x) = \frac{x - mean(x)}{ sd(x)}$$

The Boston data contains only numerical values, so we can use the function `scale()` to standardize the whole dataset.

`@instructions`
- Use the `scale()` function on the `Boston` dataset. Save the scaled data to `boston_scaled` object.
- Use `summary()` to look at the scaled variables. Note the means of the variables.
- Find out the class of the scaled object by executing the `class()` function. 
- Later we will want the data to be a data frame. Use `as.data.frame()` to convert the `boston_scaled` to a data frame format. Keep the object name as `boston_scaled`.

`@hint`
- See `?scale`

`@pre_exercise_code`
```{r}
library(MASS)
data("Boston")
```

`@sample_code`
```{r}
# MASS and Boston dataset are available

# center and standardize variables
boston_scaled <- "change me!"

# summaries of the scaled variables


# class of the boston_scaled object
class(boston_scaled)

# change the object to data frame


```

`@solution`
```{r}
# MASS and Boston dataset are available

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)

# class of the boston_scaled object
class(boston_scaled)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

```

`@sct`
```{r}
test_function("scale", args = "x",incorrect_msg = "Did you scale the dataset?")

test_function("as.data.frame", args = "x",incorrect_msg = "Please use the function as.data.frame() to convert the data to a data frame.")

test_function("summary", args = "object", incorrect_msg = "Please use the function summary() to look at the summaries of the data variables.")

test_error()
success_msg("Good work!")
```

---

## Creating a factor variable

```yaml
type: NormalExercise
key: 3d45bd1da6
lang: r
xp: 100
skills: 1
```

We can create a categorical variable from a continuous one. There are many ways to to do that. Let's choose the variable `crim` (per capita crime rate by town) to be our factor variable. We want to cut the variable by [quantiles](https://en.wikipedia.org/wiki/Quantile) to get the high, low and middle rates of crime into their own categories.

See how it's done below!

`@instructions`
- Look at the summary of the scaled variable `crim`
- Use the function `quantile()` on the scaled crime rate variable and save the results to `bins`. Print the results.
- Create categorical crime vector with the `cut()` function. Set the `breaks` argument to be the quantile vector you just created.
- Use the function `table()` on the `crime` object
- Adjust the code of `cut()` by adding the `label` argument in the function. Create a string vector with the values `"low"`, `"med_low"`, `"med_high"`, `"high"` (in that order) and use it to set the labels.
- Do the table of the `crime` object again
- Execute the last lines of code to remove the original crime rate variable and adding the new one to scaled Boston dataset.
- NOTE! If you receive an error message regarding factors while submitting and you feel your solution is correct, try pressing the submit-button again without altering the code. This usually works. We are currently working on the problem.

`@hint`
- You can create a vector with `c()`
- Separate the values with comma in a vector
- Remember that strings need the quotes around them
- `table(*object_name*)` creates a cross tabulation of the object

`@pre_exercise_code`
```{r}
library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
```

`@sample_code`
```{r}
# MASS, Boston and boston_scaled are available

# summary of the scaled crime rate


# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = "change me!", include.lowest = TRUE)

# look at the table of the new factor crime


# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

```

`@solution`
```{r}
# MASS, Boston and boston_scaled are available

# summary of the scaled crime rate
summary(boston_scaled$crim)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

```

`@sct`
```{r}
test_function("summary", incorrect_msg = "Did you use the summary() function on the crim variable in boston_scaled data?")

test_function("cut", args = "labels", incorrect_msg = "Did you set the labels to cut()?")

test_function("table", incorrect_msg = "Did you look at the table of the new factor crime?")

test_error()

success_msg("Nicely done! By the way, sometimes there are functions with the same name in different R packages (like `select` in **dplyr** and **MASS**). You can choose which one to use by adding the name of the package with two colons in front of the function call (for example `dplyr::select()`).")
```

---

## Divide and conquer: train and test sets

```yaml
type: NormalExercise
key: bca5932717
lang: r
xp: 100
skills: 1
```

When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works. 

The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction. 
 
Time to split our data!

`@instructions`
- Use the function `nrow()` on the `boston_scaled` to get the number of rows in the dataset. Save the number of rows in `n`.
- Execute the code to choose randomly 80% of the rows and save the row numbers to `ind`
- Create `train` set by selecting the row numbers that are saved in `ind`.
- Create `test` set by subtracting the rows that are used in the train set
- Take the crime classes from the `test` and save them as `correct_classes`
- Execute the code to remove `crime` from `test` set

`@hint`
- You can get the number of rows with `nrow(*name_of_the_dataframe*)`
- `train` and `test` are data frames, so you can access their columns with `$` mark

`@pre_exercise_code`
```{r}
library(MASS)
boston_scaled <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/boston_scaled.txt", sep  =",", header = T)
```

`@sample_code`
```{r}
# boston_scaled is available

# number of rows in the Boston dataset 
n <- "change me!"

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- "change me!"

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

```

`@solution`
```{r}
# boston_scaled is available

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

```

`@sct`
```{r}
test_object("n", incorrect_msg = "Did you create `n` with nrow()?")
test_object("correct_classes", incorrect_msg = "Did you create `correct_classes`?")
test_error()
success_msg("Splitting data is heavy work, well done!")
```

---

## Video: Linear Discriminant Analysis

```yaml
type: VideoExercise
key: 5d4203282b
lang: r
xp: 50
skills: 1
video_link: //player.vimeo.com/video/203184662
```


---

## Linear Discriminant analysis

```yaml
type: NormalExercise
key: 2de2d66d11
lang: r
xp: 100
skills: 1
```

[Linear Discriminant analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis) is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable. 

Linear discriminant analysis is closely related to many other methods, such as principal component analysis (we will look into that next week) and the already familiar logistic regression. 

LDA can be visualized with a biplot. We will talk more about biplots next week. The LDA biplot arrow function used in the exercise is (with slight changes) taken from [this](http://stackoverflow.com/questions/17232251/how-can-i-plot-a-biplot-for-lda-in-r) Stack Overflow message thread.

`@instructions`
- Fit a linear discriminant analysis with the function `lda()`. The function takes a formula (like in regression) as a first argument. Use the `crime` as a target variable and all the other variables as predictors. Hint! You can type `target ~ .` where the dot means all other variables in the data.
- Print the `lda.fit` object
- Create a numeric vector of the train sets crime classes (for plotting purposes)
- Use the function `plot()` on the `lda.fit` model. The argument `dimen` can be used to choose how many discriminants is used. 
- Adjust the code: add arguments `col = classes` and `pch = classes` to the plot.
- Execute the `lda.arrow()` function (if you haven't done that already). Draw the plot with the lda arrows. Note that in DataCamp you will need to select both lines of code and execute them at the same time for the `lda.arrow()` function to work.
- You can change the `myscale` argument in `lda.arrow()` to see more clearly which way the arrows are pointing.

`@hint`
- The formula looks like `target ~ .`
- The target variable in here is `crime`
- You can change a factor to numeric with `as.numeric()`
- Remember to execute the `lda.arrows()` code together with the plot in DataCamp. Otherwise the arrows won't work.

`@pre_exercise_code`
```{r}
library(MASS)
boston_scaled <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/boston_scaled.txt", sep  =",", header = T)

ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

```

`@sample_code`
```{r}
# MASS and train are available

# linear discriminant analysis
lda.fit <- lda("change me!", data = train)

# print the lda.fit object
lda.fit

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot("change me!", dimen = 2)
lda.arrows(lda.fit, myscale = 1)

```

`@solution`
```{r}
# MASS and train are available

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

```

`@sct`
```{r}
test_object("lda.fit", incorrect_msg = "Did you add a formula to the LDA model?")
test_function("plot", incorrect_msg = "Did you plot the lda.fit object?")
test_function("plot", args= "col", incorrect_msg = "Did you set the color in the plot?")
test_error()
success_msg("Expanding your statistcal skills! Nice!")
```

---

## Predict LDA

```yaml
type: NormalExercise
key: 0efb23a544
lang: r
xp: 100
skills: 1
```

Like in the regression, the function `predict()` can be used to predict values based on a model. The function arguments are almost the same. You can see the help page of prediction function for LDA with `?predict.lda`.

We split our data earlier so that we have the test set and the correct class labels. See how the LDA model performs when predicting on new (test) data.

`@instructions`
- Predict the crime classes with the `test` data. Like in regression, the `predict()` function takes the model object as a first argument.
- Create a table of the correct classes and the predicted ones. You can get the predicted classes with `lda.pred$class`. 
- Look at the table. Did the classifier predict the crime rates correctly?

`@hint`
- `table(*object_name*)` creates a cross tabulation of the object
- You can get the predicted classes with `lda.pred$class`

`@pre_exercise_code`
```{r}
library(MASS)
boston_scaled <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/boston_scaled.txt", sep  =",", header = T)

ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

lda.fit = lda(crime ~ ., data=train)
```

`@sample_code`
```{r}
# lda.fit, correct_classes and test are available

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = "change me!")

# cross tabulate the results
table(correct = "change me!", predicted = "change me!")

```

`@solution`
```{r}
# lda.fit, correct_classes and test are available

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)

```

`@sct`
```{r}
test_output_contains("table(correct = correct_classes, predicted = lda.pred$class)", incorrect_msg = "Please create a cross tabulation of the correct classes versus the predicted ones")

test_error()
success_msg("Good work!")
```

---

## Video: Distance measures and clustering

```yaml
type: VideoExercise
key: fc6108e633
lang: r
xp: 50
skills: 1
video_link: //player.vimeo.com/video/203198932
```


---

## Towards clustering: distance measures

```yaml
type: NormalExercise
key: 6719c6079e
lang: r
xp: 100
skills: 1
```

Similarity or dissimilarity of objects can be measured with distance measures. There are many different measures for different types of data. The most common or "normal" distance measure is [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). 

There are functions that calculate the distances in R. In this exercise, we will be using the base R's `dist()` function. The function creates a distance matrix that is saved as dist object. The distance matrix is usually square matrix containing the pairwise distances of the observations. So with large datasets, the computation of distance matrix is time consuming and storing the matrix might take a lot of memory.

`@instructions`
- Load the MASS package and the `Boston` dataset from it
- Create `dist_eu` by calling the `dist()` function on the Boston dataset. Note that by default, the function uses Euclidean distance measure.
- Look at the summary of the `dist_eu`
- Next create object `dist_man` that contains the Manhattan distance matrix of the Boston dataset
- Look at the summary of the `dist_man`

`@hint`
- `data(*name_of_the_dataset*)` can be used to load dataset from R package
- See `?dist`. The argument you will need to change the distance measure is called `method`
- Remember that strings need the quotes in R

`@pre_exercise_code`
```{r}
library(MASS)
data('Boston')
```

`@sample_code`
```{r}
# load MASS and Boston
library(MASS)
data('Boston')

# euclidean distance matrix
dist_eu <- "change me!"

# look at the summary of the distances


# manhattan distance matrix
dist_man <- "change me!"

# look at the summary of the distances


```

`@solution`
```{r}
# load MASS and Boston
library(MASS)
data('Boston')

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)

# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)

```

`@sct`
```{r}
test_object("dist_man", incorrect_msg = "Did you create dist_man?")
test_output_contains("summary(dist_eu)", incorrect_msg = "Did you use the summary() function on the dist_eu?")
test_output_contains("summary(dist_man)", incorrect_msg = "Did you use the summary() function on the dist_man?")
test_error()
success_msg("Dissimilar might be similar if measured in a different way. Mind = blown!")
```

---

## K-means clustering

```yaml
type: NormalExercise
key: 63da6251d5
lang: r
xp: 100
skills: 1
```

[K-means](https://en.wikipedia.org/wiki/K-means_clustering) is maybe the most used and known clustering method. It is an unsupervised method, that assigns observations to groups or **clusters** based on similarity of the objects. In the previous exercise we got a hang of **distances**. The `kmeans()` function counts the distance matrix automatically, but it is good to know the basics.  Let's cluster a bit!

`@instructions`
- First change the centers in the `kmeans()` function to be `4` and execute the clustering code
- Plot the Boston data with `pairs()`. Adjust the code by adding the `col` argument. Set the color based on the clusters that k-means produced. You can access the cluster numbers with `km$cluster`. What variables do seem to effect the clustering results? Note: With `pairs()` you can reduce the number of pairs to see the plots more clearly. On line 7, just replace `Boston` with for example `Boston[6:10]` to pair up 5 columns (columns 6 to 10).
- Try a different number of clusters: `1`, `2` and `3` (leave it to `3`). Visualize the results.

`@hint`
- You can change the number of the cluster you want to have with the `centers` argument in `kmeans()`
- See `?kmeans`
- You can access the cluster numbers with `km$cluster`

`@pre_exercise_code`
```{r}
library(MASS)
library(ggplot2)

data('Boston')
set.seed(13)

```

`@sample_code`
```{r}
# Boston dataset is available

# k-means clustering
km <-kmeans(Boston, centers = "change me!")

# plot the Boston dataset with clusters
pairs(Boston, col = "change me!")

```

`@solution`
```{r}
# Boston dataset is available

# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

```

`@sct`
```{r}
test_function("pairs", args = "col",incorrect_msg = "Did you set the color based on the clusters in `pairs()` ? ")
 
test_error()
success_msg("Clustering like a pro!")
```

---

## K-means: determine the k

```yaml
type: NormalExercise
key: f3d6cd6c00
lang: r
xp: 100
skills: 1
```

K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have. 

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes (the calculation of total WCSS was explained in the video before). When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.

`@instructions`
- Set the max number of clusters (`k_max`) to be 10
- Execute the code to calculate total WCSS. This might take a while.
- Visualize the total WCSS when the number of cluster goes from 1 to 10. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal. 
- Run `kmeans()` again with two clusters and visualize the results

`@hint`
- Simply adjust the part of the code that says "change me!"
- You can change the number of the cluster you want to have with the `centers` argument in `kmeans()`
- Numeric values do not need quotes in R

`@pre_exercise_code`
```{r}
library(MASS)
library(ggplot2)

data('Boston')

```

`@sample_code`
```{r}
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- "change me!"

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = "change me!")

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

```

`@solution`
```{r}
# Boston dataset is available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

```

`@sct`
```{r}
test_object("k_max", incorrect_msg = "Did you set the k_max to 10?")
test_object("km", incorrect_msg = "Did you do the clustering with 2 clusters?")
test_error()
success_msg("Well done! ")
```