K means examples

Pre-requisites

Before running examples have baseball dataset loaded in Aster database and check that your Aster login has following permissions:

read/write access
execute on SQL/MR functions (including Aster Analytics Foundation 6.10 or later installed)
create and drop analytical and fact tables

k-means Clustering

Running k-means

Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:

km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
                        aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
                                        "AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
                        id="playerid || '-' || teamid || '-' || yearid", 
                        scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                        schema='public', where="yearid > 2000", test=FALSE)

Note, that aggregates is optional and is not part of the k-means model definition (see later in Visualizing Cluster Properties).

k-means Function Result

Result object km.demo is compatible with stats::kmeans that returns an object of class kmeans:

>km.demo

K-means clustering with 3 clusters of sizes 2348, 8052, 2668

Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790

Clustering vector: integer(0)

Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"

computeKmeans result actually carries more elements being of class toakmeans:

> str(km.demo)

List of 19

$ cluster : int(0)

$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:3] "0" "1" "2"

.. ..$ : chr [1:4] "ab" "g" "h" "r"

$ totss : int 54940

$ withinss : num [1:3] 2195 1859 1774

$ tot.withinss : num 5827

$ betweenss : num 49113

$ size : int [1:3] 2348 8052 2668

$ iter : int 10

$ ifault : num 0

$ scale : logi TRUE

$ aggregates :'data.frame': 3 obs. of 10 variables:

..$ clusterid: int [1:3] 0 1 2

..$ cnt : int [1:3] 2348 8052 2668

..$ avg_g : num [1:3] 141.3 26.1 86.9

..$ avg_r : num [1:3] 76.1 3.3 31.6

..$ avg_h : num [1:3] 144.3 6.9 64.4

..$ avg_ab : num [1:3] 517.5 33.8 252.9

..$ ba : num [1:3] 0.278 0.149 0.254

..$ slg : num [1:3] 0.452 0.21 0.393

..$ ta : num [1:3] 0.77 0.369 0.653

..$ withinss : num [1:3] 2195 1859 1774

$ tableName : chr "batting_enh"

$ columns : chr [1:4] "ab" "g" "h" "r"

$ scaledTableName : chr "public.kmeans_demo_scaled"

$ centroidTableName: chr "public.kmeans_demo_centroids"

$ id : chr "playerid || '-' || teamid || '-' || yearid"

$ idAlias : chr "playerid_teamid_yearid"

$ whereClause : chr " WHERE yearid > 2000 "

$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA

.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

attr(*, "class")= chr [1:2] "toakmeans" "kmeans"

k-means Visualizations

Visualize k-means cluster centroids

Using line plots:

Group by clusters	Group by variables
`createCentroidPlot(km.demo, format="line")`	`createCentroidPlot(km.demo, format="line", groupByCluster=FALSE)`

Using bar plots:

Group by clusters	Group by variables
`createCentroidPlot(km.demo, format="bar")`	`createCentroidPlot(km.demo, format="bar", groupByCluster=FALSE)`

Using dodged bar plots:

Group by clusters	Group by variables
`createCentroidPlot(km.demo, format="bar_dodge")`	`createCentroidPlot(km.demo, format="bar_dodge", groupByCluster=FALSE)`

Using heatmaps:

Heatmap	Heatmap with coordinate flip
`createCentroidPlot(km.demo, format="heatmap")`	`createCentroidPlot(km.demo, format="heatmap", coordFlip = TRUE)`

Visualize Cluster Properties

Clusters have properties associated with them, the least, element counts. Use parameter aggregates in function computeKmeans to define arbitrary aggregates (properties) computed over each cluster (e.g. COUNT(*) cnt defines cluster element counts). Visualize cluster properties with createClusterPlot:

Color by clusters	Color by properties
`createClusterPlot(km.demo)`	`createClusterPlot(km.demo, colorByCluster = FALSE)`

Visualize Clustering Structure

toaster lets sample Aster table data clustered with k-means to visualize distributions and structure of clusters for every variable and each combination of variable pairs. To accomplish this sample clustered data with computeClusterSample function:

km.demo = computeClusterSample(conn, km.demo, '0.5')

where we sample roughly 50% of data and save them into already existing toakmeans object km.demo. Then, we visualize sampled clustered data with createClusterPairsPlot:

cluster pairs plot

Determining the Optimal Number of Clusters

Total Within-clusters Sum of Squares

kmeans object defines total within-cluster sum of squares as sum of within-cluster sum of squares. toaster always computes these values on the resulting clusters generated by k-means. Thus, we can determine optimal number of clusters using elbow method by running k-means for K clusters from 1 to N and plotting total within-clusters sum of squares as function of K:

N = 10
results = vector("list", N)
for(i in 1:N) {
  results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'), 
                              iterMax = 1000, threshold=0.1, 
                              id="playerid || '-' || teamid || '-' || yearid", 
                              scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                              schema='public', where="yearid > 2000")
}

data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
                  totwithinss = unlist(sapply(results, FUN=function(x) x$tot.withinss)))
ggplot(data) +
  geom_line(aes(K, totwithinss)) + geom_point(aes(K, totwithinss)) + 
  scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()

total within cluster sum of squares

Average Silhouette

Another method of measuring the quality of clustering is the silhouette method. Briefly, it introduces a metric called silhouette defined on each observation and is a function of clusters. Then the average silhouette over all observations of the entire dataset is a measure of how appropriately the data has been clustered. Note that silhouette takes values between -1 and 1 proportional to the quality of clustering.

toaster offers function computeSilhouette that calculates silhouette values on each observation and optionally saves results in Aster database table. It also returns values of silhouette and its distribution across each cluster (using histogram function in Aster):

N = 10
results = vector("list", N)
for(i in 1:N) {
  results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'), 
                               iterMax = 1000, threshold=0.1, 
                               id="playerid || '-' || teamid || '-' || yearid", 
                               scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                               schema='public', where="yearid > 2000")

  results[[i]]=computeSilhouette(conn, results[[i]], silhouetteTableName = "public.kmeans_demo_sil", drop=FALSE)
}
 
results = results[2:9]
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
                  silvalue = unlist(sapply(results, FUN=function(x) x$sil$valie)))
ggplot(data) +
  geom_line(aes(K, silvalue), color="red") + geom_point(aes(K, silvalue), color="red") +
  labs(title="Optimal Number of Clusters", x="K", y="Total Silhouette Value") +
  scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()

silhouette value

Finally, toaster saves silhouette profiles (distributions) of clusters with the result and offers visualization function for silhouette profiles:

km.demo = computeSilhouette(conn, km.demo)

createSilhouetteProfile(km.demo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly