Skip to content
Gregory Kanevsky edited this page Dec 16, 2015 · 29 revisions

Pre-requisites

Before running examples have baseball dataset loaded in Aster database and check that your Aster login has following permissions:

  • read/write access
  • execute on SQL/MR functions (including Aster Analytics Foundation 6.10 or later installed)
  • create and drop analytical and fact tables

k-means Clustering

Running k-means

Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:

km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
                        aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
                                        "AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
                        id="playerid || '-' || teamid || '-' || yearid", 
                        scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                        schema='public', where="yearid > 2000", test=FALSE)

Note, that aggregates is optional and is not part of the k-means model definition (see later in Visualizing Cluster Properties).

k-means Function Result

Result object km.demo is compatible with stats::kmeans that returns an object of class kmeans:

>km.demo

K-means clustering with 3 clusters of sizes 2348, 8052, 2668

Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790

Clustering vector: integer(0)

Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"

computeKmeans result actually carries more elements being of class toakmeans:

> str(km.demo)

List of 19

$ cluster : int(0)

$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:3] "0" "1" "2"

.. ..$ : chr [1:4] "ab" "g" "h" "r"

$ totss : int 54940

$ withinss : num [1:3] 2195 1859 1774

$ tot.withinss : num 5827

$ betweenss : num 49113

$ size : int [1:3] 2348 8052 2668

$ iter : int 10

$ ifault : num 0

$ scale : logi TRUE

$ aggregates :'data.frame': 3 obs. of 10 variables:

..$ clusterid: int [1:3] 0 1 2

..$ cnt : int [1:3] 2348 8052 2668

..$ avg_g : num [1:3] 141.3 26.1 86.9

..$ avg_r : num [1:3] 76.1 3.3 31.6

..$ avg_h : num [1:3] 144.3 6.9 64.4

..$ avg_ab : num [1:3] 517.5 33.8 252.9

..$ ba : num [1:3] 0.278 0.149 0.254

..$ slg : num [1:3] 0.452 0.21 0.393

..$ ta : num [1:3] 0.77 0.369 0.653

..$ withinss : num [1:3] 2195 1859 1774

$ tableName : chr "batting_enh"

$ columns : chr [1:4] "ab" "g" "h" "r"

$ scaledTableName : chr "public.kmeans_demo_scaled"

$ centroidTableName: chr "public.kmeans_demo_centroids"

$ id : chr "playerid || '-' || teamid || '-' || yearid"

$ idAlias : chr "playerid_teamid_yearid"

$ whereClause : chr " WHERE yearid > 2000 "

$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA

.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

  • attr(*, "class")= chr [1:2] "toakmeans" "kmeans"

k-means Visualizations

Visualize k-means cluster centroids

Using line plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="line") createCentroidPlot(km.demo, format="line", groupByCluster=FALSE)
line plot by clusters line plot by variables

Using bar plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="bar") createCentroidPlot(km.demo, format="bar", groupByCluster=FALSE)
bar plot by clusters bar plot by variables

Using dodged bar plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="bar_dodge") createCentroidPlot(km.demo, format="bar_dodge", groupByCluster=FALSE)
bar plot by clusters bar plot by variables

Using heatmaps:

Heatmap Heatmap with coordinate flip
createCentroidPlot(km.demo, format="heatmap") createCentroidPlot(km.demo, format="heatmap", coordFlip = TRUE)
bar plot by clusters bar plot by variables

Visualize Cluster Properties

Clusters have properties associated with them, the least, element counts. Use parameter aggregates in function computeKmeans to define arbitrary aggregates (properties) computed over each cluster (e.g. COUNT(*) cnt defines cluster element counts). Visualize cluster properties with createClusterPlot:

Color by clusters Color by properties
createClusterPlot(km.demo) createClusterPlot(km.demo, colorByCluster = FALSE)
bar plot by clusters bar plot by variables

Visualize Clustering Structure

toaster lets sample Aster table data clustered with k-means to visualize distributions and structure of clusters for every variable and each combination of variable pairs. To accomplish this sample clustered data with computeClusterSample function:

km.demo = computeClusterSample(conn, km.demo, '0.5')

where we sample roughly 50% of data and save them into already existing toakmeans object km.demo. Then, we visualize sampled clustered data with createClusterPairsPlot:

cluster pairs plot

Determining the Optimal Number of Clusters

Total Within-clusters Sum of Squares

kmeans object defines total within-cluster sum of squares as sum of within-cluster sum of squares. toaster always computes these values on the resulting clusters generated by k-means. Thus, we can determine optimal number of clusters using elbow method by running k-means for K clusters from 1 to N and plotting total within-clusters sum of squares as function of K:

N = 10
results = vector("list", N)
for(i in 1:N) {
  results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'), 
                              iterMax = 1000, threshold=0.1, 
                              id="playerid || '-' || teamid || '-' || yearid", 
                              scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                              schema='public', where="yearid > 2000")
}

data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
                  totwithinss = unlist(sapply(results, FUN=function(x) x$tot.withinss)))
ggplot(data) +
  geom_line(aes(K, totwithinss)) + geom_point(aes(K, totwithinss)) + 
  scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()

total within cluster sum of squares

Average Silhouette

Another method of measuring the quality of clustering is the silhouette method. Briefly, it introduces a metric called silhouette defined on each observation and is a function of clusters. Then the average silhouette over all observations of the entire dataset is a measure of how appropriately the data has been clustered. Note that silhouette takes values between -1 and 1 proportional to the quality of clustering.

toaster offers function computeSilhouette that calculates silhouette values on each observation and optionally saves results in Aster database table. It also returns values of silhouette and its distribution across each cluster (using histogram function in Aster):

N = 10
results = vector("list", N)
for(i in 1:N) {
  results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'), 
                               iterMax = 1000, threshold=0.1, 
                               id="playerid || '-' || teamid || '-' || yearid", 
                               scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                               schema='public', where="yearid > 2000")

  results[[i]]=computeSilhouette(conn, results[[i]], silhouetteTableName = "public.kmeans_demo_sil", drop=FALSE)
}
 
results = results[2:9]
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
                  silvalue = unlist(sapply(results, FUN=function(x) x$sil$valie)))
ggplot(data) +
  geom_line(aes(K, silvalue), color="red") + geom_point(aes(K, silvalue), color="red") +
  labs(title="Optimal Number of Clusters", x="K", y="Total Silhouette Value") +
  scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()

silhouette value

Finally, toaster saves silhouette profiles (distributions) of clusters with the result and offers visualization function for silhouette profiles:

km.demo = computeSilhouette(conn, km.demo)

createSilhouetteProfile(km.demo)

silhouette value

Clone this wiki locally