Skip to content
Gregory Kanevsky edited this page Apr 14, 2016 · 29 revisions

Before Using K-means Functions

Pre-requisites

Before running examples have baseball dataset loaded in Aster database and check that your Aster login has following permissions:

  • read/write access
  • execute on SQL/MR functions (including Aster Analytics Foundation 6.10 or later installed)
  • create and drop analytical and fact tables

Side-effects

k-means functions computeKmeans and computeSilhouette are not like the rest of compute functions in toaster: they may have side-effects. Special care must be taken when two or more people using them especially in case when they use the same script (code). But if you know and understand how functions work then avoiding conflicts is always possible.

Things to remember:

  • computeKmeans always creates new tables scaledTableName and centroidTableName. Their names are either passed via these parameters or generated automatically by the function. In latter case no conflict with other users is possible but the tables continue to live in Aster until taken care of.
  • computeSilhouette always creates new table silhouetteTableName. The same rules as in computeKmeans apply here with additional option of removing the table at the end by setting parameter drop=TRUE (default behavior).
  • Simply don't use the table name parameters to eliminate chance of conflict with other toaster users. But remember to clean up eventually. If calls to computeKmeans fail with SQL errors pointing to missing tables then you experience the problem most likely.

k-means Clustering

Running k-means

Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:

km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
                        aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
                                        "AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
                        id="playerid || '-' || teamid || '-' || yearid", 
                        scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                        schema='public', where="yearid > 2000", test=FALSE)

Note, that aggregates is optional and is not part of the k-means model definition (see later in Visualizing Cluster Properties).

k-means Function Result

Result object km.demo is compatible with stats::kmeans that returns an object of class kmeans:

km.demo
#> K-means clustering with 3 clusters of sizes 2348, 8052, 2668
#>
#> Cluster means:
#>           ab          g          h          r
#> 0  1.8057101  1.7576785  1.8259377  1.8176750
#> 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132
#> 2  0.4344707  0.6312995  0.3669389  0.3050790
#>
#> Clustering vector:
#> integer(0)
#>
#> Within cluster sum of squares by cluster:
#> [1] 2194.507 1858.506 1774.247
#>  (between_SS / total_SS =  89.4 %)
#>
#> Available components:
#>
#>  [1] "cluster"           "centers"           "totss"             "withinss"          "tot.withinss"      "betweenss"       "size"             
#>  [8] "iter"              "ifault"            "scale"             "aggregates"        "tableName"         "columns"           "scaledTableName"  
#> [15] "centroidTableName" "id"                "idAlias"           "whereClause"       "time"     

computeKmeans result actually carries more elements being of class toakmeans:

> str(km.demo)
#> List of 19
#>
#>  $ cluster          : int(0) 
#>  $ centers          : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:3] "0" "1" "2"
#>   .. ..$ : chr [1:4] "ab" "g" "h" "r"
#>  $ totss            : int 54940
#>  $ withinss         : num [1:3] 2195 1859 1774
#>  $ tot.withinss     : num 5827
#>  $ betweenss        : num 49113
#>  $ size             : int [1:3] 2348 8052 2668
#>  $ iter             : int 10
#>  $ ifault           : num 0
#>  $ scale            : logi TRUE
#>  $ aggregates       :'data.frame':	3 obs. of  10 variables:
#>   ..$ clusterid: int [1:3] 0 1 2
#>   ..$ cnt      : int [1:3] 2348 8052 2668
#>   ..$ avg_g    : num [1:3] 141.3 26.1 86.9
#>   ..$ avg_r    : num [1:3] 76.1 3.3 31.6
#>   ..$ avg_h    : num [1:3] 144.3 6.9 64.4
#>   ..$ avg_ab   : num [1:3] 517.5 33.8 252.9
#>   ..$ ba       : num [1:3] 0.278 0.149 0.254
#>   ..$ slg      : num [1:3] 0.452 0.21 0.393
#>   ..$ ta       : num [1:3] 0.77 0.369 0.653
#>   ..$ withinss : num [1:3] 2195 1859 1774
#>  $ tableName        : chr "batting_enh"
#>  $ columns          : chr [1:4] "ab" "g" "h" "r"
#>  $ scaledTableName  : chr "public.kmeans_demo_scaled"
#>  $ centroidTableName: chr "public.kmeans_demo_centroids"
#>  $ id               : chr "playerid || '-' || teamid || '-' || yearid"
#>  $ idAlias          : chr "playerid_teamid_yearid"
#>  $ whereClause      : chr " WHERE yearid > 2000  "
#>  $ time             :Class 'proc_time'  Named num [1:5] 0.11 0.03 68.98 NA NA
#>
#>   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
#>  - attr(*, "class")= chr [1:2] "toakmeans" "kmeans"        

k-means Visualizations

Visualize k-means cluster centroids

Using line plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="line") createCentroidPlot(km.demo, format="line", groupByCluster=FALSE)
line plot by clusters line plot by variables

Using bar plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="bar") createCentroidPlot(km.demo, format="bar", groupByCluster=FALSE)
bar plot by clusters bar plot by variables

Using dodged bar plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="bar_dodge") createCentroidPlot(km.demo, format="bar_dodge", groupByCluster=FALSE)
bar plot by clusters bar plot by variables

Using heatmaps:

Heatmap Heatmap with coordinate flip
createCentroidPlot(km.demo, format="heatmap") createCentroidPlot(km.demo, format="heatmap", coordFlip = TRUE)
bar plot by clusters bar plot by variables

Visualize Cluster Properties

Clusters have properties associated with them, the least, element counts. Use parameter aggregates in function computeKmeans to define arbitrary aggregates (properties) computed over each cluster (e.g. COUNT(*) cnt defines cluster element counts). Visualize cluster properties with createClusterPlot:

Color by clusters Color by properties
createClusterPlot(km.demo) createClusterPlot(km.demo, colorByCluster = FALSE)
bar plot by clusters bar plot by variables

Visualize Clustering Structure

toaster lets sample Aster table data clustered with k-means to visualize distributions and structure of clusters for every variable and each combination of variable pairs. To accomplish this sample clustered data with computeClusterSample function:

km.demo = computeClusterSample(conn, km.demo, '0.5')

where we sample roughly 50% of data and save them into already existing toakmeans object km.demo. Then, we visualize sampled clustered data with createClusterPairsPlot:

cluster pairs plot

Determining the Optimal Number of Clusters

Total Within-clusters Sum of Squares

kmeans object defines total within-cluster sum of squares as sum of within-cluster sum of squares. toaster always computes these values on the resulting clusters generated by k-means. Thus, we can determine optimal number of clusters using elbow method by running k-means for K clusters from 1 to N and plotting total within-clusters sum of squares as function of K:

N = 10
results = vector("list", N)
for(i in 1:N) {
  results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'), 
                              iterMax = 1000, threshold=0.1, 
                              id="playerid || '-' || teamid || '-' || yearid", 
                              scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                              schema='public', where="yearid > 2000")
}

data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
                  totwithinss = unlist(sapply(results, FUN=function(x) x$tot.withinss)))
ggplot(data) +
  geom_line(aes(K, totwithinss)) + geom_point(aes(K, totwithinss)) + 
  scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()

total within cluster sum of squares

Average Silhouette

Another method of measuring the quality of clustering is the silhouette method. Briefly, it introduces a metric called silhouette defined on each observation and is a function of clusters. Then the average silhouette over all observations of the entire dataset is a measure of how appropriately the data has been clustered. Note that silhouette takes values between -1 and 1 proportional to the quality of clustering.

toaster offers function computeSilhouette that calculates silhouette values on each observation and optionally saves results in Aster database table. It also returns values of silhouette and its distribution across each cluster (using histogram function in Aster):

N = 10
results = vector("list", N)
for(i in 1:N) {
  results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'), 
                               iterMax = 1000, threshold=0.1, 
                               id="playerid || '-' || teamid || '-' || yearid", 
                               scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                               schema='public', where="yearid > 2000")

  results[[i]]=computeSilhouette(conn, results[[i]], silhouetteTableName = "public.kmeans_demo_sil", drop=FALSE)
}
 
results = results[2:9]
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
                  silvalue = unlist(sapply(results, FUN=function(x) x$sil$valie)))
ggplot(data) +
  geom_line(aes(K, silvalue), color="red") + geom_point(aes(K, silvalue), color="red") +
  labs(title="Optimal Number of Clusters", x="K", y="Total Silhouette Value") +
  scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()

silhouette value

Finally, toaster saves silhouette profiles (distributions) of clusters with the result and offers visualization function for silhouette profiles:

km.demo = computeSilhouette(conn, km.demo)

createSilhouetteProfile(km.demo)

silhouette value

Note that commonly silhouette visuals plot actual silhouette values for each cluster. But since toaster works with big data datasets we chose histogram of silhouette values per cluster instead.

References

  1. K-means clustering R Function
  2. Teradata Aster Analytics: kmeans
  3. Silhouette (clustering)
  4. Six methods for determining an optimal k value for k-means analysis - Answer on stackoverflow containing R code for several methods of computing an optimal value of k for k-means cluster analysis
Clone this wiki locally