Skip to content
Gregory Kanevsky edited this page Dec 16, 2015 · 29 revisions

Pre-requisites

Before running examples have baseball database loaded in Aster database. Have your Aster login configured with the following permissions:

  • read/write access
  • execute on SQL/MR functions (including Aster Analytics Foundation installed)
  • create and drop analytical and fact tables

Run k-means

Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:

km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
                        aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
                                       "AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
                        id="playerid || '-' || teamid || '-' || yearid", 
                        scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                        schema='public', where="yearid > 2000", test=FALSE)

Note, that aggregates is optional and not part of the k-means model. Result object km.demo is compatible with stats::kmeans that returns an object of class kmeans:

>km.demo

K-means clustering with 3 clusters of sizes 2348, 8052, 2668

Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790

Clustering vector: integer(0)

Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"

computeKmeans result actually carries more elements being of class toakmeans:

> str(km.demo)

List of 19

$ cluster : int(0)

$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:3] "0" "1" "2"

.. ..$ : chr [1:4] "ab" "g" "h" "r"

$ totss : int 54940

$ withinss : num [1:3] 2195 1859 1774

$ tot.withinss : num 5827

$ betweenss : num 49113

$ size : int [1:3] 2348 8052 2668

$ iter : int 10

$ ifault : num 0

$ scale : logi TRUE

$ aggregates :'data.frame': 3 obs. of 10 variables:

..$ clusterid: int [1:3] 0 1 2

..$ cnt : int [1:3] 2348 8052 2668

..$ avg_g : num [1:3] 141.3 26.1 86.9

..$ avg_r : num [1:3] 76.1 3.3 31.6

..$ avg_h : num [1:3] 144.3 6.9 64.4

..$ avg_ab : num [1:3] 517.5 33.8 252.9

..$ ba : num [1:3] 0.278 0.149 0.254

..$ slg : num [1:3] 0.452 0.21 0.393

..$ ta : num [1:3] 0.77 0.369 0.653

..$ withinss : num [1:3] 2195 1859 1774

$ tableName : chr "batting_enh"

$ columns : chr [1:4] "ab" "g" "h" "r"

$ scaledTableName : chr "public.kmeans_demo_scaled"

$ centroidTableName: chr "public.kmeans_demo_centroids"

$ id : chr "playerid || '-' || teamid || '-' || yearid"

$ idAlias : chr "playerid_teamid_yearid"

$ whereClause : chr " WHERE yearid > 2000 "

$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA

.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

  • attr(*, "class")= chr [1:2] "toakmeans" "kmeans"

Visualize k-means cluster centroids

Using line plots:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="line") createCentroidPlot(km.demo, format="line", groupByCluster=FALSE)
line plot by clusters line plot by variables

Using bar plots:

Group by clusters Group by variables
createCentroidPlot(km, format="bar") createCentroidPlot(km, format="bar", groupByCluster=FALSE)
bar plot by clusters bar plot by variables

Using dodged bar plots:

Group by clusters Group by variables
createCentroidPlot(km, format="bar_dodge") createCentroidPlot(km, format="bar_dodge", groupByCluster=FALSE)
bar plot by clusters bar plot by variables

Using heatmaps:

Heatmap Heatmap with coordinate flip
createCentroidPlot(km, format="heatmap") createCentroidPlot(km, format="heatmap", coordFlip = TRUE)
bar plot by clusters bar plot by variables

Visualize Cluster Properties

Clusters have properties associated with them, the least, element counts. Use parameter aggregates in function computeKmeans to define arbitrary aggregates (properties) computed over each cluster (e.g. COUNT(*) cnt defines cluster element counts). Visualize cluster properties with createClusterPlot:

Color by clusters Color by properties
createClusterPlot(km) createClusterPlot(km, colorByCluster = FALSE)
bar plot by clusters bar plot by variables
Clone this wiki locally