Skip to content
Gregory Kanevsky edited this page Dec 15, 2015 · 29 revisions

Pre-requisites

Before running examples have baseball database loaded in Aster database. Have your Aster login configured with the following permissions:

  • read/write access
  • execute on SQL/MR functions (including Aster Analytics Foundation installed)
  • create and drop analytical and fact tables

Run k-means

Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:

km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
                        aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
                                       "AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
                        id="playerid || '-' || teamid || '-' || yearid", 
                        scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
                        schema='public', where="yearid > 2000", test=FALSE)

Note, that aggregates is optional and not part of the k-means model. Result object km.demo is compatible with stats::kmeans that returns an object of class kmeans:

>km.demo

K-means clustering with 3 clusters of sizes 2348, 8052, 2668

Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790

Clustering vector: integer(0)

Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"

computeKmeans result actually carries more elements being of class toakmeans:

> str(km.demo)

List of 19

$ cluster : int(0)

$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:3] "0" "1" "2"

.. ..$ : chr [1:4] "ab" "g" "h" "r"

$ totss : int 54940

$ withinss : num [1:3] 2195 1859 1774

$ tot.withinss : num 5827

$ betweenss : num 49113

$ size : int [1:3] 2348 8052 2668

$ iter : int 10

$ ifault : num 0

$ scale : logi TRUE

$ aggregates :'data.frame': 3 obs. of 10 variables:

..$ clusterid: int [1:3] 0 1 2

..$ cnt : int [1:3] 2348 8052 2668

..$ avg_g : num [1:3] 141.3 26.1 86.9

..$ avg_r : num [1:3] 76.1 3.3 31.6

..$ avg_h : num [1:3] 144.3 6.9 64.4

..$ avg_ab : num [1:3] 517.5 33.8 252.9

..$ ba : num [1:3] 0.278 0.149 0.254

..$ slg : num [1:3] 0.452 0.21 0.393

..$ ta : num [1:3] 0.77 0.369 0.653

..$ withinss : num [1:3] 2195 1859 1774

$ tableName : chr "batting_enh"

$ columns : chr [1:4] "ab" "g" "h" "r"

$ scaledTableName : chr "public.kmeans_demo_scaled"

$ centroidTableName: chr "public.kmeans_demo_centroids"

$ id : chr "playerid || '-' || teamid || '-' || yearid"

$ idAlias : chr "playerid_teamid_yearid"

$ whereClause : chr " WHERE yearid > 2000 "

$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA

.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

  • attr(*, "class")= chr [1:2] "toakmeans" "kmeans"

Visualize k-means cluster centroids

Using line graph:

Group by clusters Group by variables
createCentroidPlot(km.demo, format="line") createCentroidPlot(km.demo, format="line", groupByCluster=FALSE)
centroids by clusters centroids by clusters
Clone this wiki locally