-
Notifications
You must be signed in to change notification settings - Fork 2
K means examples
Before running examples have baseball database loaded in Aster database. Have your Aster login configured with the following permissions:
- read/write access
- execute on SQL/MR functions (including Aster Analytics Foundation installed)
- create and drop analytical and fact tables
Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:
km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
"AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000", test=FALSE)
Note, that aggregates is optional and not part of the k-means model.
Result object km.demo
is compatible with stats::kmeans
that returns an object of class kmeans
:
>km.demo
K-means clustering with 3 clusters of sizes 2348, 8052, 2668
Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790
Clustering vector: integer(0)
Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"
computeKmeans
result actually carries more elements being of class toakmeans
:
> str(km.demo)
List of 19
$ cluster : int(0)
$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:4] "ab" "g" "h" "r"
$ totss : int 54940
$ withinss : num [1:3] 2195 1859 1774
$ tot.withinss : num 5827
$ betweenss : num 49113
$ size : int [1:3] 2348 8052 2668
$ iter : int 10
$ ifault : num 0
$ scale : logi TRUE
$ aggregates :'data.frame': 3 obs. of 10 variables:
..$ clusterid: int [1:3] 0 1 2
..$ cnt : int [1:3] 2348 8052 2668
..$ avg_g : num [1:3] 141.3 26.1 86.9
..$ avg_r : num [1:3] 76.1 3.3 31.6
..$ avg_h : num [1:3] 144.3 6.9 64.4
..$ avg_ab : num [1:3] 517.5 33.8 252.9
..$ ba : num [1:3] 0.278 0.149 0.254
..$ slg : num [1:3] 0.452 0.21 0.393
..$ ta : num [1:3] 0.77 0.369 0.653
..$ withinss : num [1:3] 2195 1859 1774
$ tableName : chr "batting_enh"
$ columns : chr [1:4] "ab" "g" "h" "r"
$ scaledTableName : chr "public.kmeans_demo_scaled"
$ centroidTableName: chr "public.kmeans_demo_centroids"
$ id : chr "playerid || '-' || teamid || '-' || yearid"
$ idAlias : chr "playerid_teamid_yearid"
$ whereClause : chr " WHERE yearid > 2000 "
$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA
.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
- attr(*, "class")= chr [1:2] "toakmeans" "kmeans"
Using line graph:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="line") |
createCentroidPlot(km.demo, format="line", groupByCluster=FALSE) |