-
Notifications
You must be signed in to change notification settings - Fork 2
K means examples
Before running examples have baseball dataset loaded in Aster database and check that your Aster login has following permissions:
- read/write access
- execute on SQL/MR functions (including Aster Analytics Foundation 6.10 or later installed)
- create and drop analytical and fact tables
Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:
km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
"AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000", test=FALSE)
Note, that aggregates is optional and is not part of the k-means model definition (see later in Visualizing Cluster Properties).
Result object km.demo
is compatible with stats::kmeans
that returns an object of class kmeans
:
>km.demo
K-means clustering with 3 clusters of sizes 2348, 8052, 2668
Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790
Clustering vector: integer(0)
Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"
computeKmeans
result actually carries more elements being of class toakmeans
:
> str(km.demo)
List of 19
$ cluster : int(0)
$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:4] "ab" "g" "h" "r"
$ totss : int 54940
$ withinss : num [1:3] 2195 1859 1774
$ tot.withinss : num 5827
$ betweenss : num 49113
$ size : int [1:3] 2348 8052 2668
$ iter : int 10
$ ifault : num 0
$ scale : logi TRUE
$ aggregates :'data.frame': 3 obs. of 10 variables:
..$ clusterid: int [1:3] 0 1 2
..$ cnt : int [1:3] 2348 8052 2668
..$ avg_g : num [1:3] 141.3 26.1 86.9
..$ avg_r : num [1:3] 76.1 3.3 31.6
..$ avg_h : num [1:3] 144.3 6.9 64.4
..$ avg_ab : num [1:3] 517.5 33.8 252.9
..$ ba : num [1:3] 0.278 0.149 0.254
..$ slg : num [1:3] 0.452 0.21 0.393
..$ ta : num [1:3] 0.77 0.369 0.653
..$ withinss : num [1:3] 2195 1859 1774
$ tableName : chr "batting_enh"
$ columns : chr [1:4] "ab" "g" "h" "r"
$ scaledTableName : chr "public.kmeans_demo_scaled"
$ centroidTableName: chr "public.kmeans_demo_centroids"
$ id : chr "playerid || '-' || teamid || '-' || yearid"
$ idAlias : chr "playerid_teamid_yearid"
$ whereClause : chr " WHERE yearid > 2000 "
$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA
.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
- attr(*, "class")= chr [1:2] "toakmeans" "kmeans"
Using line plots:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="line") |
createCentroidPlot(km.demo, format="line", groupByCluster=FALSE) |
Using bar plots:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="bar") |
createCentroidPlot(km.demo, format="bar", groupByCluster=FALSE) |
Using dodged bar plots:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="bar_dodge") |
createCentroidPlot(km.demo, format="bar_dodge", groupByCluster=FALSE) |
Using heatmaps:
Heatmap | Heatmap with coordinate flip |
---|---|
createCentroidPlot(km.demo, format="heatmap") |
createCentroidPlot(km.demo, format="heatmap", coordFlip = TRUE) |
Clusters have properties associated with them, the least, element counts. Use parameter aggregates
in function computeKmeans
to define arbitrary aggregates (properties) computed over each cluster (e.g. COUNT(*) cnt
defines cluster element counts). Visualize cluster properties with createClusterPlot
:
Color by clusters | Color by properties |
---|---|
createClusterPlot(km.demo) |
createClusterPlot(km.demo, colorByCluster = FALSE) |
toaster lets sample Aster table data clustered with k-means to visualize distributions and structure of clusters for every variable and each combination of variable pairs. To accomplish this sample clustered data with computeClusterSample
function:
km.demo = computeClusterSample(conn, km.demo, '0.5')
where we sample roughly 50% of data and save them into already existing toakmeans
object km.demo
. Then, we visualize sampled clustered data with createClusterPairsPlot
:
kmeans
object defines total within-cluster sum of squares as sum of within-cluster sum of squares. toaster always computes these values on the resulting clusters generated by k-means. Thus, we can determine optimal number of clusters using elbow method by running k-means for K clusters from 1 to N and plotting total within-clusters sum of squares as function of K:
N = 10
results = vector("list", N)
for(i in 1:N) {
results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'),
iterMax = 1000, threshold=0.1,
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000")
}
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
totwithinss = unlist(sapply(results, FUN=function(x) x$tot.withinss)))
ggplot(data) +
geom_line(aes(K, totwithinss)) + geom_point(aes(K, totwithinss)) +
scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()
Another method of measuring the quality of clustering is the silhouette method. Briefly, it introduces a metric called silhouette defined on each observation and is a function of clusters. Then the average silhouette over all observations of the entire dataset is a measure of how appropriately the data has been clustered. Note that silhouette takes values between -1 and 1 proportional to the quality of clustering.
toaster offers function computeSilhouette that calculates silhouette values on each observation and optionally saves results in Aster database table. It also returns values of silhouette and its distribution across each cluster (using histogram function in Aster):
N = 10
results = vector("list", N)
for(i in 1:N) {
results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'),
iterMax = 1000, threshold=0.1,
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000")
results[[i]]=computeSilhouette(conn, results[[i]], silhouetteTableName = "public.kmeans_demo_sil", drop=FALSE)
}
results = results[2:9]
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
silvalue = unlist(sapply(results, FUN=function(x) x$sil$valie)))
ggplot(data) +
geom_line(aes(K, silvalue), color="red") + geom_point(aes(K, silvalue), color="red") +
labs(title="Optimal Number of Clusters", x="K", y="Total Silhouette Value") +
scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()
Finally, toaster saves silhouette profiles (distributions) of clusters with the result and offers visualization function for silhouette profiles:
km.demo = computeSilhouette(conn, km.demo)
createSilhouetteProfile(km.demo)