-
Notifications
You must be signed in to change notification settings - Fork 2
K means examples
Before running examples have baseball dataset loaded in Aster database and check that your Aster login has following permissions:
- read/write access
- execute on SQL/MR functions (including Aster Analytics Foundation 6.10 or later installed)
- create and drop analytical and fact tables
k-means functions computeKmeans
and computeSilhouette
are not like the rest of compute functions in toaster: they may have side-effects. Special care must be taken when two or more people using them especially in case when they use the same script (code). But if you know and understand how functions work then avoiding conflicts is always possible.
Things to remember:
-
computeKmeans
always creates new tablesscaledTableName
andcentroidTableName
. Their names are either passed via these parameters or generated automatically by the function. In latter case no conflict with other users is possible but the tables continue to live in Aster until taken care of. -
computeSilhouette
always creates new tablesilhouetteTableName
. The same rules as incomputeKmeans
apply here with additional option of removing the table at the end by setting parameterdrop=TRUE
(default behavior). - Simply don't use the table name parameters to eliminate chance of conflict with other toaster users. But remember to clean up eventually. If calls to
computeKmeans
fail with SQL errors pointing to missing tables then you experience the problem most likely.
Given table batting_enh of batting statistic cluster batters by their g, r, h, and ab stats:
km.demo = computeKmeans(conn, "batting_enh", centers=3, include=c('g','r','h','ab'), iterMax = 25,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h",
"AVG(ab) avg_ab", "AVG(ba) ba", "AVG(slg) slg", "AVG(ta) ta"),
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000", test=FALSE)
Note, that aggregates is optional and is not part of the k-means model definition (see later in Visualizing Cluster Properties).
Result object km.demo
is compatible with stats::kmeans
that returns an object of class kmeans
:
>km.demo
K-means clustering with 3 clusters of sizes 2348, 8052, 2668
Cluster means: ab g h r 0 1.8057101 1.7576785 1.8259377 1.8176750 1 -0.6810139 -0.6025208 -0.6645518 -0.6415132 2 0.4344707 0.6312995 0.3669389 0.3050790
Clustering vector: integer(0)
Within cluster sum of squares by cluster: [1] 2194.507 1858.506 1774.247 (between_SS / total_SS = 89.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault" "scale" "aggregates" "tableName" "columns" "scaledTableName"
[15] "centroidTableName" "id" "idAlias" "whereClause" "time"
computeKmeans
result actually carries more elements being of class toakmeans
:
> str(km.demo)
List of 19
$ cluster : int(0)
$ centers : num [1:3, 1:4] 1.806 -0.681 0.434 1.758 -0.603 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:4] "ab" "g" "h" "r"
$ totss : int 54940
$ withinss : num [1:3] 2195 1859 1774
$ tot.withinss : num 5827
$ betweenss : num 49113
$ size : int [1:3] 2348 8052 2668
$ iter : int 10
$ ifault : num 0
$ scale : logi TRUE
$ aggregates :'data.frame': 3 obs. of 10 variables:
..$ clusterid: int [1:3] 0 1 2
..$ cnt : int [1:3] 2348 8052 2668
..$ avg_g : num [1:3] 141.3 26.1 86.9
..$ avg_r : num [1:3] 76.1 3.3 31.6
..$ avg_h : num [1:3] 144.3 6.9 64.4
..$ avg_ab : num [1:3] 517.5 33.8 252.9
..$ ba : num [1:3] 0.278 0.149 0.254
..$ slg : num [1:3] 0.452 0.21 0.393
..$ ta : num [1:3] 0.77 0.369 0.653
..$ withinss : num [1:3] 2195 1859 1774
$ tableName : chr "batting_enh"
$ columns : chr [1:4] "ab" "g" "h" "r"
$ scaledTableName : chr "public.kmeans_demo_scaled"
$ centroidTableName: chr "public.kmeans_demo_centroids"
$ id : chr "playerid || '-' || teamid || '-' || yearid"
$ idAlias : chr "playerid_teamid_yearid"
$ whereClause : chr " WHERE yearid > 2000 "
$ time :Class 'proc_time' Named num [1:5] 0.11 0.03 68.98 NA NA
.. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
- attr(*, "class")= chr [1:2] "toakmeans" "kmeans"
Using line plots:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="line") |
createCentroidPlot(km.demo, format="line", groupByCluster=FALSE) |
Using bar plots:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="bar") |
createCentroidPlot(km.demo, format="bar", groupByCluster=FALSE) |
Using dodged bar plots:
Group by clusters | Group by variables |
---|---|
createCentroidPlot(km.demo, format="bar_dodge") |
createCentroidPlot(km.demo, format="bar_dodge", groupByCluster=FALSE) |
Using heatmaps:
Heatmap | Heatmap with coordinate flip |
---|---|
createCentroidPlot(km.demo, format="heatmap") |
createCentroidPlot(km.demo, format="heatmap", coordFlip = TRUE) |
Clusters have properties associated with them, the least, element counts. Use parameter aggregates
in function computeKmeans
to define arbitrary aggregates (properties) computed over each cluster (e.g. COUNT(*) cnt
defines cluster element counts). Visualize cluster properties with createClusterPlot
:
Color by clusters | Color by properties |
---|---|
createClusterPlot(km.demo) |
createClusterPlot(km.demo, colorByCluster = FALSE) |
toaster lets sample Aster table data clustered with k-means to visualize distributions and structure of clusters for every variable and each combination of variable pairs. To accomplish this sample clustered data with computeClusterSample
function:
km.demo = computeClusterSample(conn, km.demo, '0.5')
where we sample roughly 50% of data and save them into already existing toakmeans
object km.demo
. Then, we visualize sampled clustered data with createClusterPairsPlot
:
kmeans
object defines total within-cluster sum of squares as sum of within-cluster sum of squares. toaster always computes these values on the resulting clusters generated by k-means. Thus, we can determine optimal number of clusters using elbow method by running k-means for K clusters from 1 to N and plotting total within-clusters sum of squares as function of K:
N = 10
results = vector("list", N)
for(i in 1:N) {
results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'),
iterMax = 1000, threshold=0.1,
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000")
}
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
totwithinss = unlist(sapply(results, FUN=function(x) x$tot.withinss)))
ggplot(data) +
geom_line(aes(K, totwithinss)) + geom_point(aes(K, totwithinss)) +
scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()
Another method of measuring the quality of clustering is the silhouette method. Briefly, it introduces a metric called silhouette defined on each observation and is a function of clusters. Then the average silhouette over all observations of the entire dataset is a measure of how appropriately the data has been clustered. Note that silhouette takes values between -1 and 1 proportional to the quality of clustering.
toaster offers function computeSilhouette that calculates silhouette values on each observation and optionally saves results in Aster database table. It also returns values of silhouette and its distribution across each cluster (using histogram function in Aster):
N = 10
results = vector("list", N)
for(i in 1:N) {
results[[i]] = computeKmeans(conn, "batting_enh", centers=i, include=c('g','r','h','ab'),
iterMax = 1000, threshold=0.1,
id="playerid || '-' || teamid || '-' || yearid",
scaledTableName='kmeans_demo_scaled', centroidTableName='kmeans_demo_centroids',
schema='public', where="yearid > 2000")
results[[i]]=computeSilhouette(conn, results[[i]], silhouetteTableName = "public.kmeans_demo_sil", drop=FALSE)
}
results = results[2:9]
data = data.frame(K = as.integer(unlist(sapply(results, FUN=function(x) nrow(x$centers)))),
silvalue = unlist(sapply(results, FUN=function(x) x$sil$valie)))
ggplot(data) +
geom_line(aes(K, silvalue), color="red") + geom_point(aes(K, silvalue), color="red") +
labs(title="Optimal Number of Clusters", x="K", y="Total Silhouette Value") +
scale_x_continuous(breaks=seq(0,N,2)) + theme_tufte()
Finally, toaster saves silhouette profiles (distributions) of clusters with the result and offers visualization function for silhouette profiles:
km.demo = computeSilhouette(conn, km.demo)
createSilhouetteProfile(km.demo)
Note that commonly silhouette visuals plot actual silhouette values for each cluster. But since toaster works with big data datasets we chose histogram of silhouette values per cluster instead.
- K-means clustering R Function
- Teradata Aster Analytics: kmeans
- Silhouette (clustering)
- Six methods for determining an optimal k value for k-means analysis - Answer on stackoverflow containing R code for several methods of computing an optimal value of k for k-means cluster analysis