Kmeans Clustering with Aster and toaster 0.4.2

Learn Data Science
Not applicable

Last week, version 0.4.2 of the R package toaster became available on CRAN. It addressed a few bugs in kmeans, enhanced the map visualization function, and included several other minor features and fixes.

Kmeans with toaster

In Aster, clustering is represented by several functions among which kmeans seems as the most utilized. toaster's family of functions streamlines the workflow for kmeans by utilizing rich set of SQL/MR and SQL in Aster and scripting with R programming language running on a desktop. It includes steps for the data prep, the clustering itself, the model evaluation and analysis while providing visualization functions for centroids, quality of the cluster model, cluster metrics and properties, and more.

Kmeans family of functions

Because kmeans is sensitive to data variation across its dimensions it is highly recommended to normalize (scale) model variables first (in most but not all cases). Thus, function computeKmeans does this automatically (unless instructed otherwise). After clustering performed in Aster the function returns standard R kmeans object with extra information.

Using function createCentroidPlot one immediately visualizes cluster centroids in multi-dimensional space of the kmeans model.

Next, function createClusterPlot shows standard kmeans metrics (cluster member counts, within cluster sum of squares) and other cluster properties defined by the model (see aggregates in computeKmeans).

Going deeper, pair of functions computeClusterSample and createClusterPairsPlot drill into resulting cluster structure by sampling and visualizing pairwise relationships between variables within and across kmeans clusters.

Finally, functions computeSilhouette and createSilhouetteProfile implement and visualize silhouette cluster analysis on resulting cluster model in Aster.

For detailed description of functions with examples please see Kmeans with toaster in RPubs or toaster manual.