Unsupervised Learning: Clustering
KMeans
1. Introduction
Labels are an essential ingredient to supervised algorithms like Support Vector Machines, which learns a hypothesis function to predict labels given features. Kmeans clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories/groups/response variables).
2. Key Terms

Cluster is a collection of data points aggregated together because of certain similarities.

Cluster Centroid (or simply centroid) is the mean of a cluster, its values are the coordinatewise average of the data points in this cluster.

WithinCluster Variance is the coordinatewise squared deviations from the cluster centroid of all the observations belonging to that cluster:
In the expression above denotes jth predictor of observation ; denotes a set of points belonging to cluster k and denotes a centroid of cluster k.

Total WithinCluster Variance is a withincluster variance summed up across all clusters:
Note that the notation, means the euclidean distance between vectors and .
3. Data Representation and Preparation
In the formulas above represents a vector in a Pdimensional space and P is a number of predictors in data set. As you can see from the formulas above, KMeans algorithm utilizes the notion of distance between data points and each data point weights equally. In order to calculate the distance, we need our data to be numerical. For this reason, categorical values should be handled (either excluded from the list of predictors or replaced with numerical values). Also, we need to normalize our data in order to avoid the effects of incomparable units and different scaling.
4. Algorithm
KMeans algorithm finds cluster centers that minimize the total withincluster variance W(C). This is achieved in several steps:

Step 1: Randomly generate K centroids

Step 2: Assign data points to the cluster of the closest centroid:

Step 3: Compute the mean of each cluster

Step 4: Reassign centroids to respective clusters’ means computed in Step 3

Step 5: If the stop criterion is not satisfied: Go to Step 2
Stop criterion can be one of the following:

Cluster reassignation results in same clusters

A specified number of iterations is reached

Reassigned centroids are located close (need to specify the distance) to the previous centroids
In order to achieve global optima, the algorithm should be run multiple times and clusters’ realization that is observed more often will be our global optima.
Example: In Figure 1, you can see a Kmeans algorithm. Training examples are shown as dots, and cluster centroids (K) are shown as crosses. (a) is an original dataset. (b) is a random initial cluster centroids. (cf) is an illustration of running two iterations of kmeans. In each iteration, we assign each training example to the closest cluster centroid (shown by ”painting” the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it.
Figure 1: Kmean algorithm
5. Choosing K
There are three most common ways of selecting the number of clusters K.

Utilize our domain knowledge or any other insight about the data. For instance, we want to cluster flower and we know that our data contains exactly 3 types of flowers. Another example is when we want to cluster cars sold last year. In this case, K will be the number of all car manufacturers available on the market.

Run the algorithms several times for different values of K and select such K that results in the smallest value of total withincluster variance.

Perform crossvalidation and select such K that performs the best on a holdout dataset.