Clustering
- Find subgroups among observations.
- Get insights or run predictions.
Clustering is used to find groups to target adds. Also used in genomic research.
K means clustering
- Centroid: average position of a cluster
- Distortion: the mean of square distance
- Inertia: the sum of the square distance
K is a hyper-parameter - the number of clusters. Plot elbow plot to find what value of k reduces distortion the most (but use lowest possible).
Vulnerable to curse of dimensionality. PCA preprocessing helps.
Given enough time, K-means will always converge (centroids stop moving per iteration)
Hierarchical clustering
- Does not need to know K in advance
- Use (di)similarity or distance metric
- Use Dendrogram
- Deterministic
- Greedy
Choice of linkage type influences how clusters are formed.
- complete
- single
- average
- centroid
- ward
Measure distance between clusters and merge if shorter than a threshold.