1. Some materials are taken from machine learning course of Victor Kitov
Clustering is partitioning of objects into groups so that:
We can compare clustering algorithms in terms of:
different distance functions lead to different algorithms:
$\mu_{k}$ may be arbitrary or constrained to be existing objects
$K$ - is set by user
Shape of clusters is defined by $\rho(\cdot,\cdot)$
plot = interact(elbow_demo, k=IntSlider(min=2,max=8,step=1,value=2))
interact(demo_kmpp, iters=IntSlider(min=1,max=6,step=1,value=1))
<function __main__.demo_kmpp>
Hierarchical clustering may be:
If dendrogram is good, correlation should be high
interact(coph_demo, k=IntSlider(min=2, max=10, step=1, value=2), link=['complete', 'single', 'average', 'centroid'], metric=['euclidean', 'cityblock'])
<function __main__.coph_demo>
min_pts
- minimal number of points inside neighbourhoodmin_pts
points in $N_\epsilon${C}
1.function dbscan(X, eps, min_pts):
2. initialize NV = X # not visited objects
3. for x in NV:
4. remove(NV, x) # mark as visited
5. nbr = neighbours(x, eps) # set of neighbours
6. if nbr.size < min_pts:
7. mark_as_noise(x)
8. else:
9. C = new_cluster()
10. expand_cluster(x, nbr, C, eps, min_pts, NV)
11. yield C
{C}
1. function expand_cluster(x, nbr, C, eps, min_pts, NV):
2. add(x, C)
3. for x1 in nbr:
4. if x1 in NV: # object not visited
5. remove(NV, x1) # mark as visited
6. nbr1 = neighbours(x1, eps)
7. if nbr1.size >= min_pts:
8. # join sets of neighbours
9. merge(nbr, nbr_1)
10. if x1 not in any cluster:
11. add(x1, C)
interact(dbscan_demo, eps=FloatSlider(min=0.1, max=10, step=0.05, value=1), min_pts=IntSlider(min=2, max=15, step=1, value=5))
<function __main__.dbscan_demo>
min_pts
: cluster C is missedmin_pts
: clusters A and B get mergedwhere
where
Adjusted Rand Index
$$\text{ARI}(\hat{\pi},\pi^*) = \frac{\text{Rand}(\hat{\pi},\pi^*) - \text{Expected}}{\text{Max} - \text{Expected}}$$For each object $x_{i}$ define:
Silhouette coefficient for $x_{1},...x_{N}$: $$ Silhouette=\frac{1}{N}\sum_{i=1}^{N}\frac{d_{i}-s_{i}}{\max\{d_{i},s_{i}\}} $$
Advantages
Disadvantages