Quick Start

Binder Open In Colab

This page is generated by a Jupyter notebook which can be opened and run in Binder or Google Colab by clicking on the above badges. To run it in Google Colab, you need to install PyMinimax in Colab first.

Installation

The recommended way to install PyMinimax is using pip:

[ ]:
pip install pyminimax

Or if you have installed PyMinimax before, please update to the latest version:

[ ]:
pip install --upgrade pyminimax

PyMinimax runs on any platform with Python 3 and SciPy installed.

Usage

The most important function in PyMinimax is pyminimax.minimax, and by default its usage is the same as the hierarchical clustering methods in SciPy, say scipy.cluster.hierarchy.complete. Here we demonstrate with an example from the SciPy documentation. First consider a dataset of \(n=12\) points:

[1]:
X = [[0, 0], [0, 1], [1, 0],
     [0, 4], [0, 3], [1, 4],
     [4, 0], [3, 0], [4, 1],
     [4, 4], [3, 4], [4, 3]]
x x    x x
x        x

x        x
x x    x x

The minimax function takes a flattened distance matrix of the data as an argument, which can be computed by scipy.spatial.distance.pdist. By default, the return value of minimax has the same format as that of scipy.cluster.hierarchy.linkage. This is an \((n-1)\) by 4 matrix keeping the clustering result, called the linkage matrix. A detailed explanation of its format can be found in the SciPy documentation.

[2]:
from pyminimax import minimax
from scipy.spatial.distance import pdist

Z = minimax(pdist(X))
Z
[2]:
array([[ 0.        ,  1.        ,  1.        ,  2.        ],
       [ 2.        , 12.        ,  1.        ,  3.        ],
       [ 3.        ,  4.        ,  1.        ,  2.        ],
       [ 6.        ,  7.        ,  1.        ,  2.        ],
       [ 5.        , 14.        ,  1.        ,  3.        ],
       [ 8.        , 15.        ,  1.        ,  3.        ],
       [ 9.        , 10.        ,  1.        ,  2.        ],
       [11.        , 18.        ,  1.        ,  3.        ],
       [13.        , 16.        ,  3.16227766,  6.        ],
       [17.        , 19.        ,  3.16227766,  6.        ],
       [20.        , 21.        ,  5.        , 12.        ]])

Given the linkage matrix, one can then utilize the methods in SciPy to present the clustering result in a more readable manner. Below are examples applying dendrogram and fcluster.

[3]:
from scipy.cluster.hierarchy import dendrogram
from matplotlib import pyplot as plt

fig = plt.figure(figsize=(10, 4))
dendrogram(Z)
plt.show()
_images/quick_start_11_0.png
[4]:
from scipy.cluster.hierarchy import fcluster

fcluster(Z, t=1.8, criterion='distance')
[4]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=int32)

The above result says that, cutting the dendrogram at 1.8 threshold, the data has 4 clusters, with the first 3 points being in the first cluster, and the following 3 in the second cluster, and so on.

Getting Prototypes

A unique advantage of minimax linkage hierarchical clustering is that in each cluster one prototype is selected from the original data. Thus the resulting clusters are better interpretable. Starting 0.1.2, PyMinimax can also compute those prototypes. Below we demonstrate how this can be done.

To obtain the prototypes, first we need an extended linkage matrix that contains prototypes information. This is an \((n-1)\) by \(5\) matrix, where the first 4 columns are in the same format as the standard linkage matrix and the 5th column keeps the indices of the prototypes of each merged cluster. The extended linkage matrix can be computed by pyminimax.minimax with return_prototype=True.

[5]:
Z_ext = minimax(pdist(X), return_prototype=True)
Z_ext
[5]:
array([[ 0.        ,  1.        ,  1.        ,  2.        ,  0.        ],
       [ 2.        , 12.        ,  1.        ,  3.        ,  0.        ],
       [ 3.        ,  4.        ,  1.        ,  2.        ,  3.        ],
       [ 6.        ,  7.        ,  1.        ,  2.        ,  6.        ],
       [ 5.        , 14.        ,  1.        ,  3.        ,  3.        ],
       [ 8.        , 15.        ,  1.        ,  3.        ,  6.        ],
       [ 9.        , 10.        ,  1.        ,  2.        ,  9.        ],
       [11.        , 18.        ,  1.        ,  3.        ,  9.        ],
       [13.        , 16.        ,  3.16227766,  6.        ,  1.        ],
       [17.        , 19.        ,  3.16227766,  6.        ,  8.        ],
       [20.        , 21.        ,  5.        , 12.        ,  1.        ]])

The 5-column extended linkage matrix is no longer SciPy-compatible. Any SciPy function that takes a linkage matrix will only work if we slice the extended linkage matrix to drop the 5th column, hence the Z_ext[:, :4] below.

[6]:
fig = plt.figure(figsize=(10, 4))
dendrogram(Z_ext[:, :4])
plt.show()
_images/quick_start_17_0.png

In PyMinimax, the prototypes are computed by pyminimax.fcluster_prototype. Its usage is exactly the same as scipy.cluster.hierarchy.fcluster, except

  1. it takes the 5-column extended linkage matrix instead of a standard 4-column one, and

  2. it returns information regarding clusters and prototypes.

[7]:
from pyminimax import fcluster_prototype

fcluster_prototype(Z_ext, t=1.8, criterion='distance')
[7]:
array([[1, 0],
       [1, 0],
       [1, 0],
       [2, 3],
       [2, 3],
       [2, 3],
       [3, 6],
       [3, 6],
       [3, 6],
       [4, 9],
       [4, 9],
       [4, 9]], dtype=int32)

The above result says that, cutting the dendrogram at 1.8 threshold, the data has 4 clusters. The first 3 points are in the first cluster and the prototype of this cluster is the 0th data point, the following 3 points are in the second cluster and the prototype of this cluster is the 3rd data point, and so on.

See Also