User Guide¶
Discriminative clustering¶
Definition¶
Clustering is the art of separating data samples \(x\) into \(K\) groups called clusters.
We take the word discriminative in the sense of Minka [2]. In the context of clustering, this means that do not set any hypotheses on the data distribution and only seek to cluster data by directly designing a clustering distribution:
where \(y\) is the cluster to identify and \(\theta\) are the parameters of the discriminative distribution.
Choosing a model¶
Any function that takes the data as input and returns a vector in a \(K-1\)-simplex can be used as a discriminative clustering distribution. For example, if we believe that the data is linearly separable, we can use a logistic regression:
Then, the parameters to learn are \(\theta = \{w_k, b_k\}_{k=1}^K\).
If we want more complex boundaries, any neural network that finishes with a softmax activation can be used. In other words, the choice of the type of decision boundary should guide the choice of clustering distribution.
The GEMINI approach¶
Learning parameters \(\theta\) in discriminative clustering is challenging because the absence of hypothesis on the data distribution prevents us from calculating any likelihood.
In 1991, Bridle, Heading and MacKay [1] proposed to optimise the parameters such that they maximise mutual information:
Mutual information has then stayed an essential component of discriminative clustering models. By comparison with classification contexts, it is an objective function we can use independently of the form taken by the model \(p_\theta(y\mid x)\).
The generalised mutual information (GEMINI) is an extension of mutual information that replaces the Kullback-Leibler divergence \(D_\text{KL}\) by any other statistical distance \(D\). It comes with two different versions. The first approach, named one-vs-all (OvA) seeks to discriminate the individual cluster distributions from the data distribution:
The second approach is the one-vs-one (OvO) version that compares two independently drawn cluster distributions:
The specificity of GEMINI is that it involves distances in which the Bayes Theorem can easily be performed to get a tractable objective that we cane valuate using only clustering probabilities. Hence, models trained with GEMINI are discriminative models \(p_\theta(y|x)\) without any parametric assumption.
Note that mutual information and the K-Means loss are special cases of GEMINI.
Extending / Regularising models¶
Owing to the decoupling between the choice of the clustering model and the objective function to learn it, it is
possible to add regularisation that constraint the model \(p_\theta(y\mid x)\). For example, we propose to
add \(\ell_2\) penalty on logistic regressions in gemclus.linear.RIM
, taken from Krause, Perona and
Gomes [3].
We also propose models that incorporate feature selection using group-lasso penalty, inspired from LassoNet. Selecting feature can be interesting in the context of clustering for helping interpretation of clusters.
Content of the package¶
Which GEMINIs are implemented¶
All GEMINIs from our initial work are available [9]: MMD and Wasserstein distances are present for geometrical
considerations, as well as Kullback-Leibler divergence, Total Variation distance and squared Hellinger distance.
Both OvA and OvO implementations are present in all models. The OvO mode can be set in most
clustering model by adding ovo=True
in the constructor of a model.
Some models propose readily integrated GEMINIs, but it is also possible to set a custom GEMINI for some models.
The Wasserstein distance requires a distance function in the data space to compute. We directly propose all distances
available from sklearn.metrics.pairwise_distances
, with the Euclidean distance by default.
In the same manner, we provide all kernels available from sklearn.metrics.pairwise_kernels
for the MMD, with
the linear kernel by default. For both GEMINIs, it is possible as well to involve a precomputed distance or kernel of
your own that must be then passed to the GEMINI.
All loss functions we propose are located in the module gemclus.gemini
.
What discriminative distributions are available¶
We propose several clustering distributions depending on the purpose:
Logistic regressions, kernel regressions in
gemclus.linear
2-layer Multi-Layered-Perceptrons with ReLU activations in
gemclus.mlp
Decision trees (only compatible with the MMD GEMINI) in
gemclus.tree
Differentiable trees in
gemclus.tree
Sparse models in
gemclus.sparse
Nonparametric models in
gemclus.nonparametric
The sparse models are logistic regressions and MLP. Sparse means that we achieve feature selection along clustering. The sparse architecture of the MLP is inspired from LassoNet [8] which adds a linear skip connection between the inputs and clustering output. These sparse versions are located
If you want to use another model, you can derive the gemclus.DiscriminativeModel
class and rewrite its hidden methods _infer
, _get_weights
, _init_params
and
_compute_grads
. An example
of extension is given here
A summary of what is implemented¶
We hope that GemClus will keep on growing. We seek to implement methods and datasets that can be relevant in discriminative clustering.
Class/Function |
Source paper |
---|---|
Krause, Perona and Gomes [3] |
|
França, Rizzo and Vogelstein [5] |
|
Ohl et al [11] |
|
|
|
|
Ohl et al [10] |
Class/Function |
Source paper |
---|---|
|
Ohl et al [9] |
Bridle, Heading and MacKay [1] |
|
Sugiyama et al [4] |
Class/Function |
Source paper |
---|---|
Ohl et al [9] |
|
Celeux et al [6] |
Basic examples¶
We provide some basic examples in the Example gallery, including clustering of simple
distribution and how to perform feature selection using sparse models from gemclus.sparse
.