User Guide

Discriminative clustering

Definition

Clustering is the art of separating data samples \(x\) into \(K\) groups called clusters.

We take the word discriminative in the sense of Minka [2]. In the context of clustering, this means that do not set any hypotheses on the data distribution and only seek to cluster data by directly designing a clustering distribution:

\[p_\theta(y \mid x),\]

where \(y\) is the cluster to identify and \(\theta\) are the parameters of the discriminative distribution.

Choosing a model

Any function that takes the data as input and returns a vector in a \(K-1\)-simplex can be used as a discriminative clustering distribution. For example, if we believe that the data is linearly separable, we can use a logistic regression:

\[p_\theta (y = k \mid x) \propto \langle w_k, x \rangle + b_k.\]

Then, the parameters to learn are \(\theta = \{w_k, b_k\}_{k=1}^K\).

If we want more complex boundaries, any neural network that finishes with a softmax activation can be used. In other words, the choice of the type of decision boundary should guide the choice of clustering distribution.

The GEMINI approach

Learning parameters \(\theta\) in discriminative clustering is challenging because the absence of hypothesis on the data distribution prevents us from calculating any likelihood.

In 1991, Bridle, Heading and MacKay [1] proposed to optimise the parameters such that they maximise mutual information:

\[\mathcal{I} = \mathbb{E}_{y\sim p_\theta(y)} \left[D_\text{KL} (p_\theta(x\mid y) \| p_\text{data}(x))\right].\]

Mutual information has then stayed an essential component of discriminative clustering models. By comparison with classification contexts, it is an objective function we can use independently of the form taken by the model \(p_\theta(y\mid x)\).

The generalised mutual information (GEMINI) is an extension of mutual information that replaces the Kullback-Leibler divergence \(D_\text{KL}\) by any other statistical distance \(D\). It comes with two different versions. The first approach, named one-vs-all (OvA) seeks to discriminate the individual cluster distributions from the data distribution:

\[\mathbb{E}_{y \sim p_\theta(y)} \left[ D(p_\theta(x|y) \| p(x))\right].\]

The second approach is the one-vs-one (OvO) version that compares two independently drawn cluster distributions:

\[\mathbb{E}_{y_1, y_2 \sim p_\theta(y)} \left[ D(p_\theta(x|y_1) \| p_\theta(x | y_2))\right].\]

The specificity of GEMINI is that it involves distances in which the Bayes Theorem can easily be performed to get a tractable objective that we cane valuate using only clustering probabilities. Hence, models trained with GEMINI are discriminative models \(p_\theta(y|x)\) without any parametric assumption.

Note that mutual information and the K-Means loss are special cases of GEMINI.

Extending / Regularising models

Owing to the decoupling between the choice of the clustering model and the objective function to learn it, it is possible to add regularisation that constraint the model \(p_\theta(y\mid x)\). For example, we propose to add \(\ell_2\) penalty on logistic regressions in gemclus.linear.RIM, taken from Krause, Perona and Gomes [3].

We also propose models that incorporate feature selection using group-lasso penalty, inspired from LassoNet. Selecting feature can be interesting in the context of clustering for helping interpretation of clusters.

Content of the package

Which GEMINIs are implemented

All GEMINIs from our initial work are available [9]: MMD and Wasserstein distances are present for geometrical considerations, as well as Kullback-Leibler divergence, Total Variation distance and squared Hellinger distance. Both OvA and OvO implementations are present in all models. The OvO mode can be set in most clustering model by adding ovo=True in the constructor of a model.

Some models propose readily integrated GEMINIs, but it is also possible to set a custom GEMINI for some models.

The Wasserstein distance requires a distance function in the data space to compute. We directly propose all distances available from sklearn.metrics.pairwise_distances, with the Euclidean distance by default. In the same manner, we provide all kernels available from sklearn.metrics.pairwise_kernels for the MMD, with the linear kernel by default. For both GEMINIs, it is possible as well to involve a precomputed distance or kernel of your own that must be then passed to the GEMINI.

All loss functions we propose are located in the module gemclus.gemini.

What discriminative distributions are available

We propose several clustering distributions depending on the purpose:

  • Logistic regressions, kernel regressions in gemclus.linear

  • 2-layer Multi-Layered-Perceptrons with ReLU activations in gemclus.mlp

  • Decision trees (only compatible with the MMD GEMINI) in gemclus.tree

  • Differentiable trees in gemclus.tree

  • Sparse models in gemclus.sparse

  • Nonparametric models in gemclus.nonparametric

The sparse models are logistic regressions and MLP. Sparse means that we achieve feature selection along clustering. The sparse architecture of the MLP is inspired from LassoNet [8] which adds a linear skip connection between the inputs and clustering output. These sparse versions are located

If you want to use another model, you can derive the gemclus.DiscriminativeModel class and rewrite its hidden methods _infer, _get_weights, _init_params and _compute_grads. An example of extension is given here

A summary of what is implemented

We hope that GemClus will keep on growing. We seek to implement methods and datasets that can be relevant in discriminative clustering.

Models

Class/Function

Source paper

gemclus.linear.RIM, gemclus.linear.KernelRIM

Krause, Perona and Gomes [3]

gemclus.sparse.SparseLinearMI

França, Rizzo and Vogelstein [5]

gemclus.tree.Kauri

Ohl et al [11]

sparse.SparseMLPModel

Ohl et al [10], Lemhadri et al [8]

sparse.SparseLinearModel

Ohl et al [10]

Objective functions

Class/Function

Source paper

gemclus.gemini.WassersteinGEMINI, gemclus.gemini.TVGEMINI, gemclus.gemini.HellingerGEMINI

Ohl et al [9]

gemclus.gemini.MMDGEMINI

Ohl et al [9], [7]

gemclus.gemini.MI

Bridle, Heading and MacKay [1]

gemclus.gemini.ChiSquareGEMINI

Sugiyama et al [4]

Dataset

Class/Function

Source paper

gemclus.data.gstm

Ohl et al [9]

gemclus.data.celeux_one, gemclus.data.celeux_two

Celeux et al [6]

Basic examples

We provide some basic examples in the Example gallery, including clustering of simple distribution and how to perform feature selection using sparse models from gemclus.sparse.

References