User Guide¶

Discriminative clustering¶

Definition¶

Clustering is the art of separating data samples \(x\) into \(K\) groups called clusters.

We take the word discriminative in the sense of Minka [2]. In the context of clustering, this means that do not set any hypotheses on the data distribution and only seek to cluster data by directly designing a clustering distribution:

\[p_\theta(y \mid x),\]

where \(y\) is the cluster to identify and \(\theta\) are the parameters of the discriminative distribution.

Choosing a model¶

Any function that takes the data as input and returns a vector in a \(K-1\)-simplex can be used as a discriminative clustering distribution. For example, if we believe that the data is linearly separable, we can use a logistic regression:

\[p_\theta (y = k \mid x) \propto \langle w_k, x \rangle + b_k.\]

Then, the parameters to learn are \(\theta = \{w_k, b_k\}_{k=1}^K\).

If we want more complex boundaries, any neural network that finishes with a softmax activation can be used. In other words, the choice of the type of decision boundary should guide the choice of clustering distribution.

The GEMINI approach¶

Learning parameters \(\theta\) in discriminative clustering is challenging because the absence of hypothesis on the data distribution prevents us from calculating any likelihood.

In 1991, Bridle, Heading and MacKay [1] proposed to optimise the parameters such that they maximise mutual information:

\[\mathcal{I} = \mathbb{E}_{y\sim p_\theta(y)} \left[D_\text{KL} (p_\theta(x\mid y) \| p_\text{data}(x))\right].\]

Mutual information has then stayed an essential component of discriminative clustering models. By comparison with classification contexts, it is an objective function we can use independently of the form taken by the model \(p_\theta(y\mid x)\).

The generalised mutual information (GEMINI) is an extension of mutual information that replaces the Kullback-Leibler divergence \(D_\text{KL}\) by any other statistical distance \(D\). It comes with two different versions. The first approach, named one-vs-all (OvA) seeks to discriminate the individual cluster distributions from the data distribution:

\[\mathbb{E}_{y \sim p_\theta(y)} \left[ D(p_\theta(x|y) \| p(x))\right].\]

The second approach is the one-vs-one (OvO) version that compares two independently drawn cluster distributions:

\[\mathbb{E}_{y_1, y_2 \sim p_\theta(y)} \left[ D(p_\theta(x|y_1) \| p_\theta(x | y_2))\right].\]

The specificity of GEMINI is that it involves distances in which the Bayes Theorem can easily be performed to get a tractable objective that we cane valuate using only clustering probabilities. Hence, models trained with GEMINI are discriminative models \(p_\theta(y|x)\) without any parametric assumption.

Note that mutual information and the K-Means loss are special cases of GEMINI.

Extending / Regularising models¶

Owing to the decoupling between the choice of the clustering model and the objective function to learn it, it is possible to add regularisation that constraint the model \(p_\theta(y\mid x)\). For example, we propose to add \(\ell_2\) penalty on logistic regressions in gemclus.linear.RIM, taken from Krause, Perona and Gomes [3].

We also propose models that incorporate feature selection using group-lasso penalty, inspired from LassoNet. Selecting feature can be interesting in the context of clustering for helping interpretation of clusters.

Content of the package¶

Which GEMINIs are implemented¶

All GEMINIs from our initial work are available [9]: MMD and Wasserstein distances are present for geometrical considerations, as well as Kullback-Leibler divergence, Total Variation distance and squared Hellinger distance. Both OvA and OvO implementations are present in all models. The OvO mode can be set in most clustering model by adding ovo=True in the constructor of a model.

Some models propose readily integrated GEMINIs, but it is also possible to set a custom GEMINI for some models.

The Wasserstein distance requires a distance function in the data space to compute. We directly propose all distances available from sklearn.metrics.pairwise_distances, with the Euclidean distance by default. In the same manner, we provide all kernels available from sklearn.metrics.pairwise_kernels for the MMD, with the linear kernel by default. For both GEMINIs, it is possible as well to involve a precomputed distance or kernel of your own that must be then passed to the GEMINI.

All loss functions we propose are located in the module gemclus.gemini.

What discriminative distributions are available¶

We propose several clustering distributions depending on the purpose:

Logistic regressions, kernel regressions in gemclus.linear
2-layer Multi-Layered-Perceptrons with ReLU activations in gemclus.mlp
Decision trees (only compatible with the MMD GEMINI) in gemclus.tree
Differentiable trees in gemclus.tree
Sparse models in gemclus.sparse
Nonparametric models in gemclus.nonparametric

The sparse models are logistic regressions and MLP. Sparse means that we achieve feature selection along clustering. The sparse architecture of the MLP is inspired from LassoNet [8] which adds a linear skip connection between the inputs and clustering output. These sparse versions are located

If you want to use another model, you can derive the gemclus.DiscriminativeModel class and rewrite its hidden methods _infer, _get_weights, _init_params and _compute_grads. An example of extension is given here

A summary of what is implemented¶

We hope that GemClus will keep on growing. We seek to implement methods and datasets that can be relevant in discriminative clustering.

Models¶
Class/Function	Source paper
`gemclus.linear.RIM`, `gemclus.linear.KernelRIM`	Krause, Perona and Gomes [3]
`gemclus.sparse.SparseLinearMI`	Kong, Deng, and Dai [5]
`gemclus.tree.Kauri`	Ohl et al [11]
`sparse.SparseMLPModel`	Ohl et al [10], Lemhadri et al [8]
`sparse.SparseLinearModel`	Ohl et al [10]

Objective functions¶
Class/Function	Source paper
`gemclus.gemini.WassersteinGEMINI`, `gemclus.gemini.TVGEMINI`, `gemclus.gemini.HellingerGEMINI`	Ohl et al [9]
`gemclus.gemini.MMDGEMINI`	Ohl et al [9], França, Rizzo and Vogelstein [7]
`gemclus.gemini.MI`	Bridle, Heading and MacKay [1]
`gemclus.gemini.ChiSquareGEMINI`	Sugiyama et al [4]

Dataset¶
Class/Function	Source paper
`gemclus.data.gstm`	Ohl et al [9]
`gemclus.data.celeux_one`, `gemclus.data.celeux_two`	Celeux et al [6]

Basic examples¶

We provide some basic examples in the Example gallery, including clustering of simple distribution and how to perform feature selection using sparse models from gemclus.sparse.