GemClus API

The GEMINI-clustering package currently contains simple MLP and logistic regression for all-feature clustering as well as sparsity-constrained variants of these models.

Scoring with GEMINI

The following classes implement the basic GEMINIs for scoring and evaluating any conditional distribution for clustering.

gemini.MMDGEMINI([ovo, kernel, ...])

Implements the one-vs-all and one-vs-one MMD GEMINI.

gemini.WassersteinGEMINI([ovo, metric, ...])

Implements the one-vs-all and one-vs-one Wasserstein GEMINI.

gemini.MI([epsilon])

Implements the classical mutual information between cluster conditional probabilities and the complete data probabilities:

gemini.KLGEMINI([ovo, epsilon])

Implements the one-vs-all and one-vs-one KL GEMINI.

gemini.TVGEMINI([ovo, epsilon])

Implements the one-vs-all and one-vs-one Total Variation distance GEMINI.

gemini.HellingerGEMINI([ovo, epsilon])

Implements the one-vs-all and one-vs-one Squared Hellinger distance GEMINI.

Clustering models

Dense models

These models are based on standard distributions like the logistic regression or the one-hidden-layer neural network for clustering.

linear.LinearModel([n_clusters, gemini, ...])

Implementation of a logistic regression as a clustering distribution \(p(y|x)\).

linear.LinearMMD([n_clusters, max_iter, ...])

Implementation of the maximisation of the MMD GEMINI using a logistic regression as a clustering distribution \(p(y|x)\).

linear.LinearWasserstein([n_clusters, ...])

Implementation of the maximisation of the Wasserstein GEMINI using a logistic regression as a clustering distribution \(p(y|x)\).

linear.RIM([n_clusters, max_iter, ...])

Implementation of the maximisation of the classical mutual information using a logistic regression with an \(\ell_2\) penalty on the weights.

linear.KernelRIM([n_clusters, max_iter, ...])

Implementation of the maximisation of the classical mutual information using a kernelised version of the logistic regression with an \(\ell_2\) penalty on the weights.

mlp.MLPModel([n_clusters, gemini, max_iter, ...])

Implementation of a two-layer neural network as a clustering distribution \(p(y|x)\).

mlp.MLPMMD([n_clusters, max_iter, ...])

Implementation of the maximisation of the MMD GEMINI using a two-layer neural network as a clustering distribution \(p(y|x)\).

mlp.MLPWasserstein([n_clusters, max_iter, ...])

Implementation of the maximisation of the Wasserstein GEMINI using a two-layer neural network as a clustering distribution \(p(y|x)\).

Nonparametric models

These models have parameters that are assigned to the data samples according to their indices. Consequently, the parameters do not have any dependence on the location of the samples. Overall, these models can be used to model any decision boundary and do not have hyper parameters. However, the underlying distribution cannot be used on unseen samples for prediction.

nonparametric.CategoricalModel([n_clusters, ...])

The CategoricalModel is a nonparametric model where each sample is directly assign a probability vector of as conditional clustering distribution.

nonparametric.CategoricalMMD([n_clusters, ...])

The CategoricalMMD is a nonparametric model where each sample is directly assign a probability vector of as conditional clustering distribution.

nonparametric.CategoricalWasserstein([...])

The CategoricalWasserstein is a nonparametric model where each sample is directly assign a probability vector of as conditional clustering distribution.

Sparse models

These models can be trained to progressively remove features in the conditional cluster distribution. They are useful for selecting a subset of features which may enhance interpretability of clustering.

sparse.SparseLinearModel([n_clusters, ...])

This is the SparseLinearModel clustering model.

sparse.SparseLinearMI([n_clusters, groups, ...])

This is the Sparse version of the logistic regression trained with mutual information for clustering.

sparse.SparseLinearMMD([n_clusters, groups, ...])

Trains a logistic regression with sparse parameters using the MMD GEMINI.

sparse.SparseMLPModel([n_clusters, gemini, ...])

Implementation of a neural network as a clustering distribution \(p(y|x)\) with variable selection.

sparse.SparseMLPMMD([n_clusters, groups, ...])

This is the Sparse Version of the MLP MMD model.

Tree models

We propose clustering methods based on tree architectures. Thus rules are simultaneously constructed as the clustering is learnt.

tree.Kauri([max_clusters, max_depth, ...])

Implementation of the KMeans as unsupervised reward ideal tree algorithm.

tree.Douglas([n_clusters, gemini, n_cuts, ...])

Implementation of the DNDTs optimised using GEMINI leveraging apprised splits tree algorithm.

The following functions are intended to help understanding the structure of the above models by printing their inner rules.

tree.print_kauri_tree(kauri_tree[, ...])

Prints the binary tree structure of a trained KAURI tree.

Generic models

This model provides the skeleton for creating any model that must be trained with GEMINI.

DiscriminativeModel([n_clusters, gemini, ...])

This is the BaseGEMINI to derive to create a GEMINI MLP or linear clustering model.

Constraints

This method aims at decorating the GEMINI models to give further guidance on the desired clustering.

add_mlcl_constraint(gemini_model[, ...])

Adds must-link and/or cannot-link constraints to a discriminative clustering model.

Dataset generation

This package contains simple functions for generating synthetic datasets.

data.draw_gmm(n, loc, scale, pvals[, ...])

Returns \(n\) samples drawn from a mixture of Gaussian distributions.

data.multivariate_student_t(n, loc, scale[, ...])

Draws \(n\) samples from a multivariate Student-t distribution.

data.gstm([n, alpha, df, random_state])

Reproduces the Gaussian-Student Mixture dataset from the GEMINI article.

data.celeux_one([n, p, mu, random_state])

Draws \(n\) samples from a Gaussian mixture with 3 isotropic components of respective means 1, 0 and 1 over 5 dimensions scaled by \(\mu\).

data.celeux_two([n, random_state])

Draws samples from a mixture of 4 Gaussian distributions in 2d with additional variables linearly dependent of the informative variables and non-informative noisy variables.