gemclus.tree.Douglas

class gemclus.tree.Douglas(n_clusters=3, gemini='wasserstein_ova', n_cuts=1, feature_mask=None, temperature=0.1, max_iter=100, batch_size=None, solver='adam', learning_rate=0.01, verbose=False, random_state=None)[source]

Implementation of the DNDTs optimised using GEMINI leveraging apprised splits tree algorithm. This model learns clusters by optimising learnable parameters to perform feature-wise soft-binnings and recombine those bins into a single cluster predictions. The parameters are optimised to maximise a generalised mutual information score.

Parameters:
n_clustersint, default=3

The number of clusters to form as well as the number of output neurons in the neural network.

gemini: str, GEMINI instance or None, default=”wasserstein_ova”

GEMINI objective used to train this discriminative model. Can be “mmd_ova”, “mmd_ovo”, “wasserstein_ova”, “wasserstein_ovo”, “mi” or other GEMINI available in gemclus.gemini.AVAILABLE_GEMINI. Default GEMINIs involve the Euclidean metric or linear kernel. To incorporate custom metrics, a GEMINI can also be passed as an instance. If None, the GEMINI will be the MMD OvA.

n_cuts: int, default=1

The number of cuts to consider per feature in the soft binning function of the DNDT

feature_mask: array of boolean [shape d], default None

A boolean vector indicating whether a feature should be considered or not among splits. If None, all features are considered during training.

temperature: float, default=0.1

The temperature controls the relative importance of logits per leaf soft-binning. A high temperature smoothens the differences in probability whereas a low temperature produces distributions closer to delta Dirac distributions.

max_iter: int, default=100

The number of epochs for training the model parameters.

batch_size: int, default=None

The number of samples per batch during an epoch. If set to None, all samples will be considered in a single batch.

solver: {‘sgd’,’adam’}, default=’adam’

The solver for weight optimisation.

  • ‘sgd’ refers to stochastic gradient descent.

  • ‘adam’ refers to a stochastic gradient-based optimiser proposed by Kingma, Diederik and Jimmy Ba.

learning_rate: float, default=1e-2

The learning rate hyperparameter for the optimiser’s update rule.

verbose: bool, default=False

Whether to print progress messages to stdout

random_state: int, RandomState instance, default=None

Determines random number generation for feature exploration. Pass an int for reproducible results across multiple function calls.

Attributes:
optimiser_: `AdamOptimizer` or `SGDOptimizer`

The optimisation algorithm used for training depending on the chosen solver parameter.

labels_: ndarray of shape (n_samples)

The labels that were assigned to the samples passed to the fit() method.

n_iter_: int

The number of iterations that the model took for converging.

__init__(n_clusters=3, gemini='wasserstein_ova', n_cuts=1, feature_mask=None, temperature=0.1, max_iter=100, batch_size=None, solver='adam', learning_rate=0.01, verbose=False, random_state=None)[source]
find_active_points(X)[source]

Calculates the list of cut points that are considered as active for a Douglas tree and some data X. A cut point is active if its value falls within the bounds of its matching feature.

Active points can be used for finding features that actively contributed to the clustering.

Parameters:
X: {array-like, sparse matrix} of shape (n_samples, n_features)

Test samples.

Returns:
active_points: List

A list containing the integer indices of features for which the Douglas model has active cut points

fit(X, y=None)

Compute GEMINI clustering.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster.

yndarray of shape (n_samples, n_samples), default=None

Use this parameter to give a precomputed affinity metric if the option “precomputed” was passed during construction. Otherwise, it is not used and present here for API consistency by convention.

Returns:
selfobject

Fitted estimator.

fit_predict(X, y=None)

Compute GEMINI clustering and returns the predicted clusters.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster.

yndarray of shape (n_samples, n_samples), default=None

Use this parameter to give a precomputed affinity metric if the option “precomputed” was passed during construction. Otherwise, it is not used and present here for API consistency by convention.

Returns:
y_predndarray of shape (n_samples,)

Vector containing the cluster label for each sample.

get_gemini()

Initialise a gemclus.GEMINI instance that will be used to train the model.

Returns:
gemini: a GEMINI instance
get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X)

Return the cluster membership of samples. This can only be called after the model was fit to some data.

Parameters:
X{array-like, sparse matrix}, shape (n_samples, n_features)

The input samples.

Returns:
yndarray of shape (n_samples,)

The label for each sample is the label of the closest sample seen during fit.

predict_proba(X)

Probability estimates that are the output of the neural network p(y|x). The returned estimates for all classes are ordered by the label of classes.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns:
Tarray-like of shape (n_samples, n_clusters)

Returns the probability of the sample for each cluster in the model.

score(X, y=None)

Return the value of the GEMINI evaluated on the given test data.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

Test samples.

yndarray of shape (n_samples, n_samples), default=None

Use this parameter to give a precomputed affinity metric if the option “precomputed” was passed during construction. Otherwise, it is not used and present here for API consistency by convention.

Returns:
scorefloat

GEMINI evaluated on the output of self.predict(X).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

Examples using gemclus.tree.Douglas

Building a differentiable unsupervised tree: DOUGLAS

Building a differentiable unsupervised tree: DOUGLAS