Note
Go to the end to download the full example code.
Scoring any model with GEMINI¶
We show in this example how we can score the prediction of another model using GEMINI. We do not seek to perform clustering but rather to evaluate.
import numpy as np
from sklearn import datasets, preprocessing, linear_model, naive_bayes
from gemclus import gemini
Load a simple real dataset¶
X, y = datasets.load_breast_cancer(return_X_y=True)
# Preprocess this dataset
X = preprocessing.RobustScaler().fit_transform(X)
Train two supervised models¶
We will train two different models on the breast cancer dataset
# The first model is a simple logistic regression with l2 penalty
clf1 = linear_model.LogisticRegression(random_state=0).fit(X, y)
p_y_given_x_1 = clf1.predict_proba(X)
# The second model is naive Bayes using Gaussian hypotheses on the data
clf2 = naive_bayes.GaussianNB().fit(X, y)
p_y_given_x_2 = clf2.predict_proba(X)
Scoring with GEMINI¶
We can now score the clustering performances of both model with GEMINI.
# Let's start with the WassersteinGEMINI (one-vs-all) and the Euclidean distance
wasserstein_scoring = gemini.WassersteinGEMINI(metric="euclidean")
# We need to precompute the affinity matching this Wasserstein (will be the Euclidean metric here)
affinity = wasserstein_scoring.compute_affinity(X)
clf1_score = wasserstein_scoring.evaluate(p_y_given_x_1, affinity)
clf2_score = wasserstein_scoring.evaluate(p_y_given_x_2, affinity)
print("Wasserstein OvA (Euclidean):")
print(f"\t=>Linear regression: {clf1_score:.3f}")
print(f"\t=>Naive Bayes: {clf2_score:.3f}")
Wasserstein OvA (Euclidean):
=>Linear regression: 2.878
=>Naive Bayes: 3.005
Supervised Scoring with GEMINI¶
By replacing the Euclidean distance for a label-informed distance we can obtain a supervised metric.
We now specify that the metric is precomputed instead
wasserstein_scoring = gemini.WassersteinGEMINI(metric="precomputed")
# So, we precompute a distance where samples have distance 0 if they share the same label, 1 otherwise
y_one_hot = np.eye(2)[y]
precomputed_distance = 1 - np.matmul(y_one_hot, y_one_hot.T)
clf1_score = wasserstein_scoring.evaluate(p_y_given_x_1, precomputed_distance)
clf2_score = wasserstein_scoring.evaluate(p_y_given_x_2, precomputed_distance)
print("Wasserstein OvA (Supervised):")
print(f"\t=>Linear regression: {clf1_score:.3f}")
print(f"\t=>Naive Bayes: {clf2_score:.3f}")
Wasserstein OvA (Supervised):
=>Linear regression: 0.431
=>Naive Bayes: 0.403
Total running time of the script: (0 minutes 0.322 seconds)