.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/feature_selection/plot_feature_selection_logreg_mi.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_feature_selection_plot_feature_selection_logreg_mi.py: =================================================================== Feature selection using the Sparse Linear MI (Logistic regression) =================================================================== In this example, we ask the :class:`gemclus.sparse.SparseLinearMI` to perform a path where the regularisation penalty is progressively increased until all features but 2 are discarded. The model then keeps the best weights with the minimum number of features that maintains a GEMINI score close to 50% of the maximum GEMINI value encountered during the path. Contrary to the MMD sparse model, this one is not guided by specific kernel in the data space. That is why the acceptance threshold for best score is lowered to 50% instead of 90% like other models. A very similar model can be found in `Discriminative Clustering and Feature Selection for Brain MRI Segmentation` proposed by Kong et al. (2014). The dataset consists of 3 isotropic Gaussian distributions (so 3 clusters to find) in 5d with 20 noisy variables. Thus, the optimal solution should find that only 5 features are relevant and sufficient to get the correct clustering. .. GENERATED FROM PYTHON SOURCE LINES 18-26 .. code-block:: Python import numpy as np from matplotlib import pyplot as plt from sklearn import metrics, decomposition from gemclus.data import celeux_one from gemclus.sparse import SparseLinearMI .. GENERATED FROM PYTHON SOURCE LINES 27-29 Load a simple synthetic dataset -------------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 31-35 .. code-block:: Python # Generate samples on that are simple to separate with additional p independent noisy variables X, y = celeux_one(n=300, p=20, mu=1.7, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 36-40 Train the model -------------------------------------------------------------- Create the GEMINI clustering model (just a logistic regression) and call the .path method to iteratively select features through gradient descent. .. GENERATED FROM PYTHON SOURCE LINES 42-51 .. code-block:: Python clf = SparseLinearMI(random_state=0, alpha=1) # Perform a path search to eliminate all features best_weights, geminis, penalties, alphas, n_features = clf.path(X, keep_threshold=0.5) # We expect the 5 first features print(f"Selected features: {np.where(np.linalg.norm(best_weights[0], axis=1, ord=2) != 0)}") .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/.local/lib/python3.10/site-packages/sklearn/base.py:474: FutureWarning: `BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API. warnings.warn( /home/circleci/.local/lib/python3.10/site-packages/sklearn/base.py:474: FutureWarning: `BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API. warnings.warn( Selected features: (array([ 0, 1, 2, 3, 4, 6, 11]),) .. GENERATED FROM PYTHON SOURCE LINES 52-54 Final Clustering ----------------- .. GENERATED FROM PYTHON SOURCE LINES 56-70 .. code-block:: Python # Now, evaluate the cluster predictions y_pred = clf.predict(X) print(f"ARI score is {metrics.adjusted_rand_score(y_pred, y)}") # Let's make a small PCA for visualisation purpose and distinguish true labels from clustering labels X_pca = decomposition.PCA(n_components=2).fit_transform(X) for k in range(3): class_indices, = np.where(y==k) plt.scatter(X_pca[class_indices,0], X_pca[class_indices,1], c=y_pred[class_indices], marker=["+","x","o"][k]) plt.axis("off") plt.title("PCA of celeux 1 dataset clustered with a MI-trained LASSO") plt.show() .. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_logreg_mi_001.png :alt: PCA of celeux 1 dataset clustered with a MI-trained LASSO :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_logreg_mi_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none ARI score is 0.458336976163333 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 2.134 seconds) .. _sphx_glr_download_auto_examples_feature_selection_plot_feature_selection_logreg_mi.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_feature_selection_logreg_mi.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_feature_selection_logreg_mi.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_feature_selection_logreg_mi.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_