.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/feature_selection/plot_feature_selection_linear.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_feature_selection_plot_feature_selection_linear.py: ================================================================= Feature selection using the Sparse MMD OvO (Logistic regression) ================================================================= In this example, we ask the :class:`gemclus.sparse.SparseLinearMMD` to perform a path where the regularisation penalty is progressively increased until all features but 2 are discarded. The model then keeps the best weights with the minimum number of features that maintains a GEMINI score close to 90% of the maximum GEMINI value encountered during the path. The dataset consists of 3 isotropic Gaussian distributions (so 3 clusters to find) in 5d with 20 noisy variables. Thus, the optimal solution should find that only 5 features are relevant and sufficient to get the correct clustering. .. GENERATED FROM PYTHON SOURCE LINES 14-22 .. code-block:: Python import numpy as np from matplotlib import pyplot as plt from sklearn import metrics from gemclus.data import celeux_one from gemclus.sparse import SparseLinearMMD .. GENERATED FROM PYTHON SOURCE LINES 23-25 Load a simple synthetic dataset -------------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 27-31 .. code-block:: Python # Generate samples on that are simple to separate with additional p independent noisy variables X, y = celeux_one(n=300, p=20, mu=1.7, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 32-36 Train the model -------------------------------------------------------------- Create the GEMINI clustering model (just a logistic regression) and call the .path method to iteratively select features through gradient descent. .. GENERATED FROM PYTHON SOURCE LINES 38-44 .. code-block:: Python clf = SparseLinearMMD(random_state=0, alpha=1, ovo=True) # Perform a path search to eliminate all features best_weights, geminis, penalties, alphas, n_features = clf.path(X) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/.local/lib/python3.10/site-packages/sklearn/base.py:474: FutureWarning: `BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API. warnings.warn( /home/circleci/.local/lib/python3.10/site-packages/sklearn/base.py:474: FutureWarning: `BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 45-49 Path results ------------ Take a look at how the GEMINI score decreased .. GENERATED FROM PYTHON SOURCE LINES 49-63 .. code-block:: Python print(f"The model score is {clf.score(X)}") print(f"Top gemini score was {max(geminis)}, which settles an optimum of {0.9 * max(geminis)}") # Highlight the number of selected features and the GEMINI along decreasing increasing alphas plt.title("GEMINI score depending on $\\alpha$") plt.plot(alphas, geminis) plt.xlabel("$\\alpha$") plt.ylabel("GEMINI score") plt.ylim(0, max(geminis) + 0.5) plt.show() # We expect the 5 first features print(f"Selected features: {np.where(np.linalg.norm(best_weights[0], axis=1, ord=2) != 0)}") .. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_linear_001.png :alt: GEMINI score depending on $\alpha$ :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_linear_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none The model score is 2.825972824049516 Top gemini score was 3.024991249149671, which settles an optimum of 2.722492124234704 Selected features: (array([0, 1, 2, 3, 4]),) .. GENERATED FROM PYTHON SOURCE LINES 64-66 Final Clustering ----------------- .. GENERATED FROM PYTHON SOURCE LINES 68-72 .. code-block:: Python # Now, evaluate the cluster predictions y_pred = clf.predict(X) print(f"ARI score is {metrics.adjusted_rand_score(y_pred, y)}") .. rst-class:: sphx-glr-script-out .. code-block:: none ARI score is 0.83290627605772 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 5.635 seconds) .. _sphx_glr_download_auto_examples_feature_selection_plot_feature_selection_linear.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_feature_selection_linear.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_feature_selection_linear.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_feature_selection_linear.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_