.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/feature_selection/plot_feature_selection_linear.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_feature_selection_plot_feature_selection_linear.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_feature_selection_plot_feature_selection_linear.py:


=================================================================
Feature selection using the Sparse MMD OvO (Logistic regression)
=================================================================

In this example, we ask the :class:`gemclus.sparse.SparseLinearMMD` to perform a path where the regularisation penalty
is progressively increased until all features but 2 are discarded. The model then keeps the best weights with the
minimum number of features that maintains a GEMINI score close to 90% of the maximum GEMINI value encountered during
the path.

The dataset consists of 3 isotropic Gaussian distributions (so 3 clusters to find) in 5d with 20 noisy variables. Thus,
the optimal solution should find that only 5 features are relevant and sufficient to get the correct clustering.

.. GENERATED FROM PYTHON SOURCE LINES 14-22

.. code-block:: Python


    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn import metrics

    from gemclus.data import celeux_one
    from gemclus.sparse import SparseLinearMMD


.. GENERATED FROM PYTHON SOURCE LINES 23-25

Load a simple synthetic dataset
--------------------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 27-31

.. code-block:: Python


    # Generate samples on that are simple to separate with additional p independent noisy variables
    X, y = celeux_one(n=300, p=20, mu=1.7, random_state=0)


.. GENERATED FROM PYTHON SOURCE LINES 32-36

Train the model
--------------------------------------------------------------
Create the GEMINI clustering model (just a logistic regression) and call the .path method to iteratively select
features through gradient descent.

.. GENERATED FROM PYTHON SOURCE LINES 38-44

.. code-block:: Python


    clf = SparseLinearMMD(random_state=0, alpha=1, ovo=True)

    # Perform a path search to eliminate all features
    best_weights, geminis, penalties, alphas, n_features = clf.path(X)


.. GENERATED FROM PYTHON SOURCE LINES 45-49

Path results
------------

Take a look at how the GEMINI score decreased

.. GENERATED FROM PYTHON SOURCE LINES 49-63

.. code-block:: Python

    print(f"The model score is {clf.score(X)}")
    print(f"Top gemini score was {max(geminis)}, which settles an optimum of {0.9 * max(geminis)}")

    # Highlight the number of selected features and the GEMINI along decreasing increasing alphas
    plt.title("GEMINI score depending on $\\alpha$")
    plt.plot(alphas, geminis)
    plt.xlabel("$\\alpha$")
    plt.ylabel("GEMINI score")
    plt.ylim(0, max(geminis) + 0.5)
    plt.show()

    # We expect the 5 first features
    print(f"Selected features: {np.where(np.linalg.norm(best_weights[0], axis=1, ord=2) != 0)}")


.. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_linear_001.png
   :alt: GEMINI score depending on $\alpha$
   :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_linear_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    The model score is 2.825972824049516
    Top gemini score was 3.024991249149671, which settles an optimum of 2.722492124234704
    Selected features: (array([0, 1, 2, 3, 4]),)


.. GENERATED FROM PYTHON SOURCE LINES 64-66

Final Clustering
-----------------

.. GENERATED FROM PYTHON SOURCE LINES 68-72

.. code-block:: Python


    # Now, evaluate the cluster predictions
    y_pred = clf.predict(X)
    print(f"ARI score is {metrics.adjusted_rand_score(y_pred, y)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ARI score is 0.83290627605772


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 4.389 seconds)


.. _sphx_glr_download_auto_examples_feature_selection_plot_feature_selection_linear.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_feature_selection_linear.ipynb <plot_feature_selection_linear.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_feature_selection_linear.py <plot_feature_selection_linear.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_feature_selection_linear.zip <plot_feature_selection_linear.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_