#!/usr/bin/env python # coding: utf-8 # # Principal Component Analysis # # We return to some of the previously considered data sets to explore them via principal component analysis (PCA). You information about how to use it in scikit-learn [here](https://scikit-learn.org/stable/modules/decomposition.html#pca) and it can be called via [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA). For larger datasets, especially if sparse matrix formats are used, it might be more feasible to use [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD), which is computationally more efficient. # **Note:** For the 3D plots of this notebook blow, we will use an _interactive mode_ of matplotlib. You should be able to run everything smoothly after installing the package _ipympl_. You can install this package (if you are using conda/Anaconda) by the line # # `conda install -c conda-forge ipympl` # # or by looking for the package `ipympl` in the GUI of Anaconda Navigator and its package search for your environment. # ## Iris dataset # # First we load the [Iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html?highlight=iris#sklearn.datasets.load_iris) which we are already familiar with. # In[1]: from sklearn import datasets import numpy as np iris = datasets.load_iris() X = iris.data y = iris.target # In[2]: print(iris.DESCR) # We remember that the samples in this dataset contain **four** features: # In[3]: print(iris.feature_names) # In[4]: import seaborn as sns import pandas as pd import matplotlib.pyplot as plt # this is a small loop for creating a list with the category names iris.target_class = [] for k in range(len(iris.target)): iris.target_class.append(iris.target_names[iris.target[k]]) # for this, first create DataFrame df_iris = pd.DataFrame(iris.data,columns=iris['feature_names']) df_iris = df_iris.assign(iris_class=iris.target_class) sns.pairplot(df_iris, hue= "iris_class", palette="tab10", ) fig = plt.gcf() fig.set_size_inches(20, 20) # These are the pairwise scatter plot we created previously. # # Is there a better way to visualize the data, without the need of all these pairwise plots? # Our brain typially is able to visualize at most 3 dimensions well... # # Let's use PCA with two principal components. # In[5]: from sklearn.decomposition import PCA n_components = 2 pca = PCA(n_components=n_components) X_pca = pca.fit_transform(X) plt.figure(figsize=(12, 12)) for i, target_name in zip([0, 1, 2], iris.target_names): plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], cmap='tab10', lw=2, label=target_name) plt.title("PCA of iris dataset") plt.legend(loc="best", shadow=False, scatterpoints=1) plt.axis([-4, 4, -1.5, 1.5]) plt.xlabel('First principal component') plt.ylabel('Second principal component') plt.show() # This looks visually not very different from some of the scatter plots above, but it **is** conceptually different: In one plot 2D, we are able to "summarize" all four features. # # Let's repeat PCA for **three** principal components. # In[6]: from mpl_toolkits.mplot3d import Axes3D n_components = 3 pca = PCA(n_components=n_components) X_pca = pca.fit_transform(X) plt.show() ax = fig.add_subplot(111,projection='3d') ax.scatter(X_pca[y == 1, 0],X_pca[y == 1, 1],X_pca[y == 1, 2]) plt.show() # In[7]: get_ipython().run_line_magic('matplotlib', 'widget') from mpl_toolkits.mplot3d import Axes3D n_components = 3 pca = PCA(n_components=n_components) X_pca = pca.fit_transform(X) fig = plt.figure() #(figsize=(12, 12)) ax = fig.add_subplot(111, projection='3d') for i, target_name in zip([0, 1, 2], iris.target_names): ax.scatter3D(X_pca[y == i, 0], X_pca[y == i, 1], X_pca[y == i, 2],cmap='tab10', lw=4,label=target_name,marker="o",edgecolors='face') ax.set_xlabel('First principal component') ax.set_ylabel('Second principal component') ax.set_zlabel('Third principal component') plt.legend(loc="best", shadow=False, scatterpoints=1) plt.show() # This now gives a more flexible visualization that summarizes basically all the information we have available in one plot. # # Remember that the coordinates of the data points in this plot correspond to what we called the _principal components_ -- they are the rows of the matrix $\mathbf{X} \mathbf{V}$. # # Note also the different scaling of the axis, which is proportional to $\sqrt{\lambda_i}$, the $i$-th eigenvalue of the empirical covariance matrix $\mathbf{M} = \frac{1}{n}\mathbf{X}^T \mathbf{X}$. In other words, the scaling is proportional to the square root of the _variance explained by the $i$-th principal component_. # So, one may ask, how much of the total variance of the data set is explained by the three first principal components? # In[8]: vars(pca) # In[9]: print("Variance explained by each of the first three PCs:") pca.explained_variance_ # The ratio of the variance explained by the first three PCs can be calculated as # $$ # \frac{\sum_{i=1}^3 \lambda_i}{\sum_{i=1}^4 \lambda_i}, # $$ # amounts to: # In[10]: np.sum(pca.explained_variance_ratio_) # This is more than 99%, which means that the information lost by the projection from four to three dimensions is probably negligible. # # Let's squeeze out some more information from what we have done. In order to understand better how the four different features are behaved, we create a so-called [biplot](https://en.wikipedia.org/wiki/Biplot). # In[11]: # Create biplot V = pca.components_ fig = plt.figure() #(figsize=(12, 12)) ax = fig.add_subplot(111, projection='3d') mx = np.max(X_pca) mn = np.min(X_pca) for i, target_name in zip([0, 1, 2], iris.target_names): ax.scatter3D(X_pca[y == i, 0], X_pca[y == i, 1], X_pca[y == i, 2],cmap='tab10', lw=4,label=target_name,marker="o",edgecolors='face') scaling_fac = 3 for i in range(0,4): Q = ax.quiver(0,0,0,np.dot(scaling_fac*np.sqrt(pca.explained_variance_[0]),V[0,i]), scaling_fac*np.dot(np.sqrt(pca.explained_variance_[1]),V[1,i]), scaling_fac*np.dot(np.sqrt(pca.explained_variance_[2]),V[2,i]),color='red') ax.text(np.dot(scaling_fac*np.sqrt(pca.explained_variance_[0]),V[0,i]), np.dot(scaling_fac*np.sqrt(pca.explained_variance_[1]),V[1,i]), np.dot(scaling_fac*np.sqrt(pca.explained_variance_[2]),V[2,i]),iris.feature_names[i]) ax.set_xlabel('First principal component') ax.set_ylabel('Second principal component') ax.set_zlabel('Third principal component') ax.set_xlim(mn, mx) ax.set_ylim(mn, mx) ax.set_zlim(mn, mx) plt.legend(loc="best", shadow=False, scatterpoints=1) plt.show() # We note that we rescaled the axes compared to the 3d scatter plot above in order to better visualize the different magnitude of the principal components. # # From the quiver plot of the loading vectors, we can conclude that the feature 'sepal width' is quite uncorrelated with the other three features. This is due to the approximate orthogonality that we observe the corresponding loading vectors, while the other three have quite large scalar products /small angles with each other. # ## Exercises # # ### Boston housing dataset # # We now want to do some unsupervised learning via PCA can be useful in a regression setting, as the [Boston housing dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset). # # In particular we want to _reduce the dimensionality_ of the 13-dimensional features space before regressing on the housing prices (variable 'MEDV'). # # 1. # a) Load the dataset (sklearn.datasets.load_boston) that we already used previously. Use PCA to reduce the dimensionality to _5_ features (use original feature space), and run ridge regression on this features space of reduced dimensionality. How does the prediction performance change to what you saw previously? (Note: Unifying the scaling of the different features is important here, for example via [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=standardscaler#sklearn.preprocessing.StandardScaler). # # b) Create a principal component plot of the features (after scaling), using a dimensionality of 2 or 3. Can you infer something about the correlations between different features? # # ### NLP data: Newsgroups dataset # # 2. Load the [Newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html?highlight=newsgroups#sklearn.datasets.fetch_20newsgroups). Previously, we extracted features from the text documents using a count vectorizer and a TF-IDF vectorizer. # a) Repeat the classification experiments we ran before, but reducing the dimensionality by principal component analysis to $r=\in \{10, 100,1000\}$ features. How does this influence the classification accuracy?