{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Inmas Machine Learning Workshop January 2023\n", "Instructor: Christian Kuemmerle - kuemmerle@uncc.edu
\n", "Teaching Assistants: Emily Shinkle, Yuxuan Li, Derek Kielty, Yashil Sukurdeep, Tim Wang, Ben Brindle.\n", "\n", "\n", "## Session III - Clustering & K-means\n", "\n", "**This version of the notebook is more suitable for students with less experience in machine learning / who are less familiar with using Python for machine learning and/or the underlying concepts. It contains additional hints compared to the version ``session3a_K-means_advanced.ipynb``, but is similar in nature.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we will implement k-means clustering on a simple 2D dataset to gain some intuition about how it works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import sklearn.cluster as cluster \n", "%matplotlib inline\n", "sns.set_context('poster')\n", "sns.set_color_codes()\n", "plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating the dataset\n", "\n", "We start with a toy dataset that consists of 3 clusters that are sampled from Gaussian distributions with different means and with the standard deviation 0.25." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "blobs, _ = data.make_blobs(n_samples=200, centers=[(-0.75,2.25), (1.0, 2.0), (0,1)], cluster_std=0.25)\n", "\n", "fig, ax = plt.subplots()\n", "ax.scatter(blobs.T[0], blobs.T[1], c='b', **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "K-Means is the 'go-to' clustering algorithm for many simply because it is fast, easy to understand, and available everywhere (there's an implementation in almost any statistical or machine learning tool you care to use). This is a brief summary of how the algorithm works:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optimization\n", "\n", "- Formally, it's an optimization over the possible groupings of objects\n", "\n", "> For a set of $\\{ x_l \\}$ where $x_l\\in \\mathbb{R}^d$ for all $l \\in \\{1,2,\\ldots,n\\}$\n", ">
\n", ">
\n", ">$\\displaystyle \\hat{{C}} = \\textrm{arg}\\min_{{C}} \\sum_{i=1}^k \\left[\\ \\sum_{x\\in{}C_i}\\ \\lvert\\!\\lvert x-\\mu_i\\rvert\\!\\rvert^2 \\right] $ (distortion measure)\n", ">
\n", ">
\n", "> where \n", ">
\n", ">
\n", ">$\\displaystyle \\mu_i = \\frac{1}{\\lvert{C_i}\\rvert}\\sum_{x\\in{}C_i} x $\n", "\n", "Here, the $C = \\{C_1,\\ldots,C_k\\}$ corresponds to a partition of the index set $\\{1,2,\\ldots,n\\}$ of size $n$, where $n$ corresponds to the number of data points.\n", "\n", "That means that $\\{1,2,\\ldots,n\\} = C_1 \\cup C_2 \\cup \\ldots C_k$ with $C_i \\cap C_j = \\emptyset$ for $i \\neq j \\in \\{1,2,\\ldots,n\\}$. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Algorithm\n", "\n", "- Iteratively improving the $\\mu_i$ **prototypes** of $k$ clusters\n", "\n", ">1. Pick $k$ random objects as the initial $\\mu_i$ prototypes\n", ">0. Find for each object the closest prototype and assign to that cluster\n", ">0. Calculate the averages for each cluster to get new $\\mu_i$\n", ">0. Repeat until convergence\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use can use an implementation of k-means [provided by scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). We visualize the resulting clustering (and cluster centers) below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "kmeans = KMeans(n_clusters=3, max_iter=200).fit(blobs)\n", "\n", "fig, ax = plt.subplots()\n", "ax.scatter(blobs.T[0], blobs.T[1], c=kmeans.labels_ , **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "C = kmeans.cluster_centers_\n", "plt.scatter(C[:,0],C[:,1],c='r',marker='o',s=100,edgecolor='none');\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will use the clustering_map function to show the separating hyperplanes that were learned by K-means. This shows how k-means would predict the clusters that new data points belong to." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def clustering_map(X,cluster,i=0,j=1,h=0.005):\n", " '''\n", " h: step size in the mesh\n", " i: first feature number to be plotted\n", " j: second feature number to be plotted\n", " '''\n", " import matplotlib.pyplot as plt\n", " from matplotlib.colors import ListedColormap\n", " cmap_light = ListedColormap(['#FFBBBB', '#BBFFBB', '#BBBBFF'])\n", " cmap_bold = ListedColormap(['#CC0000', '#00AA00', '#0000CC'])\n", " # Points in a mesh of [x_min, m_max] x [y_min, y_max]\n", " x_min, x_max = X[:,i].min()-1, X[:,i].max()+1\n", " y_min, y_max = X[:,j].min()-1, X[:,j].max()+1\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", " grid = np.c_[xx.ravel(), yy.ravel()]\n", " \n", " # Obtain labels for each point in mesh. Use last trained model.\n", " cluster.fit(X)\n", " Z = cluster.predict(grid)\n", " \n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " fig = plt.figure()\n", "\n", " plt.pcolormesh(xx, yy, Z, cmap=cmap_light,shading='auto')\n", " \n", " # Plot the training points\n", " plt.scatter(X[:,i], X[:,j], **plot_kwds)\n", " plt.xlim(xx.min(), xx.max())\n", " plt.ylim(yy.min(), yy.max())\n", " plt.title(\"Clustering with \"+str(cluster))\n", "\n", " ax=plt.gca()\n", " #ax.legend([\"training data\"],loc=0,fontsize=8)\n", " \n", " # Plot the centroids as a white X\n", " centroids = cluster.cluster_centers_\n", " plt.scatter(centroids[:, 0], centroids[:, 1],\n", " marker='x', s=169, linewidths=3,\n", " color='w', zorder=10, alpha=0.8)\n", " \n", " return fig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clustering_map(blobs,kmeans)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Behavior of K-means for different data sets & weaknesses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The k-means algorithm is usually very fast but it has some weaknesses. It can be sensitive to the initialization, the scale and size of the different clusters. It also fails when the data cannot be separated by hyperplanes. In what follows, we will see some examples where k-means struggles to identify the correct clusters.\n", "\n", "First, we will see how k-means is affected by the scale or standard deviation of the distributions from which the samples in the two clusters are drawn.\n", "\n", "**After executing the following cell, change the parameter `cluster_std` of one of the blobs, and execute the cell again to see how this affects the geometry of the dataset.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "blobs_0, _ = data.make_blobs(n_samples=300, centers=[(-0.75,2.25)], cluster_std=0.5)\n", "blobs_1, _ = data.make_blobs(n_samples=300, centers=[(0.5, 2.0)], cluster_std=0.1)\n", "data = np.vstack([blobs_0, blobs_1])\n", " \n", "fig, ax = plt.subplots()\n", "ax.scatter(data.T[0], data.T[1], c='b', **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Use K-means to fit this dataset. Visualize the result (dataset & cluster centers jointly) in a scatter plot.**\n", "\n", "After you have done that, you can change the centers and standard deviations in \"make_blobs\" above and see how it affects the clustering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Hint: Feel free to use and revise the previous KMeans code (fourth cells above). Note that previously the data is called 'blobs', and here the data is called 'data'.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Behavior for imbalanced datasets\n", "\n", "Now we see how k-means is affected by the size or number or samples in each cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "nr_datapoints_blob0 = 20\n", "nr_datapoints_blob1 = 500\n", "\n", "blobs_0, _ = data.make_blobs(n_samples=nr_datapoints_blob0, centers=[(-0.5,2.25)], cluster_std=0.2)\n", "blobs_1, _ = data.make_blobs(n_samples=nr_datapoints_blob1, centers=[(1, 2.0)], cluster_std=0.3)\n", "data = np.vstack([blobs_0, blobs_1])\n", " \n", "fig, ax = plt.subplots()\n", "ax.scatter(data.T[0], data.T[1], c='b', **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we created the data set \"blobs\" such that one contains much fewer points than the other one.\n", "\n", "**Use now K-means to fit this data set, and visualize the outcome coloring the datapoints with different colors based on their cluster affiliation.
You can change the centers and standard deviations and see how it affects the clustering.** " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Discuss with your group: Does K-means successfully \"find\" the two clusters? Discuss how the outcome might be related to the optimization objective and the fact how K-means works.**\n", "\n", "Answer:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Behavior for non-convex cluster geometries\n", "\n", "Another interesting example is that of non-convex clusters. Consider the following example of a circle inside a ring." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "\n", "x, y = datasets.make_circles(n_samples=1000, factor=0.3, noise=0.1, random_state=2018)\n", "plt.subplot(111, aspect='equal'); \n", "plt.scatter(x[:,0], x[:,1], c=y, **plot_kwds);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Use K-means to find a clustering of this toy dataset and visualize the result.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Hint: Again, feel free to use and revise our previous K-means code. Note that here, the data is 'x'.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that k-means is unable to find the correct clustering. In general, k-means struggles when the the individual clusters cannot be separated by hyperplanes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering Beyond K-Means\n", "\n", "There is a variety of clustering algorithms that are better suited for such cases.\n", "\n", "## Spectral Clustering\n", "One example for such an algorithm is the Spectral Clustering algorithm. \n", "\n", "**Try to find a clustering using this scikit-learn implementation of the Spectral Clustering algorithm: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html. Visualize the result.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Hint: the function that you need for Spectral Clustering algorithm is:\n", "
\n", "
\n", "cluster.SpectralClustering(n_clusters=, affinity=\"nearest_neighbors\").fit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Color compression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One interesting application of clustering is in color compression within images. \n", "For example, imagine you have an image with millions of colors.\n", "In most images, a large number of the colors will be unused, and many of the pixels in the image will have similar or even identical colors.\n", "\n", "First, we plot the example image whose colors we would like to compress." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_sample_image\n", "china = load_sample_image(\"china.jpg\")\n", "\n", "fig = plt.figure()\n", "plt.imshow(china);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Obtain a simplified 10-colored version of the image by above by applying k-means. Plot the resulting image and the original image next to each other.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Hint 1: to feed the image data to our KMeans function, you may need to reshape the image data first. You may use the code below.\n", "
\n", "
\n", "im_rgb = np.array(china)\n", "
\n", "shp = im_rgb.shape\n", "
\n", "im_rgb = np.reshape(im_rgb, (shp[0]*shp[1], 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Hint 2: After using KMeans to get the label of each pixel (i.e., kmeans.labels_), you need to change the color of each pixel to the corresponding color of its label, then we get the simplified version of this image. You can use the function written below.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def average_colors(a, labels):\n", " #takes a flattened image a and labels, list of each pixel's cluster and return an image a_prime.\n", " #the color of a pixel of a_prime is the average color of all pixels sharing the same label.\n", " a_compressed = np.empty_like(a)\n", " for l in np.unique(labels):\n", " a_compressed[labels ==l, :] = np.mean(a[labels==l, :], 0)\n", " return a_compressed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }