{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Inmas Machine Learning Workshop January 2023\n", "Instructor: Christian Kuemmerle - kuemmerle@uncc.edu
\n", "Teaching Assistants: Emily Shinkle, Yuxuan Li, Derek Kielty, Yashil Sukurdeep, Tim Wang, Ben Brindle.\n", "\n", "\n", "## Session III - Clustering & K-means\n", "\n", "**This version of the notebook is more suitable for students with more experience in machine learning / who are more familiar with coding/ some of the covered context. It contains fewer hints and an additional topic at the end (initialization dependence of K-means) compared to the version ``session3a_K-means.ipynb``.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we will implement k-means clustering on a simple 2D dataset to gain some intuition about how it works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import sklearn.cluster as cluster \n", "%matplotlib inline\n", "sns.set_context('poster')\n", "sns.set_color_codes()\n", "plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating the dataset\n", "\n", "We start with a toy dataset that consists of 3 clusters that are sampled from Gaussian distributions with different means and with the standard deviation 0.25." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "blobs, _ = data.make_blobs(n_samples=200, centers=[(-0.75,2.25), (1.0, 2.0), (0,1)], cluster_std=0.25)\n", "\n", "fig, ax = plt.subplots()\n", "ax.scatter(blobs.T[0], blobs.T[1], c='b', **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "K-Means is the 'go-to' clustering algorithm for many simply because it is fast, easy to understand, and available everywhere (there's an implementation in almost any statistical or machine learning tool you care to use). This is a brief summary of how the algorithm works:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optimization\n", "\n", "- Formally, it's an optimization over the possible groupings of objects\n", "\n", "> For a set of $\\{ x_l \\}$ where $x_l\\in \\mathbb{R}^d$ for all $l$\n", ">
\n", ">
\n", ">$\\displaystyle \\hat{{C}} = \\textrm{arg}\\min_{{C}} \\sum_{i=1}^k \\left[\\ \\sum_{x\\in{}C_i}\\ \\lvert\\!\\lvert x-\\mu_i\\rvert\\!\\rvert^2 \\right] $ (distortion measure)\n", ">
\n", ">
\n", "> where \n", ">
\n", ">
\n", ">$\\displaystyle \\mu_i = \\frac{1}{\\lvert{C_i}\\rvert}\\sum_{x\\in{}C_i} x $\n", "\n", "Here, the $C = \\{C_1,\\ldots,C_k\\}$ corresponds to a partition of the index set $\\{1,2,\\ldots,n\\}$ of size $n$, where $n$ corresponds to the number of data points.\n", "\n", "That means that $\\{1,2,\\ldots,n\\} = C_1 \\cup C_2 \\cup \\ldots C_k$ with $C_i \\cap C_j = \\emptyset$ for $i \\neq j \\in \\{1,2,\\ldots,n\\}$. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Algorithm\n", "\n", "- Iteratively improving the $\\mu_i$ **prototypes** of $k$ clusters\n", "\n", ">1. Pick $k$ random objects as the initial $\\mu_i$ prototypes\n", ">0. Find for each object the closest prototype and assign to that cluster\n", ">0. Calculate the averages for each cluster to get new $\\mu_i$\n", ">0. Repeat until convergence\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use can use an implementation of k-means [provided by scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). We visualize the resulting clustering (and cluster centers) below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "kmeans = KMeans(n_clusters=3, max_iter=200).fit(blobs)\n", "\n", "fig, ax = plt.subplots()\n", "ax.scatter(blobs.T[0], blobs.T[1], c=kmeans.labels_ , **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "C = kmeans.cluster_centers_\n", "plt.scatter(C[:,0],C[:,1],c='r',marker='o',s=100,edgecolor='none');\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will use the clustering_map function to show the separating hyperplanes that were learned by K-means. This shows how k-means would predict the clusters that new data points belong to." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def clustering_map(X,cluster,i=0,j=1,h=0.005):\n", " '''\n", " h: step size in the mesh\n", " i: first feature number to be plotted\n", " j: second feature number to be plotted\n", " '''\n", " import matplotlib.pyplot as plt\n", " from matplotlib.colors import ListedColormap\n", " cmap_light = ListedColormap(['#FFBBBB', '#BBFFBB', '#BBBBFF'])\n", " cmap_bold = ListedColormap(['#CC0000', '#00AA00', '#0000CC'])\n", " # Points in a mesh of [x_min, m_max] x [y_min, y_max]\n", " x_min, x_max = X[:,i].min()-1, X[:,i].max()+1\n", " y_min, y_max = X[:,j].min()-1, X[:,j].max()+1\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", " grid = np.c_[xx.ravel(), yy.ravel()]\n", " \n", " # Obtain labels for each point in mesh. Use last trained model.\n", " cluster.fit(X)\n", " Z = cluster.predict(grid)\n", " \n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " fig = plt.figure()\n", "\n", " plt.pcolormesh(xx, yy, Z, cmap=cmap_light,shading='auto')\n", " \n", " # Plot the training points\n", " plt.scatter(X[:,i], X[:,j], **plot_kwds)\n", " plt.xlim(xx.min(), xx.max())\n", " plt.ylim(yy.min(), yy.max())\n", " plt.title(\"Clustering with \"+str(cluster))\n", "\n", " ax=plt.gca()\n", " #ax.legend([\"training data\"],loc=0,fontsize=8)\n", " \n", " # Plot the centroids as a white X\n", " centroids = cluster.cluster_centers_\n", " plt.scatter(centroids[:, 0], centroids[:, 1],\n", " marker='x', s=169, linewidths=3,\n", " color='w', zorder=10, alpha=0.8)\n", " \n", " return fig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clustering_map(blobs,kmeans)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Behavior of K-means for different data sets & weaknesses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The k-means algorithm is usually very fast but it has some weaknesses. It can be sensitive to the initialization, the scale and size of the different clusters. It also fails when the data cannot be separated by hyperplanes. In what follows, we will see some examples where k-means struggles to identify the correct clusters.\n", "\n", "## Dependence on data distribution\n", "\n", "First, we will see how k-means is affected by the scale or standard deviation of the distributions from which the samples in the two clusters are drawn.\n", "\n", "**After executing the following cell, change the parameter `cluster_std` of one of the blobs, and execute the cell again to see how this affects the geometry of the dataset.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "blobs_0, _ = data.make_blobs(n_samples=300, centers=[(-0.75,2.25)], cluster_std=0.5) # creates one of the \n", "blobs_1, _ = data.make_blobs(n_samples=300, centers=[(0.5, 2.0)], cluster_std=0.1)\n", "data = np.vstack([blobs_0, blobs_1])\n", "\n", "fig, ax = plt.subplots()\n", "ax.scatter(data.T[0], data.T[1], c='b', **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Use K-means to fit this dataset. Visualize the result (dataset & cluster centers jointly) in a scatter plot.**\n", "\n", "After you have done that, you can change the centers and standard deviations in \"make_blobs\" above and see how it affects the clustering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Behavior for imbalanced datasets\n", "\n", "Now we see how k-means is affected by the size or number or samples in each cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "nr_datapoints_blob0 = 20\n", "nr_datapoints_blob1 = 500\n", "\n", "blobs_0, _ = data.make_blobs(n_samples=nr_datapoints_blob0, centers=[(-0.5,2.25)], cluster_std=0.2)\n", "blobs_1, _ = data.make_blobs(n_samples=nr_datapoints_blob1, centers=[(1, 2.0)], cluster_std=0.3)\n", "data = np.vstack([blobs_0, blobs_1])\n", " \n", "fig, ax = plt.subplots()\n", "ax.scatter(data.T[0], data.T[1], c='b', **plot_kwds)\n", "ax.set_title('Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we created the data set \"blobs\" such that one contains much fewer points than the other one.\n", "\n", "**Use now K-means to fit this data set, and visualize the outcome coloring the datapoints with different colors based on their cluster affiliation.
You can change the centers and standard deviations and see how it affects the clustering.** \n", "\n", "Answer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Discuss with your group: Does K-means successfully \"find\" the two clusters? Discuss how the outcome might be related to the optimization objective and the fact how K-means works.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Behavior for non-convex cluster geometries\n", "\n", "Another interesting example is that of non-convex clusters. Consider the following example of a circle inside a ring." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "\n", "x, y = datasets.make_circles(n_samples=1000, factor=0.3, noise=0.1, random_state=2018)\n", "plt.subplot(111, aspect='equal'); \n", "plt.scatter(x[:,0], x[:,1], c=y, **plot_kwds);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Use K-means to find a clustering of this toy dataset and visualize the result.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that k-means is unable to find the correct clustering. In general, k-means struggles when the the individual clusters cannot be separated by hyperplanes. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering Beyond K-Means\n", "\n", "There is a variety of clustering algorithms that are better suited for such cases.\n", "\n", "## Spectral Clustering\n", "One example for such an algorithm is the Spectral Clustering algorithm. \n", "\n", "**Try to find a clustering using this scikit-learn implementation of the Spectral Clustering algorithm: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html. Visualize the result.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Color compression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One interesting application of clustering is in color compression within images. \n", "For example, imagine you have an image with millions of colors.\n", "In most images, a large number of the colors will be unused, and many of the pixels in the image will have similar or even identical colors.\n", "\n", "First, we plot the example image whose colors we would like to compress." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_sample_image\n", "china = load_sample_image(\"china.jpg\")\n", "\n", "fig = plt.figure()\n", "plt.imshow(china);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Obtain a simplified 10-colored version of the image by above by applying k-means. Plot the resulting image and the original image next to each other.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### You can write your code here\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Variants of K-Means: Initialization\n", "\n", "Since K-means is an iterative algorithm, the question of the initialization of the cluster assignments at its first iteration is relevant: different initializations might result in different outcomes.\n", "\n", "The `scikit-learn` implementation of K-means does _not_ use a completely random initialization, but something called `k-means++` to find the cluster assignments at the first iteration. Please see its documentation to see this:\n", "[Documentation of `sklearn.cluster.KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans).\n", "\n", "## K-Means++\n", "\n", "The procedure of finding the initialization in `k-means++` is the following:\n", "\n", "- Choose one center $\\mu_1$ uniformly at random among the data points.\n", "- For each data point $x$ not chosen yet, compute $D(x)$, the distance between $x$ and the nearest center $\\mu_i$ that has already been chosen.\n", "- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to $D(x)^2$.\n", "- Repeat Steps 2 and 3 until $k$ centers have been chosen.\n", "- Now that the initial centers have been chosen, proceed using standard k-means clustering.\n", "\n", "This initialization is used by default, corresponding to the option `init=‘k-means++’`.\n", "\n", "## Random Initialization\n", "\n", "In contrast, a \"naive\" random initialization would look as follows:\n", "\n", "- Choose the $k$ initial centers uniformly at random without replacement among all data points.\n", "\n", "This initialization corresponds to `init=‘random‘`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We revisit the setting from above with three \"blob\" data sets with the same standard deviation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "blobs, _ = data.make_blobs(n_samples=200, centers=[(-0.75,2.25), (1.0, 2.0), (0,1)], cluster_std=0.25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Run K-means for only _one_ iteration**\n", " - **using the k-means++ initialization, and also**\n", " - **using the random initialization.**\n", " \n", "**Visualize the resulting clusterings next to each other, and furthermore, report the value of the k-means objective (distortion measure, also called \"inertia\".**\n", "\n", "You can run the code multiple time to see how \"random\" the outcomes are. What do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Add your code here below ### \n", "\n", "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }