{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|---------------------------------------------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------|\n", "| [Exercise 5 (plant clustering)](<#Exercise-5-(plant-clustering)>) | [Exercise 6 (nonconvex clusters)](<#Exercise-6-(nonconvex-clusters)>) | [Exercise 7 (binding sites)](<#Exercise-7-(binding-sites)>) |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ML: Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clustering is one of the types of unsupervised learning. It is similar to classification: the aim is to give a label to each data point. However, unlike in classification, we are not given any examples of labels associated with the data points. We must infer from the data, which data points belong to the same cluster. This can be achieved using some notion of distance between the data points. Data points in the same cluster are somehow close to each other.\n", "\n", "One of the simplest clustering methods is the *k-means clustering*. It aims at producing a clustering that is optimal in the following sense:\n", "\n", "* the *centre of each cluster* is the average of all points in the cluster\n", "* any point in a cluster is closer to its centre than to a centre of any other cluster\n", "\n", "The k-means clustering is first given the wanted number of clusters, say k, as a *hyperparameter*. Next, to start the algorithm, k points from the data set are chosen randomly as cluster centres. Then the following phases are repeated iteratively:\n", "\n", "* any data point is set to belong to a cluster, whose centre is closest to it\n", "* then for each cluster a new centre is chosen as the average of the data points in the cluster\n", "\n", "This procedure is repeated until the clusters no longer change. This kind of algorithm is called an Expectation-Maximization (EM) algorithm, which is known to converge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The scikit-learn library has an implementation of the k-means algorithm. Let's apply it to a set of randomly generated blobs, whose labels we throw away." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "X,y = make_blobs(centers=4, n_samples=200, random_state=0, cluster_std=0.7)\n", "print(X[:10],y[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we plot these points, but without coloring the points using the labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:,0],X[:,1]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can still discern four clusters in the data set. Let's see if the k-means algorithm can recover these clusters. First we create the instance of the k-means model by giving it the number of clusters 4 as a hyperparameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "model = KMeans(4)\n", "model.fit(X)\n", "print(model.cluster_centers_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:,0],X[:,1], c=model.labels_);\n", "plt.scatter(model.cluster_centers_[:,0], model.cluster_centers_[:,1], s=100, color=\"red\"); # Show the centres" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The clustering looks more or less correct. To get a more quantitative measure of success we can get the accuracy score." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "acc=accuracy_score(y, model.labels_)\n", "print(\"Accuracy score is\", acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops! Even though the clusters could match almost perfectly to the original, their labels might be permuted. Let's select randomly one point from each cluster and check their labels from the original data labels. Then we use this label for the whole cluster. In essence, we are renaming the clusters, not re-clustering the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import scipy\n", "def find_permutation(n_clusters, real_labels, labels):\n", " permutation=[]\n", " for i in range(n_clusters):\n", " idx = labels == i\n", " new_label=scipy.stats.mode(real_labels[idx])[0][0] # Choose the most common label among data points in the cluster\n", " permutation.append(new_label)\n", " return permutation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "permutation = find_permutation(4, y, model.labels_)\n", "print(permutation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_labels = [ permutation[label] for label in model.labels_] # permute the labels\n", "print(\"Accuracy score is\", accuracy_score(y, new_labels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, the k-means algorithm seems to work well in this case, but there can be several problems. Firstly, even though an EM algorithm always converges, it might converge to a local maximum. To avoid this, EM type algorithms are usually run several times, each time starting from different random initial values. For instance, in the scikit-learn implementation, the algorithms is restarted by default 10 times. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More complicated example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The k-means algorithm can have difficulties when the clusters are not convex shapes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_moons\n", "X,y = make_moons(200, noise=0.05, random_state=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:,0], X[:,1]);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model=KMeans(2)\n", "model.fit(X)\n", "plt.scatter(X[:,0], X[:,1], c=model.labels_);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The clustering does not work well now, since it is not possible to separate the two clusters with a line. We could embed this data set into a higher dimensional space, where the separation is possible. And then apply the k-means clustering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can use a different type of clustering algorithm for this case. The *DBSCAN algorithm* is based on densities and works well on data whose density in the clusters is uniform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import DBSCAN\n", "model = DBSCAN(eps=0.3)\n", "model.fit(X)\n", "plt.scatter(X[:,0], X[:,1], c=model.labels_);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The good news is that DBSCAN does not require the user to specify the number of clusters. But now the algorithm depends on another hyperparameter: a threshold for distance (here 0.3)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clustering digits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using scikit-learn we can download a set of 1797 images of handwritten digits with the correct labels 0,1,...,9. The images have quite a low resolution: 8*8=64 pixels. Let's see how our machine learning method works with this kind of data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "digits.data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get an idea how these data points look like, we plot first ten of these." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(2,5, subplot_kw=dict(xticks=[], yticks=[]))\n", "for ax, digit in zip(axes.flat, digits.data[:10]):\n", " ax.imshow(digit.reshape(8,8), cmap=\"gray\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's cluster these data points into ten clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model=KMeans(n_clusters = 10, random_state=0)\n", "model.fit(digits.data)\n", "model.cluster_centers_.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we have ten cluster centres, which are images with 8x8=64 pixels in them. We can have a look at their appearence:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(2,5, subplot_kw=dict(xticks=[], yticks=[]))\n", "for ax, digit in zip(axes.flat, model.cluster_centers_):\n", " ax.imshow(digit.reshape(8,8), cmap=\"gray\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can recognize these numbers with the exception of maybe number eight. What is the accuracy score of this clustering?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "permutation3 = find_permutation(10, digits.target, model.labels_)\n", "print(permutation3)\n", "acc = accuracy_score(digits.target, [ permutation3[label] for label in model.labels_])\n", "print(\"Accuracy score is\", acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is quite a good result for such a simple algorithm!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####