{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|\n", "| [Exercise 1 (blob classification)](<#Exercise-1-(blob-classification)>) | [Exercise 2 (plant classification)](<#Exercise-2-(plant-classification)>) | [Exercise 3 (word classification)](<#Exercise-3-(word-classification)>) |\n", "| [Exercise 4 (spam detection)](<#Exercise-4-(spam-detection)>) | | |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ML: Naive Bayes classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Classification* is one form of supervised learning. The aim is to annotate all data points with a label. Those points that have the same label belong to the same class. There can be two or more labels. For example, a lifeform can be classified (coarsely) with labels animal, plant, fungi, archaea, bacteria, protozoa, and chromista. The data points are observed to have certain features that can be used to predict their labels. For example, if it is has feathers, then it is most likely an animal.\n", "\n", "In supervised learning an algorithm is first given a training set of data points with their features and labels. Then the algorithm learns from these features and labels a (probabilistic) model, which can afterwards be used to predict the labels of previously unseen data.\n", "\n", "*Naive Bayes classification* is a fast and simple to understand classification method. Its speed is due to some simplifications we make about the underlying probability distributions, namely, the assumption about the independence of features. Yet, it can be quite powerful, especially when there are enough features in the data.\n", "\n", "Suppose we have for each label L a probability distribution. This distribution gives probability for each possible combination of features (a feature vector):\n", "\n", "$$P(features | L).$$\n", "\n", "The main idea in Bayesian classification is to reverse the direction of dependence: we want to predict the label based on the features:\n", "\n", "$$P(L | features)$$\n", "\n", "This is possible by [the Bayes theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem):\n", "\n", "$$P(L | features) = \\frac{P(features | L)P(L)}{P(features)}.$$\n", "\n", "Let's assume we have to labels L1 and L2, and their associated distributions: $P(features | L1)$ and $P(features | L2)$. If we have a data point with \"features\", whose label we don't know, we can try to predict it using the ratio of posterior probabilities:\n", "\n", "$$\\frac{P(L1 | features)}{P(L2 | features)} = \\frac{P(features | L1)P(L1)}{P(features | L2)P(L2)}.$$\n", "\n", "If the ratio is greater than one, we label our data point with label L1, and if not, we give it label L2.\n", "The prior probabilities P(L1) and P(L2) of labels can be easily found out from the input data, as for each data point we also have its label. Same goes for the probabilities of features conditioned on the label." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first demonstrate naive Bayes classification using Gaussian distributions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "X,y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)\n", "colors=np.array([\"red\", \"blue\"])\n", "plt.scatter(X[:, 0], X[:, 1], c=colors[y], s=50)\n", "for label, c in enumerate(colors):\n", " plt.scatter([], [], c=c, label=str(label))\n", "plt.legend();\n", "#plt.colorbar();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.naive_bayes import MultinomialNB\n", "model = GaussianNB()\n", "#model = MultinomialNB()\n", "model.fit(X, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Naive Bayes algorithm fitted two 2-dimensional Gaussian distribution to the data. The means and the variances define these distributions completely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Means:\", model.theta_)\n", "print(\"Standard deviations:\", model.sigma_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot these distributions. First we define a helper function to draw an ellipse that gives the standard deviation in each direction from the origo." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_ellipse(ax, mu, sigma, color=\"k\", label=None):\n", " \"\"\"\n", " Based on\n", " http://stackoverflow.com/questions/17952171/not-sure-how-to-fit-data-with-a-gaussian-python.\n", " \"\"\"\n", " from matplotlib.patches import Ellipse\n", " # Compute eigenvalues and associated eigenvectors\n", " vals, vecs = np.linalg.eigh(sigma)\n", "\n", " # Compute \"tilt\" of ellipse using first eigenvector\n", " x, y = vecs[:, 0]\n", " theta = np.degrees(np.arctan2(y, x))\n", "\n", " # Eigenvalues give length of ellipse along each eigenvector\n", " w, h = 2 * np.sqrt(vals)\n", "\n", " ax.tick_params(axis='both', which='major', labelsize=20)\n", " ellipse = Ellipse(mu, w, h, theta, color=color, label=label) # color=\"k\")\n", " ellipse.set_clip_box(ax.bbox)\n", " ellipse.set_alpha(0.2)\n", " ax.add_artist(ellipse)\n", " return ellipse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we do the actual plotting:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "plt.xlim(-5, 5)\n", "plt.ylim(-15, 5)\n", "plot_ellipse(plt.gca(), model.theta_[0], np.identity(2)*model.sigma_[0], color=\"red\")\n", "plot_ellipse(plt.gca(), model.theta_[1], np.identity(2)*model.sigma_[1], color=\"blue\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Accuracy score* gives a measure about how well we managed to predict the labels. The maximum value is 1.0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "y_fitted = model.predict(X)\n", "acc=accuracy_score(y,y_fitted)\n", "print(\"Accuracy score is\", acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The score was the best possible, which is not a surprise, since we tried to predict the data we had already seen! Later we will split our data into two parts: one for learning the model and the other for testing its predictive skills." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Another example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's generate some more data using multivariate normal distributions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cov=np.array([[ 4.68, -4.32],\n", " [-4.32, 4.68]])\n", "mean1 = [0,0]\n", "mean2 = [0,4]\n", "n=500\n", "x1 = np.random.multivariate_normal(mean1, cov, n).T\n", "x2 = np.random.multivariate_normal(mean2, cov, n).T\n", "X=np.vstack([x1.T,x2.T])\n", "y=np.hstack([[0]*n, [1]*n]).T\n", "plt.scatter(X[:n,0], X[:n,1], color=\"red\", label=0)\n", "plt.scatter(X[n:,0], X[n:,1], color=\"blue\", label=1)\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two clusters seem to be quite separate. Let's try naive Bayesian classification on this data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = GaussianNB()\n", "#model = MultinomialNB()\n", "model.fit(X, y);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Means:\", model.theta_)\n", "print(\"Standard deviations:\", model.sigma_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_fitted = model.predict(X)\n", "colors=np.array([\"red\", \"blue\"])\n", "plt.scatter(X[:,0], X[:,1], color=colors[y_fitted])\n", "plt.scatter([], [], color=\"red\", label=\"0\")\n", "plt.scatter([], [], color=\"blue\", label=\"1\")\n", "from sklearn.metrics import accuracy_score\n", "acc=accuracy_score(y,y_fitted)\n", "plt.legend()\n", "print(\"Accuracy score is\", acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even thought the score is quite good, we can see from the plot that the algorithm didn't have good models for the data. We can plot the models the algorithm used:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "plt.xlim(-10, 10)\n", "plt.ylim(-15, 10)\n", "e1=plot_ellipse(plt.gca(), model.theta_[0], np.identity(2)*model.sigma_[0], color=\"red\", label=\"0\")\n", "e2=plot_ellipse(plt.gca(), model.theta_[1], np.identity(2)*model.sigma_[1], color=\"blue\", label=\"1\")\n", "plt.legend([e1, e2], [\"0\", \"1\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem with naive Bayesian classification is that it tries to model the data using Gaussian distributions, which are aligned along the x and y axes. With this example data we would have needed Gaussian distributions which are \"tilted\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We next try to classify a set of messages that were posted on a public forum. The messages were divided into groups by the topics. So, we have a data set ready for classification testing. Let's first load this data using scikit-learn and print the message categories." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "data = fetch_20newsgroups()\n", "data.target_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We concentrate on four message categories only. The tool `fetch_20newsgroups` allows us to easily split the data into training and testing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "categories = ['comp.graphics', 'rec.autos', 'sci.electronics', 'sci.crypt']\n", "train = fetch_20newsgroups(subset='train', categories=categories)\n", "test = fetch_20newsgroups(subset='test', categories=categories)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what we got:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Training data:\", \"Data:\", str(type(train.data)), len(train.data), \"Target:\", str(type(train.target)), len(train.target))\n", "print(\"Test data:\", \"Data:\", str(type(test.data)), len(test.data), \"Target\", str(type(test.data)), len(test.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use as features the frequencies of each word in the dataset. That is, there are as many features as there are distinct words in the dataset. We denote the number of features by $f$. As the features are now counts, it is sensible to use multinomial distribution instead of Gaussian. \n", "\n", "Let's try to model these messages using multinomial distributions. Each message category has its own distribution. A multinomial distribution has $f$ non-negative parameters $\\theta_1,\\ldots , \\theta_f$, which sum up to one. For example, the parameter $\\theta_3$ might tell the the probability of the word \"board\" appearing in a message of the category this distribution is describing.\n", "\n", "In scikit-learn there is a class `CountVectorizer` that converts messages in form of text strings to feature vectors. We can integrate this conversion with the model we are using (multinomial naive Bayes), so that the conversion happens automatically as part of the `fit` method. We achive this integration using the `make_pipeline` tool." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#from sklearn.feature_extraction.text import TfidfVectorizer # an alternative feature extractor\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.pipeline import make_pipeline\n", "\n", "#model = make_pipeline(TfidfVectorizer(), MultinomialNB())\n", "model = make_pipeline(CountVectorizer(), MultinomialNB())\n", "model.fit(train.data, train.target)\n", "labels_fitted = model.predict(test.data)\n", "print(\"Accuracy score is\", accuracy_score(labels_fitted, test.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The classifier seem to work quite well! Notice that now we used separate data for testing the model.\n", "\n", "Let's have a closer look at the resulting feature vectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vec=CountVectorizer()\n", "features=vec.fit_transform(train.data)\n", "print(\"Type of feature matrix:\", type(features))\n", "print(features[0,:]) # print the features of the first sample point" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The feature matrix is stored in sparse format, that is, only the nonzero counts are stored. How many words were in the first message?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Number of words:\", features[0,:].sum())\n", "col = vec.vocabulary_[\"it\"] # Get the column of 'it' word in the feature matrix\n", "print(f\"Word 'it' appears in the first message {features[0, col]} times.\")\n", "print()\n", "print(train.data[0]) # Let's print the corresponding message as well\n", "#print(vec.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####