{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|\n", "| [Exercise 10 (linear regression)](<#Exercise-10-(linear-regression)>) | [Exercise 11 (mystery data)](<#Exercise-11-(mystery-data)>) | [Exercise 12 (coefficient of determination)](<#Exercise-12-(coefficient-of-determination)>) |\n", "| [Exercise 13 (cycling weather continues)](<#Exercise-13-(cycling-weather-continues)>) | | |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine learning: linear regression\n", "\n", "## Linear regression\n", "Regression analysis tries to explain relationships between variables. One of these variables, called dependend variable, is what we want to \"explain\" using one or more *explanatory variables*. In linear regression we assume that the dependent variable can be, approximately, expressed as a linear combination of the explanatory variables. As a simple example, we might have dependent variable height and an explanatory variable age. The age of a person can quite well explain the height of a person, and this relationship is approximately linear for kids (ages between 1 and 16). Another way of thinking about regression is fitting a curve to the observed data points. If we have only one explanatory variable, then this is easy to visualize, as we shall see below.\n", "\n", "We can apply the linear regression easily with the [scikit-learn](https://scikit-learn.org/stable/) package. Let's go through some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we make the usual standard imports." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import sklearn # This imports the scikit-learn library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we create some data with approximately the relationship $y=2x+1$, with normally distributed errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(0)\n", "n=20 # Number of data points\n", "x=np.linspace(0, 10, n)\n", "y=x*2 + 1 + 1*np.random.randn(n) # Standard deviation 1\n", "print(x)\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we import the `LinearRegression` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can fit a line through the data points (x, y):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model=LinearRegression(fit_intercept=True)\n", "model.fit(x[:,np.newaxis], y)\n", "xfit=np.linspace(0,10,100)\n", "yfit=model.predict(xfit[:, np.newaxis])\n", "plt.plot(xfit,yfit, color=\"black\")\n", "plt.plot(x,y, 'o')\n", "# The following will draw as many line segments as there are columns in matrices x and y\n", "plt.plot(np.vstack([x,x]), np.vstack([y, model.predict(x[:, np.newaxis])]), color=\"red\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The linear regression tries to minimize the sum of squared errors $\\sum_i (y[i] - \\hat{y}[i])^2$; this is the sum of the squared lengths of the red line segments in the above plot. The estimated values $\\hat{y}[i]$ are denoted by `yfit[i]` in the above code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Parameters:\", model.coef_, model.intercept_)\n", "print(\"Coefficient:\", model.coef_[0])\n", "print(\"Intercept:\", model.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the coefficient is the slope of the fitted line, and the intercept is the point where the fitted line intersects with the y-axis.\n", "\n", "