{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\"Open\n", "\n", "| - | - | - |\n", "|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|\n", "| [Exercise 10 (linear regression)](<#Exercise-10-(linear-regression)>) | [Exercise 11 (mystery data)](<#Exercise-11-(mystery-data)>) | [Exercise 12 (coefficient of determination)](<#Exercise-12-(coefficient-of-determination)>) |\n", "| [Exercise 13 (cycling weather continues)](<#Exercise-13-(cycling-weather-continues)>) | | |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine learning: linear regression\n", "\n", "## Linear regression\n", "Regression analysis tries to explain relationships between variables. One of these variables, called dependend variable, is what we want to \"explain\" using one or more *explanatory variables*. In linear regression we assume that the dependent variable can be, approximately, expressed as a linear combination of the explanatory variables. As a simple example, we might have dependent variable height and an explanatory variable age. The age of a person can quite well explain the height of a person, and this relationship is approximately linear for kids (ages between 1 and 16). Another way of thinking about regression is fitting a curve to the observed data points. If we have only one explanatory variable, then this is easy to visualize, as we shall see below.\n", "\n", "We can apply the linear regression easily with the [scikit-learn](https://scikit-learn.org/stable/) package. Let's go through some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we make the usual standard imports." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import sklearn # This imports the scikit-learn library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we create some data with approximately the relationship $y=2x+1$, with normally distributed errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(0)\n", "n=20 # Number of data points\n", "x=np.linspace(0, 10, n)\n", "y=x*2 + 1 + 1*np.random.randn(n) # Standard deviation 1\n", "print(x)\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we import the `LinearRegression` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can fit a line through the data points (x, y):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model=LinearRegression(fit_intercept=True)\n", "model.fit(x[:,np.newaxis], y)\n", "xfit=np.linspace(0,10,100)\n", "yfit=model.predict(xfit[:, np.newaxis])\n", "plt.plot(xfit,yfit, color=\"black\")\n", "plt.plot(x,y, 'o')\n", "# The following will draw as many line segments as there are columns in matrices x and y\n", "plt.plot(np.vstack([x,x]), np.vstack([y, model.predict(x[:, np.newaxis])]), color=\"red\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The linear regression tries to minimize the sum of squared errors $\\sum_i (y[i] - \\hat{y}[i])^2$; this is the sum of the squared lengths of the red line segments in the above plot. The estimated values $\\hat{y}[i]$ are denoted by `yfit[i]` in the above code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Parameters:\", model.coef_, model.intercept_)\n", "print(\"Coefficient:\", model.coef_[0])\n", "print(\"Intercept:\", model.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the coefficient is the slope of the fitted line, and the intercept is the point where the fitted line intersects with the y-axis.\n", "\n", "
\n", "Note that in scikit-learn the attributes of the model that store the learned parameters have always an underscore at the end of the name. This applies to all algorithms in sklearn, not only the linear regression. This naming style allows one to easily spot the learned model parameters from other attributes.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameters estimated by the regression algorithm were quite close to the parameters that generated the data: coefficient 2 and intercept 1. Try experimenting with the number of data points and/or the standard deviation, to see if you can improve the estimated parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous example had only one explanatory variable. Sometimes this is called a *simple linear regression*. The next example illustrates a more complex regression with multiple explanatory variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample1=np.array([1,2,3]) # The three explanatory variables have values 1, 2, and 3, respectively\n", "sample2=np.array([4,5,6]) # Another example of values of explanatory variables\n", "sample3=np.array([7,8,10]) # ...\n", "y=np.array([15,39,66]) + np.random.randn(3) # For values 1,2, and 3 of explanatory variables, the value y=15 was observed, and so on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to fit a linear model to these points:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model2=LinearRegression(fit_intercept=False)\n", "x=np.vstack([sample1,sample2,sample3])\n", "model2.fit(x, y)\n", "model2.coef_, model2.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's print the various components involved." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b=model2.coef_[:, np.newaxis]\n", "print(\"x:\\n\", x)\n", "print(\"b:\\n\", b)\n", "print(\"y:\\n\", y[:, np.newaxis])\n", "print(\"product:\\n\", np.matmul(x, b))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Polynomial regression\n", "It may perhaps come as a surprise that one can fit a polynomial curve to data points\n", "using linear regression. The trick is to add new explanatory variables to the model.\n", "Below we have a single feature x with associated y values given by third degree polynomial,\n", "with some (gaussian) noise added. It is clear from the below plot that we cannot explain the data well with a linear function. We add two new features: $x^2$ and $x^3$. Now the model has three explanatory variables, $x, x^2$ and $x^3$. The linear regression will find the coefficients for these variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x=np.linspace(-50,150,50)\n", "y=0.15*x**3 - 20*x**2 + 5*x - 4 + 5000*np.random.randn(50)\n", "plt.scatter(x, y, color=\"black\")\n", "model_linear=LinearRegression(fit_intercept=True)\n", "model_squared=LinearRegression(fit_intercept=True)\n", "model_cubic=LinearRegression(fit_intercept=True)\n", "x2=x**2\n", "x3=x**3\n", "model_linear.fit(np.vstack([x]).T, y)\n", "model_squared.fit(np.vstack([x,x2]).T, y)\n", "model_cubic.fit(np.vstack([x,x2,x3]).T, y)\n", "xf=np.linspace(-50,150, 50)\n", "yf_linear=model_linear.predict(np.vstack([x]).T)\n", "yf_squared=model_squared.predict(np.vstack([x,x2]).T)\n", "yf_cubic=model_cubic.predict(np.vstack([x,x2,x3]).T)\n", "plt.plot(xf,yf_linear, label=\"linear\")\n", "plt.plot(xf,yf_squared, label=\"squared\")\n", "plt.plot(xf,yf_cubic, label=\"cubic\")\n", "plt.legend()\n", "print(\"Coefficients:\", model_cubic.coef_)\n", "print(\"Intercept:\", model_cubic.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear and squared are not enough to explain the data, but the linear regression manages to fit quite well a polynomial curve to the data points, when cubic variables are included!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 10 (linear regression)
\n", "\n", "This exercise can give two points at maximum!\n", "\n", "Part 1.\n", "\n", "Write a function `fit_line` that gets one dimensional arrays `x` and `y` as parameters. The function should return the tuple `(slope, intercept)` of the fitted line. Write a main program that tests the `fit_line` function with some example arrays. The main function should produce output in the following form:\n", "\n", "```\n", "Slope: 1.0\n", "Intercept: 1.16666666667\n", "```\n", "\n", "Part 2.\n", "\n", "Modify your `main` function to plot the fitted line using matplotlib, in addition to the textual output. Plot also the original data points.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 11 (mystery data)
\n", "Read the tab separated file `mystery_data.tsv`. Its first five columns define the features, and the last column is the response. Use scikit-learn's `LinearRegression` to fit this data. Implement function `mystery_data` that reads this file and learns and returns the regression coefficients for the five features. You don't have to fit the intercept. The `main` method should print output in the following form:\n", "\n", "```\n", "Coefficient of X1 is ...\n", "Coefficient of X2 is ...\n", "Coefficient of X3 is ...\n", "Coefficient of X4 is ...\n", "Coefficient of X5 is ...\n", "```\n", "\n", "Which features you think are needed to explain the response Y?\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 12 (coefficient of determination)
\n", "\n", "This exercise can give two points at maximum!\n", "\n", "Using the same data as in the previous exercise, instead of printing the regression coefficients, print the *coefficient of determination*. The coefficient of determination, denoted by R2, tells how well the linear regression fits the data. The maximum value of the coefficient of determination is 1. That means the best possible fit.\n", "\n", "Part 1.\n", "\n", "Using all the features (X1 to X5), fit the data using a linear regression (include the intercept). Get the coefficient of determination using the `score` method of the `LinearRegression` class. Write a function `coefficient_of_determination` to do all this. It should return a list containing the R2-score as the only value.\n", "\n", "Part 2.\n", "\n", "Extend your function so that it also returns R2-scores related to linear regression with each single feature in turn. The `coefficient_of_determination` (https://en.wikipedia.org/wiki/Coefficient_of_determination) function should therefore return a list with six R2-scores (the first score is for five features, like in Part 1). To achieve this, your function should call both the `fit` method and the `score` method six times. \n", "\n", "The output from the main method should look like this:\n", "```\n", "R2-score with feature(s) X: ...\n", "R2-score with feature(s) X1: ...\n", "R2-score with feature(s) X2: ...\n", "R2-score with feature(s) X3: ...\n", "R2-score with feature(s) X4: ...\n", "R2-score with feature(s) X5: ...\n", "```\n", "How small can the R2-score be? Experiment both with fitting the intercept and without fitting the intercept.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 13 (cycling weather continues)
\n", "\n", "Write function `cycling_weather_continues` that tries to explain with linear regression the variable of a cycling measuring station's counts using the weather data from corresponding day. The function should take the name of a (cycling) measuring station as a parameter and return the regression coefficients and the score. In more detail:\n", "\n", "Read the weather data set from the `src` folder. Read the cycling data set from folder `src` and restrict it to year 2017. Further, get the sums of cycling counts for each day. Merge the two datasets by the year, month, and day. \n", "Note that for the above you need only small additions to the solution of exercise `cycling_weather`.\n", "After this, use forward fill to fill the missing values.\n", "\n", "In the linear regression use as explanatory variables the following columns `'Precipitation amount (mm)'`, `'Snow depth (cm)'`, and `'Air temperature (degC)'`. Explain the variable (measuring station), whose name is given as a parameter to the function `cycling_weather_continues`. Fit also the intercept. The function should return a pair, whose first element is the regression coefficients and the second element is the score. Above, you may need to use the method `reset_index` (its counterpart is the method `set_index`).\n", "\n", "The output from the `main` function should be in the following form:\n", "\n", "```\n", "Measuring station: x\n", "Regression coefficient for variable 'precipitation': x.x\n", "Regression coefficient for variable 'snow depth': x.x\n", "Regression coefficient for variable 'temperature': x.x\n", "Score: x.xx\n", "```\n", "\n", "Use precision of one decimal for regression coefficients, and precision of two decimals for the score.\n", "In the `main` function test you solution using some measuring station, for example `Baana`.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional information\n", "\n", "* The [scikit-learn](https://scikit-learn.org/stable/) library concentrates on machine learning. Check out library [statsmodels](http://www.statsmodels.org/stable/index.html) for a more statistical viewpoint to regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary (week 5)\n", "\n", "* `pd.concat` and `pd.merge` can both combine two DataFrames, but the way the combining is done differs. The function `pd.concat` concatenates based on *indices* of DataFrames, whereas `pd.merge` combines based on the *content* of common variable(s).\n", "* The option `join=\"outer` to `pd.concat` can create missing values, but `join=inner` cannot. The former gives the union of indices and the latter gives the intersection of indices.\n", "* With `pd.concat` overlapping indices can:\n", "\n", " * cause an error\n", " * cause renumbering of indices\n", " * create hierarchical indices\n", "* Merging can join elements\n", "\n", " * one-to-one\n", " * one-to-many\n", " * many-to-many\n", "* In grouping a DataFrame can be thought to be split into smaller DataFrames. The major classes of operations on these groups are:\n", "\n", " * aggregate\n", " * filter\n", " * transform (retains shape)\n", " * apply\n", "* Series which are indexed by time are called time series\n", "* Linear regression can be used to find out linear relationships between variables\n", " * can have more than one feature (explanatory variable)\n", " * fitting polynomials is still linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\"Open\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }