{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\"Open\n", "\n", "| - | - | - |\n", "|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------|\n", "| [Exercise 13 (read series)](<#Exercise-13-(read-series)>) | [Exercise 14 (operations on series)](<#Exercise-14-(operations-on-series)>) | [Exercise 15 (inverse series)](<#Exercise-15-(inverse-series)>) |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", "\n", "In the NumPy section we dealt with some arrays, whose columns had each a special meaning. For example, the column number 0 could contain values interpreted as years, and column 1 could contain a month, and so on. It is possible to handle the data this way, but in can be hard to remember, which column number corresponds to which variable. Especially, if you later remove some column from the array, then the numbering of the remaining columns changes. One solution to this is to give a descriptive name to each column. These column names stay fixed and attached to their corresponding columns, even if we remove some of the columns. In addition, the rows can be given names as well, these are called *indices* in Pandas.\n", "\n", "The [Pandas](http://pandas.pydata.org/) library is built on top of the NumPy library, and it provides a special kind of two dimensional data structure called `DataFrame`. The `DataFrame` allows to give names to the columns, so that one can access a column using its name in place of the index of the column.\n", "\n", "First we will quickly go through a few examples to see what is possible with Pandas. You may need to check some details from the Pandas [documentation](http://pandas.pydata.org/pandas-docs/stable/) in order to complete the exercises. We start by doing some standard imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # This is the standard way of importing the Pandas library\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import some weather data that is in text form in a csv (Commma Separated Values) file. The following call will fetch the data from the internet and convert it to a DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh = pd.read_csv(\"https://www.cs.helsinki.fi/u/jttoivon/dap/data/fmi/kumpula-weather-2017.csv\")\n", "wh.head() # The head method prints the first 5 rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the DataFrame contains eight columns, three of which are actual measured variables. Now we can refer to a column by its name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh[\"Snow depth (cm)\"].head() # Using the tab key can help enter long column names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several summary statistic methods that operate on a column or on all the columns. The next example computes the mean of the temperatures over all rows of the DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh[\"Air temperature (degC)\"].mean() # Mean temperature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can drop some columns from the DataFrame with the `drop` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh.drop(\"Time zone\", axis=1).head() # Return a copy with one column removed, the original DataFrame stays intact" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh.head() # Original DataFrame is unchanged" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you want to modify the original DataFrame, you can either assign the result to the original DataFrame, or use the `inplace` parameter of the `drop` method. Many of the modifying methods of the DataFrame have the `inplace` parameter.\n", "\n", "Addition of a new column works like adding a new key-value pair to a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh[\"Rainy\"] = wh[\"Precipitation amount (mm)\"] > 5\n", "wh.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next sections we will systematically go through the DataFrame and its one-dimensional version: *Series*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation and indexing of series\n", "\n", "One can turn any one-dimensional iterable into a Series, which is a one-dimensional data structure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s=pd.Series([1, 4, 5, 2, 5, 2])\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type of the elements in this Series is `int64`, integers representable in 64 bits. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also attach a name to this series:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.name = \"Grades\"\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The common attributes of the series are the `name`, `dtype`, and `size`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Name: {s.name}, dtype: {s.dtype}, size: {s.size}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the values of the series, also the row indices were printed. All the accessing methods from NumPy arrays also work for the Series: indexing, slicing, masking and fancy indexing. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2=s[[0,5]] # Fancy indexing\n", "print(s2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t=s[-2:] # Slicing\n", "t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the indices stick to the corresponding values, they are not renumbered!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t[4] # t[0] would give an error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values as a NumPy array are accessible via the `values` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the indices are available through the `index` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index is not simply a NumPy array, but a data structure that allows fast access to the elements. The indices need not be integers, as the next example shows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3=pd.Series([1, 4, 5, 2, 5, 2], index=list(\"abcdef\"))\n", "s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3[\"b\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note a special case here: if the indices are not integers, then the last index of the slice is included in the result. This is contrary to slicing with integers!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3[\"b\":\"e\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is still possible to access the series using NumPy style *implicit integer indices*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be confusing though. Consider the following series:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s4 = pd.Series([\"Jack\", \"Jones\", \"James\"], index=[1,2,3])\n", "s4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think `s4[1]` will print? For this ambiguity Pandas offers attributes `loc` and `iloc`. The attributes `loc` always uses the explicit index, while the attribute `iloc` always uses the implicit integer index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(s4.loc[1])\n", "print(s4.iloc[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 13 (read series)
\n", "\n", "Write function `read_series` that reads input lines from the user and return a Series. Each line should contain first the index and then the corresponding value, separated by whitespace. The index and values are strings (in this case `dtype` is `object`). An empty line signals the end of Series. Malformed input should cause an exception. An input line is malformed, if it is non-empty and, when split at whitespace, does not result in two parts.\n", "\n", "Test your function from the `main` function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 14 (operations on series)
\n", "\n", "Write function `create_series` that gets two lists of numbers as parameters. Both lists should have length 3.\n", "The function should first create two Series, `s1` and `s2`. The first series should have values from the first parameter list and have corresponding indices `a`, `b`, and `c`. The second series should get its values from the second parameter list and have again the corresponding indices `a`, `b`, and `c`. The function should return the pair of these Series.\n", "\n", "Then, write a function `modify_series` that gets two Series as parameters. It should add to the first Series `s1` a new value with index `d`. The new value should be the same as the value in Series `s2` with index `b`.\n", "Then delete the element from `s2` that has index `b`. Now the first Series should have four values, while the second list has only two values. Adding a new element to a Series can be achieved by assignment, like with dictionaries. Deletion of an element from a Series can be done with the `del` statement.\n", "\n", "Test these functions from the main function. Try adding together the Series returned by the `modify_series` function. The operations on Series use the indices to keep the element-wise operations *aligned*. If for some index the operation could not be performed, the resulting value will be `NaN` (Not A Number).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####
Exercise 15 (inverse series)
\n", "\n", "Write function `inverse_series` that get a Series as a parameter and returns a new series, whose indices and values have swapped roles. Test your function from the `main` function.\n", "\n", "What happens if some value appears multiple times in the original Series? What happens if you use this value to index the resulting Series?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One may notice that there are similarities between Python's dictionaries and Pandas' Series, both can be thought to access values using keys. The difference is that Series requires that the indices have all the same type, and similarly, all the values have the same type. This restriction allows creation of fast data structures.\n", "\n", "As a mark of the similaries between these two data structures, Pandas allows creation of a `Series` object from a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = { 2001 : \"Bush\", 2005: \"Bush\", 2009: \"Obama\", 2013: \"Obama\", 2017 : \"Trump\"}\n", "s4 = pd.Series(d, name=\"Presidents\")\n", "s4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary (week 3)\n", "\n", "* You found that comparisons are also vectorized operations, and that the result of a comparison can be used to mask (i.e. restrict) further operations on arrays\n", "* You can select a list of columns using fancy indexing\n", "* An application of NumPy arrays: basic linear algebra operations and solving systems of linear equations\n", "* You know the building blocks of matplotlib's figures. You can create figures based on NumPy arrays and you can adjust the attributes of figures\n", "* You understand how (raster) images are organized as NumPy arrays. You can manipulate images using Numpy's array operations.\n", "* In Pandas it is standard to use rows of DataFrames as samples and columns as variables\n", "* Both rows and columns of Pandas DataFrames can have names, i.e. indices\n", "\n", " * Operations maintain these indices even when adding or removing rows or columns\n", " * Indices also allow several operations to be combined meaningfully and easily" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\"Open\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }