{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------|\n", "| [Exercise 13 (read series)](<#Exercise-13-(read-series)>) | [Exercise 14 (operations on series)](<#Exercise-14-(operations-on-series)>) | [Exercise 15 (inverse series)](<#Exercise-15-(inverse-series)>) |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", "\n", "In the NumPy section we dealt with some arrays, whose columns had each a special meaning. For example, the column number 0 could contain values interpreted as years, and column 1 could contain a month, and so on. It is possible to handle the data this way, but in can be hard to remember, which column number corresponds to which variable. Especially, if you later remove some column from the array, then the numbering of the remaining columns changes. One solution to this is to give a descriptive name to each column. These column names stay fixed and attached to their corresponding columns, even if we remove some of the columns. In addition, the rows can be given names as well, these are called *indices* in Pandas.\n", "\n", "The [Pandas](http://pandas.pydata.org/) library is built on top of the NumPy library, and it provides a special kind of two dimensional data structure called `DataFrame`. The `DataFrame` allows to give names to the columns, so that one can access a column using its name in place of the index of the column.\n", "\n", "First we will quickly go through a few examples to see what is possible with Pandas. You may need to check some details from the Pandas [documentation](http://pandas.pydata.org/pandas-docs/stable/) in order to complete the exercises. We start by doing some standard imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # This is the standard way of importing the Pandas library\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import some weather data that is in text form in a csv (Commma Separated Values) file. The following call will fetch the data from the internet and convert it to a DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh = pd.read_csv(\"https://www.cs.helsinki.fi/u/jttoivon/dap/data/fmi/kumpula-weather-2017.csv\")\n", "wh.head() # The head method prints the first 5 rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the DataFrame contains eight columns, three of which are actual measured variables. Now we can refer to a column by its name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh[\"Snow depth (cm)\"].head() # Using the tab key can help enter long column names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several summary statistic methods that operate on a column or on all the columns. The next example computes the mean of the temperatures over all rows of the DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh[\"Air temperature (degC)\"].mean() # Mean temperature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can drop some columns from the DataFrame with the `drop` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh.drop(\"Time zone\", axis=1).head() # Return a copy with one column removed, the original DataFrame stays intact" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh.head() # Original DataFrame is unchanged" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you want to modify the original DataFrame, you can either assign the result to the original DataFrame, or use the `inplace` parameter of the `drop` method. Many of the modifying methods of the DataFrame have the `inplace` parameter.\n", "\n", "Addition of a new column works like adding a new key-value pair to a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wh[\"Rainy\"] = wh[\"Precipitation amount (mm)\"] > 5\n", "wh.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next sections we will systematically go through the DataFrame and its one-dimensional version: *Series*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation and indexing of series\n", "\n", "One can turn any one-dimensional iterable into a Series, which is a one-dimensional data structure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s=pd.Series([1, 4, 5, 2, 5, 2])\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type of the elements in this Series is `int64`, integers representable in 64 bits. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also attach a name to this series:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.name = \"Grades\"\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The common attributes of the series are the `name`, `dtype`, and `size`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Name: {s.name}, dtype: {s.dtype}, size: {s.size}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the values of the series, also the row indices were printed. All the accessing methods from NumPy arrays also work for the Series: indexing, slicing, masking and fancy indexing. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2=s[[0,5]] # Fancy indexing\n", "print(s2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t=s[-2:] # Slicing\n", "t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the indices stick to the corresponding values, they are not renumbered!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t[4] # t[0] would give an error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values as a NumPy array are accessible via the `values` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the indices are available through the `index` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index is not simply a NumPy array, but a data structure that allows fast access to the elements. The indices need not be integers, as the next example shows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3=pd.Series([1, 4, 5, 2, 5, 2], index=list(\"abcdef\"))\n", "s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3[\"b\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "