{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|\n", "| [Exercise 1 (split date continues)](<#Exercise-1-(split-date-continues)>) | [Exercise 2 (cycling weather)](<#Exercise-2-(cycling-weather)>) | [Exercise 3 (top bands)](<#Exercise-3-(top-bands)>) |\n", "| [Exercise 4 (cyclists per day)](<#Exercise-4-(cyclists-per-day)>) | [Exercise 5 (best record company)](<#Exercise-5-(best-record-company)>) | [Exercise 6 (suicide fractions)](<#Exercise-6-(suicide-fractions)>) |\n", "| [Exercise 7 (suicide weather)](<#Exercise-7-(suicide-weather)>) | [Exercise 8 (bicycle timeseries)](<#Exercise-8-(bicycle-timeseries)>) | [Exercise 9 (commute)](<#Exercise-9-(commute)>) |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas (continues)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Catenating datasets\n", "\n", "We already saw in the NumPy section how we can catenate arrays along an axis: `axis=0` catenates vertically and `axis=1` catenates horizontally, and so on. With the DataFrames of Pandas it works similarly except that the row indices and the column names require extra attention. Also note a slight difference in the name: `np.concatenate` but `pd.concat`.\n", "\n", "Let's start by considering catenation along the axis 0, that is, vertical catenation. We will first make a helper function to easily create DataFrames for testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def makedf(cols, ind):\n", " data = {c : [str(c) + str(i) for i in ind] for c in cols}\n", " return pd.DataFrame(data, ind)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we will create some example DataFrames:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a=makedf(\"AB\", [0,1])\n", "a" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b=makedf(\"AB\", [2,3])\n", "b" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c=makedf(\"CD\", [0,1])\n", "c" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d=makedf(\"BC\", [2,3])\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following simple case, the `concat` function works exactly as we expect it would:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([a,b]) # The default axis is 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next, however, will create duplicate indices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r=pd.concat([a,a])\n", "r" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.loc[0,\"A\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not usually what we want! There are three solutions to this. Firstly, deny creation of duplicated indices by giving the `verify_integrity` parameter to the `concat` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " pd.concat([a,a], verify_integrity=True)\n", "except ValueError as e:\n", " import sys\n", " print(e, file=sys.stderr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Secondly, we can ask for automatic renumbering of rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([a,a], ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thirdly, we can ask for *hierarchical indexing*. The indices can contain multiple levels, but on this course we don't consider hierarchical indices in detail. Hierarchical indices can make a two dimensional array to work like higher dimensional array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r2=pd.concat([a,a], keys=['first', 'second'])\n", "r2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r2[\"A\"][\"first\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Everything works similarly, when we want to catenate horizontally:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([a,c], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have so far assumed that when concatenating vertically the columns of both DataFrames are the same, and when joining horizontally the indices are the same. This is, however, not required:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([a,d], sort=False) # sort option is used to silence a deprecation message" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It expanded the non-existing cases with `NaN`s. This method is called an *outer join*, which forms the union of columns in the two DataFrames. The alternative is *inner join*, which forms the intersection of columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([a,d], join=\"inner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####