{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|\n", "| [Exercise 1 (cities)](<#Exercise-1-(cities)>) | [Exercise 2 (powers of series)](<#Exercise-2-(powers-of-series)>) | [Exercise 3 (municipal information)](<#Exercise-3-(municipal-information)>) |\n", "| [Exercise 4 (municipalities of finland)](<#Exercise-4-(municipalities-of-finland)>) | [Exercise 5 (swedish and foreigners)](<#Exercise-5-(swedish-and-foreigners)>) | [Exercise 6 (growing municipalities)](<#Exercise-6-(growing-municipalities)>) |\n", "| [Exercise 7 (subsetting with loc)](<#Exercise-7-(subsetting-with-loc)>) | [Exercise 8 (subsetting by positions)](<#Exercise-8-(subsetting-by-positions)>) | [Exercise 9 (snow depth)](<#Exercise-9-(snow-depth)>) |\n", "| [Exercise 10 (average temperature)](<#Exercise-10-(average-temperature)>) | [Exercise 11 (below zero)](<#Exercise-11-(below-zero)>) | [Exercise 12 (cyclists)](<#Exercise-12-(cyclists)>) |\n", "| [Exercise 13 (missing value types)](<#Exercise-13-(missing-value-types)>) | [Exercise 14 (special missing values)](<#Exercise-14-(special-missing-values)>) | [Exercise 15 (last week)](<#Exercise-15-(last-week)>) |\n", "| [Exercise 16 (split date)](<#Exercise-16-(split-date)>) | [Exercise 17 (cleaning data)](<#Exercise-17-(cleaning-data)>) | |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas (continues)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation of dataframes\n", "\n", "The DataFrame is essentially a two dimensional object, and it can be created in three different ways:\n", "\n", "* out of a two dimensional NumPy array\n", "* out of given columns\n", "* out of given rows\n", "\n", "### Creating DataFrames from a NumPy array\n", "\n", "In the following example a DataFrame with 2 rows and 3 column is created. The row and column indices are given explicitly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.DataFrame(np.random.randn(2,3), columns=[\"First\", \"Second\", \"Third\"], index=[\"a\", \"b\"])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that now both the rows and columns can be accessed using the special `Index` object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index # These are the \"row names\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns # These are the \"column names\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If either `columns` or `index` argument is left out, then an implicit integer index will be used:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2=pd.DataFrame(np.random.randn(2,3), index=[\"a\", \"b\"])\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the column index is an object similar to Python's builtin `range` type:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating DataFrames from columns\n", "\n", "A column can be specified as a list, an NumPy array, or a Pandas' Series. The names of the columns can be given either with the `columns` parameter, or if Series objects are used, then the `name` attribute of each Series is used as the column name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s1 = pd.Series([1,2,3])\n", "s1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2 = pd.Series([4,5,6], name=\"b\")\n", "s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Give the column name explicitly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(s1, columns=[\"a\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `name` attribute of Series s2 as the column name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(s2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If using multiple columns, then they must be given as the dictionary, whose keys give the column names and values are the actual column content." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({\"a\": s1, \"b\": s2})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating DataFrames from rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can give a list of rows as a parameter to the DataFrame constructor. Each row is given as a dict, list, Series, or NumPy array. If we want to give names for the columns, then either the rows must be dictionaries, where the key is the column name and the values are the elements of the DataFrame on that row and column, or else the column names must be given explicitly. An example of this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.DataFrame([{\"Wage\" : 1000, \"Name\" : \"Jack\", \"Age\" : 21}, {\"Wage\" : 1500, \"Name\" : \"John\", \"Age\" : 29}])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame([[1000, \"Jack\", 21], [1500, \"John\", 29]], columns=[\"Wage\", \"Name\", \"Age\"])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "