{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "| - | - | - |\n", "|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|\n", "| [Exercise 1 (integers in brackets)](<#Exercise-1-(integers-in-brackets)>) | [Exercise 2 (file listing)](<#Exercise-2-(file-listing)>) | [Exercise 3 (red green blue)](<#Exercise-3-(red-green-blue)>) |\n", "| [Exercise 4 (word frequencies)](<#Exercise-4-(word-frequencies)>) | [Exercise 5 (summary)](<#Exercise-5-(summary)>) | [Exercise 6 (file count)](<#Exercise-6-(file-count)>) |\n", "| [Exercise 7 (file extensions)](<#Exercise-7-(file-extensions)>) | [Exercise 8 (prepend)](<#Exercise-8-(prepend)>) | [Exercise 9 (rational)](<#Exercise-9-(rational)>) |\n", "| [Exercise 10 (extract numbers)](<#Exercise-10-(extract-numbers)>) | | |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Python (continues)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular expressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples\n", "\n", "We have already seen that we can ask from a string `str`\n", "whether it begins with some substring as follows:\n", "`str.startswith('Apple')`.\n", "If we would like to know whether it starts with `\"Apple\"` or\n", "`\"apple\"`, we would have to call `startswith` method twice.\n", "Regular expressions offer a simpler solution:\n", "`re.match(r\"[Aa]pple\", str)`.\n", "The bracket notation is one example of the special syntax of\n", "*regular expressions*. In this case it says that any of the\n", "characters inside brackets will do: either `\"A\"` or `\"a\"`. The other\n", "letters in `\"pple\"` will act normally. The string `r\"[Aa]pple\"` is\n", "called a *pattern*.\n", "\n", "A more complicated example asks whether the string `str`\n", "starts with either `apple` or `banana` (no matter if the first letter\n", "is capital or not):\n", "`re.match(r\"[Aa]pple|[Bb]anana\", str)`.\n", "In this example we saw a new special character `|` that denotes\n", "an alternative. On either side of the bar character we have a\n", "*subpattern*.\n", "\n", "A legal variable name in Python starts with a letter or an\n", "underline character and the following characters can also be\n", "digits.\n", "So legal names are, for instance: `_hidden`, `L_value`, `A123_`.\n", "But the name `2abc` is not a valid variable name.\n", "Let’s see what would be the regular expression pattern to\n", "recognise valid variable names:\n", "`r\"[A-Za-z_][A-Za-z_0-9]*\\Z\"`.\n", "Here we have used a shorthand for character ranges: `A-Z`.\n", "This means all the characters from `A` to `Z`.\n", "\n", "The first character of the variable name is defined in the first\n", "brackets. The subsequent characters are defined in the second\n", "brackets.\n", "The special character `*` means that we allow any number\n", "(0,1,2, . . . ) of the previous subpattern. For example the\n", "pattern `r\"ba*\"` allows strings `\"b\"`, `\"ba\"`, `\"baa\"`, `\"baaa\"`, and\n", "so on.\n", "The special syntax `\\Z` denotes the end of the string.\n", "Without it we would also accept `abc-` as a valid name since\n", "the `match` function normally checks only that a string starts with a pattern.\n", "\n", "The special notations, like `\\Z`, also cause problems with string\n", "handling.\n", "Remember that normally in string literals we have some\n", "special notation: `\\n` stands for newline, `\\t` stands for tab, and\n", "so on.\n", "So, both string literals and regular expressions use similar\n", "looking notations, which can create serious confusion.\n", "This can be solved by using the so-called *raw strings*. We\n", "denote a raw string by having an `r` letter before the first\n", "quotation mark, for example `r\"ab*\\Z\"`.\n", "When using raw strings, the newline (`\\n`), tab (`\\t`), and other\n", "special string literal notations aren’t interpreted. One should\n", "always use raw strings when defining regular expression\n", "patterns!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Patterns\n", "\n", "A pattern represents a set of strings. This set can even be\n", "potentially infinite.\n", "They can be used to describe a set of strings that have some\n", "commonality; some regular structure.\n", "Regular expressions (RE) are a classical computer science topic.\n", "They are very common in programming tasks. Scripting\n", "languages, like Python, are very fluent in regular expressions.\n", "Very complex text processing can be achieved using regular\n", "expressions.\n", "\n", "In patterns, normal characters (letters, numbers) just represent\n", "themselves, unless preceded by a backslash, which may trigger\n", "some special meaning.\n", "Punctuation characters have special meaning, unless preceded\n", "by backslash (`\\`), which deprives their special meaning.\n", "Use `\\\\` to represent a backslash character without any special\n", "meaning.\n", "In the following slides we will go through some of the more\n", "common RE notations.\n", "\n", "```\n", ". Matches any character\n", "[...] Matches any character contained within the brackets\n", "[^...] Matches any character not appearing after the hat (ˆ)\n", "ˆ Matches the start of the string\n", "$ Matches the end of the string\n", "* Matches zero or more previous RE\n", "+ Matches one or more previous RE\n", "{m,n} Matches m to n occurences of previous RE\n", "? Matches zero or one occurences of previous RE\n", "```\n", "\n", "We have already seen that a `|` character denotes alternatives.\n", "For example, the pattern `r\"Get (on|off|ready)\"` matches\n", "the following strings: `\"Get on\"`, `\"Get off\"`, `\"Get ready\"`.\n", "We can use parentheses to create groupings inside a pattern:\n", "`r\"(ab)+\"` will match the strings `\"ab\"`, `\"abab\"`, `\"ababab\"`,\n", "and so on.\n", "These groups are also given a reference number starting from 1. \n", "We can refer to groups using backreferences: `\\number`.\n", "For example, we can find separated patterns that get\n", "repeated: `r\"([a-z]{3,}) \\1 \\1\"`.\n", "This will recognise, for example, the following strings: `\"aca\n", "aca aca\"`, `\"turn turn turn\"`. But not the strings `\"aca\n", "aba aca\"` or `\"ac ac ac\"`.\n", "\n", "\n", "In the following, note that a hat (ˆ) as the first character\n", "inside brackets will create a complement set of characters:\n", "\n", "```\n", "`\\d` same as `[0-9]`, matches a digit\n", "`\\D` same as `[ˆ0-9]`, matches anything but a digit\n", "`\\s` matches a whitespace character (space, newline, tab, ... )\n", "`\\S` matches a nonwhitespace character\n", "`\\w` same as `[a-zA-Z0-9_]`, matches one alphanumeric character\n", "`\\W` matches one non-alphanumeric character\n", "```\n", "\n", "Using the above notation we can now shorten our previous\n", "variable name example to `r’[a-zA-Z_]\\w*\\Z’`\n", "\n", "The patterns `\\A`, `\\b`, `\\B`, and `\\Z` will all match an empty\n", "string, but in specific places.\n", "The patterns `\\A` and `\\Z` will recognise the beginning and end\n", "of the string, respectively.\n", "Note that the patterns `ˆ` and `$` can in some cases match also\n", "after a newline and before a newline, correspondingly.\n", "So, `\\A` is distinct from `ˆ`, and `\\Z` is distinct from `$`.\n", "The pattern `\\b` matches at the start or end of a word. The\n", "pattern `\\B` does the reverse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Match and search functions\n", "\n", "We have so far only used the `re.match` function which tries\n", "to find a match at the beginning of a string\n", "The function `re.search` allows to match any substring of a\n", "string.\n", "Example: `re.search(r'\\bback\\b', s)` will match\n", "strings `\"back\"`, `\"a back, is a body part\"`, `\"get back\"`. But it\n", "will not match the strings `\"backspace\"` or `\"comeback\"`.\n", "\n", "The function `re.search` finds only the first occurence.\n", "We can use the `re.findall` function to find all occurences.\n", "Let’s say we want to find all present participle words in a\n", "string `s`. The present participle words have ending `'ing'`.\n", "The function call would look like this:\n", "`re.findall(r'\\w+ing\\b', s)`.\n", "Let’s try running this:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Doing', 'going', 'staying', 'sleeping']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "s = \"Doing things, going home, staying awake, sleeping later\"\n", "re.findall(r'\\w+ing\\b', s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s say we want to pick up all the integers from a string.\n", "We can try that with the following function call:\n", "`re.findall(r'[+-]?\\d+', s)`.\n", "An example run:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['23', '-24', '-1']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'[+-]?\\d+', \"23 + -24 = -1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we are given a string of if/then sentences, and we\n", "would like to extract the conditions from these sentences.\n", "Let’s try the following function call:\n", "`re.findall(r'[Ii]f (.*), then', s)`.\n", "An example run:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['I’m not in a hurry, then I should stay. On the other hand, if I leave']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = (\"If I’m not in a hurry, then I should stay. \" +\n", " \"On the other hand, if I leave, then I can sleep.\")\n", "re.findall(r'[Ii]f (.*), then', s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But I wanted a result: `[\"I'm not in a hurry\", 'I leave']`. That\n", "is, the condition from both sentences. How can this be fixed?\n", "\n", "The problem is that the pattern `.*` tries to match as many\n", "characters as possible.\n", "This is called *greedy matching*.\n", "One way of solving this problem is to notice that the two\n", "sentences are separated by a full-stop (.).\n", "So, instead of matching all the characters, we need to match\n", "everything but the dot character.\n", "This can be achieved by using the complement character\n", "class: `[^.]`. The hat character (`ˆ`) in the beginning of a\n", "character class means the complement character class\n", "\n", "After the modification the function call looks like this:\n", "`re.findall(r'[Ii]f ([^.]*), then', s)`.\n", "Another way of solving this problem is to use a non-greedy\n", "matching.\n", "The repetition specifiers `+`, `*`, `?`, and `{m,n}` have\n", "corresponding non-greedy versions: `+?`, `*?`, `??`, and `{m,n}?`.\n", "These expressions use as few characters as possible to make\n", "the whole pattern match some substring.\n", "By using non-greedy version, the function call looks like this:\n", "`re.findall(r’[Ii]f (.*?), then’, s)`.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Functions in the `re` module\n", "\n", "Below is a list of the most common functions in the `re` module\n", "\n", "* `re.match(pattern, str)`\n", "* `re.search(pattern, str)`\n", "* `re.findall(pattern, str)`\n", "* `re.finditer(pattern, str)`\n", "* `re.sub(pattern, replacement, str, count=0)`\n", "\n", "Functions `match` and `search` return a *match object*.\n", "A match object describes the found occurence.\n", "The function `findall` returns a list of all the occurences of\n", "the pattern. The elements in the list are strings.\n", "The function `finditer` works like `findall` function except\n", "that instead of returning a list, it returns an iterator whose\n", "items are match objects.\n", "The function `sub` replaces all the occurences of the pattern in\n", "`str` with the string replacement and returns the new string.\n", "\n", "An example: The following program will replace all \"she\"\n", "words with \"he\"\n", "\n", "```\n", "import re\n", "str = \"She goes where she wants to, she's a sheriff.\"\n", "newstr = re.sub(r'\\b[Ss]he\\b', 'he', str)\n", "print newstr\n", "```\n", "\n", "This will print `he goes where he wants to, he's a sheriff.`\n", "\n", "The `sub` function can also use backreferences to refer to the\n", "matched string. The backreferences \\1, \\2, and so on, refer\n", "to the groups of the pattern, in order.\n", "An example:\n", "```\n", "import re\n", "str = \"\"\"He is the president of Russia.\n", "He’s a powerful man.\"\"\"\n", "newstr = re.sub(r'(\\b[Hh]e\\b)', r'\\1 (Putin)', str, 1)\n", "print newstr\n", "```\n", "\n", "This will print\n", "```\n", "He (Putin) is the president of Russia.\n", "He’s a powerful man.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Match object\n", "\n", "Functions `match`, `search`, and `finditer` use `match` objects\n", "to describe the found occurence.\n", "The method `groups()` of the match object returns the tuple\n", "of all the substrings matched by the groups of the pattern.\n", "Each pair of parentheses in the pattern creates a new group.\n", "These groups are are referred to by indices 1, 2, ...\n", "The group 0 is a special one: it refers to the match created by\n", "the whole pattern.\n", "\n", "Let’s look at the match object returned by the call\n", "\n", "```\n", "mo = re.search(r'\\d+ (\\d+) \\d+ (\\d+)',\n", "'first 123 45 67 890 last')\n", "```\n", "\n", "The call `mo.groups()` returns a tuple `(’45’, ’890’)`.\n", "We can access just some individual groups by using the\n", "method `group(gid, ...)`.\n", "For example, the call `mo.group(1)` will return `’45’`.\n", "The zeroth group will represent the whole match:\n", "`’123 45 67 890’`\n", "\n", "In addition to accessing the strings matched by the pattern\n", "and its groups, the corresponding indices of the original string\n", "can be accessed:\n", "\n", "* The `start(gid=0)` and `end(gid=0)` methods return the start\n", "and end indices of the matched group gid, correspondingly\n", "* The method `span(gid)` just returns the pair of these start\n", "and end indices\n", "\n", "The match object mo can also be used like a boolean value:\n", "\n", "```python\n", "mo = re.search(...)\n", "if mo:\n", " # do something\n", "```\n", "\n", "will do something if a match was found.\n", "Alternatively, the match object can be converted to a boolean\n", "value by the call `found = bool(mo)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Miscellaneous stuff\n", "\n", "If the same pattern is used in many function calls, it may be\n", "wise to precompile the pattern, mainly for efficiency reasons.\n", "This can be done using the `compile(pattern, flags=0)` function\n", "in the `re` module. The function returns a so-called RE object.\n", "The RE object has method versions of the functions found in\n", "module `re`.\n", "The only difference is that the first parameter is not the\n", "pattern since the precompiled pattern is stored in the RE\n", "object.\n", "\n", "The details of matching operation can be specified using\n", "optional flags.\n", "These flags can be given either inside the pattern or as a\n", "parameter to the compile function.\n", "Some of the more common flags are given in the following\n", "table\n", "\n", "| x | Flag |\n", "|-----|--------------|\n", "|`(?i)` | re.IGNORECASE|\n", "|`(?m)` | re.MULTILINE|\n", "|`(?s)` | re.DOTALL|\n", "\n", "The elements on the left can appear anywhere in the pattern\n", "but preferably in the beginning.\n", "On the right there are attributes of the re module that can be\n", "given to the compile function as the second parameter\n", "\n", "The `IGNORECASE` flag makes lower- and uppercase\n", "characters appear as equal.\n", "The `MULTILINE` flag makes the special characters `ˆ` and `$`\n", "match the beginning and end of each line in addition to the\n", "beginning and end of the whole string. These flags make `\\A`\n", "differ from `ˆ`, and `\\Z` differ from `$`.\n", "The `DOTALL` flag makes the character class `.` (dot) also\n", "accept the newline character, in addition to all the other\n", "letters.\n", "\n", "When giving multiple flags to the compile function, the flags\n", "can be separated with the `|` sign.\n", "For example, `re.compile(pattern, re.MULTILINE | re.DOTALL)`.\n", "This is equal to `re.compile('(?m)(?s)' + pattern)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####