Why do we need them?

Parsimony: We can express sets of strings more compactly.

For example: (J.|Julia) Fukuyama represents the set containing J. Fukuyama and Julia Fukuyama
Allows us to specify infinite-size sets in finite space.

For example: our , * expression from before.

Regular expressions abstractly

The formal definition of a regular expression is inductive. Suppose that we have a finite alphabet $\Sigma$.

We start with specifying the following as regular expressions:

$\emptyset$: The empty set
$\varepsilon$: The set containing the empty string ""
Literal character $a$: The one-element set $\{a\}$, for $a \in \Sigma$

The following operations, performed on regular expressions, yield regular expressions:

Concatenation: If $R$ and $S$ are regular expressions, $RS$ denotes the set of strings that can be formed by concatenating a string in $R$ followed by a string in $S$.

If $R = \{good, bad\}$ and $S = \{boy, girl\}$, then $RS = \{goodboy, goodgirl, badboy, badgirl\}$.

Alternation (boolean or): If $R$ and $S$ are regular expressions, $R|S$ denotes the set of strings formed by taking either an element of $R$ or an element of $S$. This is the same as the set union.

Same example: If $R = \{good, bad\}$ and $S = \{boy, girl\}$, then $R|S = \{good, bad, boy, girl\}$.

Kleene star: If $R$ is a regular expression, $R^\star$ denotes the set of strings created by concatenating any finite number (including zero) of the strings in $R$.

If $R = \{good, bad\}$, $R^\star$ contains , good, goodgood, goodbad, bad, badgood, badbadgood, and so on

Order of operations: Kleene star has highest priority, followed by concatenation, followed by alternation.

If a different grouping is desired, use parentheses ().

Examples:

a|b*: {ε, "a", "b", "bb", "bbb", ...}

(a|b)*: The set of all string containing only a and b, {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}

ab*(c|epsilon): The set of strings starting with a single a, followed by zero or more b's, optionally ending with a c, {"a", "ac", "ab", "abc", "abb", "abbc", ...}

How they are actually implemented

Actual implementations of regular expressions have many more symbols and operators, but they are mostly just shorthand for some common operations that would take longer to express using only the three operations in the formal definition.

Quantification operations

*: Same as in the formal definition: zero or more times.
?: Zero or one occurrence of the preceding element. colou?r matches color and colour.
+: One or more occurrences of the preceding element.
{m}: Exactly m occurrences of the preceding element.
{m,}: At least m occurrences of the preceding element.
{m,n}: Between m and n occurrences of the preceding element, inclusive.

Alternatives to |

[]: Matches any single character inside the brackets.
[^ ]: Negation, matches anything except the set of characters inside the brackets.
.: Wildcard, matches any character.
Character classes, defined differently for the different implementations. See https://en.wikipedia.org/wiki/Regular_expression#Character_classes, the POSIX column.

Anchoring

^ (not inside square brackets) means that what comes after must be at the start of a line.
$ means that what comes before must be at the end of a line.
\< anchors to the beginning of a word.
\> anchors to the end of a word. Note that when you create a string using this operator, you will have to escape the \

grepl("\\<a", "hat at")

## [1] TRUE

grepl("\\<a", "hat cat")

## [1] FALSE

Greedy quantification

By default, quantifiers are greedy, meaning they match the longest substring possible.

We can make them have the opposite behavior by modifying them with the ? character: in that case, they match the shortest substring possible.

regmatches("<p></p>", regexpr("<p.*>", "<p></p>"))

## [1] "<p></p>"

regmatches("<p></p>", regexpr("<p.*?>", "<p></p>"))

## [1] "<p>"

Splitting on regular expressions

strsplit, from last time, also works with regular expressions.

Syntax: strsplit(s, split)

s is a character vector (can have length greater than 1), and the function vectorizes.
split gives the regular expression we want to split on: every element of s will be split into pieces separated by the regex split.

Two other arguments you should know about, fixed and perl:

fixed = TRUE means to treat split as the literal string instead of a regular expression, fixed = FALSE is the default and means that we treat the splitting expression as a regular expression.
perl relates to the flavor of regular expressions used.

1 2

strsplit("my parents, Ayn Rand, and God", ",[[:space:]]*(and)?[[:space:]]+")

## [[1]]
## [1] "my parents" "Ayn Rand"   "God"

strsplit("my parents, Ayn Rand and God", ",[[:space:]]*(and)?[[:space:]]+")

## [[1]]
## [1] "my parents"       "Ayn Rand and God"

strsplit("my parents, Ayn Rand and God", "(,[[:space:]]*)|([[:space:]]+and[[:space:]]+)")

## [[1]]
## [1] "my parents" "Ayn Rand"   "God"

strsplit("Beyonce, Taylor Swift, and Kanye", ",[[:space:]]*(and)?[[:space:]]+")

## [[1]]
## [1] "Beyonce"      "Taylor Swift" "Kanye"

strsplit("Beyonce,Taylor Swift,  and Kanye", ",[[:space:]]*(and[[:space:]]+)?")

## [[1]]
## [1] "Beyonce"      "Taylor Swift" "Kanye"

strsplit("Beyonce,Taylor Swift,  and  Kanye", ",[[:space:]]*(and[[:space:]]+)?")

## [[1]]
## [1] "Beyonce"      "Taylor Swift" "Kanye"

Finding regular expressions

grep

syntax: grep(regex, string).
regex is a regular expression
string is a character vector
The function will return the indices in string that match regex

grepl (for grep logical): Same as grep, but returns a Boolean vector describing the match indices.

grep("(K|k)ansas", state.name)

## [1]  4 16

grepl("(K|k)ansas", state.name)

##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

A bigger example: let's try extracting the text from an html document.

nyt.html comes from an article on the New York Times.

I want to just get the body of the article out, and I know that the line containing the article will have the string <section name = "articleBody".

I can use grep to find the line corresponding to the article.

nyt = readLines("nyt.html")

## Warning in readLines("nyt.html"): incomplete final line found on 'nyt.html'

article_text_index = grep("<section name=\"articleBody\"", nyt)

grep and grepl just tell us if the regular expression is present. What if you want to know where it is within the string?

regexpr

Syntax: regexpr(regex, string)
regex is a regular expression.
string is a character vector.
Vectorizes over string
Gives the location of the first occurrence of the regex expression within each element of string.

gregexpr: Same syntax as regexpr, but gives the locations of all the occurrences of regex instead of just the first.

What do you think the output types will be for these functions?

Example:

fruits = "apple|banana|fruit"
regexpr(fruits, "fruit flies like a banana")

## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

typeof(regexpr(fruits, "fruit flies like a banana"))

## [1] "integer"

gregexpr(fruits, "fruit flies like a banana")

## [[1]]
## [1]  1 20
## attr(,"match.length")
## [1] 5 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

typeof(gregexpr(fruits, "fruit flies like a banana"))

## [1] "list"

If you want the text that matches the regular expression, you need regmatches

Syntax: regmatches(text, match)

match is the output from regexpr or gregexpr, describing the locations of the regular expressions.
text is the character vector you passed to regexpr or gregexpr.

Example:

text = c("fruit flies like a banana", "banana flies like a fruit")
regmatches(text, regexpr(fruits, text))

## [1] "fruit"  "banana"

regmatches(text, gregexpr(fruits, text))

## [[1]]
## [1] "fruit"  "banana"
## 
## [[2]]
## [1] "banana" "fruit"

Back to our newspaper article example: we want the text of the article, not just its position.

We know it's going to be between something that looks like <section name="articleBody" and </section>, so we create a regular expression for that sort of string and use regmatches and gregexpr.

article_marked_up = regmatches(nyt[article_text_index],
    gregexpr("<section name=\"articleBody\".+?</section>", nyt[article_text_index]))

Then let's try to split the article on paragraphs:

paragraphs = strsplit(body_sub[[1]], "(<p[^>]*>)|(</p>)")

Replacing regular expressions

Several options:

Not recommended: you can assign to regmatches, which works like substr (also don't recommend using substr in this way).
sub and gsub are like regexpr and gregexpr, allow you to replace regular expressions.

Syntax for sub and gsub: sub(regex, replacement, string)

regex is the regular expression to replace.
replacement is what you want to replace it with.
string is a character vector containing the text that needs to be modified.
The function vectorizes over string.
The functions will return a new string.

Back to our article example, we still have the problem of a bunch of extra html tags, which we would like to get rid of.

paragraphs_no_tags = gsub(pattern = "<.+?>", replacement = "", x = paragraphs[[1]])

for(i in 1:4) cat(paragraphs_no_tags[i])

## WASHINGTON — Speaker Nancy Pelosi asked President Trump on Wednesday to scrap or delay his Jan. 29 State of the Union address amid the partial government shutdown, an extraordinary request that escalated the partisan battle over his border wall even as bipartisan groups of lawmakers pressed him to reopen the government and make room for compromise.In a letter to Mr. Trump that underscored how the shutdown fight has poisoned hopes of bipartisan comity at the start of divided government, Ms. Pelosi cited security concerns as her reason for proposing that the president postpone the annual presidential ritual of addressing a joint session of Congress in a televised speech during prime time — or perhaps submit a written message instead.

Remember greedy quantification? We really need the ? modifier here:

wrong = gsub(pattern = "<.+>", replacement = "", x = paragraphs[[1]])
which(nchar(wrong) != nchar(paragraphs_no_tags))

##  [1] 12 18 19 21 23 25 43 47 67 69 82

cat(paragraphs[[1]][82])

## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a <a class="css-1g7m0tk" href="https://twitter.com/SteveScalise/status/1085553606638665733" title="" rel="noopener noreferrer" target="_blank">Twitter post</a>, branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”

cat(wrong[82])

## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a , branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”

cat(paragraphs_no_tags[82])

## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a Twitter post, branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”

Regular expressions