Regular expressions

Reading: Matloff Chapter 11.2.

There are also many tutorials about regular expressions online, many of them not very good.

Wikipedia actually has a good treatment if you go through it slowly.

You can also find formal treatments of regular expressions in books on the foundations of computing, e.g. Hopcroft, Motwani, and Ullman (2000). Introduction to Automata Theory, Languages, and Computation.

What is a regular expression?

For example, the regular expression

, *

is the set of strings that start with a comma and are followed by any number (including zero) of spaces.

Why do we need them?

Regular expressions abstractly

The formal definition of a regular expression is inductive. Suppose that we have a finite alphabet \(\Sigma\).

We start with specifying the following as regular expressions:

The following operations, performed on regular expressions, yield regular expressions:

Order of operations: Kleene star has highest priority, followed by concatenation, followed by alternation.

If a different grouping is desired, use parentheses ().

Examples:

How they are actually implemented

Actual implementations of regular expressions have many more symbols and operators, but they are mostly just shorthand for some common operations that would take longer to express using only the three operations in the formal definition.

Quantification operations

Alternatives to |

Anchoring

grepl("\\<a", "hat at")
## [1] TRUE
grepl("\\<a", "hat cat")
## [1] FALSE

Greedy quantification

By default, quantifiers are greedy, meaning they match the longest substring possible.

We can make them have the opposite behavior by modifying them with the ? character: in that case, they match the shortest substring possible.

regmatches("<p></p>", regexpr("<p.*>", "<p></p>"))
## [1] "<p></p>"
regmatches("<p></p>", regexpr("<p.*?>", "<p></p>"))
## [1] "<p>"

Splitting on regular expressions

strsplit, from last time, also works with regular expressions.

Syntax: strsplit(s, split)

Two other arguments you should know about, fixed and perl:

1 2

strsplit("my parents, Ayn Rand, and God", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "my parents" "Ayn Rand"   "God"
strsplit("my parents, Ayn Rand and God", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "my parents"       "Ayn Rand and God"
strsplit("my parents, Ayn Rand and God", "(,[[:space:]]*)|([[:space:]]+and[[:space:]]+)")
## [[1]]
## [1] "my parents" "Ayn Rand"   "God"
strsplit("Beyonce, Taylor Swift, and Kanye", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "Beyonce"      "Taylor Swift" "Kanye"
strsplit("Beyonce,Taylor Swift,  and Kanye", ",[[:space:]]*(and[[:space:]]+)?")
## [[1]]
## [1] "Beyonce"      "Taylor Swift" "Kanye"
strsplit("Beyonce,Taylor Swift,  and  Kanye", ",[[:space:]]*(and[[:space:]]+)?")
## [[1]]
## [1] "Beyonce"      "Taylor Swift" "Kanye"

Finding regular expressions

grep

grepl (for grep logical): Same as grep, but returns a Boolean vector describing the match indices.

grep("(K|k)ansas", state.name)
## [1]  4 16
grepl("(K|k)ansas", state.name)
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

A bigger example: let's try extracting the text from an html document.

nyt.html comes from an article on the New York Times.

I want to just get the body of the article out, and I know that the line containing the article will have the string <section name = "articleBody".

I can use grep to find the line corresponding to the article.

nyt = readLines("nyt.html")
## Warning in readLines("nyt.html"): incomplete final line found on 'nyt.html'
article_text_index = grep("<section name=\"articleBody\"", nyt)

grep and grepl just tell us if the regular expression is present. What if you want to know where it is within the string?

regexpr

gregexpr: Same syntax as regexpr, but gives the locations of all the occurrences of regex instead of just the first.

What do you think the output types will be for these functions?

Example:

fruits = "apple|banana|fruit"
regexpr(fruits, "fruit flies like a banana")
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
typeof(regexpr(fruits, "fruit flies like a banana"))
## [1] "integer"
gregexpr(fruits, "fruit flies like a banana")
## [[1]]
## [1]  1 20
## attr(,"match.length")
## [1] 5 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
typeof(gregexpr(fruits, "fruit flies like a banana"))
## [1] "list"

If you want the text that matches the regular expression, you need regmatches

Syntax: regmatches(text, match)

Example:

text = c("fruit flies like a banana", "banana flies like a fruit")
regmatches(text, regexpr(fruits, text))
## [1] "fruit"  "banana"
regmatches(text, gregexpr(fruits, text))
## [[1]]
## [1] "fruit"  "banana"
## 
## [[2]]
## [1] "banana" "fruit"

Back to our newspaper article example: we want the text of the article, not just its position.

We know it's going to be between something that looks like <section name="articleBody" and </section>, so we create a regular expression for that sort of string and use regmatches and gregexpr.

article_marked_up = regmatches(nyt[article_text_index],
    gregexpr("<section name=\"articleBody\".+?</section>", nyt[article_text_index]))

Then let's try to split the article on paragraphs:

paragraphs = strsplit(body_sub[[1]], "(<p[^>]*>)|(</p>)")

Replacing regular expressions

Several options:

Syntax for sub and gsub: sub(regex, replacement, string)

Back to our article example, we still have the problem of a bunch of extra html tags, which we would like to get rid of.

paragraphs_no_tags = gsub(pattern = "<.+?>", replacement = "", x = paragraphs[[1]])

for(i in 1:4) cat(paragraphs_no_tags[i])
## WASHINGTON — Speaker Nancy Pelosi asked President Trump on Wednesday to scrap or delay his Jan. 29 State of the Union address amid the partial government shutdown, an extraordinary request that escalated the partisan battle over his border wall even as bipartisan groups of lawmakers pressed him to reopen the government and make room for compromise.In a letter to Mr. Trump that underscored how the shutdown fight has poisoned hopes of bipartisan comity at the start of divided government, Ms. Pelosi cited security concerns as her reason for proposing that the president postpone the annual presidential ritual of addressing a joint session of Congress in a televised speech during prime time — or perhaps submit a written message instead.

Remember greedy quantification? We really need the ? modifier here:

wrong = gsub(pattern = "<.+>", replacement = "", x = paragraphs[[1]])
which(nchar(wrong) != nchar(paragraphs_no_tags))
##  [1] 12 18 19 21 23 25 43 47 67 69 82
cat(paragraphs[[1]][82])
## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a <a class="css-1g7m0tk" href="https://twitter.com/SteveScalise/status/1085553606638665733" title="" rel="noopener noreferrer" target="_blank">Twitter post</a>, branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”
cat(wrong[82])
## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a , branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”
cat(paragraphs_no_tags[82])
## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a Twitter post, branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”

Once the article is formatted a little better, we can do things like count the number of ads placed in the body of the article.

grep("Advertisement", paragraphs_no_tags)
## [1] 10 22 30 50 64

A final note

They're not for everything