Reading: Matloff Chapter 11.2.
There are also many tutorials about regular expressions online, many of them not very good.
Wikipedia actually has a good treatment if you go through it slowly.
You can also find formal treatments of regular expressions in books on the foundations of computing, e.g. Hopcroft, Motwani, and Ullman (2000). Introduction to Automata Theory, Languages, and Computation.
What is a regular expression?
For example, the regular expression
, *
is the set of strings that start with a comma and are followed by any number (including zero) of spaces.
Why do we need them?
Parsimony: We can express sets of strings more compactly.
For example: (J.|Julia) Fukuyama
represents the set containing J. Fukuyama
and Julia Fukuyama
Allows us to specify infinite-size sets in finite space.
For example: our , *
expression from before.
The formal definition of a regular expression is inductive. Suppose that we have a finite alphabet \(\Sigma\).
We start with specifying the following as regular expressions:
\(\emptyset\): The empty set
\(\varepsilon\): The set containing the empty string ""
Literal character \(a\): The one-element set \(\{a\}\), for \(a \in \Sigma\)
The following operations, performed on regular expressions, yield regular expressions:
Order of operations: Kleene star has highest priority, followed by concatenation, followed by alternation.
If a different grouping is desired, use parentheses ()
.
Examples:
a|b*
: {ε, "a", "b", "bb", "bbb", ...}
(a|b)*
: The set of all string containing only a
and b
, {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}
ab*(c|epsilon)
: The set of strings starting with a single a
, followed by zero or more b
's, optionally ending with a c
, {"a", "ac", "ab", "abc", "abb", "abbc", ...}
Actual implementations of regular expressions have many more symbols and operators, but they are mostly just shorthand for some common operations that would take longer to express using only the three operations in the formal definition.
*
: Same as in the formal definition: zero or more times.
?
: Zero or one occurrence of the preceding element. colou?r
matches color
and colour
.
+
: One or more occurrences of the preceding element.
{m}
: Exactly m occurrences of the preceding element.
{m,}
: At least m occurrences of the preceding element.
{m,n}
: Between m and n occurrences of the preceding element, inclusive.
[]
: Matches any single character inside the brackets.
[^ ]
: Negation, matches anything except the set of characters inside the brackets.
.
: Wildcard, matches any character.
Character classes, defined differently for the different implementations. See https://en.wikipedia.org/wiki/Regular_expression#Character_classes, the POSIX column.
^
(not inside square brackets) means that what comes after must be at the start of a line.
$
means that what comes before must be at the end of a line.
\<
anchors to the beginning of a word.
\>
anchors to the end of a word. Note that when you create a string using this operator, you will have to escape the \
grepl("\\<a", "hat at")
## [1] TRUE
grepl("\\<a", "hat cat")
## [1] FALSE
By default, quantifiers are greedy, meaning they match the longest substring possible.
We can make them have the opposite behavior by modifying them with the ?
character: in that case, they match the shortest substring possible.
regmatches("<p></p>", regexpr("<p.*>", "<p></p>"))
## [1] "<p></p>"
regmatches("<p></p>", regexpr("<p.*?>", "<p></p>"))
## [1] "<p>"
strsplit
, from last time, also works with regular expressions.
Syntax: strsplit(s, split)
s
is a character vector (can have length greater than 1), and the function vectorizes.
split
gives the regular expression we want to split on: every element of s
will be split into pieces separated by the regex split
.
Two other arguments you should know about, fixed
and perl
:
fixed = TRUE
means to treat split
as the literal string instead of a regular expression, fixed = FALSE
is the default and means that we treat the splitting expression as a regular expression.
perl
relates to the flavor of regular expressions used.
strsplit("my parents, Ayn Rand, and God", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "my parents" "Ayn Rand" "God"
strsplit("my parents, Ayn Rand and God", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "my parents" "Ayn Rand and God"
strsplit("my parents, Ayn Rand and God", "(,[[:space:]]*)|([[:space:]]+and[[:space:]]+)")
## [[1]]
## [1] "my parents" "Ayn Rand" "God"
strsplit("Beyonce, Taylor Swift, and Kanye", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "Beyonce" "Taylor Swift" "Kanye"
strsplit("Beyonce,Taylor Swift, and Kanye", ",[[:space:]]*(and[[:space:]]+)?")
## [[1]]
## [1] "Beyonce" "Taylor Swift" "Kanye"
strsplit("Beyonce,Taylor Swift, and Kanye", ",[[:space:]]*(and[[:space:]]+)?")
## [[1]]
## [1] "Beyonce" "Taylor Swift" "Kanye"
grep
syntax: grep(regex, string)
.
regex
is a regular expression
string
is a character vector
The function will return the indices in string that match regex
grepl
(for grep logical): Same as grep
, but returns a Boolean vector describing the match indices.
grep("(K|k)ansas", state.name)
## [1] 4 16
grepl("(K|k)ansas", state.name)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
A bigger example: let's try extracting the text from an html document.
nyt.html
comes from an article on the New York Times.
I want to just get the body of the article out, and I know that the line containing the article will have the string <section name = "articleBody"
.
I can use grep
to find the line corresponding to the article.
nyt = readLines("nyt.html")
## Warning in readLines("nyt.html"): incomplete final line found on 'nyt.html'
article_text_index = grep("<section name=\"articleBody\"", nyt)
grep
and grepl
just tell us if the regular expression is present. What if you want to know where it is within the string?
regexpr
Syntax: regexpr(regex, string)
regex
is a regular expression.
string
is a character vector.
Vectorizes over string
Gives the location of the first occurrence of the regex
expression within each element of string
.
gregexpr
: Same syntax as regexpr
, but gives the locations of all the occurrences of regex
instead of just the first.
What do you think the output types will be for these functions?
Example:
fruits = "apple|banana|fruit"
regexpr(fruits, "fruit flies like a banana")
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
typeof(regexpr(fruits, "fruit flies like a banana"))
## [1] "integer"
gregexpr(fruits, "fruit flies like a banana")
## [[1]]
## [1] 1 20
## attr(,"match.length")
## [1] 5 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
typeof(gregexpr(fruits, "fruit flies like a banana"))
## [1] "list"
If you want the text that matches the regular expression, you need regmatches
Syntax: regmatches(text, match)
match
is the output from regexpr
or gregexpr
, describing the locations of the regular expressions.
text
is the character vector you passed to regexpr
or gregexpr
.
Example:
text = c("fruit flies like a banana", "banana flies like a fruit")
regmatches(text, regexpr(fruits, text))
## [1] "fruit" "banana"
regmatches(text, gregexpr(fruits, text))
## [[1]]
## [1] "fruit" "banana"
##
## [[2]]
## [1] "banana" "fruit"
Back to our newspaper article example: we want the text of the article, not just its position.
We know it's going to be between something that looks like <section name="articleBody"
and </section>
, so we create a regular expression for that sort of string and use regmatches
and gregexpr
.
article_marked_up = regmatches(nyt[article_text_index],
gregexpr("<section name=\"articleBody\".+?</section>", nyt[article_text_index]))
Then let's try to split the article on paragraphs:
paragraphs = strsplit(body_sub[[1]], "(<p[^>]*>)|(</p>)")
Several options:
Not recommended: you can assign to regmatches
, which works like substr
(also don't recommend using substr
in this way).
sub
and gsub
are like regexpr
and gregexpr
, allow you to replace regular expressions.
Syntax for sub
and gsub
: sub(regex, replacement, string)
regex
is the regular expression to replace.
replacement
is what you want to replace it with.
string
is a character vector containing the text that needs to be modified.
The function vectorizes over string
.
The functions will return a new string.
Back to our article example, we still have the problem of a bunch of extra html tags, which we would like to get rid of.
paragraphs_no_tags = gsub(pattern = "<.+?>", replacement = "", x = paragraphs[[1]])
for(i in 1:4) cat(paragraphs_no_tags[i])
## WASHINGTON — Speaker Nancy Pelosi asked President Trump on Wednesday to scrap or delay his Jan. 29 State of the Union address amid the partial government shutdown, an extraordinary request that escalated the partisan battle over his border wall even as bipartisan groups of lawmakers pressed him to reopen the government and make room for compromise.In a letter to Mr. Trump that underscored how the shutdown fight has poisoned hopes of bipartisan comity at the start of divided government, Ms. Pelosi cited security concerns as her reason for proposing that the president postpone the annual presidential ritual of addressing a joint session of Congress in a televised speech during prime time — or perhaps submit a written message instead.
Remember greedy quantification? We really need the ?
modifier here:
wrong = gsub(pattern = "<.+>", replacement = "", x = paragraphs[[1]])
which(nchar(wrong) != nchar(paragraphs_no_tags))
## [1] 12 18 19 21 23 25 43 47 67 69 82
cat(paragraphs[[1]][82])
## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a <a class="css-1g7m0tk" href="https://twitter.com/SteveScalise/status/1085553606638665733" title="" rel="noopener noreferrer" target="_blank">Twitter post</a>, branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”
cat(wrong[82])
## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a , branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”
cat(paragraphs_no_tags[82])
## “What are Democrats afraid of Americans hearing?” Representative Steve Scalise of Louisiana, the No. 2 Republican, said in a Twitter post, branding Ms. Pelosi #ShutdownNancy. “That 17,000+ criminals were caught last year at the border? 90% of heroin in the US comes across the southern border? Illegal border crossings dropped 90%+ in areas w/ a wall?”
Once the article is formatted a little better, we can do things like count the number of ads placed in the body of the article.
grep("Advertisement", paragraphs_no_tags)
## [1] 10 22 30 50 64