Reading: Matloff Chapter 11.2.
There are also many tutorials about regular expressions online, many of them not very good.
Wikipedia actually has a good treatment if you go through it slowly.
You can also find formal treatments of regular expressions in books on the foundations of computing, e.g. Hopcroft, Motwani, and Ullman (2000). Introduction to Automata Theory, Languages, and Computation.
A way of specifying a set of strings.
, *
is the set of strings that start with a comma and are followed by any number (including zero) of spaces.
Why do we need them?
Parsimony: We can express sets of strings more compactly.
For example: (J.|Julia) Fukuyama
represents the set containing J. Fukuyama
and Julia Fukuyama
Allows us to specify infinite-size sets in finite space.
For example: our , *
expression from before.
The formal definition of a regular expression is inductive. Suppose that we have a finite alphabet \(\Sigma\).
We start with specifying the following as regular expressions:
\(\emptyset\): The empty set
\(\varepsilon\): The set containing the empty string ""
Literal character \(a\): The one-element set \(\{a\}\), for \(a \in \Sigma\)
The following operations, performed on regular expressions, yield regular expressions:
Order of operations: Kleene star has highest priority, followed by concatenation, followed by alternation.
If a different grouping is desired, use parentheses ()
.
Examples:
a|b
: {"a", "b"}
a|b*
: {ε, "a", "b", "bb", "bbb", ...}
(a|b)*
: The set of all string containing only a
and b
, {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}
ab*(c|ε)
: The set of strings starting with a single a
, followed by zero or more b
’s, optionally ending with a c
, {"a", "ac", "ab", "abc", "abb", "abbc", ...}
Actual implementations of regular expressions have many more symbols and operators, but they are mostly just shorthand for some common operations that would take longer to express using only the three operations in the formal definition.
*
: Same as in the formal definition: zero or more times.
?
: Zero or one occurrence of the preceding element. colou?r
matches color
and colour
.
+
: One or more occurrences of the preceding element.
{m}
: Exactly m occurrences of the preceding element.
{m,}
: At least m occurrences of the preceding element.
{m,n}
: Between m and n occurrences of the preceding element, inclusive.
[]
: Matches any single character inside the brackets.
[^ ]
: Negation, matches anything except the set of characters inside the brackets.
.
: Wildcard, matches any character.
Character classes, defined differently for the different implementations. See https://en.wikipedia.org/wiki/Regular_expression#Character_classes, the POSIX column.
Check whether a string contains a regular expression
Extract the portion of the string that matches a regular expression
Split a string into pieces delimited by a regular expression
grep
syntax: grep(regex, string)
.
regex
is a regular expression
string
is a character vector
The function will return the indices in string that match regex
grepl
(for grep logical): Same as grep
, but returns a Boolean vector describing the match indices.
Pedantic note:
A regular expression describes a set of strings.
If I say a string s
“contains” or “matches” a regular expression re
, I mean that a substring of s
is present in the set described by re
.
## [1] 4 16
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
grep
and grepl
just tell us if the regular expression is present. What if you want to know where it is within the string?
regexpr
Syntax: regexpr(regex, string)
regex
is a regular expression.
string
is a character vector.
Gives the location of the first occurrence of the regex
expression in string
.
gregexpr
: Same syntax as regexpr
, but gives the locations of all the occurrences of regex
instead of just the first.
Example:
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 1 20
## attr(,"match.length")
## [1] 5 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
If you want the text that matches the regular expression, you need regmatches
Syntax: regmatches(text, match)
match
is the output from regexpr
or gregexpr
, describing the locations of the regular expressions.
text
is the character vector you passed to regexpr
or gregexpr
.
The pattern you use to extract the text is
regmatches(text, regexpr(re, text))
Example: We have some sentence fragments and we would like to extract all the fruits from the text.
fruit_fly_text = "drosophila like bananas"
wasp_text = "wasps like bananas and baby fruit flies"
fruits
## [1] "apple|banana|fruit"
## [1] "banana"
## [1] "banana"
## [[1]]
## [1] "banana"
## [[1]]
## [1] "banana" "fruit"
Syntax: strsplit(s, split)
s
is a character vector (can have length greater than 1), and the function vectorizes.
split
gives the regular expression we want to split on: every element of s
will be split into pieces separated by the regex split
.
Two other arguments you should know about, fixed
and perl
:
fixed = TRUE
means to treat split
as the literal string instead of a regular expression, fixed = FALSE
is the default and means that we treat the splitting expression as a regular expression.
perl
relates to the flavor of regular expressions used.
## [[1]]
## [1] "my parents" "Ayn Rand" "God"
## [[1]]
## [1] "my parents" "Ayn Rand and God"
## [[1]]
## [1] "my parents" "Ayn Rand" "God"
Greedy quantification: There can be more than one match of a string to a regular expression that begins at the same place. Which one do you want?
Anchoring: Can constrain the expression to match at certain places in words or lines.
By default, quantifiers are greedy, meaning they match the longest substring possible.
We can make them have the opposite behavior by modifying them with the ?
character: in that case, they match the shortest substring possible.
## [1] "[i][j]"
## [1] "[i]"
Note: Escaping
The special characters used for regular expressions need to be escaped using \\
.
One \
is the normal escape character, its function is to tell the string processing tools that the next character should be read in a special way.
When we create a regular expression, we need a literal \
, and so we need to escape the escape character.
^
(not inside square brackets) means that what comes after must be at the start of a line.
$
means that what comes before must be at the end of a line.
\<
anchors to the beginning of a word.
\>
anchors to the end of a word. Note that when you create a string using this operator, you will have to escape the \
\b
anchors to the boundary of words (beginning or ending)
\B
anchors to anywhere aside from the boundary