Reading: Matloff Chapter 11.2.
Wikipedia actually has a good treatment if you go through it slowly.
You can also find formal treatments of regular expressions in books on the foundations of computing, e.g. Hopcroft, Motwani, and Ullman (2000). Introduction to Automata Theory, Languages, and Computation.
A way of specifying a set of strings.
, *
specifies the set of strings that start with a comma and are followed by any number (including zero) of spaces.
Why do we need them?
Parsimony: We can express sets of strings more compactly.
For example: (J.|Julia) Fukuyama
represents the
set containing J. Fukuyama
and
Julia Fukuyama
Allows us to specify infinite-size sets in finite space.
For example: our , *
expression from before.
The formal definition of a regular expression is inductive. Suppose that we have a finite alphabet \(\Sigma\).
We start with specifying the following as regular expressions:
\(\emptyset\) is a regular expression specifying the empty set
\(\varepsilon\) is a regular
expression specifying the set containing the empty string
""
Literal character \(a\) is a regular expression specifying the one-element set \(\{a\}\), for \(a \in \Sigma\)
The following operations, performed on regular expressions, yield regular expressions:
Order of operations: Kleene star has highest priority, followed by concatenation, followed by alternation.
If a different grouping is desired, use parentheses
()
.
Examples:
a|b
: {"a", "b"}
a|b*
:
{ε, "a", "b", "bb", "bbb", ...}
(a|b)*
: The set of all string containing only
a
and b
,
{ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}
ab*(c|ε)
: The set of strings starting with a single
a
, followed by zero or more b
’s, optionally
ending with a c
,
{"a", "ac", "ab", "abc", "abb", "abbc", ...}
Actual implementations of regular expressions have many more symbols and operators, but they are mostly just shorthand for some common operations that would take longer to express using only the three operations in the formal definition.
*
: Same as in the formal definition: zero or more
times.
?
: Zero or one occurrence of the preceding element.
colou?r
matches color
and
colour
.
+
: One or more occurrences of the preceding
element.
{m}
: Exactly m occurrences of the preceding
element.
{m,}
: At least m occurrences of the preceding
element.
{m,n}
: Between m and n occurrences of the preceding
element, inclusive.
[]
: Matches any single character inside the
brackets.
[^ ]
: Negation, matches anything except the
set of characters inside the brackets.
.
: Wildcard, matches any character.
Character classes, defined differently for the different implementations. See https://en.wikipedia.org/wiki/Regular_expression#Character_classes, the POSIX column.
Check whether a string contains a regular expression
Extract the portion of the string that matches a regular expression
Split a string into pieces delimited by a regular expression
grep
syntax: grep(re, string)
.
re
is a regular expression
string
is a character vector
The function will return the indices in string that contain a
match to re
grepl
(for grep logical): Same as grep
, but
returns a Boolean vector describing the indices of the strings that
contain a match to re
.
Note:
A regular expression describes a set of strings.
The functions we will be using don’t check whether a string is an element of the set of strings that the regular expression corresponds to. They check whether the target string has a substring that is an element of that set.
## [1] 4 16
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE
grep
and grepl
just tell us if the regular
expression is present. What if you want to know where it is within the
string?
regexpr
Syntax: regexpr(re, string)
re
is a regular expression.
string
is a character vector.
Gives the location of the first occurrence of the re
expression in string
.
gregexpr
: Same syntax as regexpr
, but gives
the locations of all the occurrences of re
instead
of just the first.
Example:
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 1 20
## attr(,"match.length")
## [1] 5 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
If you want the text that matches the regular expression, you need
regmatches
Syntax: regmatches(text, match)
match
is the output from regexpr
or
gregexpr
, describing the locations of the regular
expressions.
text
is the character vector you passed to
regexpr
or gregexpr
.
The pattern you use to extract the text is
regmatches(text, regexpr(re, text))
Example: We have some sentence fragments and we would like to extract all the fruits from the text.
fruit_fly_text <-"drosophila like bananas"
wasp_text <- "wasps like bananas and baby fruit flies"
fruits
## [1] "apple|banana|fruit"
## [1] "banana"
## [1] "banana"
## [[1]]
## [1] "banana"
## [[1]]
## [1] "banana" "fruit"
Syntax: strsplit(s, split)
s
is a character vector (can have length greater
than 1), and the function vectorizes.
split
gives the regular expression we want to split
on: every element of s
will be split into pieces separated
by the regex split
.
Two other arguments you should know about, fixed
and
perl
:
fixed = TRUE
means to treat split
as
the literal string instead of a regular expression,
fixed = FALSE
is the default and means that we treat the
splitting expression as a regular expression.
perl
relates to the flavor of regular expressions
used.
## [[1]]
## [1] "my parents" "Ayn Rand" "God"
## [[1]]
## [1] "my parents" "Ayn Rand and God"
## [[1]]
## [1] "my parents" "Ayn Rand" "God"
Greedy quantification: There can be more than one match of a string to a regular expression that begins at the same place. Which one do you want?
Anchoring: Can constrain the expression to match at certain places in words or lines.
By default, quantifiers are greedy, meaning they match the longest substring possible.
We can make them have the opposite behavior by modifying them with
the ?
character: in that case, they match the shortest
substring possible.
## [1] "[i][j]"
## [1] "[i]"
Note: Escaping
The special characters used for regular expressions need to be
escaped using \\
.
One \
is the normal escape character, its function
is to tell the string processing tools that the next character should be
read in a special way.
When we create a regular expression, we need a literal
\
, and so we need to escape the escape character.
^
(not inside square brackets) means that what comes
after must be at the start of a line.
$
means that what comes before must be at the end of
a line.
\<
anchors to the beginning of a word.
\>
anchors to the end of a word. Note that when
you create a string using this operator, you will have to escape the
\
\b
anchors to the boundary of words (beginning or
ending)
\B
anchors to anywhere aside from the
boundary