Stat 610 Lecture 3: Regular Expressions

Regular expressions

Reading: Matloff Chapter 11.2.

There are also many tutorials about regular expressions online, many of them not very good.

Wikipedia actually has a good treatment if you go through it slowly.

You can also find formal treatments of regular expressions in books on the foundations of computing, e.g. Hopcroft, Motwani, and Ullman (2000). Introduction to Automata Theory, Languages, and Computation.

What is a regular expression?

A way of specifying a set of strings.

, *

is the set of strings that start with a comma and are followed by any number (including zero) of spaces.

Why do we need them?

Regular expressions abstractly

The formal definition of a regular expression is inductive. Suppose that we have a finite alphabet \(\Sigma\).

We start with specifying the following as regular expressions:

The following operations, performed on regular expressions, yield regular expressions:

Order of operations: Kleene star has highest priority, followed by concatenation, followed by alternation.

If a different grouping is desired, use parentheses ().

Examples:

How they are actually implemented

Actual implementations of regular expressions have many more symbols and operators, but they are mostly just shorthand for some common operations that would take longer to express using only the three operations in the formal definition.

Quantification operations

Alternatives to |

What do we do with regular expressions?

Finding regular expressions

grep

grepl (for grep logical): Same as grep, but returns a Boolean vector describing the match indices.

Pedantic note:

grep("(K|k)ansas", state.name)
## [1]  4 16
grepl("(K|k)ansas", state.name)
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

grep and grepl just tell us if the regular expression is present. What if you want to know where it is within the string?

regexpr

gregexpr: Same syntax as regexpr, but gives the locations of all the occurrences of regex instead of just the first.

Example:

fruits = "apple|banana|fruit"
regexpr(fruits, "fruit flies like a banana")
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gregexpr(fruits, "fruit flies like a banana")
## [[1]]
## [1]  1 20
## attr(,"match.length")
## [1] 5 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

If you want the text that matches the regular expression, you need regmatches

Syntax: regmatches(text, match)

The pattern you use to extract the text is

regmatches(text, regexpr(re, text))

Example: We have some sentence fragments and we would like to extract all the fruits from the text.

fruit_fly_text = "drosophila like bananas"
wasp_text = "wasps like bananas and baby fruit flies"
fruits
## [1] "apple|banana|fruit"
regmatches(fruit_fly_text, regexpr(fruits, fruit_fly_text))
## [1] "banana"
regmatches(wasp_text, regexpr(fruits, wasp_text))
## [1] "banana"
regmatches(fruit_fly_text, gregexpr(fruits, fruit_fly_text))
## [[1]]
## [1] "banana"
regmatches(wasp_text, gregexpr(fruits, wasp_text))
## [[1]]
## [1] "banana" "fruit"

Splitting on regular expressions

Syntax: strsplit(s, split)

Two other arguments you should know about, fixed and perl:

1 2

strsplit("my parents, Ayn Rand, and God", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "my parents" "Ayn Rand"   "God"
strsplit("my parents, Ayn Rand and God", ",[[:space:]]*(and)?[[:space:]]+")
## [[1]]
## [1] "my parents"       "Ayn Rand and God"
strsplit("my parents, Ayn Rand and God", "(,[[:space:]]*)|([[:space:]]+and[[:space:]]+)")
## [[1]]
## [1] "my parents" "Ayn Rand"   "God"

Two more ideas

Greedy quantification

By default, quantifiers are greedy, meaning they match the longest substring possible.

We can make them have the opposite behavior by modifying them with the ? character: in that case, they match the shortest substring possible.

regmatches("[i][j]", regexpr("\\[.*\\]", "[i][j]"))
## [1] "[i][j]"
regmatches("[i][j]", regexpr("\\[.*?\\]", "[i][j]"))
## [1] "[i]"

Note: Escaping

Anchoring

grepl("\\<a", "hat at")
## [1] TRUE
grepl("\\<a", "hat cat")
## [1] FALSE
grepl("\\bnana", "bananas")
## [1] FALSE
grepl("\\Bnana", "bananas")
## [1] TRUE

A final note

They’re not for everything