Stat 610 Lecture 4: Text representations and data frames

Text representations and Data Frames

Reading: Matloff Chapter 11.1 (for strings), 6.1 (for factors), 5.1 (for data frames)

The character type

typeof('a')
## [1] "character"
typeof("ABC")
## [1] "character"
length('a')
## [1] 1
length("ABC")
## [1] 1
nchar('a')
## [1] 1
nchar("ABC")
## [1] 3

Creating strings

You can use either double or single quotes.

"The Leopard"
## [1] "The Leopard"
'Burt Lancaster'
## [1] "Burt Lancaster"

Displaying strings

print("The Leopard")
## [1] "The Leopard"
"The Leopard"
## [1] "The Leopard"
cat("The Leopard")
## The Leopard

Escape characters

The \ is an escape character.

It tells R to interpret whatever comes after in a different way from normal.

There are two cases:

  1. If the escape character comes before a character that would not usually be part of the string, the escape character indicates that the next character should be used literally, and not for its special meaning.

  2. If the escape character comes before some kinds of real characters, the escape character modifies their behavior.

As an example of case 1, the " or ' in a string definition would usually mean that we come to the end of the string definition.

If you want a literal " or ' in the string in that case, you need to escape it.

cat("Giuseppe Tomasi di Lampedusa's \"The Leopard\"")
## Giuseppe Tomasi di Lampedusa's "The Leopard"
cat('Giuseppe Tomasi di Lampedusa\'s "The Leopard"')
## Giuseppe Tomasi di Lampedusa's "The Leopard"

Another example of case 1: normally the \ means to escape the next character. If you need a \ in your string, you need to escape it:

cat("We use the '\\' character to escape")
## We use the '\' character to escape

From last time: Greedy quantification

By default, quantifiers are greedy, meaning they match the longest substring possible.

We can make them have the opposite behavior by modifying them with the ? character: in that case, they match the shortest substring possible.

regmatches("[i][j]", regexpr("\\[.*\\]", "[i][j]"))
## [1] "[i][j]"
regmatches("[i][j]", regexpr("\\[.*?\\]", "[i][j]"))
## [1] "[i]"

Anchoring

How do we represent the anchors in R?

Look at what happens if we try to create a string with the anchors:

cat("\B", fill = TRUE)
## Error: '\B' is an unrecognized escape in character string (<text>:1:7)
cat("\>", fill = TRUE)
## Error: '\>' is an unrecognized escape in character string (<text>:1:7)
cat("\b", fill = TRUE)
## 

So we need to escape the \:

cat("\\B", fill = TRUE)
## \B
cat("\\>", fill = TRUE)
## \>
cat("\\b", fill = TRUE)
## \b
cat("\\<a", fill = TRUE)
## \<a
grepl("\\<a", "hat at")
## [1] TRUE
grepl("\\<a", "hat cat")
## [1] FALSE
cat("\\bnana", fill = TRUE)
## \bnana
grepl("\\bnana", "bananas")
## [1] FALSE
cat("\\Bnana", fill = TRUE)
## \Bnana
grepl("\\Bnana", "bananas")
## [1] TRUE

String Manipulation

We can’t use [] or [[]] to get at the internal parts of a string because strings are atomic in R.

We need to use substr.

Syntax is substr(string, start, stop):

lancaster <- "Burt Lancaster"
substr(lancaster, 1, 4)
## [1] "Burt"

substr vectorizes over the first argument (i.e. the start and stop arguments will be expanded to match the length of the first argument):

substr("Burt Lancaster", 1, 4)
## [1] "Burt"
substr("Burt Lancaster", c(1, 6), c(4, 14))
## [1] "Burt"
substr(rep("Burt Lancaster", 2), c(1, 6), c(4, 14))
## [1] "Burt"      "Lancaster"
substr(rep("Burt Lancaster", 2), 1, c(4, 14))
## [1] "Burt"           "Burt Lancaster"

We can use substr for replacement as well:

lancaster
## [1] "Burt Lancaster"
substr(lancaster, 1, 4) <- "Bill"
lancaster
## [1] "Bill Lancaster"

Combining strings

paste is the workhorse function for string combination.

Simplest way to use it:

paste(s1, s2, ..., sn, sep)

This will paste together s1, s2, up to sn, with sep in between each one.

For example:

paste("Chico", "Harpo", "Groucho", sep = ", ")
## [1] "Chico, Harpo, Groucho"

The arguments to paste can be vectors, and the function is vectorized so we get recycling:

paste(c("Chico", "Harpo", "Groucho"), "Marx", sep = " ")
## [1] "Chico Marx"   "Harpo Marx"   "Groucho Marx"
paste("Marx", c("Chico", "Harpo", "Groucho"), sep = ", ")
## [1] "Marx, Chico"   "Marx, Harpo"   "Marx, Groucho"

Final argument: collapse.

Syntax paste(vector, collapse)

Will create one string (vector of length 1, not the length of the input vector) by pasting the elements of vector together, with the argument from collapse in between them.

paste(c("Chico", "Harpo", "Grouco"), collapse = ", ")
## [1] "Chico, Harpo, Grouco"

Finaly, we can specify both sep and collapse together.

Think of this as first calling paste with collapse = NULL, then calling paste with non-null collapse on the result:

paste(c("Chico", "Harpo", "Groucho"), "Marx", sep = " ", collapse = " and ")
## [1] "Chico Marx and Harpo Marx and Groucho Marx"
## note the equivalence:
marx.brothers = paste(c("Chico", "Harpo", "Groucho"), "Marx", sep = " ", collapse = NULL)
paste(marx.brothers, collapse = " and ")
## [1] "Chico Marx and Harpo Marx and Groucho Marx"

Splitting strings

Primariy function is strsplit.

Syntax: strsplit(s, split)

Given these parameters, what do you expect the output from strsplit to look like? Which of the data structures we’ve seen so far can accommodate everything we need?

split.brothers <- strsplit("Groucho and Harpo and Chico", "and")
typeof(split.brothers)
## [1] "list"
split.brothers
## [[1]]
## [1] "Groucho " " Harpo "  " Chico"
split.two.groups <- strsplit(c("Groucho and Harpo and Chico", "David and John"), "and")
typeof(split.two.groups)
## [1] "list"
split.two.groups
## [[1]]
## [1] "Groucho " " Harpo "  " Chico"  
## 
## [[2]]
## [1] "David " " John"

A slightly more realistic example:

I have some fasta files that I’m going to use as input, perform some manipulations on, and then write some output based on each one. I want the output files to have the same prefix but have the extension .txt instead of .fasta:

file.names <- c("ighv_human.fasta", "ighd_mouse.fasta", "ighj_human.fasta", "ighv_mouse.fasta", "ighd_human.fasta", "ighj_mouse.fasta")
split.files <- strsplit(file.names, ".", fixed = TRUE)
output.names <- character(length = length(file.names))
for(i in 1:length(split.files)) {
    prefix <- split.files[[i]][[1]]
    output.names[i] <- paste(prefix, ".txt", sep = "")
}
output.names
## [1] "ighv_human.txt" "ighd_mouse.txt" "ighj_human.txt" "ighv_mouse.txt"
## [5] "ighd_human.txt" "ighj_mouse.txt"

Factors

Factor creation:

factor(c("a", "b", "b", "z"))
## [1] a b b z
## Levels: a b z

Factors vs. strings

state.name
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
typeof(state.name) ## typeof tells us about R's internal representation of the variable
## [1] "character"
state.name.fac <- as.factor(state.name)
typeof(state.name.fac) ## typeof tells us that from R's point of view, state.name.fac is a vector of integers 
## [1] "integer"
class(state.name.fac) ## class tells us that the object is a factor
## [1] "factor"
attributes(state.name.fac)
## $levels
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"       
## 
## $class
## [1] "factor"
unclass(state.name.fac)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## attr(,"levels")
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

This representation is a bit more parsimonious in general (but not always…):

object.size(state.name)
## 3496 bytes
object.size(state.name.fac)
## 4080 bytes
object.size(rep(state.name, 1000))
## 403096 bytes
object.size(rep(state.name.fac, 1000))
## 203880 bytes

Factor manipulation

Problems we might need to deal with:

Suppose California falls off into the ocean and Puerto Rico becomes a state.

We need to change our state names to reflect the new state of affairs, and we want to replace California with Puerto Rico.

cal.index <- which(state.name.fac == "California")
state.name.fac[cal.index] = "Puerto Rico"
## Warning in `[<-.factor`(`*tmp*`, cal.index, value = "Puerto Rico"): invalid
## factor level, NA generated
state.name.fac
##  [1] Alabama        Alaska         Arizona        Arkansas       <NA>          
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Texas          Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

Why does this happen?

Let’s try again:

state.name.fac <- factor(state.name)
levels(state.name.fac) <- c(levels(state.name.fac), "Puerto Rico")
state.name.fac[cal.index] <- "Puerto Rico"
state.name.fac
##  [1] Alabama        Alaska         Arizona        Arkansas       Puerto Rico   
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Texas          Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Puerto Rico

Then Puerto Rico decides to change its name to The People’s Republic of Puerto Rico, or The PR of PR, for its pleasing symmetry.

We need to modify our state names again. How can we do it?

pr.index <- which(levels(state.name.fac) == "Puerto Rico")
levels(state.name.fac)[pr.index] <- "The PR of PR"
state.name.fac
##  [1] Alabama        Alaska         Arizona        Arkansas       The PR of PR  
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Texas          Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... The PR of PR

We now have this extra level hanging around, and we would like to get rid of it.

What can we do?

cal.index <- which(levels(state.name.fac) == "California")
levels(state.name.fac) <- levels(state.name.fac)[-cal.index]
## Error in `levels<-.factor`(`*tmp*`, value = c("Alabama", "Alaska", "Arizona", : number of levels differs

This is good behavior! It prevents us from messing up our factor variables by mistake.

How do we actually do this?

droplevels(state.name.fac) ## droplevels function does what it says it does
##  [1] Alabama        Alaska         Arizona        Arkansas       The PR of PR  
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Texas          Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 50 Levels: Alabama Alaska Arizona Arkansas Colorado Connecticut ... The PR of PR
factor(state.name.fac, levels = levels(state.name.fac)[-cal.index])
##  [1] Alabama        Alaska         Arizona        Arkansas       The PR of PR  
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Texas          Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 50 Levels: Alabama Alaska Arizona Arkansas Colorado Connecticut ... The PR of PR
factor(state.name.fac)
##  [1] Alabama        Alaska         Arizona        Arkansas       The PR of PR  
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Texas          Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 50 Levels: Alabama Alaska Arizona Arkansas Colorado Connecticut ... The PR of PR

Ordered factors

No detail here, but for reference: ordered factors exist, are created with the ordered = TRUE flag, some functions will have nicer default behavior if you tell them that ordinal variables are ordinal.

Data frames

Why do we need data frames?

Data frames in R are lists with some extra rules:

This gives essentially a heterogeneous analog of a matrix.

Data frame creation

kids <- c("Jack", "Jill")
ages <- c(12, 10)
df <- data.frame(kids, ages)
df <- data.frame(Kids = kids, Ages = ages)
typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"

Referring to elements of a data frame

Remember that data frames are lists, wich each element of the list corresponding to one column.

This means we can use the [[]] notation to get a column:

df[[1]]
## [1] "Jack" "Jill"

We can also refer to portions of a data frame the same way we do for matrices, with the [] notation:

df[,1]
## [1] "Jack" "Jill"
df[,1,drop = FALSE]
##   Kids
## 1 Jack
## 2 Jill
df[1,"Ages"]
## [1] 12

Finally, we have a special operator, $, to get just one of the columns of the data frame:

df$Age
## [1] 12 10

Pay attention to the return type: some subsetting operations give you a data frame back, but some give you a vector.

That’s all, folks