Reading: Matloff Chapter 11.1 (for strings), 6.1 (for factors), 5.1 (for data frames)
Strings have type character
in R.
No special type for single characters vs. strings:
## [1] "character"
## [1] "character"
## [1] 1
## [1] 1
nchar
function:## [1] 1
## [1] 3
You can use either double or single quotes.
## [1] "The Leopard"
## [1] "Burt Lancaster"
print
works.
Sometimes slightly nicer to use cat
, which prints the string to the console instead of giving the representation with subscripts that print
does.
## [1] "The Leopard"
## [1] "The Leopard"
## The Leopard
The \
is an escape character.
It tells R to interpret whatever comes after literally instead of as a special character.
For example, if you need quotes inside your string, you need to escape the quotation character:
## Giuseppe Tomasi di Lampedusa's "The Leopard"
## Giuseppe Tomasi di Lampedusa's "The Leopard"
We can’t use []
or [[]]
to get at the internal parts of a string because strings are atomic in R.
We need to use substr
.
Syntax is substr(string, start, stop)
:
## [1] "Burt"
substr
vectorizes over the first argument (i.e. the start
and stop
arguments will be expanded to match the length of the first argument):
## [1] "Burt"
## [1] "Burt"
## [1] "Burt" "Lancaster"
## [1] "Burt" "Burt Lancaster"
We can use substr
for replacement as well:
## [1] "Burt Lancaster"
## [1] "Bill Lancaster"
What happens if the replacement string isn’t the same length as the piece of the string you want to replace?
paste
is the workhorse function for string combination.
Simplest way to use it:
paste(s1, s2, ..., sn, sep)
This will paste together s1
, s2
, up to sn
, with sep
in between each one.
For example:
## [1] "Chico, Harpo, Groucho"
The arguments to paste
can be vectors, and the function is vectorized so we get recycling:
## [1] "Chico Marx" "Harpo Marx" "Groucho Marx"
## [1] "Marx, Chico" "Marx, Harpo" "Marx, Groucho"
Final argument: collapse
.
Syntax paste(vector, collapse)
Will create one string (vector of length 1, not the length of the input vector) by pasting the elements of vector
together, with the argument from collapse
in between them.
## [1] "Chico, Harpo, Grouco"
Finaly, we can specify both sep
and collapse
together.
Think of this as first calling paste
with collapse = NULL
, then calling paste
with non-null collapse on the result:
## [1] "Chico Marx and Harpo Marx and Groucho Marx"
## note the equivalence:
marx.brothers = paste(c("Chico", "Harpo", "Groucho"), "Marx", sep = " ", collapse = NULL)
paste(marx.brothers, collapse = " and ")
## [1] "Chico Marx and Harpo Marx and Groucho Marx"
Primariy function is strsplit
.
Syntax: strsplit(s, split)
s
is a character vector (can have length greater than 1), and the function vectorizes.
split
gives the string we want to split on: every element of s
will be split into pieces separated by split
.
Given these parameters, what do you expect the output from strsplit
to look like? Which of the data structures we’ve seen so far can accommodate everything we need?
A slightly more realistic example:
I have some fasta files that I’m going to use as input, perform some manipulations on, and then write some output based on each one. I want the output files to have the same prefix but have the extension .txt
instead of .fasta
:
file.names = c("ighv_human.fasta", "ighd_mouse.fasta", "ighj_human.fasta", "ighv_mouse.fasta", "ighd_human.fasta", "ighj_mouse.fasta")
split.files = strsplit(file.names, ".", fixed = TRUE) ## fixed = TRUE has to do with regular expressions, which we'll talk about on Thursday
output.names = character(length = length(file.names))
for(i in 1:length(split.files)) {
prefix = split.files[[i]][[1]]
output.names[i] = paste(prefix, ".txt", sep = "")
}
output.names
## [1] "ighv_human.txt" "ighd_mouse.txt" "ighj_human.txt" "ighv_mouse.txt"
## [5] "ighd_human.txt" "ighj_mouse.txt"
A way of representing qualitative variables.
Variable that takes string values, but we know in advance which strings are valid.
Factors represented differently from strings internally.
Factors are an integer vector plus a set of names corresponding to each of the levels.
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
## [1] "character"
state.name.fac = as.factor(state.name)
typeof(state.name.fac) ## typeof tells us that from R's point of view, state.name.fac is a vector of integers
## [1] "integer"
## [1] "factor"
## $levels
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
##
## $class
## [1] "factor"
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50
## attr(,"levels")
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
This representation is a bit more parsimonious in general (but not always…):
## 3496 bytes
## 4080 bytes
## 403096 bytes
## 203880 bytes
Problems we might need to deal with:
Too many levels
Too few levels
Levels with the wrong names
Suppose California falls off into the ocean and Puerto Rico becomes a state.
We need to change our state names to reflect the new state of affairs, and we want to replace California with Puerto Rico.
## Warning in `[<-.factor`(`*tmp*`, cal.index, value = "Puerto Rico"): invalid
## factor level, NA generated
## [1] Alabama Alaska Arizona Arkansas
## [5] <NA> Colorado Connecticut Delaware
## [9] Florida Georgia Hawaii Idaho
## [13] Illinois Indiana Iowa Kansas
## [17] Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi
## [25] Missouri Montana Nebraska Nevada
## [29] New Hampshire New Jersey New Mexico New York
## [33] North Carolina North Dakota Ohio Oklahoma
## [37] Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Texas Utah
## [45] Vermont Virginia Washington West Virginia
## [49] Wisconsin Wyoming
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
Why does this happen?
Let’s try again:
state.name.fac = factor(state.name)
levels(state.name.fac) = c(levels(state.name.fac), "Puerto Rico")
state.name.fac[cal.index] = "Puerto Rico"
state.name.fac
## [1] Alabama Alaska Arizona Arkansas
## [5] Puerto Rico Colorado Connecticut Delaware
## [9] Florida Georgia Hawaii Idaho
## [13] Illinois Indiana Iowa Kansas
## [17] Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi
## [25] Missouri Montana Nebraska Nevada
## [29] New Hampshire New Jersey New Mexico New York
## [33] North Carolina North Dakota Ohio Oklahoma
## [37] Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Texas Utah
## [45] Vermont Virginia Washington West Virginia
## [49] Wisconsin Wyoming
## 51 Levels: Alabama Alaska Arizona Arkansas California ... Puerto Rico
Then Puerto Rico decides to change its name to The People’s Republic of Puerto Rico, or The PR of PR, for its pleasing symmetry.
We need to modify our state names again. How can we do it?
pr.index = which(levels(state.name.fac) == "Puerto Rico")
levels(state.name.fac)[pr.index] = "The PR of PR"
state.name.fac
## [1] Alabama Alaska Arizona Arkansas
## [5] The PR of PR Colorado Connecticut Delaware
## [9] Florida Georgia Hawaii Idaho
## [13] Illinois Indiana Iowa Kansas
## [17] Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi
## [25] Missouri Montana Nebraska Nevada
## [29] New Hampshire New Jersey New Mexico New York
## [33] North Carolina North Dakota Ohio Oklahoma
## [37] Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Texas Utah
## [45] Vermont Virginia Washington West Virginia
## [49] Wisconsin Wyoming
## 51 Levels: Alabama Alaska Arizona Arkansas California ... The PR of PR
We now have this extra level hanging around, and we would like to get rid of it.
What can we do?
cal.index = which(levels(state.name.fac) == "California")
levels(state.name.fac) = levels(state.name.fac)[-cal.index]
## Error in `levels<-.factor`(`*tmp*`, value = c("Alabama", "Alaska", "Arizona", : number of levels differs
This is good behavior! It prevents us from messing up our factor variables by mistake.
How do we actually do this?
## [1] Alabama Alaska Arizona Arkansas
## [5] The PR of PR Colorado Connecticut Delaware
## [9] Florida Georgia Hawaii Idaho
## [13] Illinois Indiana Iowa Kansas
## [17] Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi
## [25] Missouri Montana Nebraska Nevada
## [29] New Hampshire New Jersey New Mexico New York
## [33] North Carolina North Dakota Ohio Oklahoma
## [37] Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Texas Utah
## [45] Vermont Virginia Washington West Virginia
## [49] Wisconsin Wyoming
## 50 Levels: Alabama Alaska Arizona Arkansas Colorado ... The PR of PR
## [1] Alabama Alaska Arizona Arkansas
## [5] The PR of PR Colorado Connecticut Delaware
## [9] Florida Georgia Hawaii Idaho
## [13] Illinois Indiana Iowa Kansas
## [17] Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi
## [25] Missouri Montana Nebraska Nevada
## [29] New Hampshire New Jersey New Mexico New York
## [33] North Carolina North Dakota Ohio Oklahoma
## [37] Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Texas Utah
## [45] Vermont Virginia Washington West Virginia
## [49] Wisconsin Wyoming
## 50 Levels: Alabama Alaska Arizona Arkansas Colorado ... The PR of PR
## [1] Alabama Alaska Arizona Arkansas
## [5] The PR of PR Colorado Connecticut Delaware
## [9] Florida Georgia Hawaii Idaho
## [13] Illinois Indiana Iowa Kansas
## [17] Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi
## [25] Missouri Montana Nebraska Nevada
## [29] New Hampshire New Jersey New Mexico New York
## [33] North Carolina North Dakota Ohio Oklahoma
## [37] Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Texas Utah
## [45] Vermont Virginia Washington West Virginia
## [49] Wisconsin Wyoming
## 50 Levels: Alabama Alaska Arizona Arkansas Colorado ... The PR of PR
No detail here, but for reference: ordered factors exist, are created with the ordered = TRUE
flag, some functions will have nicer default behavior if you tell them that ordinal variables are ordinal.
Why do we need data frames?
A common setup is that we have some samples, and some observations on those samples. The observations are of different types: some are numbers, and some are qualitative variables.
We can’t use a matrix because matrices require all of the elements to be of the same type. Note that this isn’t just an arbitrary rule about matrices: if we think about it from the computer’s point of view, the amount of space required for a matrix containing all the same type of observations is fixed, but if the type of the observation can change, we don’t know how much space to allocate.
We need a data structure that mostly acts like a matrix (rectangular, we can extract subsets easily, and so on), but that allows for different data types in different columns.
Data frames in R are lists with some extra rules:
Each element of the list needs to be a vector
Each of the vectors needs to be of the same length
This gives essentially a heterogeneous analog of a matrix.
kids = c("Jack", "Jill")
ages = c(12, 10)
df = data.frame(kids, ages)
df = data.frame(Kids = kids, Ages = ages)
typeof(df)
## [1] "list"
## [1] "data.frame"
Remember that data frames are lists, wich each element of the list corresponding to one column.
This means we can use the [[]]
notation to get a column:
## [1] Jack Jill
## Levels: Jack Jill
We can also refer to portions of a data frame the same way we do for matrices, with the []
notation:
## [1] Jack Jill
## Levels: Jack Jill
## Kids
## 1 Jack
## 2 Jill
## [1] 12