Course goals: Be able to write clean, accurate, maintainable code. Learn how the algorithms you will use as a statistician work so that you can diagnose problems and extend or tweak them to fit your needs.
First half of the course focuses on software engineering, some specifics of R, and how to write good code.
Second half of the course focuses on algorithms.
Course will be primarily in R. One homework will be in python, and you'll have the option of doing the final homework in python as well if you would like more practice.
Course website: jfukuyama.github.io/teaching/stat710 will have slides, assignments, and any additional readings.
Homework submitted through canvas.
Assessment:
40% homework. Homeworks will generally be posted on Sunday and due the following Tuesday, 9 days later.
30% midterm. In class, written.
30% final exam. Like the midterm, at the scheduled final exam time.
Main text is Matloff, The Art of R Programming. The R Cookbook, by Paul Teetor, will also be useful.
How to read the book:
Matloff has example code, you should have an R session open and actually type in the commands, make sure that you get the same results.
Don't worry about the analogies to C (unless you have a lot of experience in C and you find that helpful!)
You can generally skip the extended examples.
Reading: Matloff Chapter 1, 2, 3.1, 3.2, 3.4, 3.5, 3.6, 3.7, 4.1, 4.2, 4.3, 5.1, 5.2 , 6.1
Chapter 1: Don't worry about understanding everything in detail, but it gives a good overview and will help you have a slot for everything later.
Also, you should probably not use his advice on getting help, the real way is to google your problem + R and click on the first stackoverflow link.
It's almost never a good idea to ask for help on the mailing list.
Goal today: Learn about the ways R stores data and how to access and manipulate data in those structures.
Why is this important?
R is object-oriented and functional (we'll discuss this more later in the course).
This sometimes makes it seem like magic: the plot
function does completely different things in different contexts. How does R know?
Data types for today:
Vectors
Matrices
Lists
Fundamental data type in R, scalars are actually just vectors of length 1.
Easiest way to make a vector is with the c
function (for "concatenate"):
x = c(1, 168, .3)
x
## [1] 1.0 168.0 0.3
The :
operator will also make a vector of numbers separated by 1:
y = 5:10
y
## [1] 5 6 7 8 9 10
What do you think y = 5.1:10
will do?
seq
makes a vector containing units with arbitrary spacing:
seq(from = 12, to = 30, by = 3)
## [1] 12 15 18 21 24 27 30
You can specify the length you want instead of the spacing:
seq(from = 1.1, to = 2, length = 10)
## [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
rep
will give a vector of repeated elements:
rep(1, times = 5)
## [1] 1 1 1 1 1
You can repeat vectors instead of single numbers:
rep(c(1, 5, 7), times = 3)
## [1] 1 5 7 1 5 7 1 5 7
Specifying each
instead of times
changes how the repeats work: the first element in the vector is repeated each
times, then the second, then the third:
rep(c(1, 5, 7), each = 3)
## [1] 1 1 1 5 5 5 7 7 7
Vectors can be added, subtracted, multiplied, etc.
If the vectors are the same size, this will be done element-wise.
x = c(1, 2, 4)
x + c(5, 0, -1)
## [1] 6 2 3
x * c(5, 0, -1)
## [1] 5 0 -4
x / c(5, 4, -1)
## [1] 0.2 0.5 -4.0
What if the vectors are not the same size?
Remember that scalars in R are just vectors, so in the following code we are actually adding two vectors:
x = c(1, 2, 4)
x + 2
## [1] 3 4 6
This is weird and important!
If a vector operation requires the vectors to be the same length, R automatically repeats the shorter one until it is long enough to match the longer one.
That is what happened in the example above: the longer vector, x
, had length 3, and the shorter vector, 2
, had length 1. What really happened was more like
x = c(1, 2, 4)
x + rep(2, times = 3)
## [1] 3 4 6
x + 2 # notice that these give the same results
## [1] 3 4 6
This makes sense for adding a vector and a scalar (= vector of length 1).
It can give results you're not expecting if you have vectors of different lengths:
c(1, 2, 4) + c(6, 0, 9, 20, 22)
## Warning in c(1, 2, 4) + c(6, 0, 9, 20, 22): longer object length is not a
## multiple of shorter object length
## [1] 7 2 13 21 24
## same as
c(1, 2, 4, 1, 2) + c(6, 0, 9, 20, 22)
## [1] 7 2 13 21 24
If the vector lengths are not multiples of one another R will warn you about the recycling. Otherwise, it does it without complaint:
c(2, 3, 7, 8) + c(0, 1)
## [1] 2 4 7 9
## same as
c(2, 3, 7, 8) + rep(c(0, 1), times = 2)
## [1] 2 4 7 9
>
, <
, <=
, >=
, ==
are all vector operations and work like the arithmetic operators:
c(2, 5, 7) > c(1, 3, 8)
## [1] TRUE TRUE FALSE
c(2, 5, 7) < 8
## [1] TRUE TRUE TRUE
all
and any
work on boolean vectors (vectors of TRUE
and FALSE
), and do what is implied by their name:
any(c(TRUE, FALSE))
## [1] TRUE
all(c(TRUE, FALSE))
## [1] FALSE
x = 1:10
any(x > 8)
## [1] TRUE
any(x > 88)
## [1] FALSE
all(x > 88)
## [1] FALSE
all(x > 0)
## [1] TRUE
We can't use ==
to test whether vectors are the same because ==
will give us a boolean vector.
x = 1:3
y = c(1, 3, 4)
x == y
## [1] TRUE FALSE FALSE
Two ways around this all
and identical
:
all(x == y)
## [1] FALSE
identical(x, y)
## [1] FALSE
You need to be a bit careful with identical though:
identical(1:3, c(1, 2, 3))
## [1] FALSE
## why?
typeof(1:3)
## [1] "integer"
typeof(c(1, 2, 3))
## [1] "double"
To get a subvector, we can use []
.
The format is vector1[vector2]
, where vector2
gives the indices we want to pick out of vector1
.
y = c(1.2, 3.9, 0.4, 0.12)
y[c(1,3)] # extract elements 1 and 3 of y
## [1] 1.2 0.4
y[2:3]
## [1] 3.9 0.4
v = 3:4
y[v]
## [1] 0.40 0.12
We are allowed to repeat indices:
y
## [1] 1.20 3.90 0.40 0.12
y[rep(c(1, 3), each = 3)]
## [1] 1.2 1.2 1.2 0.4 0.4 0.4
To exclude indices instead of include them, we can use -
:
z = c(5, 12, 13)
z[-1] # exclude element 1
## [1] 12 13
z[-1:-2] # exclude elements 1 through 2
## [1] 13
We will often want a sub-vector, but we won't know in advance which elements we want.
As a contrived example, suppose we want to extract just the elements of the vector whose square is greater than 8.
We could use the following code to do so:
z = c(5, 2, -3, 8)
w = z[z * z > 8]
w
## [1] 5 -3 8
What's happening here?
z
## [1] 5 2 -3 8
z * z
## [1] 25 4 9 64
z * z > 8
## [1] TRUE FALSE TRUE TRUE
z[c(TRUE, FALSE, TRUE, TRUE)]
## [1] 5 -3 8
z[z * z > 8]
## [1] 5 -3 8
Another simple way to extract only some of the values of a vector is with subset
:
z = c(5, 2, -3, 8)
z[z * z > 8]
## [1] 5 -3 8
subset(z, z * z > 8) ## gives the same result
## [1] 5 -3 8
We can also use which
. which
just gives us the positions at which the condition occurs, and we can use those positions to get the relevant subvector:
z = c(5, 2, -3, 8)
which(z * z > 8)
## [1] 1 3 4
z[which(z * z > 8)]
## [1] 5 -3 8
Matrices in R turn out to just be vectors with an additional attribute giving the dimensions.
Most basic way of creating a matrix is with the matrix
function.
It takes a vector giving the values that should go in the matrix plus number of rows and number of columns.
By default, the values are arranged in column-major order, but you can specify that the data is coming in in row-major order instead:
y = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
y
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
y = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
y
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Note that byrow = TRUE
doesn't change the way the matrix is stored behind the scenes.
You can specify just a number of rows or just a number of columns, and R will figure out what the dimensions of the matrix should be for you:
y = matrix(c(1, 2, 3, 4), nrow = 2)
y
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
What we learned about vector recycling applies to matrices too:
y = matrix(1:2, nrow = 2, ncol = 5)
y
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
y = matrix(1:3, nrow = 2, ncol = 5)
## Warning in matrix(1:3, nrow = 2, ncol = 5): data length [3] is not a sub-
## multiple or multiple of the number of rows [2]
y
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 2 1 3
## [2,] 2 1 3 2 1
What happens with matrix(1:3, nrow = 2)
?
These work pretty much the same way vector operations do.
With matrices we get matrix multiplication, %*%
in addition to the operations available for vectors.
y = matrix(c(1, 2, 3, 4), nrow = 2)
y
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
y %*% y
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
y * 3
## [,1] [,2]
## [1,] 3 9
## [2,] 6 12
y + y
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Getting sub-matrices is analogous to getting sub-vectors.
We still use []
, but now we have two indices.
The syntax is matrix[rowIndices, colIndices]
, where rowIndices
is a vector describing the rows you want to take and colIndices
is a vector describing the columns you want to take.
Leaving one empty means that you want to take all the available indices.
Let's see how this works:
z = matrix(sample(1:12), nrow = 4)
z
## [,1] [,2] [,3]
## [1,] 8 3 2
## [2,] 5 10 7
## [3,] 12 1 11
## [4,] 6 9 4
z[,2:3] ## extract the 2nd and 3rd columns
## [,1] [,2]
## [1,] 3 2
## [2,] 10 7
## [3,] 1 11
## [4,] 9 4
z[2:3, 2] ## extract the 2nd and 3rd rows of the 2nd column
## [1] 10 1
z[,-1] ## negative subscripts work the same way as with vectors
## [,1] [,2]
## [1,] 3 2
## [2,] 10 7
## [3,] 1 11
## [4,] 9 4
We can assign values to submatrices using this indexing as well:
y = matrix(1:6, nrow = 3)
y
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
y[c(1,3),] = matrix(c(1, 1, 8, 12), nrow = 2)
y
## [,1] [,2]
## [1,] 1 8
## [2,] 2 5
## [3,] 1 12
As with vectors, we can filter to submatrices by generating indices that we want to keep.
As a simple example, suppose we want just the rows of the matrix for which the second column is at least 3.
We could do that as follows:
x = matrix(c(1, 2, 3, 2, 3, 4), nrow = 3)
x
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
x[x[,2] >= 3, ]
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
What is happening?
x
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
x[,2]
## [1] 2 3 4
x[,2] >= 3
## [1] FALSE TRUE TRUE
x[c(FALSE, TRUE, TRUE), ]
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
x[x[,2] >= 3, ]
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
Notice what happens if we try to make a sub-matrix corresponding to just one row of x
:
x
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
r = x[2,]
r
## [1] 2 3
What happened?
attributes(x)
## $dim
## [1] 3 2
attributes(r)
## NULL
str(x)
## num [1:3, 1:2] 1 2 3 2 3 4
str(r)
## num [1:2] 2 3
r
is a vector!
drop = FALSE
lets us avoid this behavior.
r = x[2,,drop = FALSE]
r
## [,1] [,2]
## [1,] 2 3
attributes(r)
## $dim
## [1] 1 2
str(r)
## num [1, 1:2] 2 3
Lists are technically vectors.
The vectors from before were called atomic vectors, which means that their components couldn't be broken down into smaller components.
In R, the main purpose of lists is to lump together data of different types. Atomic vectors require all their elements to be of the same type, but lists can have any sort of elements (including lists!).
You can create a list with the list
function...
j = list(name = "Joe", salary = 55000, union = TRUE)
j
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
The component names (tags) are not necessary, but it is good practice to use them.
list("Joe", 55000, TRUE)
## [[1]]
## [1] "Joe"
##
## [[2]]
## [1] 55000
##
## [[3]]
## [1] TRUE
Suppose we want to get Joe's salary. There are at least three different ways to do so:
j$salary
## [1] 55000
j[["salary"]]
## [1] 55000
j[[2]]
## [1] 55000
The double brackets [[]]
allow us access to an element of the list.
Note that if we didn't use the tags, we would only be able to access the salary using j[[2]]
.
Very important: [[]]
and []
are different operations:
[[]]
gives an element of the list
[]
gives a sublist
j[[2]]
## [1] 55000
class(j[[2]])
## [1] "numeric"
j[2]
## $salary
## [1] 55000
class(j[2])
## [1] "list"
We can add to a list by tag:
j$hobby = "sailing"
j
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
We can add to a list by index:
j[[5]] = 1:5
j
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
##
## [[5]]
## [1] 1 2 3 4 5
We can remove an element from a list by setting it to NULL
:
j[[5]] = NULL
j
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
j$hobby = NULL
j
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
Names are easy, just use the names
function:
names(j)
## [1] "name" "salary" "union"
To get values, we use unlist
.
This is another function you have to be careful with, because it doesn't necessarily do what you think it will.
It returns an atomic vector.
That means all the elements have to be of the same class, and so data often gets coerced to a different type.
If we unlist j
, we get back a character vector. This is essentially because R knows how to convert numbers and booleans to characters but doesn't know how to convert characters to numbers or booleans.
unlist(j)
## name salary union
## "Joe" "55000" "TRUE"
You can see the text for the coercion hierarchy.
Note that unlist
gives an atomic vector even for recursive lists (lists of lists), for example:
complex_list = list(a = list(1), b = 1:5, c = (list(a = 1, b = 2)))
unlist(complex_list)
## a b1 b2 b3 b4 b5 c.a c.b
## 1 1 2 3 4 5 1 2