Julia Fukuyama
Course goals: Be able to write clean, accurate, maintainable code. Learn how the algorithms you will use as a statistician work so that you can diagnose problems and extend or tweak them to fit your needs.
First half of the course focuses on software engineering, some specifics of R, and how to write good code.
Second half of the course focuses on algorithms.
Course website: jfukuyama.github.io/teaching/stat610 will have slides, assignments, and any additional readings.
Homework submitted through canvas.
Lab:
A couple of times over the course of the semester there will be a lab.
Lab will be a script that you can work through, longer/more complicated examples of things we covered in class, will be posted so you can go through them at home if you prefer.
Tentative dates on the website.
We have space reserved Thursdays 4-5:30 and Fridays 11:15-12:45, will use whichever one is better for your collective schedules.
Assessment:
40% homework. Homeworks will generally be posted on Sunday and due the following Tuesday, 9 days later.
30% midterm. In class Tuesday October 15, written.
30% final project. Due on the last day of class.
Main text for the first half of the class is Matloff, The Art of R Programming. The R Cookbook, by Paul Teetor, will also be useful.
Main text for the second half of the class is Lange, Numerical Analysis for Statisticians.
How to read the Matloff:
Matloff has example code, you should have an R session open and actually type in the commands, make sure that you get the same results.
Don’t worry about the analogies to C (unless you have a lot of experience in C and you find that helpful!)
You can generally skip the extended examples.
Reading: Matloff Chapter 1, 2, 3.1, 3.2, 3.4, 3.5, 3.6, 3.7, 4.1, 4.2, 4.3, 5.1, 5.2 , 6.1
Chapter 1: Don’t worry about understanding everything in detail, but it gives a good overview and will help you have a slot for everything later.
Also, you should probably not use his advice on getting help, the real way is to google your problem + R and click on the first stackoverflow link.
It’s almost never a good idea to ask for help on the mailing list.
Goal today: Learn about the ways R stores data and how to access and manipulate data in those structures.
Why is this important?
R is object-oriented and functional (we’ll discuss this more later in the course).
This sometimes makes it seem like magic: the plot
function does completely different things in different contexts. How does R know?
Data types for today:
Vectors
Matrices
Lists
Fundamental data type in R, scalars are actually just vectors of length 1.
The :
operator will also make a vector of numbers separated by 1:
## [1] 5 6 7 8 9 10
What do you think y = 5.1:10
will do?
seq
makes a vector containing units with arbitrary spacing:
## [1] 12 15 18 21 24 27 30
You can specify the length you want instead of the spacing:
## [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
You can repeat vectors instead of single numbers:
## [1] 1 5 7 1 5 7 1 5 7
Specifying each
instead of times
changes how the repeats work: the first element in the vector is repeated each
times, then the second, then the third:
## [1] 1 1 1 5 5 5 7 7 7
Vectors can be added, subtracted, multiplied, etc.
If the vectors are the same size, this will be done element-wise.
## [1] 6 2 3
## [1] 5 0 -4
## [1] 0.2 0.5 -4.0
What if the vectors are not the same size?
Remember that scalars in R are just vectors, so in the following code we are actually adding two vectors:
## [1] 3 4 6
This is weird and important!
If a vector operation requires the vectors to be the same length, R automatically repeats the shorter one until it is long enough to match the longer one.
That is what happened in the example above: the longer vector, x
, had length 3, and the shorter vector, 2
, had length 1. What really happened was more like
## [1] 3 4 6
## [1] 3 4 6
This makes sense for adding a vector and a scalar (= vector of length 1).
It can give results you’re not expecting if you have vectors of different lengths:
## Warning in c(1, 2, 4) + c(6, 0, 9, 20, 22): longer object length is not a
## multiple of shorter object length
## [1] 7 2 13 21 24
## [1] 7 2 13 21 24
If the vector lengths are not multiples of one another R will warn you about the recycling. Otherwise, it does it without complaint:
## [1] 2 4 7 9
## [1] 2 4 7 9
>
, <
, <=
, >=
, ==
are all vector operations and work like the arithmetic operators:
## [1] TRUE TRUE FALSE
## [1] TRUE TRUE TRUE
all
and any
work on boolean vectors (vectors of TRUE
and FALSE
), and do what is implied by their name:
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] TRUE
We can’t use ==
to test whether vectors are the same because ==
will give us a boolean vector.
## [1] TRUE FALSE FALSE
Two ways around this all
and identical
:
## [1] FALSE
## [1] FALSE
To get a subvector, we can use []
.
The format is vector1[vector2]
, where vector2
gives the indices we want to pick out of vector1
.
## [1] 1.2 0.4
## [1] 3.9 0.4
## [1] 0.40 0.12
We are allowed to repeat indices:
## [1] 1.20 3.90 0.40 0.12
## [1] 1.2 1.2 1.2 0.4 0.4 0.4
To exclude indices instead of include them, we can use -
:
## [1] 12 13
## [1] 13
We will often want a sub-vector, but we won’t know in advance which elements we want.
As a contrived example, suppose we want to extract just the elements of the vector whose square is greater than 8.
We could use the following code to do so:
## [1] 5 -3 8
Another simple way to extract only some of the values of a vector is with subset
:
## [1] 5 -3 8
## [1] 5 -3 8
We can also use which
. which
just gives us the positions at which the condition occurs, and we can use those positions to get the relevant subvector:
## [1] 1 3 4
## [1] 5 -3 8
Matrices in R turn out to just be vectors with an additional attribute giving the dimensions.
Most basic way of creating a matrix is with the matrix
function.
It takes a vector giving the values that should go in the matrix plus number of rows and number of columns.
By default, the values are arranged in column-major order, but you can specify that the data is coming in in row-major order instead:
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Note that byrow = TRUE
doesn’t change the way the matrix is stored behind the scenes.
You can specify just a number of rows or just a number of columns, and R will figure out what the dimensions of the matrix should be for you:
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
What we learned about vector recycling applies to matrices too:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
## Warning in matrix(1:3, nrow = 2, ncol = 5): data length [3] is not a sub-
## multiple or multiple of the number of rows [2]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 2 1 3
## [2,] 2 1 3 2 1
What happens with matrix(1:3, nrow = 2)
?
These work pretty much the same way vector operations do.
With matrices we get matrix multiplication, %*%
in addition to the operations available for vectors.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
## [,1] [,2]
## [1,] 3 9
## [2,] 6 12
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Getting sub-matrices is analogous to getting sub-vectors.
We still use []
, but now we have two indices.
The syntax is matrix[rowIndices, colIndices]
, where rowIndices
is a vector describing the rows you want to take and colIndices
is a vector describing the columns you want to take.
Leaving one empty means that you want to take all the available indices.
Let’s see how this works:
## [,1] [,2] [,3]
## [1,] 3 4 5
## [2,] 12 2 11
## [3,] 10 6 9
## [4,] 8 1 7
## [,1] [,2]
## [1,] 4 5
## [2,] 2 11
## [3,] 6 9
## [4,] 1 7
## [1] 2 6
## [,1] [,2]
## [1,] 4 5
## [2,] 2 11
## [3,] 6 9
## [4,] 1 7
We can assign values to submatrices using this indexing as well:
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] 1 8
## [2,] 2 5
## [3,] 1 12
As with vectors, we can filter to submatrices by generating indices that we want to keep.
As a simple example, suppose we want just the rows of the matrix for which the second column is at least 3.
We could do that as follows:
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
Notice what happens if we try to make a sub-matrix corresponding to just one row of x
:
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
## [1] 2 3
What happened?
## $dim
## [1] 3 2
## NULL
## num [1:3, 1:2] 1 2 3 2 3 4
## num [1:2] 2 3
r
is a vector!
drop = FALSE
lets us avoid this behavior.
## [,1] [,2]
## [1,] 2 3
## $dim
## [1] 1 2
## num [1, 1:2] 2 3
Lists are technically vectors.
The vectors from before were called atomic vectors, which means that their components couldn’t be broken down into smaller components.
In R, the main purpose of lists is to lump together data of different types. Atomic vectors require all their elements to be of the same type, but lists can have any sort of elements (including lists!).
You can create a list with the list
function…
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
The component names (tags) are not necessary, but it is good practice to use them.
## [[1]]
## [1] "Joe"
##
## [[2]]
## [1] 55000
##
## [[3]]
## [1] TRUE
Suppose we want to get Joe’s salary. There are at least three different ways to do so:
## [1] 55000
## [1] 55000
## [1] 55000
The double brackets [[]]
allow us access to an element of the list.
Note that if we didn’t use the tags, we would only be able to access the salary using j[[2]]
.
We can add to a list by tag:
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
We can add to a list by index:
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
##
## [[5]]
## [1] 1 2 3 4 5
We can remove an element from a list by setting it to NULL
:
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
Names are easy, just use the names
function:
## [1] "name" "salary" "union"
To get values, we use unlist
.
This is another function you have to be careful with, because it doesn’t necessarily do what you think it will.
It returns an atomic vector.
That means all the elements have to be of the same class, and so data often gets coerced to a different type.
If we unlist j
, we get back a character vector. This is essentially because R knows how to convert numbers and booleans to characters but doesn’t know how to convert characters to numbers or booleans.
## name salary union
## "Joe" "55000" "TRUE"
You can see the text for the coercion hierarchy.