Course Mechanics

Course goals: Be able to write clean, accurate, maintainable code. Learn how the algorithms you will use as a statistician work so that you can diagnose problems and extend or tweak them to fit your needs.

First half of the course focuses on software engineering, some specifics of R, and how to write good code.
Second half of the course focuses on algorithms.
Course will be primarily in R. One homework will be in python, and you'll have the option of doing the final homework in python as well if you would like more practice.
Course website: jfukuyama.github.io/teaching/stat710 will have slides, assignments, and any additional readings.
Homework submitted through canvas.

Assessment:

40% homework. Homeworks will generally be posted on Sunday and due the following Tuesday, 9 days later.
30% midterm. In class, written.
30% final exam. Like the midterm, at the scheduled final exam time.

Textbooks

Main text is Matloff, The Art of R Programming. The R Cookbook, by Paul Teetor, will also be useful.

How to read the book:

Matloff has example code, you should have an R session open and actually type in the commands, make sure that you get the same results.
Don't worry about the analogies to C (unless you have a lot of experience in C and you find that helpful!)
You can generally skip the extended examples.

Reading for today

Reading: Matloff Chapter 1, 2, 3.1, 3.2, 3.4, 3.5, 3.6, 3.7, 4.1, 4.2, 4.3, 5.1, 5.2 , 6.1

Chapter 1: Don't worry about understanding everything in detail, but it gives a good overview and will help you have a slot for everything later.

Also, you should probably not use his advice on getting help, the real way is to google your problem + R and click on the first stackoverflow link.

It's almost never a good idea to ask for help on the mailing list.

Data types and data structures

Goal today: Learn about the ways R stores data and how to access and manipulate data in those structures.

Why is this important?

R is object-oriented and functional (we'll discuss this more later in the course).
This sometimes makes it seem like magic: the plot function does completely different things in different contexts. How does R know?

Data types for today:

Vectors
Matrices
Lists

Vectors

Fundamental data type in R, scalars are actually just vectors of length 1.

Creating a vector

Easiest way to make a vector is with the c function (for "concatenate"):

x = c(1, 168, .3)
x

## [1]   1.0 168.0   0.3

The : operator will also make a vector of numbers separated by 1:

y = 5:10
y

## [1]  5  6  7  8  9 10

What do you think y = 5.1:10 will do?

seq makes a vector containing units with arbitrary spacing:

seq(from = 12, to = 30, by = 3)

## [1] 12 15 18 21 24 27 30

You can specify the length you want instead of the spacing:

seq(from = 1.1, to = 2, length = 10)

##  [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

rep will give a vector of repeated elements:

rep(1, times = 5)

## [1] 1 1 1 1 1

You can repeat vectors instead of single numbers:

rep(c(1, 5, 7), times = 3)

## [1] 1 5 7 1 5 7 1 5 7

Specifying each instead of times changes how the repeats work: the first element in the vector is repeated each times, then the second, then the third:

rep(c(1, 5, 7), each = 3)

## [1] 1 1 1 5 5 5 7 7 7

Some vector operations

Vectors can be added, subtracted, multiplied, etc.

If the vectors are the same size, this will be done element-wise.

x = c(1, 2, 4)
x + c(5, 0, -1)

## [1] 6 2 3

x * c(5, 0, -1)

## [1]  5  0 -4

x / c(5, 4, -1)

## [1]  0.2  0.5 -4.0

What if the vectors are not the same size?

Remember that scalars in R are just vectors, so in the following code we are actually adding two vectors:

x = c(1, 2, 4)
x + 2

## [1] 3 4 6

Vector recycling

This is weird and important!

If a vector operation requires the vectors to be the same length, R automatically repeats the shorter one until it is long enough to match the longer one.

That is what happened in the example above: the longer vector, x, had length 3, and the shorter vector, 2, had length 1. What really happened was more like

x = c(1, 2, 4)
x + rep(2, times = 3)

## [1] 3 4 6

x + 2 # notice that these give the same results

## [1] 3 4 6

This makes sense for adding a vector and a scalar (= vector of length 1).

It can give results you're not expecting if you have vectors of different lengths:

c(1, 2, 4) + c(6, 0, 9, 20, 22)

## Warning in c(1, 2, 4) + c(6, 0, 9, 20, 22): longer object length is not a
## multiple of shorter object length

## [1]  7  2 13 21 24

## same as
c(1, 2, 4, 1, 2) + c(6, 0, 9, 20, 22)

## [1]  7  2 13 21 24

If the vector lengths are not multiples of one another R will warn you about the recycling. Otherwise, it does it without complaint:

c(2, 3, 7, 8) + c(0, 1)

## [1] 2 4 7 9

## same as
c(2, 3, 7, 8) + rep(c(0, 1), times = 2)

## [1] 2 4 7 9

Logical operations

>, <, <=, >=, == are all vector operations and work like the arithmetic operators:

c(2, 5, 7) > c(1, 3, 8)

## [1]  TRUE  TRUE FALSE

c(2, 5, 7) < 8

## [1] TRUE TRUE TRUE

all and any work on boolean vectors (vectors of TRUE and FALSE), and do what is implied by their name:

any(c(TRUE, FALSE))

## [1] TRUE

all(c(TRUE, FALSE))

## [1] FALSE

x = 1:10
any(x > 8)

## [1] TRUE

any(x > 88)

## [1] FALSE

all(x > 88)

## [1] FALSE

all(x > 0)

## [1] TRUE

Testing vector equality

We can't use == to test whether vectors are the same because == will give us a boolean vector.

x = 1:3
y = c(1, 3, 4)
x == y

## [1]  TRUE FALSE FALSE

Two ways around this all and identical:

all(x == y)

## [1] FALSE

identical(x, y)

## [1] FALSE

You need to be a bit careful with identical though:

identical(1:3, c(1, 2, 3))

## [1] FALSE

## why?
typeof(1:3)

## [1] "integer"

typeof(c(1, 2, 3))

## [1] "double"

Vector access

To get a subvector, we can use [].

The format is vector1[vector2], where vector2 gives the indices we want to pick out of vector1.

y = c(1.2, 3.9, 0.4, 0.12)
y[c(1,3)]  # extract elements 1 and 3 of y

## [1] 1.2 0.4

y[2:3]

## [1] 3.9 0.4

v = 3:4
y[v]

## [1] 0.40 0.12

We are allowed to repeat indices:

## [1] 1.20 3.90 0.40 0.12

y[rep(c(1, 3), each = 3)]

## [1] 1.2 1.2 1.2 0.4 0.4 0.4

To exclude indices instead of include them, we can use -:

z = c(5, 12, 13)
z[-1]  # exclude element 1

## [1] 12 13

z[-1:-2]  # exclude elements 1 through 2

## [1] 13

Filtering

We will often want a sub-vector, but we won't know in advance which elements we want.

As a contrived example, suppose we want to extract just the elements of the vector whose square is greater than 8.

We could use the following code to do so:

z = c(5, 2, -3, 8)
w = z[z * z > 8]
w

## [1]  5 -3  8

What's happening here?

## [1]  5  2 -3  8

z * z

## [1] 25  4  9 64

z * z > 8

## [1]  TRUE FALSE  TRUE  TRUE

z[c(TRUE, FALSE, TRUE, TRUE)]

## [1]  5 -3  8

z[z * z > 8]

## [1]  5 -3  8

Another simple way to extract only some of the values of a vector is with subset:

z = c(5, 2, -3, 8)
z[z * z > 8]

## [1]  5 -3  8

subset(z, z * z > 8) ## gives the same result

## [1]  5 -3  8

We can also use which. which just gives us the positions at which the condition occurs, and we can use those positions to get the relevant subvector:

z = c(5, 2, -3, 8)
which(z * z > 8)

## [1] 1 3 4

z[which(z * z > 8)]

## [1]  5 -3  8

Matrices

Matrices in R turn out to just be vectors with an additional attribute giving the dimensions.

Matrix creation

Most basic way of creating a matrix is with the matrix function.

It takes a vector giving the values that should go in the matrix plus number of rows and number of columns.

By default, the values are arranged in column-major order, but you can specify that the data is coming in in row-major order instead:

y = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
y

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

y = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
y

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Note that byrow = TRUE doesn't change the way the matrix is stored behind the scenes.

You can specify just a number of rows or just a number of columns, and R will figure out what the dimensions of the matrix should be for you:

y = matrix(c(1, 2, 3, 4), nrow = 2)
y

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

What we learned about vector recycling applies to matrices too:

y = matrix(1:2, nrow = 2, ncol = 5)
y

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2

y = matrix(1:3, nrow = 2, ncol = 5)

## Warning in matrix(1:3, nrow = 2, ncol = 5): data length [3] is not a sub-
## multiple or multiple of the number of rows [2]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    2    1    3
## [2,]    2    1    3    2    1

What happens with matrix(1:3, nrow = 2)?

Matrix operations

These work pretty much the same way vector operations do.

With matrices we get matrix multiplication, %*% in addition to the operations available for vectors.

y = matrix(c(1, 2, 3, 4), nrow = 2)
y

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

y %*% y

##      [,1] [,2]
## [1,]    7   15
## [2,]   10   22

y * 3

##      [,1] [,2]
## [1,]    3    9
## [2,]    6   12

y + y

##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8

Matrix indexing

Getting sub-matrices is analogous to getting sub-vectors.

We still use [], but now we have two indices.

The syntax is matrix[rowIndices, colIndices], where rowIndices is a vector describing the rows you want to take and colIndices is a vector describing the columns you want to take.

Leaving one empty means that you want to take all the available indices.

Let's see how this works:

z = matrix(sample(1:12), nrow = 4)
z

##      [,1] [,2] [,3]
## [1,]    8    3    2
## [2,]    5   10    7
## [3,]   12    1   11
## [4,]    6    9    4

z[,2:3] ## extract the 2nd and 3rd columns

##      [,1] [,2]
## [1,]    3    2
## [2,]   10    7
## [3,]    1   11
## [4,]    9    4

z[2:3, 2] ## extract the 2nd and 3rd rows of the 2nd column

## [1] 10  1

z[,-1] ## negative subscripts work the same way as with vectors

##      [,1] [,2]
## [1,]    3    2
## [2,]   10    7
## [3,]    1   11
## [4,]    9    4

We can assign values to submatrices using this indexing as well:

y = matrix(1:6, nrow = 3)
y

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

y[c(1,3),] = matrix(c(1, 1, 8, 12), nrow = 2)
y

##      [,1] [,2]
## [1,]    1    8
## [2,]    2    5
## [3,]    1   12

Filtering

As with vectors, we can filter to submatrices by generating indices that we want to keep.

As a simple example, suppose we want just the rows of the matrix for which the second column is at least 3.

We could do that as follows:

x = matrix(c(1, 2, 3, 2, 3, 4), nrow = 3)
x

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4

x[x[,2] >= 3, ]

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

What is happening?

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4

x[,2]

## [1] 2 3 4

x[,2] >= 3

## [1] FALSE  TRUE  TRUE

x[c(FALSE, TRUE, TRUE), ]

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

x[x[,2] >= 3, ]

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

Avoiding unintended dimension reduction

Notice what happens if we try to make a sub-matrix corresponding to just one row of x:

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4

r = x[2,]
r

## [1] 2 3

What happened?

attributes(x)

## $dim
## [1] 3 2

attributes(r)

## NULL

str(x)

##  num [1:3, 1:2] 1 2 3 2 3 4

str(r)

##  num [1:2] 2 3

r is a vector!

drop = FALSE lets us avoid this behavior.

r = x[2,,drop = FALSE]
r

##      [,1] [,2]
## [1,]    2    3

attributes(r)

## $dim
## [1] 1 2

str(r)

##  num [1, 1:2] 2 3

Lists

Lists are technically vectors.
The vectors from before were called atomic vectors, which means that their components couldn't be broken down into smaller components.
In R, the main purpose of lists is to lump together data of different types. Atomic vectors require all their elements to be of the same type, but lists can have any sort of elements (including lists!).

Creating lists

You can create a list with the list function...

j = list(name = "Joe", salary = 55000, union = TRUE)
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE

The component names (tags) are not necessary, but it is good practice to use them.

list("Joe", 55000, TRUE)

## [[1]]
## [1] "Joe"
## 
## [[2]]
## [1] 55000
## 
## [[3]]
## [1] TRUE

List access

Suppose we want to get Joe's salary. There are at least three different ways to do so:

j$salary

## [1] 55000

j[["salary"]]

## [1] 55000

j[[2]]

## [1] 55000

The double brackets [[]] allow us access to an element of the list.

Note that if we didn't use the tags, we would only be able to access the salary using j[[2]].

Sublist vs. Element of list

Very important: [[]] and [] are different operations:

[[]] gives an element of the list
[] gives a sublist

j[[2]]

## [1] 55000

class(j[[2]])

## [1] "numeric"

j[2]

## $salary
## [1] 55000

class(j[2])

## [1] "list"

List manipulation

We can add to a list by tag:

j$hobby = "sailing"
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"

We can add to a list by index:

j[[5]] = 1:5
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"
## 
## [[5]]
## [1] 1 2 3 4 5

We can remove an element from a list by setting it to NULL:

j[[5]] = NULL
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"

j$hobby = NULL
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE

Extracting names and values

Names are easy, just use the names function:

names(j)

## [1] "name"   "salary" "union"

To get values, we use unlist.

This is another function you have to be careful with, because it doesn't necessarily do what you think it will.

It returns an atomic vector.
That means all the elements have to be of the same class, and so data often gets coerced to a different type.
If we unlist j, we get back a character vector. This is essentially because R knows how to convert numbers and booleans to characters but doesn't know how to convert characters to numbers or booleans.

unlist(j)

##    name  salary   union 
##   "Joe" "55000"  "TRUE"

You can see the text for the coercion hierarchy.

Note that unlist gives an atomic vector even for recursive lists (lists of lists), for example:

complex_list = list(a = list(1), b = 1:5, c = (list(a = 1, b = 2)))
unlist(complex_list)

##   a  b1  b2  b3  b4  b5 c.a c.b 
##   1   1   2   3   4   5   1   2

Stat 710 Lecture 1: Basic Data Types

Course Mechanics

Textbooks

Reading for today

Data types and data structures

Vectors

Creating a vector

Some vector operations

Vector recycling

Logical operations

Testing vector equality

Vector access

Filtering

Matrices

Matrix creation

Matrix operations

Matrix indexing

Filtering

Avoiding unintended dimension reduction

Lists

Creating lists

List access

Sublist vs. Element of list

List manipulation

Extracting names and values