Stat 710 Lecture 1: Basic Data Types

Course Mechanics

Course goals: Be able to write clean, accurate, maintainable code. Learn how the algorithms you will use as a statistician work so that you can diagnose problems and extend or tweak them to fit your needs.

Assessment:

Textbooks

Main text is Matloff, The Art of R Programming. The R Cookbook, by Paul Teetor, will also be useful.

How to read the book:

Reading for today

Reading: Matloff Chapter 1, 2, 3.1, 3.2, 3.4, 3.5, 3.6, 3.7, 4.1, 4.2, 4.3, 5.1, 5.2 , 6.1

Chapter 1: Don't worry about understanding everything in detail, but it gives a good overview and will help you have a slot for everything later.

Also, you should probably not use his advice on getting help, the real way is to google your problem + R and click on the first stackoverflow link.

It's almost never a good idea to ask for help on the mailing list.

Data types and data structures

Goal today: Learn about the ways R stores data and how to access and manipulate data in those structures.

Why is this important?

Data types for today:

Vectors

Fundamental data type in R, scalars are actually just vectors of length 1.

Creating a vector

Easiest way to make a vector is with the c function (for "concatenate"):

x = c(1, 168, .3)
x
## [1]   1.0 168.0   0.3

The : operator will also make a vector of numbers separated by 1:

y = 5:10
y
## [1]  5  6  7  8  9 10

What do you think y = 5.1:10 will do?

seq makes a vector containing units with arbitrary spacing:

seq(from = 12, to = 30, by = 3)
## [1] 12 15 18 21 24 27 30

You can specify the length you want instead of the spacing:

seq(from = 1.1, to = 2, length = 10)
##  [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

rep will give a vector of repeated elements:

rep(1, times = 5)
## [1] 1 1 1 1 1

You can repeat vectors instead of single numbers:

rep(c(1, 5, 7), times = 3)
## [1] 1 5 7 1 5 7 1 5 7

Specifying each instead of times changes how the repeats work: the first element in the vector is repeated each times, then the second, then the third:

rep(c(1, 5, 7), each = 3)
## [1] 1 1 1 5 5 5 7 7 7

Some vector operations

Vectors can be added, subtracted, multiplied, etc.

If the vectors are the same size, this will be done element-wise.

x = c(1, 2, 4)
x + c(5, 0, -1)
## [1] 6 2 3
x * c(5, 0, -1)
## [1]  5  0 -4
x / c(5, 4, -1)
## [1]  0.2  0.5 -4.0

What if the vectors are not the same size?

Remember that scalars in R are just vectors, so in the following code we are actually adding two vectors:

x = c(1, 2, 4)
x + 2
## [1] 3 4 6

Vector recycling

This is weird and important!

If a vector operation requires the vectors to be the same length, R automatically repeats the shorter one until it is long enough to match the longer one.

That is what happened in the example above: the longer vector, x, had length 3, and the shorter vector, 2, had length 1. What really happened was more like

x = c(1, 2, 4)
x + rep(2, times = 3)
## [1] 3 4 6
x + 2 # notice that these give the same results
## [1] 3 4 6

This makes sense for adding a vector and a scalar (= vector of length 1).

It can give results you're not expecting if you have vectors of different lengths:

c(1, 2, 4) + c(6, 0, 9, 20, 22)
## Warning in c(1, 2, 4) + c(6, 0, 9, 20, 22): longer object length is not a
## multiple of shorter object length
## [1]  7  2 13 21 24
## same as
c(1, 2, 4, 1, 2) + c(6, 0, 9, 20, 22)
## [1]  7  2 13 21 24

If the vector lengths are not multiples of one another R will warn you about the recycling. Otherwise, it does it without complaint:

c(2, 3, 7, 8) + c(0, 1)
## [1] 2 4 7 9
## same as
c(2, 3, 7, 8) + rep(c(0, 1), times = 2)
## [1] 2 4 7 9

Logical operations

>, <, <=, >=, == are all vector operations and work like the arithmetic operators:

c(2, 5, 7) > c(1, 3, 8)
## [1]  TRUE  TRUE FALSE
c(2, 5, 7) < 8
## [1] TRUE TRUE TRUE

all and any work on boolean vectors (vectors of TRUE and FALSE), and do what is implied by their name:

any(c(TRUE, FALSE))
## [1] TRUE
all(c(TRUE, FALSE))
## [1] FALSE
x = 1:10
any(x > 8)
## [1] TRUE
any(x > 88)
## [1] FALSE
all(x > 88)
## [1] FALSE
all(x > 0)
## [1] TRUE

Testing vector equality

We can't use == to test whether vectors are the same because == will give us a boolean vector.

x = 1:3
y = c(1, 3, 4)
x == y
## [1]  TRUE FALSE FALSE

Two ways around this all and identical:

all(x == y)
## [1] FALSE
identical(x, y)
## [1] FALSE

You need to be a bit careful with identical though:

identical(1:3, c(1, 2, 3))
## [1] FALSE
## why?
typeof(1:3)
## [1] "integer"
typeof(c(1, 2, 3))
## [1] "double"

Vector access

To get a subvector, we can use [].

The format is vector1[vector2], where vector2 gives the indices we want to pick out of vector1.

y = c(1.2, 3.9, 0.4, 0.12)
y[c(1,3)]  # extract elements 1 and 3 of y
## [1] 1.2 0.4
y[2:3]
## [1] 3.9 0.4
v = 3:4
y[v]
## [1] 0.40 0.12

We are allowed to repeat indices:

y
## [1] 1.20 3.90 0.40 0.12
y[rep(c(1, 3), each = 3)]
## [1] 1.2 1.2 1.2 0.4 0.4 0.4

To exclude indices instead of include them, we can use -:

z = c(5, 12, 13)
z[-1]  # exclude element 1
## [1] 12 13
z[-1:-2]  # exclude elements 1 through 2
## [1] 13

Filtering

We will often want a sub-vector, but we won't know in advance which elements we want.

As a contrived example, suppose we want to extract just the elements of the vector whose square is greater than 8.

We could use the following code to do so:

z = c(5, 2, -3, 8)
w = z[z * z > 8]
w
## [1]  5 -3  8

What's happening here?

z
## [1]  5  2 -3  8
z * z
## [1] 25  4  9 64
z * z > 8
## [1]  TRUE FALSE  TRUE  TRUE
z[c(TRUE, FALSE, TRUE, TRUE)]
## [1]  5 -3  8
z[z * z > 8]
## [1]  5 -3  8

Another simple way to extract only some of the values of a vector is with subset:

z = c(5, 2, -3, 8)
z[z * z > 8]
## [1]  5 -3  8
subset(z, z * z > 8) ## gives the same result
## [1]  5 -3  8

We can also use which. which just gives us the positions at which the condition occurs, and we can use those positions to get the relevant subvector:

z = c(5, 2, -3, 8)
which(z * z > 8)
## [1] 1 3 4
z[which(z * z > 8)]
## [1]  5 -3  8

Matrices

Matrices in R turn out to just be vectors with an additional attribute giving the dimensions.

Matrix creation

Most basic way of creating a matrix is with the matrix function.

It takes a vector giving the values that should go in the matrix plus number of rows and number of columns.

By default, the values are arranged in column-major order, but you can specify that the data is coming in in row-major order instead:

y = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
y
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
y = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
y
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Note that byrow = TRUE doesn't change the way the matrix is stored behind the scenes.

You can specify just a number of rows or just a number of columns, and R will figure out what the dimensions of the matrix should be for you:

y = matrix(c(1, 2, 3, 4), nrow = 2)
y
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

What we learned about vector recycling applies to matrices too:

y = matrix(1:2, nrow = 2, ncol = 5)
y
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2
y = matrix(1:3, nrow = 2, ncol = 5)
## Warning in matrix(1:3, nrow = 2, ncol = 5): data length [3] is not a sub-
## multiple or multiple of the number of rows [2]
y
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    2    1    3
## [2,]    2    1    3    2    1

What happens with matrix(1:3, nrow = 2)?

Matrix operations

These work pretty much the same way vector operations do.

With matrices we get matrix multiplication, %*% in addition to the operations available for vectors.

y = matrix(c(1, 2, 3, 4), nrow = 2)
y
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
y %*% y
##      [,1] [,2]
## [1,]    7   15
## [2,]   10   22
y * 3
##      [,1] [,2]
## [1,]    3    9
## [2,]    6   12
y + y
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8

Matrix indexing

Getting sub-matrices is analogous to getting sub-vectors.

We still use [], but now we have two indices.

The syntax is matrix[rowIndices, colIndices], where rowIndices is a vector describing the rows you want to take and colIndices is a vector describing the columns you want to take.

Leaving one empty means that you want to take all the available indices.

Let's see how this works:

z = matrix(sample(1:12), nrow = 4)
z
##      [,1] [,2] [,3]
## [1,]    8    3    2
## [2,]    5   10    7
## [3,]   12    1   11
## [4,]    6    9    4
z[,2:3] ## extract the 2nd and 3rd columns
##      [,1] [,2]
## [1,]    3    2
## [2,]   10    7
## [3,]    1   11
## [4,]    9    4
z[2:3, 2] ## extract the 2nd and 3rd rows of the 2nd column
## [1] 10  1
z[,-1] ## negative subscripts work the same way as with vectors
##      [,1] [,2]
## [1,]    3    2
## [2,]   10    7
## [3,]    1   11
## [4,]    9    4

We can assign values to submatrices using this indexing as well:

y = matrix(1:6, nrow = 3)
y
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
y[c(1,3),] = matrix(c(1, 1, 8, 12), nrow = 2)
y
##      [,1] [,2]
## [1,]    1    8
## [2,]    2    5
## [3,]    1   12

Filtering

As with vectors, we can filter to submatrices by generating indices that we want to keep.

As a simple example, suppose we want just the rows of the matrix for which the second column is at least 3.

We could do that as follows:

x = matrix(c(1, 2, 3, 2, 3, 4), nrow = 3)
x
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4
x[x[,2] >= 3, ]
##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

What is happening?

x
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4
x[,2]
## [1] 2 3 4
x[,2] >= 3
## [1] FALSE  TRUE  TRUE
x[c(FALSE, TRUE, TRUE), ]
##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4
x[x[,2] >= 3, ]
##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

Avoiding unintended dimension reduction

Notice what happens if we try to make a sub-matrix corresponding to just one row of x:

x
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4
r = x[2,]
r
## [1] 2 3

What happened?

attributes(x)
## $dim
## [1] 3 2
attributes(r)
## NULL
str(x)
##  num [1:3, 1:2] 1 2 3 2 3 4
str(r)
##  num [1:2] 2 3

r is a vector!

drop = FALSE lets us avoid this behavior.

r = x[2,,drop = FALSE]
r
##      [,1] [,2]
## [1,]    2    3
attributes(r)
## $dim
## [1] 1 2
str(r)
##  num [1, 1:2] 2 3

Lists

Creating lists

You can create a list with the list function...

j = list(name = "Joe", salary = 55000, union = TRUE)
j
## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE

The component names (tags) are not necessary, but it is good practice to use them.

list("Joe", 55000, TRUE)
## [[1]]
## [1] "Joe"
## 
## [[2]]
## [1] 55000
## 
## [[3]]
## [1] TRUE

List access

Suppose we want to get Joe's salary. There are at least three different ways to do so:

j$salary
## [1] 55000
j[["salary"]]
## [1] 55000
j[[2]]
## [1] 55000

The double brackets [[]] allow us access to an element of the list.

Note that if we didn't use the tags, we would only be able to access the salary using j[[2]].

Sublist vs. Element of list

Very important: [[]] and [] are different operations:

j[[2]]
## [1] 55000
class(j[[2]])
## [1] "numeric"
j[2]
## $salary
## [1] 55000
class(j[2])
## [1] "list"

List manipulation

We can add to a list by tag:

j$hobby = "sailing"
j
## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"

We can add to a list by index:

j[[5]] = 1:5
j
## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"
## 
## [[5]]
## [1] 1 2 3 4 5

We can remove an element from a list by setting it to NULL:

j[[5]] = NULL
j
## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"
j$hobby = NULL
j
## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE

Extracting names and values

Names are easy, just use the names function:

names(j)
## [1] "name"   "salary" "union"

To get values, we use unlist.

This is another function you have to be careful with, because it doesn't necessarily do what you think it will.

unlist(j)
##    name  salary   union 
##   "Joe" "55000"  "TRUE"

You can see the text for the coercion hierarchy.

Note that unlist gives an atomic vector even for recursive lists (lists of lists), for example:

complex_list = list(a = list(1), b = 1:5, c = (list(a = 1, b = 2)))
unlist(complex_list)
##   a  b1  b2  b3  b4  b5 c.a c.b 
##   1   1   2   3   4   5   1   2