Reading for today

Reading: Matloff Chapter 1, 2, 3.1, 3.2, 3.4, 3.5, 3.6, 3.7, 4.1, 4.2, 4.3

Chapter 1: Don’t worry about understanding everything in detail, but it gives a good overview and will help you have a slot for everything later.

Also, you should probably not use his advice on getting help, the real way is to google your problem + R and click on the first stackoverflow link or else ask chat gpt.

It’s almost never a good idea to ask for help on the mailing list.

Data types and data structures

Goal today: Learn about the ways R stores data and how to access and manipulate data in those structures.

Why is this important?

R is object-oriented and functional (we’ll discuss this more later in the course).
This sometimes makes it seem like magic: the plot function does completely different things in different contexts. How does R know?

Data types for today:

Vectors
Matrices
Lists

Vectors

Fundamental data type in R, scalars are actually just vectors of length 1.

Creating a vector of length 1:

x <- 1

Creating a vector of length > 1: easiest way to make a vector is with the c function (for “concatenate”):

x <- c(1, 168, .3)
x

## [1]   1.0 168.0   0.3

The : operator will also make a vector of numbers separated by 1:

y <- 5:10
y

## [1]  5  6  7  8  9 10

seq makes a vector containing units with arbitrary spacing:

seq(from = 12, to = 30, by = 3)

## [1] 12 15 18 21 24 27 30

You can specify the length you want instead of the spacing:

seq(from = 1.1, to = 2, length = 10)

##  [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

rep will give a vector of repeated elements:

rep(1, times = 5)

## [1] 1 1 1 1 1

You can repeat vectors instead of single numbers:

rep(c(1, 5, 7), times = 3)

## [1] 1 5 7 1 5 7 1 5 7

Specifying each instead of times changes how the repeats work: the first element in the vector is repeated each times, then the second, then the third:

rep(c(1, 5, 7), each = 3)

## [1] 1 1 1 5 5 5 7 7 7

Some vector operations

Vectors can be added, subtracted, multiplied, etc.

If the vectors are the same size, this will be done element-wise.

x <- c(1, 2, 4)
x + c(5, 0, -1)

## [1] 6 2 3

x * c(5, 0, -1)

## [1]  5  0 -4

x / c(5, 4, -1)

## [1]  0.2  0.5 -4.0

What if the vectors are not the same size?

Remember that scalars in R are just vectors, so in the following code we are actually adding two vectors:

x <- c(1, 2, 4)
x + 2

## [1] 3 4 6

Vector recycling

This is weird and important!

If a vector operation requires the vectors to be the same length, R automatically repeats the shorter one until it is long enough to match the longer one.

. . .

That is what happened in the example above: the longer vector, x, had length 3, and the shorter vector, 2, had length 1. What really happened was more like

x <- c(1, 2, 4)
x + rep(2, times = 3)

## [1] 3 4 6

x + 2 # notice that these give the same results

## [1] 3 4 6

This makes sense for adding a vector and a scalar (= vector of length 1).

It can give results you’re not expecting if you have vectors of different lengths:

c(1, 2, 4) + c(6, 0, 9, 20, 22)

## Warning in c(1, 2, 4) + c(6, 0, 9, 20, 22): longer object length is not a
## multiple of shorter object length

## [1]  7  2 13 21 24

## same as
c(1, 2, 4, 1, 2) + c(6, 0, 9, 20, 22)

## [1]  7  2 13 21 24

If the vector lengths are not multiples of one another R will warn you about the recycling. Otherwise, it does it without complaint:

c(2, 3, 7, 8) + c(0, 1)

## [1] 2 4 7 9

## same as
c(2, 3, 7, 8) + rep(c(0, 1), times = 2)

## [1] 2 4 7 9

Logical operations

>, <, <=, >=, == are all vector operations and work like the arithmetic operators:

c(2, 5, 7) > c(1, 3, 8)

## [1]  TRUE  TRUE FALSE

c(2, 5, 7) < 8

## [1] TRUE TRUE TRUE

all and any work on boolean vectors (vectors of TRUE and FALSE), and do what is implied by their name:

any(c(TRUE, FALSE))

## [1] TRUE

all(c(TRUE, FALSE))

## [1] FALSE

x <- 1:10
any(x > 8)

## [1] TRUE

any(x > 88)

## [1] FALSE

all(x > 88)

## [1] FALSE

all(x > 0)

## [1] TRUE

Testing vector equality

We can’t use == to test whether vectors are the same because == will give us a boolean vector.

x <- 1:3
y <- c(1, 3, 4)
x == y

## [1]  TRUE FALSE FALSE

Two ways around this all and identical:

all(x == y)

## [1] FALSE

identical(x, y)

## [1] FALSE

You need to be a bit careful with identical though:

identical(1:3, c(1, 2, 3))

## [1] FALSE

## why?
typeof(1:3)

## [1] "integer"

typeof(c(1, 2, 3))

## [1] "double"

Vector access

To get a subvector, we can use [].

The format is vector1[vector2]

The rules:

If vector2 is a numeric vector, it is interpreted as the indices we want to pick out of vector1. This is scalar indexing.
If vector2 is a boolean (TRUE/FALSE) vector, the locations of TRUE values in vector2 are interpreted as the indices we want to retain from vector1. This is logical indexing.
Logical indexing is a vectorized operation, and so if vector2 doesn’t have the same length as vector1, vector2 will be recycled to match the length of vector1.

For example: scalar indexing

y <- c(1.2, 3.9, 0.4, 0.12)
y[c(1,3)]  # extract elements 1 and 3 of y

## [1] 1.2 0.4

y[2:3]

## [1] 3.9 0.4

v <- 3:4
y[v]

## [1] 0.40 0.12

We are allowed to repeat indices:

## [1] 1.20 3.90 0.40 0.12

y[rep(c(1, 3), each = 3)]

## [1] 1.2 1.2 1.2 0.4 0.4 0.4

To exclude indices instead of include them, we can use -:

z <- c(5, 12, 13)
z[-1]  # exclude element 1

## [1] 12 13

z[-1:-2]  # exclude elements 1 through 2

## [1] 13

Logical indexing

A simple example:

. . .

## [1] 1.20 3.90 0.40 0.12

y[c(TRUE, TRUE, FALSE, FALSE)]

## [1] 1.2 3.9

y[c(TRUE, FALSE)]

## [1] 1.2 0.4

Logical indexing is uesful when we want a sub-vector, but we won’t know in advance which elements we want.

As a contrived example, suppose we want to extract just the elements of the vector whose square is greater than 8.

We could use the following code to do so:

z <- c(5, 2, -3, 8)
w <- z[z * z > 8]
w

## [1]  5 -3  8

What’s happening here?

## [1]  5  2 -3  8

z * z

## [1] 25  4  9 64

z * z > 8

## [1]  TRUE FALSE  TRUE  TRUE

z[c(TRUE, FALSE, TRUE, TRUE)]

## [1]  5 -3  8

z[z * z > 8]

## [1]  5 -3  8

Some other example of filtering

Another simple way to extract only some of the values of a vector is with subset:

z <- c(5, 2, -3, 8)
z[z * z > 8]

## [1]  5 -3  8

subset(z, z * z > 8) ## gives the same result

## [1]  5 -3  8

We can also use which. which just gives us the positions at which the condition occurs, and we can use those positions to get the relevant subvector:

z <- c(5, 2, -3, 8)
which(z * z > 8)

## [1] 1 3 4

z[which(z * z > 8)]

## [1]  5 -3  8

Matrices

Matrices in R turn out to just be vectors with an additional attribute giving the dimensions.

Matrix creation

Most basic way of creating a matrix is with the matrix function.

It takes a vector giving the values that should go in the matrix plus number of rows and number of columns.

By default, the values are arranged in column-major order, but you can specify that the data is coming in in row-major order instead:

y <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
y

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

y <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
y

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Note that byrow = TRUE doesn’t change the way the matrix is stored behind the scenes.

You can specify just a number of rows or just a number of columns, and R will figure out what the dimensions of the matrix should be for you:

y <- matrix(c(1, 2, 3, 4), nrow = 2)
y

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

What we learned about vector recycling applies to matrices too:

y <- matrix(1:2, nrow = 2, ncol = 5)
y

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2

y <- matrix(1:3, nrow = 2, ncol = 5)

## Warning in matrix(1:3, nrow = 2, ncol = 5): data length [3] is not a
## sub-multiple or multiple of the number of rows [2]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    2    1    3
## [2,]    2    1    3    2    1

What happens with matrix(1:3, nrow = 2)?

Matrix operations

These work pretty much the same way vector operations do.

With matrices we get matrix multiplication, %*% in addition to the operations available for vectors.

y <- matrix(c(1, 2, 3, 4), nrow = 2)
y

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

y %*% y

##      [,1] [,2]
## [1,]    7   15
## [2,]   10   22

y * 3

##      [,1] [,2]
## [1,]    3    9
## [2,]    6   12

y + y

##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8

Scalar indexing for matrices

Getting sub-matrices is analogous to getting sub-vectors.

We still use [], but now we have two indices.

The syntax is matrix[rowIndices, colIndices].

rowIndices is a vector describing the rows you want to keep
colIndices is a vector describing the columns you want to keep.
Leaving one empty means that you want to keep all the available indices.

Let’s see how this works:

z <- matrix(sample(1:12), nrow = 4)
z

##      [,1] [,2] [,3]
## [1,]   10    1    8
## [2,]    6    5   11
## [3,]    9    7    4
## [4,]    2   12    3

z[,2:3] ## extract the 2nd and 3rd columns

##      [,1] [,2]
## [1,]    1    8
## [2,]    5   11
## [3,]    7    4
## [4,]   12    3

z[2:3, 2] ## extract the 2nd and 3rd rows of the 2nd column

## [1] 5 7

z[,-1] ## negative subscripts work the same way as with vectors

##      [,1] [,2]
## [1,]    1    8
## [2,]    5   11
## [3,]    7    4
## [4,]   12    3

We can assign values to submatrices using this indexing as well:

y <- matrix(1:6, nrow = 3)
y

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

y[c(1,3),] <- matrix(c(1, 1, 8, 12), nrow = 2)
y

##      [,1] [,2]
## [1,]    1    8
## [2,]    2    5
## [3,]    1   12

Logical indexing for matrices

As with vectors, we can use logical indexing for matrices.

As a simple example, suppose we want just the rows of the matrix for which the second column is at least 3.

We could do that as follows:

x <- matrix(c(1, 2, 3, 2, 3, 4), nrow = 3)
x

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4

x[x[,2] >= 3, ]

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

What is happening?

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4

x[,2]

## [1] 2 3 4

x[,2] >= 3

## [1] FALSE  TRUE  TRUE

x[c(FALSE, TRUE, TRUE), ]

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

x[x[,2] >= 3, ]

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    4

Avoiding unintended dimension reduction

Notice what happens if we try to make a sub-matrix corresponding to just one row of x:

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4

r <- x[2,]
r

## [1] 2 3

What happened?

attributes(x)

## $dim
## [1] 3 2

attributes(r)

## NULL

str(x)

##  num [1:3, 1:2] 1 2 3 2 3 4

str(r)

##  num [1:2] 2 3

r is a vector!

drop = FALSE lets us avoid this behavior.

r <- x[2,,drop = FALSE]
r

##      [,1] [,2]
## [1,]    2    3

attributes(r)

## $dim
## [1] 1 2

str(r)

##  num [1, 1:2] 2 3

Lists

Lists are technically vectors.
The vectors from before were called atomic vectors, which means that their components couldn’t be broken down into smaller components.
In R, the main purpose of lists is to lump together data of different types. Atomic vectors require all their elements to be of the same type, but lists can have any sort of elements (including lists!).

Creating lists

You can create a list with the list function…

j <- list(name = "Joe", salary = 55000, union = TRUE)
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE

The component names (tags) are not necessary, but it is good practice to use them.

list("Joe", 55000, TRUE)

## [[1]]
## [1] "Joe"
## 
## [[2]]
## [1] 55000
## 
## [[3]]
## [1] TRUE

List access

Suppose we want to get Joe’s salary. There are at least three different ways to do so:

j$salary

## [1] 55000

j[["salary"]]

## [1] 55000

j[[2]]

## [1] 55000

The double brackets [[]] allow us access to an element of the list.

Note that if we didn’t use the tags, we would only be able to access the salary using j[[2]].

Sublist vs. Element of list

Very important: [[]] and [] are different operations:

[[]] gives an element of the list
[] gives a sublist

j[[2]]

## [1] 55000

class(j[[2]])

## [1] "numeric"

j[2]

## $salary
## [1] 55000

class(j[2])

## [1] "list"

List manipulation

We can add to a list by tag:

j$hobby <- "sailing"
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"

We can add to a list by index:

j[[5]] = 1:5
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"
## 
## [[5]]
## [1] 1 2 3 4 5

We can remove an element from a list by setting it to NULL:

j[[5]] <- NULL
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE
## 
## $hobby
## [1] "sailing"

j$hobby <- NULL
j

## $name
## [1] "Joe"
## 
## $salary
## [1] 55000
## 
## $union
## [1] TRUE

Extracting names and values

Names are easy, just use the names function:

names(j)

## [1] "name"   "salary" "union"

To get values, we use unlist.

This is another function you have to be careful with, because it doesn’t necessarily do what you think it will.

It returns an atomic vector.
That means all the elements have to be of the same class, and so data often gets coerced to a different type.
If we unlist j, we get back a character vector. This is essentially because R knows how to convert numbers and booleans to characters but doesn’t know how to convert characters to numbers or booleans.

unlist(j)

##    name  salary   union 
##   "Joe" "55000"  "TRUE"

You can see the text for the coercion hierarchy.

Note that unlist gives an atomic vector even for recursive lists (lists of lists), for example:

complex_list <- list(a = list(1), b = 1:5, c = (list(a = 1, b = 2)))
unlist(complex_list)

##   a  b1  b2  b3  b4  b5 c.a c.b 
##   1   1   2   3   4   5   1   2

Stat 610 Lecture 1: Basic Data Types