Julia Fukuyama
Reading: Matloff Chapter 1, 2, 3.1, 3.2, 3.4, 3.5, 3.6, 3.7, 4.1, 4.2, 4.3
Chapter 1: Don’t worry about understanding everything in detail, but it gives a good overview and will help you have a slot for everything later.
Also, you should probably not use his advice on getting help, the real way is to google your problem + R and click on the first stackoverflow link or else ask chat gpt.
It’s almost never a good idea to ask for help on the mailing list.
Goal today: Learn about the ways R stores data and how to access and manipulate data in those structures.
Why is this important?
R is object-oriented and functional (we’ll discuss this more later in the course).
This sometimes makes it seem like magic: the plot
function does completely different things in different contexts. How
does R know?
Data types for today:
Vectors
Matrices
Lists
Fundamental data type in R, scalars are actually just vectors of length 1.
seq
makes a vector containing units with arbitrary
spacing:
## [1] 12 15 18 21 24 27 30
You can specify the length you want instead of the spacing:
## [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
You can repeat vectors instead of single numbers:
## [1] 1 5 7 1 5 7 1 5 7
Specifying each
instead of times
changes
how the repeats work: the first element in the vector is repeated
each
times, then the second, then the third:
## [1] 1 1 1 5 5 5 7 7 7
Vectors can be added, subtracted, multiplied, etc.
If the vectors are the same size, this will be done element-wise.
## [1] 6 2 3
## [1] 5 0 -4
## [1] 0.2 0.5 -4.0
What if the vectors are not the same size?
Remember that scalars in R are just vectors, so in the following code we are actually adding two vectors:
## [1] 3 4 6
This is weird and important!
If a vector operation requires the vectors to be the same length, R automatically repeats the shorter one until it is long enough to match the longer one.
. . .
That is what happened in the example above: the longer vector,
x
, had length 3, and the shorter vector, 2
,
had length 1. What really happened was more like
## [1] 3 4 6
## [1] 3 4 6
This makes sense for adding a vector and a scalar (= vector of length 1).
It can give results you’re not expecting if you have vectors of different lengths:
## Warning in c(1, 2, 4) + c(6, 0, 9, 20, 22): longer object length is not a
## multiple of shorter object length
## [1] 7 2 13 21 24
## [1] 7 2 13 21 24
If the vector lengths are not multiples of one another R will warn you about the recycling. Otherwise, it does it without complaint:
## [1] 2 4 7 9
## [1] 2 4 7 9
>
, <
, <=
,
>=
, ==
are all vector operations and work
like the arithmetic operators:
## [1] TRUE TRUE FALSE
## [1] TRUE TRUE TRUE
all
and any
work on boolean vectors
(vectors of TRUE
and FALSE
), and do what is
implied by their name:
We can’t use ==
to test whether vectors are the same
because ==
will give us a boolean vector.
## [1] TRUE FALSE FALSE
Two ways around this all
and identical
:
## [1] FALSE
## [1] FALSE
To get a subvector, we can use []
.
The format is vector1[vector2]
The rules:
If vector2
is a numeric vector, it is interpreted as
the indices we want to pick out of vector1
. This is
scalar indexing.
If vector2
is a boolean (TRUE/FALSE) vector, the
locations of TRUE values in vector2
are interpreted as the
indices we want to retain from vector1
. This is logical
indexing.
Logical indexing is a vectorized operation, and so if
vector2
doesn’t have the same length as
vector1
, vector2
will be recycled to match the
length of vector1
.
For example: scalar indexing
## [1] 1.2 0.4
## [1] 3.9 0.4
## [1] 0.40 0.12
We are allowed to repeat indices:
## [1] 1.20 3.90 0.40 0.12
## [1] 1.2 1.2 1.2 0.4 0.4 0.4
To exclude indices instead of include them, we can use
-
:
## [1] 12 13
## [1] 13
A simple example:
. . .
## [1] 1.20 3.90 0.40 0.12
## [1] 1.2 3.9
## [1] 1.2 0.4
Logical indexing is uesful when we want a sub-vector, but we won’t know in advance which elements we want.
As a contrived example, suppose we want to extract just the elements of the vector whose square is greater than 8.
We could use the following code to do so:
## [1] 5 -3 8
Another simple way to extract only some of the values of a vector is
with subset
:
## [1] 5 -3 8
## [1] 5 -3 8
We can also use which
. which
just gives us
the positions at which the condition occurs, and we can use those
positions to get the relevant subvector:
## [1] 1 3 4
## [1] 5 -3 8
Matrices in R turn out to just be vectors with an additional attribute giving the dimensions.
Most basic way of creating a matrix is with the matrix
function.
It takes a vector giving the values that should go in the matrix plus number of rows and number of columns.
By default, the values are arranged in column-major order, but you can specify that the data is coming in in row-major order instead:
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Note that byrow = TRUE
doesn’t change the way the matrix
is stored behind the scenes.
You can specify just a number of rows or just a number of columns, and R will figure out what the dimensions of the matrix should be for you:
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
What we learned about vector recycling applies to matrices too:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
## Warning in matrix(1:3, nrow = 2, ncol = 5): data length [3] is not a
## sub-multiple or multiple of the number of rows [2]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 2 1 3
## [2,] 2 1 3 2 1
What happens with matrix(1:3, nrow = 2)
?
These work pretty much the same way vector operations do.
With matrices we get matrix multiplication, %*%
in
addition to the operations available for vectors.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
## [,1] [,2]
## [1,] 3 9
## [2,] 6 12
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Getting sub-matrices is analogous to getting sub-vectors.
We still use []
, but now we have two indices.
The syntax is matrix[rowIndices, colIndices]
.
rowIndices
is a vector describing the rows you want
to keep
colIndices
is a vector describing the columns you
want to keep.
Leaving one empty means that you want to keep all the available indices.
Let’s see how this works:
## [,1] [,2] [,3]
## [1,] 10 1 8
## [2,] 6 5 11
## [3,] 9 7 4
## [4,] 2 12 3
## [,1] [,2]
## [1,] 1 8
## [2,] 5 11
## [3,] 7 4
## [4,] 12 3
## [1] 5 7
## [,1] [,2]
## [1,] 1 8
## [2,] 5 11
## [3,] 7 4
## [4,] 12 3
We can assign values to submatrices using this indexing as well:
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] 1 8
## [2,] 2 5
## [3,] 1 12
As with vectors, we can use logical indexing for matrices.
As a simple example, suppose we want just the rows of the matrix for which the second column is at least 3.
We could do that as follows:
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
Notice what happens if we try to make a sub-matrix corresponding to
just one row of x
:
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
## [1] 2 3
What happened?
## $dim
## [1] 3 2
## NULL
## num [1:3, 1:2] 1 2 3 2 3 4
## num [1:2] 2 3
r
is a vector!
drop = FALSE
lets us avoid this behavior.
## [,1] [,2]
## [1,] 2 3
## $dim
## [1] 1 2
## num [1, 1:2] 2 3
Lists are technically vectors.
The vectors from before were called atomic vectors, which means that their components couldn’t be broken down into smaller components.
In R, the main purpose of lists is to lump together data of different types. Atomic vectors require all their elements to be of the same type, but lists can have any sort of elements (including lists!).
You can create a list with the list
function…
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
The component names (tags) are not necessary, but it is good practice to use them.
## [[1]]
## [1] "Joe"
##
## [[2]]
## [1] 55000
##
## [[3]]
## [1] TRUE
Suppose we want to get Joe’s salary. There are at least three different ways to do so:
## [1] 55000
## [1] 55000
## [1] 55000
The double brackets [[]]
allow us access to an element
of the list.
Note that if we didn’t use the tags, we would only be able to access
the salary using j[[2]]
.
We can add to a list by tag:
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
We can add to a list by index:
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
##
## [[5]]
## [1] 1 2 3 4 5
We can remove an element from a list by setting it to
NULL
:
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
##
## $hobby
## [1] "sailing"
## $name
## [1] "Joe"
##
## $salary
## [1] 55000
##
## $union
## [1] TRUE
Names are easy, just use the names
function:
## [1] "name" "salary" "union"
To get values, we use unlist
.
This is another function you have to be careful with, because it doesn’t necessarily do what you think it will.
It returns an atomic vector.
That means all the elements have to be of the same class, and so data often gets coerced to a different type.
If we unlist j
, we get back a character vector. This
is essentially because R knows how to convert numbers and booleans to
characters but doesn’t know how to convert characters to numbers or
booleans.
## name salary union
## "Joe" "55000" "TRUE"
You can see the text for the coercion hierarchy.