For example:

income_and_religion

##                         <$10k $10k-$20k $20k-$30k $30k-$40k $40k-$50k
## Agnostic                   27        34        60        81        76
## Atheist                    12        27        37        52        35
## Buddhist                   27        21        30        34        33
## Catholic                  418       617       732       670       638
## Don't know/refused         15        14        15        11        10
## Evangelical Prot          575       869      1064       982       881
## Hindu                       1         9         7         9        11
## Historically Black Prot   228       244       236       238       197
## Jehovah's Witness          20        27        24        24        21
## Jewish                     19        19        25        25        30
##                         $50k-75k
## Agnostic                     137
## Atheist                       70
## Buddhist                      58
## Catholic                    1116
## Don't know/refused            35
## Evangelical Prot            1486
## Hindu                         34
## Historically Black Prot      223
## Jehovah's Witness             30
## Jewish                        95

This table is easy to read, but the variables are encoded in the row and column names, which makes it hard to use with modeling functions in R.

This data table contains all the same information:

melt(income_and_religion, varnames = c("income", "religion"), value.name = "count")

##                     income  religion count
## 1                 Agnostic     <$10k    27
## 2                  Atheist     <$10k    12
## 3                 Buddhist     <$10k    27
## 4                 Catholic     <$10k   418
## 5       Don't know/refused     <$10k    15
## 6         Evangelical Prot     <$10k   575
## 7                    Hindu     <$10k     1
## 8  Historically Black Prot     <$10k   228
## 9        Jehovah's Witness     <$10k    20
## 10                  Jewish     <$10k    19
## 11                Agnostic $10k-$20k    34
## 12                 Atheist $10k-$20k    27
## 13                Buddhist $10k-$20k    21
## 14                Catholic $10k-$20k   617
## 15      Don't know/refused $10k-$20k    14
## 16        Evangelical Prot $10k-$20k   869
## 17                   Hindu $10k-$20k     9
## 18 Historically Black Prot $10k-$20k   244
## 19       Jehovah's Witness $10k-$20k    27
## 20                  Jewish $10k-$20k    19
## 21                Agnostic $20k-$30k    60
## 22                 Atheist $20k-$30k    37
## 23                Buddhist $20k-$30k    30
## 24                Catholic $20k-$30k   732
## 25      Don't know/refused $20k-$30k    15
## 26        Evangelical Prot $20k-$30k  1064
## 27                   Hindu $20k-$30k     7
## 28 Historically Black Prot $20k-$30k   236
## 29       Jehovah's Witness $20k-$30k    24
## 30                  Jewish $20k-$30k    25
## 31                Agnostic $30k-$40k    81
## 32                 Atheist $30k-$40k    52
## 33                Buddhist $30k-$40k    34
## 34                Catholic $30k-$40k   670
## 35      Don't know/refused $30k-$40k    11
## 36        Evangelical Prot $30k-$40k   982
## 37                   Hindu $30k-$40k     9
## 38 Historically Black Prot $30k-$40k   238
## 39       Jehovah's Witness $30k-$40k    24
## 40                  Jewish $30k-$40k    25
## 41                Agnostic $40k-$50k    76
## 42                 Atheist $40k-$50k    35
## 43                Buddhist $40k-$50k    33
## 44                Catholic $40k-$50k   638
## 45      Don't know/refused $40k-$50k    10
## 46        Evangelical Prot $40k-$50k   881
## 47                   Hindu $40k-$50k    11
## 48 Historically Black Prot $40k-$50k   197
## 49       Jehovah's Witness $40k-$50k    21
## 50                  Jewish $40k-$50k    30
## 51                Agnostic  $50k-75k   137
## 52                 Atheist  $50k-75k    70
## 53                Buddhist  $50k-75k    58
## 54                Catholic  $50k-75k  1116
## 55      Don't know/refused  $50k-75k    35
## 56        Evangelical Prot  $50k-75k  1486
## 57                   Hindu  $50k-75k    34
## 58 Historically Black Prot  $50k-75k   223
## 59       Jehovah's Witness  $50k-75k    30
## 60                  Jewish  $50k-75k    95

It’s harder to look at this table, but easier to manipulate programmatically.

We’ll see melt in a couple slides.

Reshaping

The term for transforming these datasets into each other is called “reshaping”, and pretty much all reshaping can be done with a combination of two operations: melting and casting.

Melting: Taking a wide dataset and making it long.
Casting: Taking a melted dataset and making it wide.

The mneumonic is that you first “melt” the data set and then you “cast” it. The “molten” data isn’t usually the final form you want it to be in, but it allows you to change it into other shapes easily.

Melting data

The melt function takes data from wide form to long form.

Conceptually, we have two sets of variables:

Identification variables, that describe the observations, and
Measurement variables, which are measurements.

To melt a dateset, you need to decide what these are.

What does a molten data table look like?

Columns (potentially more than one) for id variables.
If there is more than one measured variable, one column describing the variable measured.
One column for the value of the measured variable on the corresponding observation.

Here id variables are income and religion, and the measurement is the count of each.

Since we only have one measurement variable, we don’t need an extra column describing what the measured variable is.

head(melt(income_and_religion))

##                 Var1  Var2 value
## 1           Agnostic <$10k    27
## 2            Atheist <$10k    12
## 3           Buddhist <$10k    27
## 4           Catholic <$10k   418
## 5 Don't know/refused <$10k    15
## 6   Evangelical Prot <$10k   575

Melting

Based on what we said before, we will need to provide data, id variables, and measurement variables, but in practice we have two distinct situations:

Matrix or array input: here we assume that the id variables are the dimensions of the matrix or array, and the measurement variables are the elements of the matrix or the array. Therefore, we don’t actually specify id variables or measurement variables.
Data frame input: id variables and measurement variables have to be specified.

Melting data frames

Let’s see an example.

head(french_fries)

##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3

We want to melt this data frame so that

id.vars are time, treatment, subject, rep
measure.vars are the remainder: potato, buttery, grassy, rancid, painty

head(melt(french_fries, id.vars = c("time", "treatment", "subject", "rep"),
     measure.vars = c("potato", "buttery", "grassy", "rancid", "painty")))

##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8

You can specify just one of id.vars and measure.vars, in which case the function will assume that all the other variables should be in the other class.

You can also specify the variables by number instead of by name.

head(melt(french_fries, id.vars = 1:4))

##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8

head(melt(french_fries, measure.vars = 5:9))

##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8

If you don’t specify either, the function will assume that all the factor variables should be id variables and all of the numeric variables should be measured variables.

plyr::ldply(french_fries, class)

##         .id      V1
## 1      time  factor
## 2 treatment  factor
## 3   subject  factor
## 4       rep numeric
## 5    potato numeric
## 6   buttery numeric
## 7    grassy numeric
## 8    rancid numeric
## 9    painty numeric

head(melt(french_fries))

## Using time, treatment, subject as id variables

##   time treatment subject variable value
## 1    1         1       3      rep     1
## 2    1         1       3      rep     2
## 3    1         1      10      rep     1
## 4    1         1      10      rep     2
## 5    1         1      15      rep     1
## 6    1         1      15      rep     2

You can make the output slightly nicer by specifying variable names and value names:

head(melt(french_fries, id.vars = 1:4, variable.name = "flavor", value.name = "flavor_intensity"))

##   time treatment subject rep flavor flavor_intensity
## 1    1         1       3   1 potato              2.9
## 2    1         1       3   2 potato             14.0
## 3    1         1      10   1 potato             11.0
## 4    1         1      10   2 potato              9.9
## 5    1         1      15   1 potato              1.2
## 6    1         1      15   2 potato              8.8

Melting arrays

Assume that the id variables are the dimensions of the array
Assume the measured variable is stored as elements in the array.
We don’t specify id variables or measurement variables.

Therefore, the syntax is simply melt(array)

For example:

HairEyeColor

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

melt(HairEyeColor)

##     Hair   Eye    Sex value
## 1  Black Brown   Male    32
## 2  Brown Brown   Male    53
## 3    Red Brown   Male    10
## 4  Blond Brown   Male     3
## 5  Black  Blue   Male    11
## 6  Brown  Blue   Male    50
## 7    Red  Blue   Male    10
## 8  Blond  Blue   Male    30
## 9  Black Hazel   Male    10
## 10 Brown Hazel   Male    25
## 11   Red Hazel   Male     7
## 12 Blond Hazel   Male     5
## 13 Black Green   Male     3
## 14 Brown Green   Male    15
## 15   Red Green   Male     7
## 16 Blond Green   Male     8
## 17 Black Brown Female    36
## 18 Brown Brown Female    66
## 19   Red Brown Female    16
## 20 Blond Brown Female     4
## 21 Black  Blue Female     9
## 22 Brown  Blue Female    34
## 23   Red  Blue Female     7
## 24 Blond  Blue Female    64
## 25 Black Hazel Female     5
## 26 Brown Hazel Female    29
## 27   Red Hazel Female     7
## 28 Blond Hazel Female     5
## 29 Black Green Female     2
## 30 Brown Green Female    14
## 31   Red Green Female     7
## 32 Blond Green Female     8

There are a couple of optional arguments for making the output a little nicer, like giving a different name to the column in the output describing the measured value:

melt(HairEyeColor, value.name = "number")

##     Hair   Eye    Sex number
## 1  Black Brown   Male     32
## 2  Brown Brown   Male     53
## 3    Red Brown   Male     10
## 4  Blond Brown   Male      3
## 5  Black  Blue   Male     11
## 6  Brown  Blue   Male     50
## 7    Red  Blue   Male     10
## 8  Blond  Blue   Male     30
## 9  Black Hazel   Male     10
## 10 Brown Hazel   Male     25
## 11   Red Hazel   Male      7
## 12 Blond Hazel   Male      5
## 13 Black Green   Male      3
## 14 Brown Green   Male     15
## 15   Red Green   Male      7
## 16 Blond Green   Male      8
## 17 Black Brown Female     36
## 18 Brown Brown Female     66
## 19   Red Brown Female     16
## 20 Blond Brown Female      4
## 21 Black  Blue Female      9
## 22 Brown  Blue Female     34
## 23   Red  Blue Female      7
## 24 Blond  Blue Female     64
## 25 Black Hazel Female      5
## 26 Brown Hazel Female     29
## 27   Red Hazel Female      7
## 28 Blond Hazel Female      5
## 29 Black Green Female      2
## 30 Brown Green Female     14
## 31   Red Green Female      7
## 32 Blond Green Female      8

Casting

reshape2 uses *cast: either dcast or acast for data frame or array output, respectively.

Syntax: *cast(mdata, col_var_1 + ... + col_var_n ~ row_var_1 + ... + row_var_m, value.var)

mdata should be a molten data set
The columns of the output will be all the combinations of col_var_1, …, col_var_n
The rows of the output will be all the combinations of row_var_1, …, row_var_n
value.var is the value variable, the values that should go in the cells of the output. If you don’t specify it, melt will guess.

Example

ff_m = melt(french_fries, id.vars = 1:4, variable.name = "flavor", na.rm = TRUE)
head(ff_m)

##   time treatment subject rep flavor value
## 1    1         1       3   1 potato   2.9
## 2    1         1       3   2 potato  14.0
## 3    1         1      10   1 potato  11.0
## 4    1         1      10   2 potato   9.9
## 5    1         1      15   1 potato   1.2
## 6    1         1      15   2 potato   8.8

cast_ex_1 = dcast(ff_m, time + treatment ~ subject + rep + flavor, value.var = "value")
cast_ex_1[1:5,1:5]

##   time treatment 3_1_potato 3_1_buttery 3_1_grassy
## 1    1         1        2.9         0.0        0.0
## 2    1         2       13.9         0.0        0.0
## 3    1         3       14.1         0.0        0.0
## 4    2         1        9.0         0.3        0.1
## 5    2         2       14.1         0.9        0.3

## check to see if the numbers match
subset(ff_m, time == 1 & subject == "3" &  rep == 1 & flavor == "potato")

##    time treatment subject rep flavor value
## 1     1         1       3   1 potato   2.9
## 25    1         2       3   1 potato  13.9
## 49    1         3       3   1 potato  14.1

Aggregation and casting

When you cast data, you often don’t use all of the variables.

This means that each element of the cast table will correspond to more than one measurement, and so they need to be aggregated in some way.

head(ff_m)

##   time treatment subject rep flavor value
## 1    1         1       3   1 potato   2.9
## 2    1         1       3   2 potato  14.0
## 3    1         1      10   1 potato  11.0
## 4    1         1      10   2 potato   9.9
## 5    1         1      15   1 potato   1.2
## 6    1         1      15   2 potato   8.8

dcast(ff_m, subject ~ flavor, value.var = "value")

## Aggregation function missing: defaulting to length

##    subject potato buttery grassy rancid painty
## 1        3     54      54     54     54     54
## 2       10     60      60     60     60     60
## 3       15     59      59     59     59     59
## 4       16     60      59     60     60     60
## 5       19     60      60     60     60     60
## 6       31     54      54     54     54     54
## 7       51     60      60     60     60     60
## 8       52     60      60     60     60     60
## 9       63     60      60     60     60     60
## 10      78     60      60     60     60     60
## 11      79     54      52     54     54     53
## 12      86     54      54     54     54     54

The aggregation function is specified with fun.aggregate:

dcast(ff_m, subject ~ flavor, value.var = "value", fun.aggregate = mean)

##    subject    potato   buttery    grassy    rancid      painty
## 1        3  6.083333 0.5759259 0.1296296 2.7000000 2.818518519
## 2       10  9.993333 6.7266667 0.4016667 3.0933333 0.961666667
## 3       15  3.908475 1.0084746 0.4000000 2.9389831 2.566101695
## 4       16  6.601667 3.1067797 0.9783333 3.5733333 0.746666667
## 5       19  8.921667 2.4100000 1.7416667 6.0016667 3.611666667
## 6       31  8.635185 0.5703704 0.1388889 6.1907407 4.466666667
## 7       51 10.293333 3.1883333 1.3233333 4.9116667 2.251666667
## 8       52  5.350000 0.8983333 0.9400000 3.8900000 2.500000000
## 9       63  7.750000 0.0650000 0.0450000 5.7766667 3.770000000
## 10      78  3.800000 0.5783333 0.6533333 1.4133333 3.245000000
## 11      79  7.911111 0.5173077 0.2388889 0.9259259 0.009433962
## 12      86  4.014815 1.8222222 0.8444444 4.7074074 3.325925926

A couple of other notes:

You can use . to represent all the other variables in the formula.
The order that you specify the variables in the formula matters for the way the output is ordered: the first variables change most slowly, and the later ones change more quickly.

head(dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean), n = 10)

##    subject treatment    potato   buttery     grassy   rancid   painty
## 1        3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2        3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3        3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4       10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5       10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6       10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## 7       15         1  3.360000 0.7200000 0.42000000 3.965000 3.260000
## 8       15         2  4.405000 1.3150000 0.34000000 2.285000 2.060000
## 9       15         3  3.963158 0.9894737 0.44210526 2.547368 2.368421
## 10      16         1  6.495000 3.2600000 0.75500000 4.120000 1.230000

head(dcast(ff_m, treatment + subject ~ flavor, fun.aggregate = mean), n = 10)

##    treatment subject    potato   buttery     grassy   rancid   painty
## 1          1       3  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2          1      10  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 3          1      15  3.360000 0.7200000 0.42000000 3.965000 3.260000
## 4          1      16  6.495000 3.2600000 0.75500000 4.120000 1.230000
## 5          1      19  9.385000 3.0550000 2.02000000 5.360000 2.775000
## 6          1      31  8.844444 0.4444444 0.08888889 5.944444 3.211111
## 7          1      51 10.675000 2.6400000 1.05000000 5.150000 1.955000
## 8          1      52  5.060000 0.8050000 0.87500000 4.285000 2.645000
## 9          1      63  6.775000 0.0250000 0.00000000 6.055000 3.855000
## 10         1      78  3.620000 0.7350000 0.54000000 1.505000 3.490000

Aside: Pipes

The pipe operator, %>%, allows you to change the way in which you call functions.

x %>% fun1() means “pipe x to fun1”.
x %>% fun1 is the same as fun1(x).
Most useful when piping several functions together.

library(magrittr)
head(dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean))

##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000

dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean) %>% head()

##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000

dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean) %>% head

##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000

dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean) %>% head(n = 10)

##    subject treatment    potato   buttery     grassy   rancid   painty
## 1        3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2        3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3        3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4       10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5       10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6       10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## 7       15         1  3.360000 0.7200000 0.42000000 3.965000 3.260000
## 8       15         2  4.405000 1.3150000 0.34000000 2.285000 2.060000
## 9       15         3  3.963158 0.9894737 0.44210526 2.547368 2.368421
## 10      16         1  6.495000 3.2600000 0.75500000 4.120000 1.230000

ff_m %>% dcast(subject + treatment ~ flavor, fun.aggregate = mean) %>% head()

##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000

Merging

Final topic: What if you have data from two different places and you need to put them together?

Basic syntax: merge(x, y, by.x, by.y)

x and y are the two datasets you want to merge.
by.x is the column of x to merge on.
by.y is the column of y to merge on.

Example:

cities = data.frame(
    city=c('New York','Boston','Juneau',
        'Anchorage','San Diego',
        'Philadelphia','Los Angeles',
        'Fairbanks','Ann Arbor','Seattle'),
    state.abb=c('NY','MA','AK','AK','CA',
        'PA','CA','AK','MI','WA'))

states = data.frame(state.name, state.abb)
cities

##            city state.abb
## 1      New York        NY
## 2        Boston        MA
## 3        Juneau        AK
## 4     Anchorage        AK
## 5     San Diego        CA
## 6  Philadelphia        PA
## 7   Los Angeles        CA
## 8     Fairbanks        AK
## 9     Ann Arbor        MI
## 10      Seattle        WA

head(states)

##   state.name state.abb
## 1    Alabama        AL
## 2     Alaska        AK
## 3    Arizona        AZ
## 4   Arkansas        AR
## 5 California        CA
## 6   Colorado        CO

We want to add the state name to the cities data frame, and we can use merge.

merge(states, cities, by.x = "state.abb", by.y = "state.abb")

##    state.abb    state.name         city
## 1         AK        Alaska       Juneau
## 2         AK        Alaska    Anchorage
## 3         AK        Alaska    Fairbanks
## 4         CA    California    San Diego
## 5         CA    California  Los Angeles
## 6         MA Massachusetts       Boston
## 7         MI      Michigan    Ann Arbor
## 8         NY      New York     New York
## 9         PA  Pennsylvania Philadelphia
## 10        WA    Washington      Seattle

Notice in the last example that there was some ambiguity in how the merge took place because the two datasets have different sets of values for state.abb.

Default: the output dataset only has rows for which we had values in both data frames.

Can modify with all, all.x, or all.y

all = TRUE means that you get one row for values of the merging variable that were seen in either x or y
all.x = TRUE means you get one row for each value of the merging variable that was seen in x
all.y = TRUE means you get one row for each value of the merging variable that was seen in y.

merge(states, cities, all.x = TRUE)

##    state.abb     state.name         city
## 1         AK         Alaska       Juneau
## 2         AK         Alaska    Anchorage
## 3         AK         Alaska    Fairbanks
## 4         AL        Alabama         <NA>
## 5         AR       Arkansas         <NA>
## 6         AZ        Arizona         <NA>
## 7         CA     California    San Diego
## 8         CA     California  Los Angeles
## 9         CO       Colorado         <NA>
## 10        CT    Connecticut         <NA>
## 11        DE       Delaware         <NA>
## 12        FL        Florida         <NA>
## 13        GA        Georgia         <NA>
## 14        HI         Hawaii         <NA>
## 15        IA           Iowa         <NA>
## 16        ID          Idaho         <NA>
## 17        IL       Illinois         <NA>
## 18        IN        Indiana         <NA>
## 19        KS         Kansas         <NA>
## 20        KY       Kentucky         <NA>
## 21        LA      Louisiana         <NA>
## 22        MA  Massachusetts       Boston
## 23        MD       Maryland         <NA>
## 24        ME          Maine         <NA>
## 25        MI       Michigan    Ann Arbor
## 26        MN      Minnesota         <NA>
## 27        MO       Missouri         <NA>
## 28        MS    Mississippi         <NA>
## 29        MT        Montana         <NA>
## 30        NC North Carolina         <NA>
## 31        ND   North Dakota         <NA>
## 32        NE       Nebraska         <NA>
## 33        NH  New Hampshire         <NA>
## 34        NJ     New Jersey         <NA>
## 35        NM     New Mexico         <NA>
## 36        NV         Nevada         <NA>
## 37        NY       New York     New York
## 38        OH           Ohio         <NA>
## 39        OK       Oklahoma         <NA>
## 40        OR         Oregon         <NA>
## 41        PA   Pennsylvania Philadelphia
## 42        RI   Rhode Island         <NA>
## 43        SC South Carolina         <NA>
## 44        SD   South Dakota         <NA>
## 45        TN      Tennessee         <NA>
## 46        TX          Texas         <NA>
## 47        UT           Utah         <NA>
## 48        VA       Virginia         <NA>
## 49        VT        Vermont         <NA>
## 50        WA     Washington      Seattle
## 51        WI      Wisconsin         <NA>
## 52        WV  West Virginia         <NA>
## 53        WY        Wyoming         <NA>

Some additional notes:

Default if you don’t specify by.x and by.y is to use the columns that are common to the two.
by.x/by.y can have length more than 1, in which case we match on the entire set of specified variables.
Can use by instead of by.x and by.y, in which case the name of the column to merge on has to be the same in both x and y.

Some final notes

Merging, melting/casting, and split/apply/combine from last week are useful enough that there are multiple implementations.

You can use other implementations, but you shouldn’t try to re-make them from scratch.

Some other options are:

join in plyr does the same thing as merge
gather and spread, and more recently pivot_longer and pivot_wider in the tidyr package do roughly the same thing as cast and melt.
There are multiple iterations of reshape, reshape2 was a redesign of reshape.
plyr and dplyr are also two packages that have basically the same goal, but dplyr is focused more on data frames.

More on data manipulation, shape changing, merging, transformations

Tidy data

Data semantics