More on data manipulation, shape changing, merging, transformations

Reading: Hadley Wickham, “Tidy Data” and “Reshaping data with the reshape package”

Agenda for today:

Tidy data

We usually want our data in the folowing form:

Data don’t always come this way!

Even if the data do satisfy the “one row per observation” rule for one analysis, they don’t necessarily do so for another, and we often need to change the “shape” of the data.

Data semantics

Three concepts:

Datasets contain values, each of which belongs to a variable and an observation.

Datasets can encode these in a lot of different ways. The “tidy” way is to have

This is usually the way that other functions want the data to come in.

For example:

income_and_religion
##                         <$10k $10k-$20k $20k-$30k $30k-$40k $40k-$50k
## Agnostic                   27        34        60        81        76
## Atheist                    12        27        37        52        35
## Buddhist                   27        21        30        34        33
## Catholic                  418       617       732       670       638
## Don't know/refused         15        14        15        11        10
## Evangelical Prot          575       869      1064       982       881
## Hindu                       1         9         7         9        11
## Historically Black Prot   228       244       236       238       197
## Jehovah's Witness          20        27        24        24        21
## Jewish                     19        19        25        25        30
##                         $50k-75k
## Agnostic                     137
## Atheist                       70
## Buddhist                      58
## Catholic                    1116
## Don't know/refused            35
## Evangelical Prot            1486
## Hindu                         34
## Historically Black Prot      223
## Jehovah's Witness             30
## Jewish                        95

This table is easy to read, but the variables are encoded in the row and column names, which makes it hard to use with modeling functions in R.

This data table contains all the same information:

melt(income_and_religion, varnames = c("income", "religion"), value.name = "count")
##                     income  religion count
## 1                 Agnostic     <$10k    27
## 2                  Atheist     <$10k    12
## 3                 Buddhist     <$10k    27
## 4                 Catholic     <$10k   418
## 5       Don't know/refused     <$10k    15
## 6         Evangelical Prot     <$10k   575
## 7                    Hindu     <$10k     1
## 8  Historically Black Prot     <$10k   228
## 9        Jehovah's Witness     <$10k    20
## 10                  Jewish     <$10k    19
## 11                Agnostic $10k-$20k    34
## 12                 Atheist $10k-$20k    27
## 13                Buddhist $10k-$20k    21
## 14                Catholic $10k-$20k   617
## 15      Don't know/refused $10k-$20k    14
## 16        Evangelical Prot $10k-$20k   869
## 17                   Hindu $10k-$20k     9
## 18 Historically Black Prot $10k-$20k   244
## 19       Jehovah's Witness $10k-$20k    27
## 20                  Jewish $10k-$20k    19
## 21                Agnostic $20k-$30k    60
## 22                 Atheist $20k-$30k    37
## 23                Buddhist $20k-$30k    30
## 24                Catholic $20k-$30k   732
## 25      Don't know/refused $20k-$30k    15
## 26        Evangelical Prot $20k-$30k  1064
## 27                   Hindu $20k-$30k     7
## 28 Historically Black Prot $20k-$30k   236
## 29       Jehovah's Witness $20k-$30k    24
## 30                  Jewish $20k-$30k    25
## 31                Agnostic $30k-$40k    81
## 32                 Atheist $30k-$40k    52
## 33                Buddhist $30k-$40k    34
## 34                Catholic $30k-$40k   670
## 35      Don't know/refused $30k-$40k    11
## 36        Evangelical Prot $30k-$40k   982
## 37                   Hindu $30k-$40k     9
## 38 Historically Black Prot $30k-$40k   238
## 39       Jehovah's Witness $30k-$40k    24
## 40                  Jewish $30k-$40k    25
## 41                Agnostic $40k-$50k    76
## 42                 Atheist $40k-$50k    35
## 43                Buddhist $40k-$50k    33
## 44                Catholic $40k-$50k   638
## 45      Don't know/refused $40k-$50k    10
## 46        Evangelical Prot $40k-$50k   881
## 47                   Hindu $40k-$50k    11
## 48 Historically Black Prot $40k-$50k   197
## 49       Jehovah's Witness $40k-$50k    21
## 50                  Jewish $40k-$50k    30
## 51                Agnostic  $50k-75k   137
## 52                 Atheist  $50k-75k    70
## 53                Buddhist  $50k-75k    58
## 54                Catholic  $50k-75k  1116
## 55      Don't know/refused  $50k-75k    35
## 56        Evangelical Prot  $50k-75k  1486
## 57                   Hindu  $50k-75k    34
## 58 Historically Black Prot  $50k-75k   223
## 59       Jehovah's Witness  $50k-75k    30
## 60                  Jewish  $50k-75k    95

It’s harder to look at this table, but easier to manipulate programmatically.

We’ll see melt in a couple slides.

Reshaping

The term for transforming these datasets into each other is called “reshaping”, and pretty much all reshaping can be done with a combination of two operations: melting and casting.

The mneumonic is that you first “melt” the data set and then you “cast” it. The “molten” data isn’t usually the final form you want it to be in, but it allows you to change it into other shapes easily.

Melting data

The melt function takes data from wide form to long form.

Conceptually, we have two sets of variables:

To melt a dateset, you need to decide what these are.

What does a molten data table look like?

Here id variables are income and religion, and the measurement is the count of each.

Since we only have one measurement variable, we don’t need an extra column describing what the measured variable is.

head(melt(income_and_religion))
##                 Var1  Var2 value
## 1           Agnostic <$10k    27
## 2            Atheist <$10k    12
## 3           Buddhist <$10k    27
## 4           Catholic <$10k   418
## 5 Don't know/refused <$10k    15
## 6   Evangelical Prot <$10k   575

Melting

Based on what we said before, we will need to provide data, id variables, and measurement variables, but in practice we have two distinct situations:

Melting data frames

Let’s see an example.

head(french_fries)
##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3

We want to melt this data frame so that

head(melt(french_fries, id.vars = c("time", "treatment", "subject", "rep"),
     measure.vars = c("potato", "buttery", "grassy", "rancid", "painty")))
##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8

You can specify just one of id.vars and measure.vars, in which case the function will assume that all the other variables should be in the other class.

You can also specify the variables by number instead of by name.

head(melt(french_fries, id.vars = 1:4))
##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8
head(melt(french_fries, measure.vars = 5:9))
##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8

If you don’t specify either, the function will assume that all the factor variables should be id variables and all of the numeric variables should be measured variables.

plyr::ldply(french_fries, class)
##         .id      V1
## 1      time  factor
## 2 treatment  factor
## 3   subject  factor
## 4       rep numeric
## 5    potato numeric
## 6   buttery numeric
## 7    grassy numeric
## 8    rancid numeric
## 9    painty numeric
head(melt(french_fries))
## Using time, treatment, subject as id variables
##   time treatment subject variable value
## 1    1         1       3      rep     1
## 2    1         1       3      rep     2
## 3    1         1      10      rep     1
## 4    1         1      10      rep     2
## 5    1         1      15      rep     1
## 6    1         1      15      rep     2

You can make the output slightly nicer by specifying variable names and value names:

head(melt(french_fries, id.vars = 1:4, variable.name = "flavor", value.name = "flavor_intensity"))
##   time treatment subject rep flavor flavor_intensity
## 1    1         1       3   1 potato              2.9
## 2    1         1       3   2 potato             14.0
## 3    1         1      10   1 potato             11.0
## 4    1         1      10   2 potato              9.9
## 5    1         1      15   1 potato              1.2
## 6    1         1      15   2 potato              8.8

Melting arrays

Therefore, the syntax is simply melt(array)

For example:

HairEyeColor
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
melt(HairEyeColor)
##     Hair   Eye    Sex value
## 1  Black Brown   Male    32
## 2  Brown Brown   Male    53
## 3    Red Brown   Male    10
## 4  Blond Brown   Male     3
## 5  Black  Blue   Male    11
## 6  Brown  Blue   Male    50
## 7    Red  Blue   Male    10
## 8  Blond  Blue   Male    30
## 9  Black Hazel   Male    10
## 10 Brown Hazel   Male    25
## 11   Red Hazel   Male     7
## 12 Blond Hazel   Male     5
## 13 Black Green   Male     3
## 14 Brown Green   Male    15
## 15   Red Green   Male     7
## 16 Blond Green   Male     8
## 17 Black Brown Female    36
## 18 Brown Brown Female    66
## 19   Red Brown Female    16
## 20 Blond Brown Female     4
## 21 Black  Blue Female     9
## 22 Brown  Blue Female    34
## 23   Red  Blue Female     7
## 24 Blond  Blue Female    64
## 25 Black Hazel Female     5
## 26 Brown Hazel Female    29
## 27   Red Hazel Female     7
## 28 Blond Hazel Female     5
## 29 Black Green Female     2
## 30 Brown Green Female    14
## 31   Red Green Female     7
## 32 Blond Green Female     8

There are a couple of optional arguments for making the output a little nicer, like giving a different name to the column in the output describing the measured value:

melt(HairEyeColor, value.name = "number")
##     Hair   Eye    Sex number
## 1  Black Brown   Male     32
## 2  Brown Brown   Male     53
## 3    Red Brown   Male     10
## 4  Blond Brown   Male      3
## 5  Black  Blue   Male     11
## 6  Brown  Blue   Male     50
## 7    Red  Blue   Male     10
## 8  Blond  Blue   Male     30
## 9  Black Hazel   Male     10
## 10 Brown Hazel   Male     25
## 11   Red Hazel   Male      7
## 12 Blond Hazel   Male      5
## 13 Black Green   Male      3
## 14 Brown Green   Male     15
## 15   Red Green   Male      7
## 16 Blond Green   Male      8
## 17 Black Brown Female     36
## 18 Brown Brown Female     66
## 19   Red Brown Female     16
## 20 Blond Brown Female      4
## 21 Black  Blue Female      9
## 22 Brown  Blue Female     34
## 23   Red  Blue Female      7
## 24 Blond  Blue Female     64
## 25 Black Hazel Female      5
## 26 Brown Hazel Female     29
## 27   Red Hazel Female      7
## 28 Blond Hazel Female      5
## 29 Black Green Female      2
## 30 Brown Green Female     14
## 31   Red Green Female      7
## 32 Blond Green Female      8

Casting

reshape2 uses *cast: either dcast or acast for data frame or array output, respectively.

Syntax: *cast(mdata, col_var_1 + ... + col_var_n ~ row_var_1 + ... + row_var_m, value.var)

Example

ff_m = melt(french_fries, id.vars = 1:4, variable.name = "flavor", na.rm = TRUE)
head(ff_m)
##   time treatment subject rep flavor value
## 1    1         1       3   1 potato   2.9
## 2    1         1       3   2 potato  14.0
## 3    1         1      10   1 potato  11.0
## 4    1         1      10   2 potato   9.9
## 5    1         1      15   1 potato   1.2
## 6    1         1      15   2 potato   8.8
cast_ex_1 = dcast(ff_m, time + treatment ~ subject + rep + flavor, value.var = "value")
cast_ex_1[1:5,1:5]
##   time treatment 3_1_potato 3_1_buttery 3_1_grassy
## 1    1         1        2.9         0.0        0.0
## 2    1         2       13.9         0.0        0.0
## 3    1         3       14.1         0.0        0.0
## 4    2         1        9.0         0.3        0.1
## 5    2         2       14.1         0.9        0.3
## check to see if the numbers match
subset(ff_m, time == 1 & subject == "3" &  rep == 1 & flavor == "potato")
##    time treatment subject rep flavor value
## 1     1         1       3   1 potato   2.9
## 25    1         2       3   1 potato  13.9
## 49    1         3       3   1 potato  14.1

Aggregation and casting

When you cast data, you often don’t use all of the variables.

This means that each element of the cast table will correspond to more than one measurement, and so they need to be aggregated in some way.

head(ff_m)
##   time treatment subject rep flavor value
## 1    1         1       3   1 potato   2.9
## 2    1         1       3   2 potato  14.0
## 3    1         1      10   1 potato  11.0
## 4    1         1      10   2 potato   9.9
## 5    1         1      15   1 potato   1.2
## 6    1         1      15   2 potato   8.8
dcast(ff_m, subject ~ flavor, value.var = "value")
## Aggregation function missing: defaulting to length
##    subject potato buttery grassy rancid painty
## 1        3     54      54     54     54     54
## 2       10     60      60     60     60     60
## 3       15     59      59     59     59     59
## 4       16     60      59     60     60     60
## 5       19     60      60     60     60     60
## 6       31     54      54     54     54     54
## 7       51     60      60     60     60     60
## 8       52     60      60     60     60     60
## 9       63     60      60     60     60     60
## 10      78     60      60     60     60     60
## 11      79     54      52     54     54     53
## 12      86     54      54     54     54     54

The aggregation function is specified with fun.aggregate:

dcast(ff_m, subject ~ flavor, value.var = "value", fun.aggregate = mean)
##    subject    potato   buttery    grassy    rancid      painty
## 1        3  6.083333 0.5759259 0.1296296 2.7000000 2.818518519
## 2       10  9.993333 6.7266667 0.4016667 3.0933333 0.961666667
## 3       15  3.908475 1.0084746 0.4000000 2.9389831 2.566101695
## 4       16  6.601667 3.1067797 0.9783333 3.5733333 0.746666667
## 5       19  8.921667 2.4100000 1.7416667 6.0016667 3.611666667
## 6       31  8.635185 0.5703704 0.1388889 6.1907407 4.466666667
## 7       51 10.293333 3.1883333 1.3233333 4.9116667 2.251666667
## 8       52  5.350000 0.8983333 0.9400000 3.8900000 2.500000000
## 9       63  7.750000 0.0650000 0.0450000 5.7766667 3.770000000
## 10      78  3.800000 0.5783333 0.6533333 1.4133333 3.245000000
## 11      79  7.911111 0.5173077 0.2388889 0.9259259 0.009433962
## 12      86  4.014815 1.8222222 0.8444444 4.7074074 3.325925926

A couple of other notes:

head(dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean), n = 10)
##    subject treatment    potato   buttery     grassy   rancid   painty
## 1        3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2        3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3        3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4       10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5       10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6       10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## 7       15         1  3.360000 0.7200000 0.42000000 3.965000 3.260000
## 8       15         2  4.405000 1.3150000 0.34000000 2.285000 2.060000
## 9       15         3  3.963158 0.9894737 0.44210526 2.547368 2.368421
## 10      16         1  6.495000 3.2600000 0.75500000 4.120000 1.230000
head(dcast(ff_m, treatment + subject ~ flavor, fun.aggregate = mean), n = 10)
##    treatment subject    potato   buttery     grassy   rancid   painty
## 1          1       3  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2          1      10  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 3          1      15  3.360000 0.7200000 0.42000000 3.965000 3.260000
## 4          1      16  6.495000 3.2600000 0.75500000 4.120000 1.230000
## 5          1      19  9.385000 3.0550000 2.02000000 5.360000 2.775000
## 6          1      31  8.844444 0.4444444 0.08888889 5.944444 3.211111
## 7          1      51 10.675000 2.6400000 1.05000000 5.150000 1.955000
## 8          1      52  5.060000 0.8050000 0.87500000 4.285000 2.645000
## 9          1      63  6.775000 0.0250000 0.00000000 6.055000 3.855000
## 10         1      78  3.620000 0.7350000 0.54000000 1.505000 3.490000

Aside: Pipes

The pipe operator, %>%, allows you to change the way in which you call functions.

library(magrittr)
head(dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean))
##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean) %>% head()
##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean) %>% head
##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
dcast(ff_m, subject + treatment ~ flavor, fun.aggregate = mean) %>% head(n = 10)
##    subject treatment    potato   buttery     grassy   rancid   painty
## 1        3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2        3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3        3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4       10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5       10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6       10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## 7       15         1  3.360000 0.7200000 0.42000000 3.965000 3.260000
## 8       15         2  4.405000 1.3150000 0.34000000 2.285000 2.060000
## 9       15         3  3.963158 0.9894737 0.44210526 2.547368 2.368421
## 10      16         1  6.495000 3.2600000 0.75500000 4.120000 1.230000
ff_m %>% dcast(subject + treatment ~ flavor, fun.aggregate = mean) %>% head()
##   subject treatment    potato   buttery     grassy   rancid   painty
## 1       3         1  6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2       3         2  6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3       3         3  5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4      10         1  9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5      10         2  9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6      10         3 10.030000 6.4500000 0.14500000 3.110000 0.690000

Merging

Final topic: What if you have data from two different places and you need to put them together?

Basic syntax: merge(x, y, by.x, by.y)

Example:

cities = data.frame(
    city=c('New York','Boston','Juneau',
        'Anchorage','San Diego',
        'Philadelphia','Los Angeles',
        'Fairbanks','Ann Arbor','Seattle'),
    state.abb=c('NY','MA','AK','AK','CA',
        'PA','CA','AK','MI','WA'))

states = data.frame(state.name, state.abb)
cities
##            city state.abb
## 1      New York        NY
## 2        Boston        MA
## 3        Juneau        AK
## 4     Anchorage        AK
## 5     San Diego        CA
## 6  Philadelphia        PA
## 7   Los Angeles        CA
## 8     Fairbanks        AK
## 9     Ann Arbor        MI
## 10      Seattle        WA
head(states)
##   state.name state.abb
## 1    Alabama        AL
## 2     Alaska        AK
## 3    Arizona        AZ
## 4   Arkansas        AR
## 5 California        CA
## 6   Colorado        CO

We want to add the state name to the cities data frame, and we can use merge.

merge(states, cities, by.x = "state.abb", by.y = "state.abb")
##    state.abb    state.name         city
## 1         AK        Alaska       Juneau
## 2         AK        Alaska    Anchorage
## 3         AK        Alaska    Fairbanks
## 4         CA    California    San Diego
## 5         CA    California  Los Angeles
## 6         MA Massachusetts       Boston
## 7         MI      Michigan    Ann Arbor
## 8         NY      New York     New York
## 9         PA  Pennsylvania Philadelphia
## 10        WA    Washington      Seattle

Notice in the last example that there was some ambiguity in how the merge took place because the two datasets have different sets of values for state.abb.

Can modify with all, all.x, or all.y

merge(states, cities, all.x = TRUE)
##    state.abb     state.name         city
## 1         AK         Alaska       Juneau
## 2         AK         Alaska    Anchorage
## 3         AK         Alaska    Fairbanks
## 4         AL        Alabama         <NA>
## 5         AR       Arkansas         <NA>
## 6         AZ        Arizona         <NA>
## 7         CA     California    San Diego
## 8         CA     California  Los Angeles
## 9         CO       Colorado         <NA>
## 10        CT    Connecticut         <NA>
## 11        DE       Delaware         <NA>
## 12        FL        Florida         <NA>
## 13        GA        Georgia         <NA>
## 14        HI         Hawaii         <NA>
## 15        IA           Iowa         <NA>
## 16        ID          Idaho         <NA>
## 17        IL       Illinois         <NA>
## 18        IN        Indiana         <NA>
## 19        KS         Kansas         <NA>
## 20        KY       Kentucky         <NA>
## 21        LA      Louisiana         <NA>
## 22        MA  Massachusetts       Boston
## 23        MD       Maryland         <NA>
## 24        ME          Maine         <NA>
## 25        MI       Michigan    Ann Arbor
## 26        MN      Minnesota         <NA>
## 27        MO       Missouri         <NA>
## 28        MS    Mississippi         <NA>
## 29        MT        Montana         <NA>
## 30        NC North Carolina         <NA>
## 31        ND   North Dakota         <NA>
## 32        NE       Nebraska         <NA>
## 33        NH  New Hampshire         <NA>
## 34        NJ     New Jersey         <NA>
## 35        NM     New Mexico         <NA>
## 36        NV         Nevada         <NA>
## 37        NY       New York     New York
## 38        OH           Ohio         <NA>
## 39        OK       Oklahoma         <NA>
## 40        OR         Oregon         <NA>
## 41        PA   Pennsylvania Philadelphia
## 42        RI   Rhode Island         <NA>
## 43        SC South Carolina         <NA>
## 44        SD   South Dakota         <NA>
## 45        TN      Tennessee         <NA>
## 46        TX          Texas         <NA>
## 47        UT           Utah         <NA>
## 48        VA       Virginia         <NA>
## 49        VT        Vermont         <NA>
## 50        WA     Washington      Seattle
## 51        WI      Wisconsin         <NA>
## 52        WV  West Virginia         <NA>
## 53        WY        Wyoming         <NA>

Some additional notes:

Some final notes

Merging, melting/casting, and split/apply/combine from last week are useful enough that there are multiple implementations.

You can use other implementations, but you shouldn’t try to re-make them from scratch.

Some other options are: