Reading: Hadley Wickham, “Tidy Data” and “Reshaping data with the reshape package”
Agenda for today:
Reshaping/tidy data/wide vs. long format
Merging
We usually want our data in the folowing form:
In a rectangular data frame
One column per sample or observation
Data don’t always come this way!
Even if the data do satisfy the “one row per observation” rule for one analysis, they don’t necessarily do so for another, and we often need to change the “shape” of the data.
Three concepts:
Values
Variables
Observations
Datasets contain values, each of which belongs to a variable and an observation.
Datasets can encode these in a lot of different ways. The “tidy” way is to have
Each variable a column
Each observation a row
Each cell a value
This is usually the way that other functions want the data to come in.
For example:
## <$10k $10k-$20k $20k-$30k $30k-$40k $40k-$50k
## Agnostic 27 34 60 81 76
## Atheist 12 27 37 52 35
## Buddhist 27 21 30 34 33
## Catholic 418 617 732 670 638
## Don't know/refused 15 14 15 11 10
## Evangelical Prot 575 869 1064 982 881
## Hindu 1 9 7 9 11
## Historically Black Prot 228 244 236 238 197
## Jehovah's Witness 20 27 24 24 21
## Jewish 19 19 25 25 30
## $50k-75k
## Agnostic 137
## Atheist 70
## Buddhist 58
## Catholic 1116
## Don't know/refused 35
## Evangelical Prot 1486
## Hindu 34
## Historically Black Prot 223
## Jehovah's Witness 30
## Jewish 95
This table is easy to read, but the variables are encoded in the row and column names, which makes it hard to use with modeling functions in R.
This data table contains all the same information:
## income religion count
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don't know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
## 7 Hindu <$10k 1
## 8 Historically Black Prot <$10k 228
## 9 Jehovah's Witness <$10k 20
## 10 Jewish <$10k 19
## 11 Agnostic $10k-$20k 34
## 12 Atheist $10k-$20k 27
## 13 Buddhist $10k-$20k 21
## 14 Catholic $10k-$20k 617
## 15 Don't know/refused $10k-$20k 14
## 16 Evangelical Prot $10k-$20k 869
## 17 Hindu $10k-$20k 9
## 18 Historically Black Prot $10k-$20k 244
## 19 Jehovah's Witness $10k-$20k 27
## 20 Jewish $10k-$20k 19
## 21 Agnostic $20k-$30k 60
## 22 Atheist $20k-$30k 37
## 23 Buddhist $20k-$30k 30
## 24 Catholic $20k-$30k 732
## 25 Don't know/refused $20k-$30k 15
## 26 Evangelical Prot $20k-$30k 1064
## 27 Hindu $20k-$30k 7
## 28 Historically Black Prot $20k-$30k 236
## 29 Jehovah's Witness $20k-$30k 24
## 30 Jewish $20k-$30k 25
## 31 Agnostic $30k-$40k 81
## 32 Atheist $30k-$40k 52
## 33 Buddhist $30k-$40k 34
## 34 Catholic $30k-$40k 670
## 35 Don't know/refused $30k-$40k 11
## 36 Evangelical Prot $30k-$40k 982
## 37 Hindu $30k-$40k 9
## 38 Historically Black Prot $30k-$40k 238
## 39 Jehovah's Witness $30k-$40k 24
## 40 Jewish $30k-$40k 25
## 41 Agnostic $40k-$50k 76
## 42 Atheist $40k-$50k 35
## 43 Buddhist $40k-$50k 33
## 44 Catholic $40k-$50k 638
## 45 Don't know/refused $40k-$50k 10
## 46 Evangelical Prot $40k-$50k 881
## 47 Hindu $40k-$50k 11
## 48 Historically Black Prot $40k-$50k 197
## 49 Jehovah's Witness $40k-$50k 21
## 50 Jewish $40k-$50k 30
## 51 Agnostic $50k-75k 137
## 52 Atheist $50k-75k 70
## 53 Buddhist $50k-75k 58
## 54 Catholic $50k-75k 1116
## 55 Don't know/refused $50k-75k 35
## 56 Evangelical Prot $50k-75k 1486
## 57 Hindu $50k-75k 34
## 58 Historically Black Prot $50k-75k 223
## 59 Jehovah's Witness $50k-75k 30
## 60 Jewish $50k-75k 95
It’s harder to look at this table, but easier to manipulate programmatically.
We’ll see melt
in a couple slides.
The term for transforming these datasets into each other is called “reshaping”, and pretty much all reshaping can be done with a combination of two operations: melting and casting.
Melting: Taking a wide dataset and making it long.
Casting: Taking a melted dataset and making it wide.
The mneumonic is that you first “melt” the data set and then you “cast” it. The “molten” data isn’t usually the final form you want it to be in, but it allows you to change it into other shapes easily.
The melt
function takes data from wide form to long form.
Conceptually, we have two sets of variables:
Identification variables, that describe the observations, and
Measurement variables, which are measurements.
To melt a dateset, you need to decide what these are.
What does a molten data table look like?
Columns (potentially more than one) for id variables.
If there is more than one measured variable, one column describing the variable measured.
One column for the value of the measured variable on the corresponding observation.
Here id variables are income and religion, and the measurement is the count of each.
Since we only have one measurement variable, we don’t need an extra column describing what the measured variable is.
## Var1 Var2 value
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don't know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
Based on what we said before, we will need to provide data, id variables, and measurement variables, but in practice we have two distinct situations:
Matrix or array input: here we assume that the id variables are the dimensions of the matrix or array, and the measurement variables are the elements of the matrix or the array. Therefore, we don’t actually specify id variables or measurement variables.
Data frame input: id variables and measurement variables have to be specified.
Let’s see an example.
## time treatment subject rep potato buttery grassy rancid painty
## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5
## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0
## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0
## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0
## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1
## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3
We want to melt this data frame so that
id.vars
are time, treatment, subject, rep
measure.vars
are the remainder: potato, buttery, grassy, rancid, painty
head(melt(french_fries, id.vars = c("time", "treatment", "subject", "rep"),
measure.vars = c("potato", "buttery", "grassy", "rancid", "painty")))
## time treatment subject rep variable value
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
You can specify just one of id.vars
and measure.vars
, in which case the function will assume that all the other variables should be in the other class.
You can also specify the variables by number instead of by name.
## time treatment subject rep variable value
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
## time treatment subject rep variable value
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
If you don’t specify either, the function will assume that all the factor variables should be id variables and all of the numeric variables should be measured variables.
## .id V1
## 1 time factor
## 2 treatment factor
## 3 subject factor
## 4 rep numeric
## 5 potato numeric
## 6 buttery numeric
## 7 grassy numeric
## 8 rancid numeric
## 9 painty numeric
## Using time, treatment, subject as id variables
## time treatment subject variable value
## 1 1 1 3 rep 1
## 2 1 1 3 rep 2
## 3 1 1 10 rep 1
## 4 1 1 10 rep 2
## 5 1 1 15 rep 1
## 6 1 1 15 rep 2
You can make the output slightly nicer by specifying variable names and value names:
## time treatment subject rep flavor flavor_intensity
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
Assume that the id variables are the dimensions of the array
Assume the measured variable is stored as elements in the array.
We don’t specify id variables or measurement variables.
Therefore, the syntax is simply melt(array)
For example:
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
## Hair Eye Sex value
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
## 7 Red Blue Male 10
## 8 Blond Blue Male 30
## 9 Black Hazel Male 10
## 10 Brown Hazel Male 25
## 11 Red Hazel Male 7
## 12 Blond Hazel Male 5
## 13 Black Green Male 3
## 14 Brown Green Male 15
## 15 Red Green Male 7
## 16 Blond Green Male 8
## 17 Black Brown Female 36
## 18 Brown Brown Female 66
## 19 Red Brown Female 16
## 20 Blond Brown Female 4
## 21 Black Blue Female 9
## 22 Brown Blue Female 34
## 23 Red Blue Female 7
## 24 Blond Blue Female 64
## 25 Black Hazel Female 5
## 26 Brown Hazel Female 29
## 27 Red Hazel Female 7
## 28 Blond Hazel Female 5
## 29 Black Green Female 2
## 30 Brown Green Female 14
## 31 Red Green Female 7
## 32 Blond Green Female 8
There are a couple of optional arguments for making the output a little nicer, like giving a different name to the column in the output describing the measured value:
## Hair Eye Sex number
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
## 7 Red Blue Male 10
## 8 Blond Blue Male 30
## 9 Black Hazel Male 10
## 10 Brown Hazel Male 25
## 11 Red Hazel Male 7
## 12 Blond Hazel Male 5
## 13 Black Green Male 3
## 14 Brown Green Male 15
## 15 Red Green Male 7
## 16 Blond Green Male 8
## 17 Black Brown Female 36
## 18 Brown Brown Female 66
## 19 Red Brown Female 16
## 20 Blond Brown Female 4
## 21 Black Blue Female 9
## 22 Brown Blue Female 34
## 23 Red Blue Female 7
## 24 Blond Blue Female 64
## 25 Black Hazel Female 5
## 26 Brown Hazel Female 29
## 27 Red Hazel Female 7
## 28 Blond Hazel Female 5
## 29 Black Green Female 2
## 30 Brown Green Female 14
## 31 Red Green Female 7
## 32 Blond Green Female 8
reshape2 uses *cast
: either dcast
or acast
for data frame or array output, respectively.
Syntax: *cast(mdata, col_var_1 + ... + col_var_n ~ row_var_1 + ... + row_var_m, value.var)
mdata
should be a molten data set
The columns of the output will be all the combinations of col_var_1
, …, col_var_n
The rows of the output will be all the combinations of row_var_1
, …, row_var_n
value.var
is the value variable, the values that should go in the cells of the output. If you don’t specify it, melt
will guess.
Example
## time treatment subject rep flavor value
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
cast_ex_1 = dcast(ff_m, time + treatment ~ subject + rep + flavor, value.var = "value")
cast_ex_1[1:5,1:5]
## time treatment 3_1_potato 3_1_buttery 3_1_grassy
## 1 1 1 2.9 0.0 0.0
## 2 1 2 13.9 0.0 0.0
## 3 1 3 14.1 0.0 0.0
## 4 2 1 9.0 0.3 0.1
## 5 2 2 14.1 0.9 0.3
## check to see if the numbers match
subset(ff_m, time == 1 & subject == "3" & rep == 1 & flavor == "potato")
## time treatment subject rep flavor value
## 1 1 1 3 1 potato 2.9
## 25 1 2 3 1 potato 13.9
## 49 1 3 3 1 potato 14.1
When you cast data, you often don’t use all of the variables.
This means that each element of the cast table will correspond to more than one measurement, and so they need to be aggregated in some way.
## time treatment subject rep flavor value
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
## Aggregation function missing: defaulting to length
## subject potato buttery grassy rancid painty
## 1 3 54 54 54 54 54
## 2 10 60 60 60 60 60
## 3 15 59 59 59 59 59
## 4 16 60 59 60 60 60
## 5 19 60 60 60 60 60
## 6 31 54 54 54 54 54
## 7 51 60 60 60 60 60
## 8 52 60 60 60 60 60
## 9 63 60 60 60 60 60
## 10 78 60 60 60 60 60
## 11 79 54 52 54 54 53
## 12 86 54 54 54 54 54
The aggregation function is specified with fun.aggregate
:
## subject potato buttery grassy rancid painty
## 1 3 6.083333 0.5759259 0.1296296 2.7000000 2.818518519
## 2 10 9.993333 6.7266667 0.4016667 3.0933333 0.961666667
## 3 15 3.908475 1.0084746 0.4000000 2.9389831 2.566101695
## 4 16 6.601667 3.1067797 0.9783333 3.5733333 0.746666667
## 5 19 8.921667 2.4100000 1.7416667 6.0016667 3.611666667
## 6 31 8.635185 0.5703704 0.1388889 6.1907407 4.466666667
## 7 51 10.293333 3.1883333 1.3233333 4.9116667 2.251666667
## 8 52 5.350000 0.8983333 0.9400000 3.8900000 2.500000000
## 9 63 7.750000 0.0650000 0.0450000 5.7766667 3.770000000
## 10 78 3.800000 0.5783333 0.6533333 1.4133333 3.245000000
## 11 79 7.911111 0.5173077 0.2388889 0.9259259 0.009433962
## 12 86 4.014815 1.8222222 0.8444444 4.7074074 3.325925926
A couple of other notes:
You can use .
to represent all the other variables in the formula.
The order that you specify the variables in the formula matters for the way the output is ordered: the first variables change most slowly, and the later ones change more quickly.
## subject treatment potato buttery grassy rancid painty
## 1 3 1 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 3 2 6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3 3 3 5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4 10 1 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5 10 2 9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6 10 3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## 7 15 1 3.360000 0.7200000 0.42000000 3.965000 3.260000
## 8 15 2 4.405000 1.3150000 0.34000000 2.285000 2.060000
## 9 15 3 3.963158 0.9894737 0.44210526 2.547368 2.368421
## 10 16 1 6.495000 3.2600000 0.75500000 4.120000 1.230000
## treatment subject potato buttery grassy rancid painty
## 1 1 3 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 1 10 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 3 1 15 3.360000 0.7200000 0.42000000 3.965000 3.260000
## 4 1 16 6.495000 3.2600000 0.75500000 4.120000 1.230000
## 5 1 19 9.385000 3.0550000 2.02000000 5.360000 2.775000
## 6 1 31 8.844444 0.4444444 0.08888889 5.944444 3.211111
## 7 1 51 10.675000 2.6400000 1.05000000 5.150000 1.955000
## 8 1 52 5.060000 0.8050000 0.87500000 4.285000 2.645000
## 9 1 63 6.775000 0.0250000 0.00000000 6.055000 3.855000
## 10 1 78 3.620000 0.7350000 0.54000000 1.505000 3.490000
The pipe operator, %>%
, allows you to change the way in which you call functions.
x %>% fun1()
means “pipe x
to fun1
”.
x %>% fun1
is the same as fun1(x)
.
Most useful when piping several functions together.
## subject treatment potato buttery grassy rancid painty
## 1 3 1 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 3 2 6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3 3 3 5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4 10 1 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5 10 2 9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6 10 3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## subject treatment potato buttery grassy rancid painty
## 1 3 1 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 3 2 6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3 3 3 5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4 10 1 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5 10 2 9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6 10 3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## subject treatment potato buttery grassy rancid painty
## 1 3 1 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 3 2 6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3 3 3 5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4 10 1 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5 10 2 9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6 10 3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## subject treatment potato buttery grassy rancid painty
## 1 3 1 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 3 2 6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3 3 3 5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4 10 1 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5 10 2 9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6 10 3 10.030000 6.4500000 0.14500000 3.110000 0.690000
## 7 15 1 3.360000 0.7200000 0.42000000 3.965000 3.260000
## 8 15 2 4.405000 1.3150000 0.34000000 2.285000 2.060000
## 9 15 3 3.963158 0.9894737 0.44210526 2.547368 2.368421
## 10 16 1 6.495000 3.2600000 0.75500000 4.120000 1.230000
## subject treatment potato buttery grassy rancid painty
## 1 3 1 6.216667 0.3722222 0.18888889 2.105556 3.111111
## 2 3 2 6.738889 0.5888889 0.10555556 3.138889 2.477778
## 3 3 3 5.294444 0.7666667 0.09444444 2.855556 2.866667
## 4 10 1 9.955000 6.7500000 0.58500000 4.020000 1.375000
## 5 10 2 9.995000 6.9800000 0.47500000 2.150000 0.820000
## 6 10 3 10.030000 6.4500000 0.14500000 3.110000 0.690000
Final topic: What if you have data from two different places and you need to put them together?
Basic syntax: merge(x, y, by.x, by.y)
x
and y
are the two datasets you want to merge.
by.x
is the column of x
to merge on.
by.y
is the column of y
to merge on.
Example:
cities = data.frame(
city=c('New York','Boston','Juneau',
'Anchorage','San Diego',
'Philadelphia','Los Angeles',
'Fairbanks','Ann Arbor','Seattle'),
state.abb=c('NY','MA','AK','AK','CA',
'PA','CA','AK','MI','WA'))
states = data.frame(state.name, state.abb)
cities
## city state.abb
## 1 New York NY
## 2 Boston MA
## 3 Juneau AK
## 4 Anchorage AK
## 5 San Diego CA
## 6 Philadelphia PA
## 7 Los Angeles CA
## 8 Fairbanks AK
## 9 Ann Arbor MI
## 10 Seattle WA
## state.name state.abb
## 1 Alabama AL
## 2 Alaska AK
## 3 Arizona AZ
## 4 Arkansas AR
## 5 California CA
## 6 Colorado CO
We want to add the state name to the cities data frame, and we can use merge.
## state.abb state.name city
## 1 AK Alaska Juneau
## 2 AK Alaska Anchorage
## 3 AK Alaska Fairbanks
## 4 CA California San Diego
## 5 CA California Los Angeles
## 6 MA Massachusetts Boston
## 7 MI Michigan Ann Arbor
## 8 NY New York New York
## 9 PA Pennsylvania Philadelphia
## 10 WA Washington Seattle
Notice in the last example that there was some ambiguity in how the merge took place because the two datasets have different sets of values for state.abb
.
Can modify with all
, all.x
, or all.y
all = TRUE
means that you get one row for values of the merging variable that were seen in either x
or y
all.x = TRUE
means you get one row for each value of the merging variable that was seen in x
all.y = TRUE
means you get one row for each value of the merging variable that was seen in y
.
## state.abb state.name city
## 1 AK Alaska Juneau
## 2 AK Alaska Anchorage
## 3 AK Alaska Fairbanks
## 4 AL Alabama <NA>
## 5 AR Arkansas <NA>
## 6 AZ Arizona <NA>
## 7 CA California San Diego
## 8 CA California Los Angeles
## 9 CO Colorado <NA>
## 10 CT Connecticut <NA>
## 11 DE Delaware <NA>
## 12 FL Florida <NA>
## 13 GA Georgia <NA>
## 14 HI Hawaii <NA>
## 15 IA Iowa <NA>
## 16 ID Idaho <NA>
## 17 IL Illinois <NA>
## 18 IN Indiana <NA>
## 19 KS Kansas <NA>
## 20 KY Kentucky <NA>
## 21 LA Louisiana <NA>
## 22 MA Massachusetts Boston
## 23 MD Maryland <NA>
## 24 ME Maine <NA>
## 25 MI Michigan Ann Arbor
## 26 MN Minnesota <NA>
## 27 MO Missouri <NA>
## 28 MS Mississippi <NA>
## 29 MT Montana <NA>
## 30 NC North Carolina <NA>
## 31 ND North Dakota <NA>
## 32 NE Nebraska <NA>
## 33 NH New Hampshire <NA>
## 34 NJ New Jersey <NA>
## 35 NM New Mexico <NA>
## 36 NV Nevada <NA>
## 37 NY New York New York
## 38 OH Ohio <NA>
## 39 OK Oklahoma <NA>
## 40 OR Oregon <NA>
## 41 PA Pennsylvania Philadelphia
## 42 RI Rhode Island <NA>
## 43 SC South Carolina <NA>
## 44 SD South Dakota <NA>
## 45 TN Tennessee <NA>
## 46 TX Texas <NA>
## 47 UT Utah <NA>
## 48 VA Virginia <NA>
## 49 VT Vermont <NA>
## 50 WA Washington Seattle
## 51 WI Wisconsin <NA>
## 52 WV West Virginia <NA>
## 53 WY Wyoming <NA>
Some additional notes:
Default if you don’t specify by.x
and by.y
is to use the columns that are common to the two.
by.x
/by.y
can have length more than 1, in which case we match on the entire set of specified variables.
Can use by
instead of by.x
and by.y
, in which case the name of the column to merge on has to be the same in both x
and y
.
Merging, melting/casting, and split/apply/combine from last week are useful enough that there are multiple implementations.
You can use other implementations, but you shouldn’t try to re-make them from scratch.
Some other options are:
join
in plyr
does the same thing as merge
gather
and spread
, and more recently pivot_longer
and pivot_wider
in the tidyr
package do roughly the same thing as cast
and melt
.
There are multiple iterations of reshape, reshape2
was a redesign of reshape
.
plyr
and dplyr
are also two packages that have basically the same goal, but dplyr
is focused more on data frames.