In R - generate pairwise data.frame from all rows in data.frame - r

I have a data.frame called df with a 8 million observations on 4 columns:
name <- c("Pablo", "Christina", "Steve", "Diego", "Ali", "Brit", "Ruth", "Mia", "David", "Dylan")
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
v2 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)
> df
name year v1 v2
1 Pablo 2000 2 9
2 Christina 2001 5 3
3 Steve 2002 8 9
4 Diego 2003 7 6
5 Ali 2004 2 4
6 Brit 2005 1 1
7 Ruth 2006 10 9
8 Mia 2007 6 7
9 David 2008 10 9
10 Dylan 2009 3 2
I want to generate a data.frame output with all pair-wise combination of the rows in df that looks like this:
>output
name year v1 v2 name_2 year_2 v1_2 v2_2
1 Pablo 2000 2 9 Christina 2001 5 3
2 Pablo 2000 2 9 Steve 2002 8 9
3 Pablo 2000 2 9 Diego 2003 7 6
etc.
What are the fastest ways to do this?

tidyr::crossing will return all combinations of observations, but you'll need to set names with setNames or the like. If you don't want self-matches, you can remove them by calling dplyr::filter on any unique ID column.
library(tidyverse)
df_crossed <- df %>%
setNames(paste0(names(.), '_2')) %>%
crossing(df) %>%
filter(name != name_2)
head(df_crossed)
## name_2 year_2 v1_2 v2_2 name year v1 v2
## 1 Pablo 2000 5 5 Christina 2001 7 3
## 2 Pablo 2000 5 5 Steve 2002 1 9
## 3 Pablo 2000 5 5 Diego 2003 2 8
## 4 Pablo 2000 5 5 Ali 2004 9 5
## 5 Pablo 2000 5 5 Brit 2005 8 5
## 6 Pablo 2000 5 5 Ruth 2006 8 1
Another way to fix names would be to use janitor::clean_names after crossing, though it's an extra package.

Hopefully this will give the result the post owner was looking for.
name <- c("Pablo", "Christina", "Steve", "Diego", "Ali", "Brit", "Ruth", "Mia", "David", "Dylan")
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
v2 <- sample(1:10, 10, replace=T)
df <- data.frame(name, year, v1, v2, stringsAsFactors=FALSE)
print(df)
rows = nrow(df)
n <- rows * (rows - 1) / 2
ndf <- data.frame(
name1=character(n),year1=numeric(n), v1_1=numeric(n),v2_1=numeric(n),
name2=character(n),year2=numeric(n), v1_2=numeric(n),v2_2=numeric(n),
stringsAsFactors=FALSE
)
k <- 1
for (i in 1:(rows-1))
{
for (j in (i+1):rows)
{
ndf[k,] <- c(df[i,], df[j,])
k <- k + 1
}
}
print(ndf)
# name year v1 v2
#1 Pablo 2000 4 9
#2 Christina 2001 2 1
#3 Steve 2002 2 9
#4 Diego 2003 5 5
#5 Ali 2004 10 4
#6 Brit 2005 5 2
#7 Ruth 2006 7 10
#8 Mia 2007 6 7
#9 David 2008 4 10
#10 Dylan 2009 7 3
# name1 year1 v1_1 v2_1 name2 year2 v1_2 v2_2
#1 Pablo 2000 4 9 Christina 2001 2 1
#2 Pablo 2000 4 9 Steve 2002 2 9
#3 Pablo 2000 4 9 Diego 2003 5 5
#4 Pablo 2000 4 9 Ali 2004 10 4
#5 Pablo 2000 4 9 Brit 2005 5 2
#6 Pablo 2000 4 9 Ruth 2006 7 10
#7 Pablo 2000 4 9 Mia 2007 6 7
#8 Pablo 2000 4 9 David 2008 4 10
#9 Pablo 2000 4 9 Dylan 2009 7 3
#10 Christina 2001 2 1 Steve 2002 2 9
#...

Not to add to the noise but consider a base R cross join with merge on same dataframe that also filters out reverse duplicates. Do note, cross join before filter will return a 8 mill X 8 mill records dataset, so hopefully your RAM is sufficient for such an operation.
df <- data.frame(name = c("Pablo", "Christina", "Steve", "Diego", "Ali",
"Brit", "Ruth", "Mia", "David", "Dylan"),
year = seq(2000, 2009, 1),
v1 =sample(1:10, 10, replace=T),
v2 =sample(1:10, 10, replace=T),
stringsAsFactors = FALSE)
# MERGE ON KEY, THEN REMOVE KEY COL
df$key <- 1
dfm <- merge(df, df, by="key")[,-1]
# FILTER OUT SAME NAME AND REVERSE DUPS, THEN RENAME COLUMNS
dfm <- setNames(dfm[(dfm$name.x < dfm$name.y),],
c("name_p1", "year_p1", "V1_p1", "V2_p1",
"name_p2", "year_p2", "V1_p2", "V2_p2"))
# ALL PABLO PAIRINGS
dfm[dfm$name_p1=='Pablo' | dfm$name_p2=='Pablo',]
# name_p1 year_p1 V1_p1 V2_p1 name_p2 year_p2 V1_p2 V2_p2
# 3 Pablo 2000 7 8 Steve 2002 3 1
# 7 Pablo 2000 7 8 Ruth 2006 8 4
# 11 Christina 2001 10 10 Pablo 2000 7 8
# 31 Diego 2003 4 9 Pablo 2000 7 8
# 41 Ali 2004 5 3 Pablo 2000 7 8
# 51 Brit 2005 2 4 Pablo 2000 7 8
# 71 Mia 2007 7 7 Pablo 2000 7 8
# 81 David 2008 1 7 Pablo 2000 7 8
# 91 Dylan 2009 9 2 Pablo 2000 7 8
If somehow this large set derived from an SQL compliant database, I can provide the counterpart in SQL which may be more efficient as the filter runs with join process and not separately after.

This extension of #alistaires solution shows a crossed matrix used as index. The question as stated wants the full crossed output which
will be very large (~64 million rows for 8 million items) so there
is really no way around the memory requirement. However, if the
the real-world use of this is to deal with subsets, the indexing technique
shown here may be a way to reduce memory use. Its possible that crossing the integers only uses less memory during the crossing operation.
library(dplyr)
library(tidyr)
crossed <- as.matrix(crossing(1:nrow(df), 1:nrow(df)))
# bind and name in one step (may be inefficient) so that filter can be applied in one step
output <- as.data.frame(cbind(df[crossed[, 1],],
data.frame(name_2 = df[crossed[, 2], 1],
year_2 = df[crossed[, 2], 2],
v1_2 = df[crossed[, 2], 3],
v2_2 = df[crossed[, 2], 4]) )) %>%
filter(!(name == name_2 & year == year_2))
# estimated sized for 8 million rows gine this 10 row sample
format(object.size(output) / (10 / 8e6), units="MB")
#[1] "5304 Mb"

You could cross join the name column to itself, using data.table and remove repeated cases. This will result in a smaller structure on which to merge in data rather than doing the full merge, then filtering. You can add the rest of the data with two merges: once to merge data associated with the first name column and again to merge in data associated with the second column.
name <- c("Pablo", "Christina", "Steve", "Diego", "Ali", "Brit", "Ruth", "Mia", "David", "Dylan")
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
v2 <- sample(1:10, 10, replace=T)
# stringsAsFactors = FALSE in order for pmin to work properly
df <- data.frame(name, year, v1, v2, stringsAsFactors = FALSE)
library(data.table)
setDT(df)
setkey(df)
# cross-join name column to itself while removing duplicates and redundancies
name_cj <- setnames(
CJ(df[, name], df[, name])[V1 < V2], # taking a hint from Parfait's clever solution
c("name1", "name2"))
# perform 2 merges, once for the 1st name column and
# again for the 2nd name colum
name_cj <- merge(
merge(name_cj, df, by.x = "name1", by.y = "name"),
df,
by.x = "name2", by.y = "name", suffixes = c("_1", "_2"))
# reorder columns as desired with setorder()
head(name_cj)
# name2 name1 year_1 v1_1 v2_1 year_2 v1_2 v2_2
#1: Brit Ali 2004 3 8 2005 4 5
#2: Christina Ali 2004 3 8 2001 9 8
#3: Christina Brit 2005 4 5 2001 9 8
#4: David Ali 2004 3 8 2008 5 2
#5: David Brit 2005 4 5 2008 5 2
#6: David Christina 2001 9 8 2008 5 2

Related

`str_replace_all` numeric values in column according to named vector

I want to use a named vector to map numeric values of a data frame column.
consider the following example:
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
df
# year value
# 1 2000 12
# 2 2001 15
# 3 2002 11
# 4 2003 12
# 5 2004 14
# 6 2005 1
I now want to replace according to a vector, like this one
repl_vec <- c("1"="apple", "11"="radish", "12"="tomato", "13"="cucumber", "14"="eggplant", "15"="carrot")
which I do with this
df %>% mutate(val_alph = str_replace_all(value, repl_vec))
However, this gives:
# year value val_alph
# 1 2000 11 appleapple
# 2 2001 13 apple3
# 3 2002 15 apple5
# 4 2003 12 apple2
# 5 2004 14 apple4
# 6 2005 1 apple
since str_replace_all uses the first match and not the whole match. In the real data, the names of the named vector are also numbers (one- and two-digits).
I expect the output to be like this:
# year value val_alph
# 1 2000 11 radish
# 2 2001 13 cucumber
# 3 2002 15 carrot
# 4 2003 12 tomato
# 5 2004 14 eggplant
# 6 2005 1 apple
Does someone have a clever way of achieving this?
I would use base R's match instead of string matching here, since you are looking for exact whole string matches.
df %>%
mutate(value = repl_vec[match(value, names(repl_vec))])
#> year value
#> 1 2000 radish
#> 2 2001 carrot
#> 3 2002 carrot
#> 4 2003 cucumber
#> 5 2004 eggplant
#> 6 2005 apple
Created on 2022-04-20 by the reprex package (v2.0.1)
Is this what you want to do?
set.seed(1234)
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
repl_vec <- c("1"="one", "11"="eleven", "12"="twelve", "13"="thirteen", "14"="fourteen", "15"="fifteen")
names(repl_vec) <- paste0("\\b", names(repl_vec), "\\b")
df %>%
mutate(val_alph = str_replace_all(value, repl_vec, names(repl_vec)))
which gives:
year value val_alph
1 2000 14 fourteen
2 2001 12 twelve
3 2002 15 fifteen
4 2003 14 fourteen
5 2004 11 eleven
6 2005 1 one

subtract specific row und rename it

it is possible to subtract certain rows and rename them?
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","b","c","a","b","c", "a", "b", "c")
value <- c(2,2,10,3,3,12,4,4,16)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
And this is how the result should look:
year
category
value
2005
a
2
2005
b
2
2005
c
4
2006
a
3
2006
b
3
2006
c
12
2007
a
4
2007
b
4
2007
c
16
2005
c-b
2
2006
c-b
9
2007
c-b
12
You can use group_modify:
library(tidyverse)
df %>%
group_by(year) %>%
group_modify(~ add_row(.x, category = "c-b", value = .x$value[.x$category == "c"] - .x$value[.x$category == "b"]))
# A tibble: 12 x 3
# Groups: year [3]
year category value
<dbl> <chr> <dbl>
1 2005 a 2
2 2005 b 2
3 2005 c 10
4 2005 c-b 8
5 2006 a 3
6 2006 b 3
7 2006 c 12
8 2006 c-b 9
9 2007 a 4
10 2007 b 4
11 2007 c 16
12 2007 c-b 12
See substract() function.
Example:
substracted_df<-substr(df,df$category=="c")
If you want to know which rows are you dealing with, use which()
rows<-which(df$category=="c")
substracted_df<-df[rows, ]
You can rename each desired row as
row.names(substracted_df)<-c("Your desired row names")

Assign unique ID based on two columns [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

R converting to long format, pattern

I would like to convert a data.table like this one from wide format to long.
set.seed(1)
DT <- data.table(
ID = c(1:5, NA),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Conc_2001 = sample(99,6),
Conc_2002 = sample(79,6),
Conc_2003 = sample(49,6)
)
ID Name Kind_2001 Kind_2002 Kind_2003 Conc_2001 Conc_2002 Conc_2003
1 Bob b NA c 38 22 24
2 Ana b c b 77 31 29
3 Smith c c NA 91 2 49
4 Sam NA a b 21 30 9
5 Big a a c 62 66 38
NA Lulu NA a NA 12 26 30
And I would like to get something like this:
ID Name Year Kind Conc
1 Bob 2001 b 38
1 Bob 2002 NA 22
1 Bob 2003 c 24
2 Ana 2001 b 77
2 Ana 2002 c 31
2 Ana 2003 b 29
...
The real table has many more variables, I'm looking for a solution without explicitly saying every column name or number, detecting automatically the pattern.
I have two kind of columns, some ending with an underscore and a four digit year, such as _2001, and the other without that ending.
Some can have an underscore in the middle of the name (this will be kept untransformed).
I would like to transform the columns ending with a year to long format.
I've tried with
melt(DT, id=1:2, variable.name = "year")
or with
melt(DT, id=1:2, measure=patterns("_2[0-9][0-9][0-9]$"))
but I'm not getting what I want.
Maybe I first need to filter the names with gsub.
PD: I've found this solution.
posi <- grep("_[0-9][0-9][0-9][0-9]$",colnames(DT))
work <- unique(gsub("_[0-9][0-9][0-9][0-9]$","",colnames(DT)[posi]))
melt(DT, measure=patterns(paste0("^",work)), variable="year", value.name=work)
It almost works but the year column is not populated properly. I'm missing something or it's a bug.
And I'm sure it could be written simpler.
ID Name year Kind Conc
1 Bob 1 b 38
2 Ana 1 b 77
3 Smith 1 c 91
4 Sam 1 NA 21
5 Big 1 a 62
NA Lulu 1 NA 12
1 Bob 2 NA 22
2 Ana 2 c 31
3 Smith 2 c 2
4 Sam 2 a 30
5 Big 2 a 66
NA Lulu 2 a 26
1 Bob 3 c 24
2 Ana 3 b 29
3 Smith 3 NA 49
4 Sam 3 b 9
5 Big 3 c 38
NA Lulu 3 NA 30
Regards
I've tried eddi solution with my database and I get the error:
"Error: cannot allocate vector of size 756.5 Mb"
even though I have 16GB of memory.
We can solve this on scale using reshape() from base R, without having to explicitly name variables.
# First we get indices of colnames that have format "_1234" at the end
tomelt <- grep("_([0-9]{4})$",names(DT))
# Now we use these indices to reshape data
reshape(DT, varying = tomelt, sep = "_",
direction = 'long', idvar = "ID", timevar = "Year)
# ID Name Year Kind Conc
# 1: 1 Bob 2001 b 38
# 2: 2 Ana 2001 b 77
# 3: 3 Smith 2001 c 91
# 4: 4 Sam 2001 NA 21
# 5: 5 Big 2001 a 62
# 6: NA Lulu 2001 NA 12
...
If we are looking for data.table solution, extract the prefix part from the names of the "DT" and use the unique elements as patterns in the measure argument in melt. Similarly, the suffix from "Year" is extracted and replace the numeric index with that.
nm <- unique(sub("_\\d+", "", names(DT)[-(1:2)]))
yr <- unique(sub("\\D+_", "", names(DT)[-(1:2)]))
melt(DT, measure = patterns(paste0("^", nm)), value.name = nm,
variable.name = "Year")[, Year := yr[Year]][]
# ID Name Year Kind Conc
# 1: 1 Bob 2001 b 38
# 2: 2 Ana 2001 b 77
# 3: 3 Smith 2001 c 91
# 4: 4 Sam 2001 NA 21
# 5: 5 Big 2001 a 62
# 6: NA Lulu 2001 NA 12
# 7: 1 Bob 2002 NA 22
# 8: 2 Ana 2002 c 31
# 9: 3 Smith 2002 c 2
#10: 4 Sam 2002 a 30
#11: 5 Big 2002 a 66
#12: NA Lulu 2002 a 26
#13: 1 Bob 2003 c 24
#14: 2 Ana 2003 b 29
#15: 3 Smith 2003 NA 49
#16: 4 Sam 2003 b 9
#17: 5 Big 2003 c 38
#18: NA Lulu 2003 NA 30
Here's an option that's more robust with respect to the order of your columns, as well as missing/extra years:
dcast(melt(DT, id.vars = c("ID", "Name"))
[, .(ID, Name, sub('_.*', '', variable), sub('.*_', '', variable), value)],
ID + Name + V4 ~ V3)
# ID Name V4 Conc Kind
# 1: 1 Bob 2001 38 b
# 2: 1 Bob 2002 22 NA
# 3: 1 Bob 2003 24 c
# 4: 2 Ana 2001 77 b
# 5: 2 Ana 2002 31 c
# 6: 2 Ana 2003 29 b
# 7: 3 Smith 2001 91 c
# 8: 3 Smith 2002 2 c
# 9: 3 Smith 2003 49 NA
#10: 4 Sam 2001 21 NA
#11: 4 Sam 2002 30 a
#12: 4 Sam 2003 9 b
#13: 5 Big 2001 62 a
#14: 5 Big 2002 66 a
#15: 5 Big 2003 38 c
#16: NA Lulu 2001 12 NA
#17: NA Lulu 2002 26 a
#18: NA Lulu 2003 30 NA
Edit for many id columns:
idvars = grep("_", names(DT), invert = TRUE)
dcast(melt(DT, id.vars = idvars)
[, `:=`(var = sub('_.*', '', variable),
year = sub('.*_', '', variable),
variable = NULL)],
... ~ var, value.var='value')
In case anybody is interested I post here my full solution,
able to work with datasets bigger than memory. It uses some of your ideas and some mine.
My data is the file file.csv (or you can even do it with a compressed file using fread("unzip -c name.zip").
## Initialization
nline <- 1500000 # total number of lines or use wc -l to do it automatically.
chunk <- 5000 # change it according to your memory and number of columns.
times <- ceiling(nline/chunk)
name <- names(fread("file.csv", stringsAsFactors=F, integer64 = "character", nrows=0, na.strings=c("", "NA")) )
idvars = grep("_20[0-9][0-9]$",name , invert = TRUE)
# Now we loop every chunk
for(iter in 0:(times-1)) {
my <- fread("file.csv", stringsAsFactors=F, integer64 = "character", skip=1+(iter*chunk), nrows=chunk, na.strings=c("", "NA"))
colnames(my) <- name
temp <- melt(my, id.vars = idvars)
newfile <- dcast(
temp[, `:=`(var = sub('_20[0-9][0-9]$', '', variable), year = sub('.*_', '', variable), variable = NULL)],
... ~ var, value.var='value')
fwrite(newfile, "long.csv", quote=FALSE, sep=",", append=T)
rm(temp); rm(newfile); rm(my); gc()
}
#
As said before the problem with this method is that it converts all the value to character but if you save them to a file and read the file again (as here) you get the proper classes.
In case of very large files this method is very slow.
I encourage you to improve this solution or suggest any generic solution with tidyr, splitstackshape or other packages.
Or even better it would be great to do it with a database such as sqlite.
The solution should work on datasets with unordered columns or even with "_" in the middle of the name, such as:
set.seed(1)
DT <- data.table(
ID = c(1:15),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu", "Loli", "Chochi", "Tom", "Dick", "Pet", "Shin", "Rock", "Pep", "XXX"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Conc_2004 = sample(49,15),
aa_Conc_2001 = c(sample(99,14), NA),
Conc_2002 = sample(79,15)
)

Merge 2 data frame based on 2 columns with different column names

I have 2 very large data sets that looks like below:
merge_data <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
position=c("yes","no","yes","no","yes",
"no","yes","no","yes","yes"),
school = c("a","b","a","a","c","b","c","d","d","e"),
year1 = c(2000,2000,2000,2001,2001,2000,
2003,2005,2008,2009),
year2=year1-1)
merge_data
ID position school year1 year2
1 1 support a 2000 1999
2 2 oppose b 2000 1999
3 3 support a 2000 1999
4 4 oppose a 2001 2000
5 5 support c 2001 2000
6 6 oppose b 2000 1999
7 7 support c 2003 2002
8 8 oppose d 2005 2004
9 9 support d 2008 2007
10 10 support e 2009 2008
merge_data_2 <- data.frame(year=c(1999,1999,2000,2000,2000,2001,2003
,2012,2009,2009,2008,2002,2009,2005,
2001,2000,2002,2000,2008,2005),
amount=c(100,200,300,400,500,600,700,800,900,
1000,1100,1200,1300,1400,1500,1600,
1700,1800,1900,2000),
ID=c(1,1,2,2,2,3,3,3,5,6,8,9,10,13,15,17,19,20,21,7))
merge_data_2
year amount ID
1 1999 100 1
2 1999 200 1
3 2000 300 2
4 2000 400 2
5 2000 500 2
6 2001 600 3
7 2003 700 3
8 2012 800 3
9 2009 900 5
10 2009 1000 6
11 2008 1100 8
12 2002 1200 9
13 2009 1300 10
14 2005 1400 13
15 2001 1500 15
16 2000 1600 17
17 2002 1700 19
18 2000 1800 20
19 2008 1900 21
20 2005 2000 7
And what I want is:
ID position school year1 year2 amount
1 yes a 2000 1999 300
2 no b 2000 1999 1200
10 yes e 2009 2008 1300
for ID=1 in the merge_data_2, we have amount =300, since there are 2 cases where ID=1,and their year1 or year1 is equal to the year of ID=1 in merge_data
So basically what I want is to perform a merge based on the ID and year.
2 conditions:
ID from merge_data matches the ID from merge_data_2
one of the year1 and year2 from merge_data also matches the year from merge_data_2.
then make the merge based on the sum of the amount for each IDs.
and I think the code will be something looks like:
merge_data_final <- merge(merge_data, merge_data_2,
merge_data$ID == merge_data_2$ID && (merge_data$year1 ||
merge_data$year2 == merge_data_2$year))
Then somehow to aggregate the amount by ID.
Obviously I know the code is wrong, and I have been thinking about plyr or reshape library, but was having difficulties of getting my hands on them.
Any helps would be great! thanks guys!
As noted above, I think you have some discrepancies between your example input and output data. Here's the basic approach - you were on the right track with reshape2. You can simply melt() your data into long format so you are joining on a single column instead of the either/or bit you had going on before.
library(reshape2)
#melt into long format
merge_data_m <- melt(merge_data, measure.vars = c("year1", "year2"))
#merge together, specifying the joining columns
merge(merge_data_m, merge_data_2, by.x = c("ID", "value"), by.y = c("ID", "year"))
#-----
ID value position school variable amount
1 1 1999 yes a year2 100
2 1 1999 yes a year2 200
3 2 2000 no b year1 500
4 2 2000 no b year1 300
5 2 2000 no b year1 400

Resources