merging on multiple columns R - r

I'm surprised if this isn't a duplicate, but I couldn't find the answer anywhere else.
I have two data frames, data1 and data2, that differ in one column, but the rest of the columns are the same. I would like to merge them on a unique identifying column, id. However, in the event an ID from data2 does not have a match in data1, I want the entry in data2 to be appended at the bottom, similar to plyr::rbind.fill() rather than renaming all the corresponding columns in data2 as column1.x and column1.y. I realize this isn't the clearest explanation, maybe I shouldn't be working on a Saturday. Here is code to create the two dataframes, and the desired output:
spp1 <- c('A','B','C')
spp2 <- c('B','C','D')
trait.1 <- rep(1.1,length(spp1))
trait.2 <- rep(2.0,length(spp2))
id_1 <- c(1,2,3)
id_2 <- c(2,9,7)
data1 <- data.frame(spp1,trait.1,id_1)
data2 <- data.frame(spp2,trait.2,id_2)
colnames(data1) <- c('spp','trait.1','id')
colnames(data2) <- c('spp','trait.2','id')
Desired output:
spp trait.1 trait.2 id
1 A 1.1 NA 1
2 B 1.1 2 2
3 C 1.1 NA 3
4 C NA 2 9
5 D NA 2 7

Try this:
library(dplyr)
full_join(data1, data2, by = c("id", "spp"))
Output:
spp trait.1 id trait.2
1 A 1.1 1 NA
2 B 1.1 2 2
3 C 1.1 3 NA
4 C NA 9 2
5 D NA 7 2
Alternatively, also merge would work:
merge(data1, data2, by = c("id", "spp"), all = TRUE)

Related

Conditional merging based on full join

I would like to conditionally merge two datasets such that the values in dataframe2 replace the values in dataframe1, unless dataframe2 contains missing values. This should be performed in the case of a full join such that rows from both dataframe are preserved.
This question is inspired from Conditional merge/replacement in R (which seems to work only for inner join).
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:5,x2=c("zz","qq", NA, "qy"),stringsAsFactors=FALSE)
I would like the following result:
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
5 5 qy
I tried the following code though it returns NA for the 4th column but I would like the original value to be preserved since in this case df2 contains missing value for 4.
df3 <- anti_join(df1, df2, by = "x1")
rbind(df3, df2)
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 <NA>
5 5 qy
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
5 5 qy

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

merging more than 2 data frames whilst assigning an identifier factor in R

Take this very simple RWE, I want to know what package can be used to automatically assign a factor (preferable the data frame name) when we merge two or more data.frames
I have manually defined the factor in the example below and shown the desired output. But i want to automate it as I have over 100 tables to merge. Note that the headers within each df are constant, only the name itself changes
A <- 1:5
B <- 5:1
df1 <- data.frame(A,B)
A <- 2:6
B <- 6:2
df2 <- data.frame(A,B)
df1$ID <- rep("df1", 5)
df2$ID <- rep("df2", 5)
big_df <- rbind(df1,df2)
Assuming that your data.frame names follow a certain pattern like beginning with "df" followed by numbers and they are not inside a list but simply in your global environment, you can use the following:
library(data.table)
bigdf <- rbindlist(Filter(is.data.frame, mget(ls(pattern = "^df\\d+"))), id = "ID")
Without data.table, you could do it as follows:
lst <- Filter(is.data.frame, mget(ls(pattern = "^df\\d+")))
bigdf <- do.call(rbind, Map(function(df, id) transform(df, ID=id), lst, names(lst)))
Consider the following:
library(dplyr)
cof_df <- bind_rows(df1, df2, .id="ID")
cof_df
ID A B
1 1 1 5
2 1 2 4
3 1 3 3
4 1 4 2
5 1 5 1
6 2 2 6
7 2 3 5
8 2 4 4
9 2 5 3
10 2 6 2
And then:
cof_df$ID <- factor(cof_df$ID,
levels = c(1,2),
labels = paste0("df", unique(cof_df$ID)))
does the recoding.
A similar result can be obtained by naming the arguments in bind_rows, as in
cof_df <- bind_rows(df1=df1, df2=df2, .id="ID")
Another solution will be to use merge:
merged <- merge(df1, df2, all=TRUE, sort =FALSE)
> merged
A B ID
1 1 5 df1
2 2 4 df1
3 3 3 df1
4 4 2 df1
5 5 1 df1
6 2 6 df2
7 3 5 df2
8 4 4 df2
9 5 3 df2
10 6 2 df2

Match a Variable to Other Dataset Based using Multiple Overlapping Variables

I have two data sets with some overlapping variables. One dataset is basically a subset of the other but needs an additional variable added based on some of the overlapping variables. For example
varA <- c(rep(c("a","b"), each=5))
blah <- c(11:20)
varB <- c(1:10)
speed <- rnorm(10)
dataset1 <- data.frame(varA,blah,varB,speed)
varA.2 <- c("a","a","b","b")
varB.2 <- c(2,10,11,7)
speed.2 <- rep(NA, 4)
dataset2 <- data.frame(varA.2, varB.2, speed.2)
dataset2
I would like the "speed.2" variable to contain the speed values for the lines where varA and varB are matching between the two sets.
I've tried something with "merge" but am having issues.
Thank you!
May be:
colnames(dataset2) <- gsub("\\..*","", colnames(dataset2))
library(dplyr)
left_join(dataset2[,-3],dataset1[,-2])
# Joining by: c("varA", "varB")
# varA varB speed
#1 a 2 -1.3243815
#2 a 10 NA
#3 b 11 NA
#4 b 7 -0.6026936
Or without changing the column names.
merge(dataset1[,-2],dataset2[,-3], by.x=c("varA","varB"), by.y=c("varA.2", "varB.2"), all.y=TRUE)
# varA varB speed
# 1 a 2 -0.6797753
# 2 a 10 NA
# 3 b 7 -2.1838454
# 4 b 11 NA
Values in speed differ as the example was without using set.seed()
You can use 'match' function for "where varA and varB are matching"
dataset2$speed.2 = dataset1[match(paste(dataset2$varA.2,dataset2$varB.2),
paste(dataset1$varA, dataset1$varB)),]$speed
dataset2
varA.2 varB.2 speed.2
1 a 2 0.3917783
2 a 10 NA
3 b 11 NA
4 b 7 1.3265439
>

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Resources