Finding out what values didn't merge in R - r

If I had two simple dataframes:
a <- 1:10
b <- c("a","b","c","d","e","f","g","h","i","j")
df1 <-data.frame(a,b)
c <- 1:7
d <- c("k","l","m","n","o","p","q")
df2 <-data.frame(c,d)
... and I wanted to merge them by "a" and "c" for df1 and df2 respectively using:
df3= merge(df1, df2, by.x = "a", by.y = "c")
How would I go about producing a dataframe of rows in df1 which didn't merge? For example:
a b
8 8 h
9 9 i
10 10 j
Any help would be gratefully received.
EDIT
Using the suggestion in the comment, I can do:
check = setdiff(df1$a, df2$c)
This is great, as I get 8:10 which is correct, but I do need the other column in df1 listed to... Can this be done with setdiff too?

Look up the all argument.
df3= merge(df1, df2, by.x = "a", by.y = "c", all.x = TRUE)
will return this. Now you can filter on d to get the entries you're looking for.
a b d
1 1 a k
2 2 b l
3 3 c m
4 4 d n
5 5 e o
6 6 f p
7 7 g q
8 8 h <NA>
9 9 i <NA>
10 10 j <NA>

Per comments:
check = setdiff(df1$a, df2$c)
alldiff <- df1[1:dim(df1)[1] %in% check,]
(note that dim(df1)[1] is the same as, say, length(df1$a) )
With credit to Codoremifa for the second line.

Related

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n
We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]
A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

Merge on x1, or if no match x2, or if no match x3

I'm trying to merge 2 datasets on a key, but if there is no match then I want to try another key, and so on.
df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F"))
df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o"))
df1
a b c
1 5 T F
2 1 T T
3 7 T F
4 3 F F
df2
x1 x2 x3 ..
1 4 7 g ..
2 5 8 w ..
3 3 1 t ..
4 9 2 o ..
The desired output is something like
a b c x3 ..
1 5 T F w ..
2 1 T T t ..
3 7 T F g ..
4 3 F F t ..
I tried something along the lines of
dfm <- merge(df1,df2, by.x = "a", by.y = "x1", all.x = TRUE)
dfm <- merge(dfm,df2, by.x = "a", by.y = "x2", all.x = TRUE)
but that isn't quite right.
This really isn't a standard sort of merge. You can make it more standard by reshaping df2 so you have just one field to merge on
df2long <- rbind(
data.frame(a = df2$x1, df2[,-(1:2), drop=FALSE]),
data.frame(a = df2$x2, df2[,-(1:2), drop=FALSE])
)
dfm <- merge(df1, df2long, by = "a", all.x = TRUE)
You could do something like this:
matches <- lapply(df2[, c("x1", "x2")], function(x) match(df1$a, x))
# finding matches in df2$x1 and df2$x2
# notice that the code below should work with any number of columns to be matched:
# you just need to add the names here eg. df2[, paste0("x", 1:100)]
matches
$x1
[1] 2 NA NA 3
$x2
[1] NA 3 1 NA
combo <- Reduce(function(a,b) "[<-"(a, is.na(a), b[is.na(a)]), matches)
# combining the matches on "first come first served" basis
combo
[1] 2 3 1 3
cbind(df1, df2[combo,])
a b c x1 x2 x3
2 5 T F 5 8 w
3 1 T T 3 1 t
1 7 T F 4 7 g
3.1 3 F F 3 1 t
If I understand correctly, the OP has requested to try a match of a with x1 first, then - if failed - to try to match a with x2. So any match of a with x1 should take precedence over a match of a with x2.
Unfortunately, the sample data set provided by the OP does not include a use case to prove this. Therefore, I have modified the sample dataset accordingly (see Data section).
The approach suggested here is to reshape df2 from wide to long format (likewise to MrFlick's answer) but to use a data.table join with parameter mult = "first".
The columns of df2 to be considered as key columns and the precedence can be controlled by the measure.vars parameter to melt(). After reshaping, melt() arranges the rows in the column order given in measure.vars:
library(data.table)
# define cols of df2 to use as key in order of
key_cols <- c("x1", "x2")
# reshape df2 from wide to long format
long <- melt(setDT(df2), measure.vars = key_cols, value.name = "a")
# join long with df1, pick first matches
result <- long[setDT(df1), on = "a", mult = "first"]
# clean up
setcolorder(result, names(df1))
result[, variable := NULL]
result
a b c x3
1: 5 T F w
2: 1 T T t
3: 7 T F g
4: 3 F F t
5: 0 F F <NA>
Please, note that the original row order of df1 has been preserved.
Also, note that the code works for an arbitrary number of key columns. The precedence of key columns can be easily changed. E.g., if the order is reversed, i.e., key_cols <- c("x2", "x1") matches of a with x2 will be picked first.
Data
Enhanced sample datasets:
df1 has an additional row with no match in df2.
df1 <- data.frame(a=c(5,1,7,3,0),
b=c("T","T","T","F","F"),
c=c("F","T","F","F","F"))
df1
a b c
1: 5 T F
2: 1 T T
3: 7 T F
4: 3 F F
5: 0 F F
df2 has an additional row to prove that a match in x1 takes precedence over a match in x2. The value 5 appears twice: In row 2 of column x1 and in row 5 of column x2.
df2 <- data.frame(x1=c(4,5,3,9,6),
x2=c(7,8,1,2,5),
x3=c("g","w","t","o","n"))
df2
x1 x2 x3
1: 4 7 g
2: 5 8 w
3: 3 1 t
4: 9 2 o
5: 6 5 n
Not sure I understood your question, but rather than repetitive merging I'd compare the keys of the potential merge, if this number is >0, than you have a match. If you want to take the first column with a match you can try this:
library(tidyr)
library(purrr)
(df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F")) )
(df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o")) )
FirstColMatch<-1:ncol(df2) %>%
map(~intersect(df1$a, df2[[.x]])) %>%
map(length) %>%
detect_index(function(x)x>0)
NewDF<-merge(df1,df2,by.x="a", by.y =names(df2)[FirstColMatch])

Sum of elements based on unique names across a list(unknown length) of data frames [duplicate]

This question already has answers here:
Aggregate variables in list of data frames into single data frame
(2 answers)
Closed 5 years ago.
I am trying to get the sum of elements based on unique names across a list containing unknown number of dataframes.
## Test Data
Name1 <- c("A","B","C","D")
Name2 <- c("A","D")
Name3 <- c("B","C","F")
Values1 <- c(1,2,3,4)
Values2 <- c(5,7)
Values3 <- c(6,8,9)
DF1 <- data.frame(Name1,Values1,stringsAsFactors = FALSE)
DF2 <- data.frame(Name2,Values2,stringsAsFactors = FALSE)
DF3 <- data.frame(Name3,Values3,stringsAsFactors = FALSE)
DFList <- list(DF1,DF2,DF3)
My Output will be:
A B C D F
6 8 11 11 9
I am not sure if using a loop is effective, since there can be any number of dataframes in the list and the number of unique rows in a dataframe can range anywhere between 100,000 to 1 Million.
Solution using data.table::rbindlist:
data.table::rbindlist(DFList)[, sum(Values1), Name1]
Name1 V1
1: A 6
2: B 8
3: C 11
4: D 11
5: F 9
rbindlist binds columns despite their names and then you can sum(Values1) by Name1.
sapply(split(unlist(lapply(DFList, "[[", 2)), unlist(lapply(DFList, "[[", 1))), sum)
# A B C D F
# 6 8 11 11 9
OR
aggregate(formula = Value~Name,
data = do.call(rbind, lapply(DFList, function(x) setNames(x, c("Name", "Value")))),
FUN = sum)
# Name Value
#1 A 6
#2 B 8
#3 C 11
#4 D 11
#5 F 9
Similar to the answer of #d.b.
lst <- unlist(lapply(DFList, function(DF) setNames(DF[[2]], DF[[1]])))
tapply(lst, names(lst), sum)
#A B C D F
#6 8 11 11 9

Merging data frames with a non-unique column

I would like to create a new data frame that borrows an ID variable from another data frame. The data frame I would like to merge has repeated observations in the ID column which is causing me some problems.
DF1<-data.frame(ID1=rep(c("A","B", "C", "D", "E") , 2), X1=rnorm(10))
DF2<-data.frame(ID1=c("A", "B", "C", "D", "E"), ID2=c("V","W","X","Y" ,"Z"), X2=rnorm(5), X3=rnorm(5))
What I would like to append DF2$ID2 onto DF by the ID1 column. My goal is something that looks like this (I do not want DF2$X2 and DF$X3 in the 'Goal' data frame):
Goal<-data.frame(ID2=DF2$ID2, DF1)
I have tried merge but it complains because DF1$ID1 is not unique. I know R can goggle this up in 1 line of code but I can't seem to make the functions I know work. Any help would be greatly appreciated!
There should be no problem with a simple merge. Using your sample data
merge(DF1, DF2[,c("ID1","ID2")], by="ID1")
produces
ID1 X1 ID2
1 A 0.03594331 V
2 A 0.42814900 V
3 B -2.17161263 W
4 B -0.33403550 W
5 C 0.95407844 X
6 C -0.23186723 X
7 D 0.46395514 Y
8 D -1.49919961 Y
9 E -0.20342430 Z
10 E -0.49847569 Z
You could also use left_join from library(dplyr)
library(dplyr)
left_join(DF1, DF2[,c("ID1", "ID2")])
# ID1 X1 ID2
#1 A -1.20927237 V
#2 B -0.03003128 W
#3 C -0.75799708 X
#4 D 0.53946986 Y
#5 E -0.52009921 Z
#6 A 1.15822659 V
#7 B -0.91976194 W
#8 C 0.74620142 X
#9 D -2.46452560 Y
#10 E 0.80015219 Z

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Resources