dplyr inner_join with NAs on character columns

dplyr inner_join with NAs on character columns - r

I have two equal data frames
a <- c(1,2,3)
b <- c(3,2,1)
c <- c('a','b',NA)
df1 <- data.frame(a=a, b=b, c=c, stringsAsFactors=FALSE)
df2 <- data.frame(a=a, b=b, c=c, stringsAsFactors=FALSE)
I would like to use dplyr::inner_join to
"return all rows from x where there are matching values in y, and all columns from x and y" dplyr documentation
(which is everything as they are equal) but it doesn't seem to work with an NA in column c (type chr). Is this standard behaviour to not join on the NAs?
For example
library(dplyr)
> inner_join(df1, df2)
Joining by: c("a", "b", "c")
a b c
1 1 3 a
2 2 2 b
doesn't join on the NA. However, I would like it to return the same as merge
> merge(df1, df2)
a b c
1 1 3 a
2 2 2 b
3 3 1 <NA>
Have I misunderstood how inner_join works in this instance and is this behaving as described?
Further Detail
inner_join matches NA on a numeric column
a <- c(1,2,3)
b <- c(3,2,NA)
c <- c('a','b','c')
df1 <- data.frame(a=a, b=b, c=c, stringsAsFactors=FALSE)
df2 <- data.frame(a=a, b=b, c=c, stringsAsFactors=FALSE)
> inner_join(df1, df2)
Joining by: c("a", "b", "c")
a b c
1 1 3 a
2 2 2 b
3 3 NA c
Edit
As #thelatemail points out, inner_join also works as merge when the NA is in a factor column
df1 <- data.frame(a=a, b=b, c=c, stringsAsFactors=T)
df2 <- data.frame(a=a, b=b, c=c, stringsAsFactors=T)
inner_join(df1, df2)
Joining by: c("a", "b", "c")
a b c
1 1 3 a
2 2 2 b
3 3 3 <NA>
Edit 2
Thanks to #shadow for pointing out this is a known issue here and here

This issue was occurring in version 0.4.1. This is now fixed in version 0.4.2:
sessionInfo()
...
other attached packages:
[1] dplyr_0.4.2
...
> inner_join(df1, df2)
Joining by: c("a", "b", "c")
a b c
1 1 3 a
2 2 2 b
3 3 1 <NA>
Check with merge:
> merge(df1, df2)
a b c
1 1 3 a
2 2 2 b
3 3 1 <NA>
> all.equal(inner_join(df1, df2), merge(df1, df2))
Joining by: c("a", "b", "c")
[1] TRUE

Related

r - append one table to another if they share the same value in a column

df1 <- data.frame(a = c(1:5), b = c(6:10), c=c("df1","df1","df1","df1","df1"))
df2 <- data.frame(a = c(1,3,5,7,9), b = c(16:20), c=c("df2","df2","df2","df2","df2"), d= LETTERS[1:5], e= LETTERS[6:10])
I would like to create a new table that does following:
stack one table on top of the other only if the value in column a matches (i.e. 1,3,5 only)
show only columns a, b, and c (ignore columns d and e)
in total there should be 6 rows and 3 columns, with rows 1-3 from df1 (a=1,3,5), and rows 4-6 from df2 (a=1,3,5)

base R
common <- intersect(df1$a, df2$a)
rbind(
subset(df1, a %in% common, select = a:c),
subset(df1, a %in% common, select = a:c)
)
# a b c
# 1 1 6 df1
# 3 3 8 df1
# 5 5 10 df1
# 11 1 6 df1
# 31 3 8 df1
# 51 5 10 df1
dplyr
library(dplyr)
bind_rows(
semi_join(df1, df2, by = "a"),
semi_join(df2, df1, by = "a")
) %>%
select(a, b, c)
# a b c
# 1 1 6 df1
# 2 3 8 df1
# 3 5 10 df1
# 4 1 16 df2
# 5 3 17 df2
# 6 5 18 df2

Use semi_join() from dplyr package.
df1 <- data.frame(a = c(1:5), b = c(6:10), c=c("df1","df1","df1","df1","df1"))
df2 <- data.frame(a = c(1,3,5,7,9), b = c(16:20), c=c("df2","df2","df2","df2","df2"), d= LETTERS[1:5], e= LETTERS[6:10])
library(dplyr)
new_df <- rbind(semi_join(df1,df2,by="a")[,c(1:3)],semi_join(df2,df1,by="a")[,c(1:3)])
new_df
Semi-Join: Returns all rows from df1 where there are matching
values in df2, keeping just columns from df1. Its a filtering join.
Output:
> new_df
a b c
1 1 6 df1
2 3 8 df1
3 5 10 df1
4 1 16 df2
5 3 17 df2
6 5 18 df2

Here is a base R option using merge and split.default :
df3 <- merge(df1, df2, by = 'a')
result <- subset(do.call(cbind.data.frame,
sapply(split.default(df3, sub('\\..*', '', names(df3))),
unlist, use.names = FALSE)), select = a:c)
result
# a b c
#1 1 6 df1
#2 3 8 df1
#3 5 10 df1
#4 1 16 df2
#5 3 17 df2
#6 5 18 df2

R - subset rows by rows in another data frame

Let's say I have a data frame df containing only factors/categorical variables. I have another data frame conditions where each row contains a different combination of the different factor levels of some subset of variables in df (made using expand.grid and levels etc.). I'm trying to figure out a way of subsetting df based on each row of conditions. So for example, if the column names of conditions are c("A", "B", "C") and the first row is c('a1', 'b1', 'c1'), then I want df[df$A == 'a1' & df$B == 'b1' & df$C == 'c1',], and so on.

I'd think this is a great time to use merge (or dplyr::*_join or ...):
df1 <- expand.grid(A = letters[1:4], B = LETTERS[1:4], stringsAsFactors = FALSE)
df1$rn <- seq_len(nrow(df1))
# 'df2' contains the conditions we want to filter (retain)
df2 <- data.frame(
a1 = c('a', 'a', 'c'),
b1 = c('B', 'C', 'C'),
stringsAsFactors = FALSE
)
df1
# A B rn
# 1 a A 1
# 2 b A 2
# 3 c A 3
# 4 d A 4
# 5 a B 5
# 6 b B 6
# 7 c B 7
# 8 d B 8
# 9 a C 9
# 10 b C 10
# 11 c C 11
# 12 d C 12
# 13 a D 13
# 14 b D 14
# 15 c D 15
# 16 d D 16
df2
# a1 b1
# 1 a B
# 2 a C
# 3 c C
Using df2 to define which combinations we need to keep,
merge(df1, df2, by.x=c('A','B'), by.y=c('a1','b1'))
# A B rn
# 1 a B 5
# 2 a C 9
# 3 c C 11
# or
dplyr::inner_join(df1, df2, by=c(A='a1', B='b1'))
(I defined df2 with different column names just to show how it works, but in reality since its purpose is "solely" to be declarative on which combinations to filter, it would make sense to me to have the same column names, in which case the by= argument just gets simpler.)

One option is to create the condition with Reduce
df[Reduce(`&`, Map(`==`, df[c("A", "B", "C")], df[1, c("A", "B", "C")])),]
Or another option is rowSums
df[rowSums(df[c("A", "B", "C")] ==
df[1, c("A", "B", "C")][col(df[c("A", "B", "C")])]) == 3,]

Creating new variable in dataframe based on matching values from other dataframe

I have two dataframes, df1 and df2, of which two columns have partly matching values, however in completely different order; also, the values are unique in df2 but may be repeated in df1.
What I'd like to do is transfer into df1, not the matching values, but values associated with them in another variable in df2; for one value in df1, "G", I do not want the associated value to be transferred but rather just NA.
Consider df1 and df2:
df1 <- data.frame(
x = c("A", NA, "L", "G", "C", "F", NA, "J", "G", "K")
)
df2 <- data.frame(
a = LETTERS[1:10],
b = 1:10 # these are the values to be transferred into df1$z
)
df1$z <- ifelse(df1$x=="G", NA, ifelse(df1$x %in% df2$a, df2$b[df2$a %in% df1$x], NA))
The values to be transferred from df2 into df1 are in df2$b. I've tried the above ifelse() string but the resulting values in df1$z are only partly correct. Where's the mistake?

I think this does what you want:
df1$z <- df2$b[match(df1$x,df2$a)]
df1$z[df1$x=='G']=NA
Output:
> df1
x z
1 A 1
2 <NA> NA
3 L NA
4 G 7
5 C 3
6 F 6
7 <NA> NA
8 J 10
9 G 7
10 K NA
Hope this helps!

dplyr::left_join(df1,df2,by=c("x"="a")) %>% mutate(b = ifelse(x=="G",NA,b))
# x b
# 1 A 1
# 2 <NA> NA
# 3 L NA
# 4 G NA
# 5 C 3
# 6 F 6
# 7 <NA> NA
# 8 J 10
# 9 G NA
# 10 K NA

Matching columns with other columns in data frames and adding certain columns of matching values

I have tried searching for something but cannot find it. I have found similar threads but still they don't get what I want. I know there should be an easy way to do this without writing a loop function. Here it goes
I have two data frame df1 and df2
df1 <- data.frame(ID = c("a", "b", "c", "d", "e", "f"), y = 1:6 )
df2 <- data.frame(x = c("a", "c", "g", "f"), f=c("M","T","T","M"), obj=c("F70", "F60", "F71", "F82"))
df2$f <- as.factor(df2$f)
now I want to match df1 and df2 "ID" and "x" column with each other. But I want to add new columns to the df1 data frame that matches "ID" and "x" from df2 as well. The final output of df1 should look like this
ID y obj f1 f2
a 1 F70 M NA
b 2 NA NA NA
c 3 F60 NA T
d 4 NA NA NA
e 5 NA NA NA
f 6 F82 M NA

We can do this with tidyverse after joining the two datasets and spread the 'f' column
library(tidyverse)
left_join(df1, df2, by = c(ID = "x")) %>%
group_by(f) %>%
spread(f, f) %>%
select(-6) %>%
rename(f1 = M, f2 = T)
# A tibble: 6 × 5
# ID y obj f1 f2
#* <chr> <int> <fctr> <fctr> <fctr>
#1 a 1 F70 M NA
#2 b 2 NA NA NA
#3 c 3 F60 NA T
#4 d 4 NA NA NA
#5 e 5 NA NA NA
#6 f 6 F82 M NA
Or a similar approach with data.table
library(data.table)
dcast(setDT(df2)[df1, on = .(x = ID)], x+obj + y ~ f, value.var = 'f')[, -6, with = FALSE]

Here is a base R process.
# combine the data.frames
dfNew <- merge(df1, df2, by.x="ID", by.y="x", all.x=TRUE)
# add f1 and f2 variables
dfNew[c("f1", "f2")] <- lapply(c("M", "T"),
function(i) factor(ifelse(as.character(dfNew$f) == i, i, NA)))
# remove original factor variable
dfNew <- dfNew[-3]
ID y obj f1 f2
1 a 1 F70 M <NA>
2 b 2 <NA> <NA> <NA>
3 c 3 F60 <NA> T
4 d 4 <NA> <NA> <NA>
5 e 5 <NA> <NA> <NA>
6 f 6 F82 M <NA>

how to subset in r for this particular condition?

df1 and df2 have columns a,b. I want to subset data from df1 such that each entry in df1$a along with df1$b is in df2$a along with df2$b.
df1
a b c
1 m df1
2 f df1
3 f df1
4 m df1
5 f df1
6 m df1
df2
a b c
1 m df2
3 f df2
4 f df2
5 m df2
6 f df2
7 m df2
desired output
df
a b c
1 m df1
3 f df1
i am using :
df <- subset(df1,(df1$a%in%df2$a & df1$b%in%df2$b))
but this is giving results similar to
df <-subset(df1,df1$a%in%df2$a)

You can use package dplyr:
library(dplyr)
intersect(df1,df2)
# a b
#1 1 m
#2 3 f
Edit for the new data.frames with c column:
you can use function semi_join (also from dplyr):
semi_join(df1,df2,by=c("a","b"))
# a b c
#1 1 m df1
#2 3 f df1
Other option, in base R:
you can paste your a and b variables to subset your data.frame:
df1[paste(df1$a,df1$b) %in% paste(df2$a,df2$b), ]
# a b
#1 1 m
#3 3 f
and with the new data.frames:
# a b c
# 1 1 m df1
# 3 3 f df1

Or you could do
Res <- rbind(df1, df2)
Res[duplicated(Res), ]
# a b
# 7 1 m
# 8 3 f
Edit1: Per the edit, here's a similar data.table solution
library(data.table)
Res <- rbind(df1, df2)
setDT(Res)[duplicated(Res, by = c("a", "b"), fromLast = TRUE)]
# a b c
# 1: 1 m df1
# 2: 3 f df1
Edit2: I see that #CathG opened a join battlefront, so here's how we do it with data.table
setkey(setDT(df1), a, b) ; setkey(setDT(df2), a, b)
df1[df2, nomatch = 0]
# a b c i.c
# 1: 1 m df1 df2
# 2: 3 f df1 df2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr inner_join with NAs on character columns - r

Related

r - append one table to another if they share the same value in a column

R - subset rows by rows in another data frame

Creating new variable in dataframe based on matching values from other dataframe

Matching columns with other columns in data frames and adding certain columns of matching values

how to subset in r for this particular condition?

Categories

Resources