Joining / merging two data frames by symmetric differences in rows and columns

Joining / merging two data frames by symmetric differences in rows and columns - r

I would like to join / merge two data frames, but ignoring similarities in rows and columns in the resulting data frame. Consider the following example:
df1 <- data.frame(
id = c("a","b","c"),
a = runif(3,1,9),
b = runif(3,1,9)
)
df2 <- data.frame(
df1[1:2,],
c = runif(2,1,9)
)
Results in two data frames that have exactly four cells in common (not counting id), so df1[1:2,2:3] == df2[1:2,2:3]. However, they do differ in regard that df1 as an additional row and df2 has an additional column:
> print(df1)
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469
> print(df2)
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
I want a new data frame to consist of the symmetric differences between these two, so no duplicates in rows or columns. The closest result I have achieved is by using dplyr::full_join(df1, df2, by = "id"), but this results in duplicated columns.
The result should look like this:
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
3 c 5.608775 4.219469 NA
What's the best way of achieving this dynamically? Thanks

With data.table we can join on the 'id' and assign the 'c' from the second dataset to create the 'c' column in the first data. By default, the non-matching elements will be assigned as NA
library(data.table)
setDT(df1)[df2, c := c, on = .(id)]
df1
# id a b c
#1: a 4.601639 1.065642 7.476494
#2: b 6.065758 6.234421 8.929932
#3: c 4.000351 7.365717 NA
NOTE: The values are different as there was not set seed
In base R, an option would be match
df1$c <- df2$c[match(df1$id, df2$id)]
Regarding the OP's use of full_join (left_join would be fine based on the example), the trick is to remove the columns that are not needed in the second dataset
library(dplyr)
nm1 <- c("id", setdiff(names(df2), names(df1)))
left_join(df1, select(df2, nm1), by = 'id')

Another approach if one of the data frames has all the rows you want (df2 here):
library(dplyr)
bind_rows(df2, anti_join(df1, df2))
#Joining, by = c("id", "a", "b")
# id a b c
#1 a 1.912298 5.792475 6.899253
#2 b 2.537666 1.495075 1.186120
#3 c 5.947766 6.594028 NA

In this particular case this would be sufficient
library(sqldf)
sqldf("select * from df1 left natural join df2")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr:
library(dplyr)
left_join(df1, df2)
but in general you might need the following. Note this is perfectly general. We did not need to specify the column or row names in either the above or following code and in the following code it is symmetric in df1 and df2 so it does not rely on knowing the structure of either.
sqldf("select * from df1 left natural join df2
union
select * from df2 left natural join df1")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr. This will give a warning but still works. You can avoid the warning if id were character rather than factor or if you convert it to character first.
library(dplyr)
rbind(left_join(df1, df2), left_join(df2, df1)) %>% distinct
Note
Because the question did not use set.seed the code to generate the input is
not reproducible but we can copy the particular df1 and df2 so that we have the same data as in the question.
Lines1 <- "
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469"
df1 <- read.table(text = Lines1)
Lines2 <- "
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280"
df2 <- read.table(text = Lines2)

Related

r - Efficient conditional join on multiple columns

I have two tables that i would like to join using multiple columns, and this is perfectly feasible using the dplyr join functions. The complication comes from the fact that i need to join on multiple columns and the join should be succesful if at least one column join is succesful. To demonstrate my case here is a reproducible example:
df1 <- data.frame(
A1 = c(1,2,3,4),
B1 = c(4,5,6,7),
C1 = c("a", "b", "c", "d")
)
df2 <- data.frame(
A2 = c(8,"",3,4),
B2 = c(9,5,"",7),
C2 = c("aa", "bb", "cc", "dd")
)
I would like to join df1 and df2 on columns A or B, meaning to keep all rows where at least df1$A = df2$A or df1$B = df2$B (please note my real dataset has 6 columns that i would like to use for the joining). The end result for the simplified example should be:
data.frame(
A1 = c(2,3,4),
A2 = c("",3,7),
B1 = c(5,6,7),
B2 = c(5,"", 7),
C1 = c("b", "c", "d"),
C2 = c("bb", "cc", "dd")
)
Many thanks in advance for any recommendations on how this can be done efficiently or if fast is not possible then slow solution can be accepted as well

Not quite sure how to do this using dplyr, but sqldf could help you out:
library(sqldf)
sqldf("SELECT *
FROM df1
JOIN df2
ON df1.A1 = df2.A2
OR df1.B1 = df2.B2")
You can add additional OR statements after this for more columns.

A simple way can be:
library(dplyr)
df1 <- df1 %>%
mutate(A1 = as.character(A1), B1 = as.character(B1))
df1 %>%
bind_cols(df2) %>%
filter(A1 == A2 | B1 == B2) %>%
relocate(sort(names(.)))
#> A1 A2 B1 B2 C1 C2
#> 1 2 5 5 b bb
#> 2 3 3 6 c cc
#> 3 4 4 7 7 d dd

It seems like this isn't possible with a single call to a dplyr join function.
If you would like to use a dplyr join, here is a hacky workaround I created using a purrr map function to do a separate inner join for each of the conditions in the conditional join. Then bind them together and remove duplicate rows. It can be generalized to more columns by appending to the key1 and key2 vectors.
note: first we need to modify the example data so columns to be joined have the same type. dplyr throws an error if you try to join incompatible column types, in this case integer and character.
library(dplyr)
library(purrr)
df1 <- df1 %>%
mutate(A1 = as.character(A1), B1 = as.character(B1))
key1 <- c('A1', 'B1')
key2 <- c('A2', 'B2')
map2_dfr(key1, key2, ~ inner_join(df1, df2, by = setNames(.y, .x), keep = TRUE)) %>%
distinct()
Result:
A1 B1 C1 A2 B2 C2
1 3 6 c 3 cc
2 4 7 d 4 7 dd
3 2 5 b 5 bb

comparing two columns of two dataframe and finding miss matching

I have two different dataframes as below
df1 <- data.frame(state=letters[1:3],district=letters[4:6])
state district
1 a d
2 b e
3 c f
and df2
df2 <- data.frame(state=letters[1:3], district= c("e","d","f"))
state district
1 a e
2 b d
3 c f
I want to check whether districts of df1 exists in df2? if not select state and district.
And If districts in df1 exists in df2 does it belongs to exact same state indf1 or not?
suppose district "d" belongs to state "a" in df1 but district "d" belongs to state "b" in df2 which is wrong.
What I am trying is:
'%noin%' <- Negate('%in%')
#creating unique id for df1
df1$uuid <- tolower(paste0(df1$state,"_",df1$district))
#creating unique id for df2
df2$uuid <- tolower(paste0(df2$state,"_",df2$district))
df_result <- df1 %>% filter(df1$uuid %noin% df2$uuid) %>%
select(state,district)
state district
1 a d
2 b e
how can I select the right state in df2 which these districts belongs to?
what my expected output looks like is:
expected_output <- data.frame(state=c("a","b"), district=c("d","e"),state_in_df_2=c("b","a"))
state district state_in_df_2
1 a d b
2 b e a
Thank you in advance

Using an anti_join and a left_join you could do:
library(dplyr)
df1 <- data.frame(state=letters[1:3],district=letters[4:6])
df2 <- data.frame(state=letters[1:3], district= c("e","d","f"))
df1 %>%
anti_join(df2, by = c("state", "district")) %>%
left_join(df2, by = c("district"), suffix = c("", "_in_df2"))
#> state district state_in_df2
#> 1 a d b
#> 2 b e a

Not sure If this will generalise in your case but you can try,
filter(merge(df1, df2, by = 'district'), state.x != state.y)
# district state.x state.y
#1 d a b
#2 e b a

Subsetting data, if the column entry contains letters

I have data as follows:
DT <- as.data.frame(c("1","2", "3", "A", "B"))
names(DT)[1] <- "charnum"
What I want is quite simple, but I could not find an example on it on stackoverflow.
I want to split the dataset into two. DT1 with all the rows for which DT$charnum has numbers and DT2 with all the rows for which DT$charnum has letters. I tried something like:
DT1 <- DT[is.numeric(as.numeric(DT$charnum)),]
But that gives:
[1] 1 2 3 A B
Levels: 1 2 3 A B
Desired result:
> DT1
charnum
1 1
2 2
3 3
> DT2
charnum
1 A
2 B

You can use regular expressions to separate the two types of data that you have and then separate the two datasets.
result <- split(DT, grepl('^\\d+$', DT$charnum))
DT1 <- type.convert(result[[1]])
DT1
# charnum
#4 A
#5 B
DT2 <- type.convert(result[[2]])
DT2
# charnum
#1 1
#2 2
#3 3

Using tidyverse
library(dplyr)
library(purrr)
library(stringr)
DT %>%
group_split(grp = str_detect(charnum, "\\d+"), .keep = FALSE) %>%
map(type.convert, as.is = TRUE)

selecting values of one dataframe based on partial string in another dataframe

I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).

One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4

An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4

Double merge two data frames in r

I have two dataframes
df1 = data.frame(Sites=c("A","B","C"),total=c(12,6,35))
df2 = data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"), Score=c(60,70,80))
I need to merge them to produce the dataframe
df3=data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"),
Score=c(60,70,80),Site.1.total=c(12,12,6),Site.2.total=c(6,35,35))
Any advice on the simplest way to do such a double merge? Thanks

Simply merge twice:
x <- merge(df2, df1, all.x=TRUE, by.x="Site.2", by.y="Sites", sort=FALSE)
merge(x, df1, all.x=TRUE, by.x="Site.1", by.y="Sites", sort=FALSE)
Site.1 Site.2 Score total.x total.y
1 A B 60 6 12
2 A C 70 35 12
3 B C 80 35 6

Here are a couple of sqldf solutions.
First lets rename the columns containing a dot in their names to remove the dot since dot is an SQL operator. (Had we not wished to do that we could have referred to those columns in the SQL statement as Site_1 and Site_2 and it would have understood that we were referring to Site.1 and Site.2 .)
library(sqldf)
df1 = data.frame(Sites = c("A","B","C"), total = c(12,6,35))
df2 = data.frame(Site1 = c("A","A","B"), Site2 = c("B","C","C"),
Score = c(60,70,80))
Now that we have our inputs lets try a couple of approaches with sqldf:
sqldf with three sql statements
temp1 <- sqldf("SELECT * FROM df1 as a, df2 as b WHERE a.Sites = b.Site1 ")
temp2 <- sqldf("SELECT * FROM df1 as a, df2 as b WHERE a.Sites = b.Site2 ")
sqldf("SELECT
Site1,
b.Site2,
a.Score,
a.Total as Site1Total,
b.Total as Site2Total
FROM temp1 as a, temp2 as b
USING (Site1)
GROUP BY a.Total, b.Total")
sqldf reduced to a triple join
We can further reduce the above to a triple join which perhaps clarifies the essence of the computation. That is, the three SQL statements above can be reduced to this single statement:
> sqldf("SELECT Site1, Site2, Score, a1.total AS total1, a2.total AS total2
+ FROM df1 AS a1, df1 a2, df2 AS b
+ WHERE a1.Sites = Site1 AND a2.Sites = Site2")
Site1 Site2 Score total1 total2
1 A B 60 12 6
2 A C 70 12 35
3 B C 80 6 35

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Joining / merging two data frames by symmetric differences in rows and columns - r

Another approach if one of the data frames has all the rows you want (df2 here): library(dplyr) bind_rows(df2, anti_join(df1, df2)) #Joining, by = c("id", "a", "b") # id a b c #1 a 1.912298 5.792475 6.899253 #2 b 2.537666 1.495075 1.186120 #3 c 5.947766 6.594028 NA

Related

r - Efficient conditional join on multiple columns

comparing two columns of two dataframe and finding miss matching

Subsetting data, if the column entry contains letters

selecting values of one dataframe based on partial string in another dataframe

Double merge two data frames in r

Categories

Resources