Left_join under condition - r

I have 2 dataframes and I need to merge them based on condition:
```
# Dataframe 1
plant1 <- c("FF", "DO")
loc1 <- c("MM", "KB")
df1 <- data.frame(plant1, loc1)
df1
plant1 loc1
1 FF MM
2 DO KB
# Dataframe 2
plant2 <- c("FF", "DO","DO")
loc2 <- c("MM", "KB","KB")
name <- c("name_1", "name_2","name_3")
frequency <- c(1, 2, 2)
df2 <- data.frame(plant2, loc2, name, frequency)
df2
plant2 loc2 name frequency
1 FF MM name_1 1
2 DO KB name_2 2
3 DO KB name_3 2
```
I need to bring to df1 value of name from df2 ONLY for those cases WHERE frequency == 1,
for the rest of the cases I need to set specific text.
This is the result I need to get:
plant3 loc3 name3
1 FF MM name_1
2 DO KB multiple
I am starting with the simplest code, where I need to add that condition:
df1 %>% left_join(df2, by=c("plant1" = "plant2", "loc1" = "loc2" ))
Of course I can do it in "dirty" way by simple left_join and then replacing values in name column for frequency !=1 and adding unique().
Is there more elegant way?
I was checking this discussion for the topic, but could not apply it for my case:
https://community.rstudio.com/t/how-can-i-join-two-tables-with-an-or-statement-in-r-using-dplyrs-join-functions/37633

here is a data.table possibility...
library(data.table)
# Make them data.tables
setDT(df1);setDT(df2)
# Set key for join
setkey(df1, plant1, loc1)
setkey(df2, plant2, loc2)
# Join
df2[df1, .(name3 = if (.N > 1) "multiple" else x.name), by = .EACHI][]
# plant2 loc2 name3
# 1: DO KB multiple
# 2: FF MM name_1

Related

Using a list of dataframes as lookups for new columns in other dataframe

I have a list of some 500.000 trees in different sites, and am trying to find out which sites are more endangered by specific pests (n ~500). Each pest has a different host range. I have these host ranges in dataframes in a list.
I am trying to use these dfs in the list as lookup tables, and will calculate fraction of suitable trees.
Example code:
#pests with their host ranges
pest1 <- as.data.frame(c("Abies", "Quercus"))
pest2 <- as.data.frame(c("Abies"))
pest3 <- as.data.frame (c("Abies", "Picea"))
pestlist <- as.list(c(pest1, pest2, pest3))
#changing this to any other kind of list would be fine too
df1 <- NULL#this will be the dataframe that would get new columns
df1$genus <- c("Abies", "Picea", "Abies", "Quercus", "Abies")
df1$site <- c("A" , "A" , "B" , "B", "B")
df1 <- as.data.frame(df1)
I tried following code, but it seems to go wrong because I don't know how to loop through lists:
library(tidyverse)
df2 <- map2(pestlist, names(df1), mutate(ifelse(df1$genus %in% pestlist , 1,0)))
For clarity, I want to go from:
to
Thanks for your time!
I would recommend using a named list of vectors instead of a list of data.frames. Then we can use map_dfc() inside mutate():
# use vectors instead of data.frames
pest1 <- c("Abies", "Quercus")
pest2 <- c("Abies")
pest3 <- c("Abies", "Picea")
df1 <- NULL#this will be the dataframe that would get new columns
df1$genus <- c("Abies", "Picea", "Abies", "Quercus", "Abies")
df1$site <- c("A" , "A" , "B" , "B", "B")
df1 <- as.data.frame(df1)
library(tidyverse)
# create named list of vectors
pestlist <- tibble::lst(pest1, pest2, pest3)
# use map_dfc
df1 %>%
mutate(map_dfc(pestlist, ~ as.integer(genus %in% .x)))
If you fix up your pestlist like this:
pestlist =do.call(
rbind,lapply(seq_along(pestlist),\(i) data.frame(pest=i,genus=pestlist[[i]]))
)
then you can join df1 and a version of pestlist that has been pivoted to wide format
inner_join(
df1, pivot_wider(
mutate(pestlist,v=1),
names_from=pest,names_prefix = "pest",values_from = v,values_fill = 0
)
)
Output:
genus site pest1 pest2 pest3
1 Abies A 1 1 1
2 Picea A 0 0 1
3 Abies B 1 1 1
4 Quercus B 1 0 0
5 Abies B 1 1 1
library(data.table)
setDT(df1, key = 'genus')
pestdt = rbindlist(list(pest1, pest2, pest3), use.names = FALSE, idcol = TRUE)
setnames(pestdt, c('id', 'genus'))
pestdt = dcast(pestdt, genus ~ paste0('pest', id), value.var = 'genus', fun.aggregate = \(x) as.integer(nzchar(x)), fill = 0)
cols = paste0("pest", 1:3)
df1[, (cols) := pestdt[.SD, mget(cols)]]
#
# genus site pest1 pest2 pest3
# <char> <char> <int> <int> <int>
# 1: Abies A 1 1 1
# 2: Abies B 1 1 1
# 3: Abies B 1 1 1
# 4: Picea A 0 0 1
# 5: Quercus B 1 0 0

I want to created another column in the first dataframe based on values of second column prior to or on the date of first dataframe for each group

I have two dataframes one with unique id and the other with multiple ids. I want to create another column in the first dataframe if the value in the second dataframe on or prior to first dataframe date has a value of 1, and if the id is missing in the second dataframe i want to assign NA. What would be the most efficient way to do it in R?
# first dataframe with each unique id
set.seed(123)
df <- data.frame(id = c(1:7),
date = seq(as.Date("2020/12/26"),
as.Date("2021/1/1"), by = "day"))
#s second dataframe with repeated id
df1 <- data.frame(id = rep(1:5, each = 5),
date = sample(seq(as.Date('2020/12/20'),
as.Date('2021/1/15'), by="day"), 25),
assign = sample(c(0,1), replace=TRUE, size=25))
df1 <- arrange(df1,id, date)
# the output that I want
df$response <- c(1,0,0,1,0,NA,NA)
May be we can use a join
library(data.table)
df2 <- setDT(df1)[as.logical(assign)]
setDT(df)[df2, response := assign, on = .(id, date), roll = -Inf]
df[is.na(response) & id %in% df2$id, response := 0]
-output
df
# id date response
#1: 1 2020-12-26 1
#2: 2 2020-12-27 0
#3: 3 2020-12-28 0
#4: 4 2020-12-29 1
#5: 5 2020-12-30 0
#6: 6 2020-12-31 NA
#7: 7 2021-01-01 NA

Joining / merging two data frames by symmetric differences in rows and columns

I would like to join / merge two data frames, but ignoring similarities in rows and columns in the resulting data frame. Consider the following example:
df1 <- data.frame(
id = c("a","b","c"),
a = runif(3,1,9),
b = runif(3,1,9)
)
df2 <- data.frame(
df1[1:2,],
c = runif(2,1,9)
)
Results in two data frames that have exactly four cells in common (not counting id), so df1[1:2,2:3] == df2[1:2,2:3]. However, they do differ in regard that df1 as an additional row and df2 has an additional column:
> print(df1)
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469
> print(df2)
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
I want a new data frame to consist of the symmetric differences between these two, so no duplicates in rows or columns. The closest result I have achieved is by using dplyr::full_join(df1, df2, by = "id"), but this results in duplicated columns.
The result should look like this:
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
3 c 5.608775 4.219469 NA
What's the best way of achieving this dynamically? Thanks
With data.table we can join on the 'id' and assign the 'c' from the second dataset to create the 'c' column in the first data. By default, the non-matching elements will be assigned as NA
library(data.table)
setDT(df1)[df2, c := c, on = .(id)]
df1
# id a b c
#1: a 4.601639 1.065642 7.476494
#2: b 6.065758 6.234421 8.929932
#3: c 4.000351 7.365717 NA
NOTE: The values are different as there was not set seed
In base R, an option would be match
df1$c <- df2$c[match(df1$id, df2$id)]
Regarding the OP's use of full_join (left_join would be fine based on the example), the trick is to remove the columns that are not needed in the second dataset
library(dplyr)
nm1 <- c("id", setdiff(names(df2), names(df1)))
left_join(df1, select(df2, nm1), by = 'id')
Another approach if one of the data frames has all the rows you want (df2 here):
library(dplyr)
bind_rows(df2, anti_join(df1, df2))
#Joining, by = c("id", "a", "b")
# id a b c
#1 a 1.912298 5.792475 6.899253
#2 b 2.537666 1.495075 1.186120
#3 c 5.947766 6.594028 NA
In this particular case this would be sufficient
library(sqldf)
sqldf("select * from df1 left natural join df2")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr:
library(dplyr)
left_join(df1, df2)
but in general you might need the following. Note this is perfectly general. We did not need to specify the column or row names in either the above or following code and in the following code it is symmetric in df1 and df2 so it does not rely on knowing the structure of either.
sqldf("select * from df1 left natural join df2
union
select * from df2 left natural join df1")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr. This will give a warning but still works. You can avoid the warning if id were character rather than factor or if you convert it to character first.
library(dplyr)
rbind(left_join(df1, df2), left_join(df2, df1)) %>% distinct
Note
Because the question did not use set.seed the code to generate the input is
not reproducible but we can copy the particular df1 and df2 so that we have the same data as in the question.
Lines1 <- "
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469"
df1 <- read.table(text = Lines1)
Lines2 <- "
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280"
df2 <- read.table(text = Lines2)

Update existing data.frame with values from another one if missing

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b

Assign random but unique value between two dataframes

I have two objects:
Dataframe 1:
Address City
xyz City1
xyy City1
xxx City2
... ...
Dataframe 2
Column 1 Column 2 City
.... ... City1
.... ... City2
I want to join the two data-frames, so that I assign a random, but unique address from dataframe one to dataframe two, given that there is a match between the cities.
Essentially, the idea is to assign a random address for a given city.
I don't believe a join would work here, as the size of the dataframes varies and I need to assign a unique address value. Perhaps I'm mistaken though.
Any ideas how I can pull this off?
The idea is to pick a random row for each City in your first dataset and then join that info back to your second dataset.
# example datasets
df1 = read.table(text = "Address City
xyz City1
xyy City1
xxx City2
zzz City2", header=T, stringsAsFactors=F)
df2 = read.table(text = "Column1 Column2 City
1 3 City1
2 4 City2", header=T, stringsAsFactors=F)
library(dplyr)
set.seed(1) # for reproducible results
df1 %>%
group_by(City) %>% # for each city
sample_n(1) %>% # pick a random row
right_join(df2, by="City") %>% # right join df2
ungroup() # forget the grouping
# # A tibble: 2 x 4
# Address City Column1 Column2
# <chr> <chr> <int> <int>
# 1 xyz City1 1 3
# 2 xxx City2 2 4
A data.table alternative:
Scramble the entire address data once (sample(.I)), join on 'City', and select the first of the matches (mult = "first")
library(data.table)
setDT(d1)
setDT(d2)
d1[d1[ , sample(.I)]][d2, on = "City", mult = "first"]
# City Address
# 1: c1 a2
# 2: c2 a3
# 3: c3 a1
# 4: c4 a2
d1 <- data.frame(City = rep(paste0("c", 1:4), each = 4),
Address = paste0("a", 1:4))
d2 <- data.frame(City = paste0("c", 1:4))
Don't know if speeed is an issue, but this should be faster on a larger data.

Resources