Assign random but unique value between two dataframes - r

I have two objects:
Dataframe 1:
Address City
xyz City1
xyy City1
xxx City2
... ...
Dataframe 2
Column 1 Column 2 City
.... ... City1
.... ... City2
I want to join the two data-frames, so that I assign a random, but unique address from dataframe one to dataframe two, given that there is a match between the cities.
Essentially, the idea is to assign a random address for a given city.
I don't believe a join would work here, as the size of the dataframes varies and I need to assign a unique address value. Perhaps I'm mistaken though.
Any ideas how I can pull this off?

The idea is to pick a random row for each City in your first dataset and then join that info back to your second dataset.
# example datasets
df1 = read.table(text = "Address City
xyz City1
xyy City1
xxx City2
zzz City2", header=T, stringsAsFactors=F)
df2 = read.table(text = "Column1 Column2 City
1 3 City1
2 4 City2", header=T, stringsAsFactors=F)
library(dplyr)
set.seed(1) # for reproducible results
df1 %>%
group_by(City) %>% # for each city
sample_n(1) %>% # pick a random row
right_join(df2, by="City") %>% # right join df2
ungroup() # forget the grouping
# # A tibble: 2 x 4
# Address City Column1 Column2
# <chr> <chr> <int> <int>
# 1 xyz City1 1 3
# 2 xxx City2 2 4

A data.table alternative:
Scramble the entire address data once (sample(.I)), join on 'City', and select the first of the matches (mult = "first")
library(data.table)
setDT(d1)
setDT(d2)
d1[d1[ , sample(.I)]][d2, on = "City", mult = "first"]
# City Address
# 1: c1 a2
# 2: c2 a3
# 3: c3 a1
# 4: c4 a2
d1 <- data.frame(City = rep(paste0("c", 1:4), each = 4),
Address = paste0("a", 1:4))
d2 <- data.frame(City = paste0("c", 1:4))
Don't know if speeed is an issue, but this should be faster on a larger data.

Related

comparing two columns of two dataframe and finding miss matching

I have two different dataframes as below
df1 <- data.frame(state=letters[1:3],district=letters[4:6])
state district
1 a d
2 b e
3 c f
and df2
df2 <- data.frame(state=letters[1:3], district= c("e","d","f"))
state district
1 a e
2 b d
3 c f
I want to check whether districts of df1 exists in df2? if not select state and district.
And If districts in df1 exists in df2 does it belongs to exact same state indf1 or not?
suppose district "d" belongs to state "a" in df1 but district "d" belongs to state "b" in df2 which is wrong.
What I am trying is:
'%noin%' <- Negate('%in%')
#creating unique id for df1
df1$uuid <- tolower(paste0(df1$state,"_",df1$district))
#creating unique id for df2
df2$uuid <- tolower(paste0(df2$state,"_",df2$district))
df_result <- df1 %>% filter(df1$uuid %noin% df2$uuid) %>%
select(state,district)
state district
1 a d
2 b e
how can I select the right state in df2 which these districts belongs to?
what my expected output looks like is:
expected_output <- data.frame(state=c("a","b"), district=c("d","e"),state_in_df_2=c("b","a"))
state district state_in_df_2
1 a d b
2 b e a
Thank you in advance
Using an anti_join and a left_join you could do:
library(dplyr)
df1 <- data.frame(state=letters[1:3],district=letters[4:6])
df2 <- data.frame(state=letters[1:3], district= c("e","d","f"))
df1 %>%
anti_join(df2, by = c("state", "district")) %>%
left_join(df2, by = c("district"), suffix = c("", "_in_df2"))
#> state district state_in_df2
#> 1 a d b
#> 2 b e a
Not sure If this will generalise in your case but you can try,
filter(merge(df1, df2, by = 'district'), state.x != state.y)
# district state.x state.y
#1 d a b
#2 e b a

Left_join under condition

I have 2 dataframes and I need to merge them based on condition:
```
# Dataframe 1
plant1 <- c("FF", "DO")
loc1 <- c("MM", "KB")
df1 <- data.frame(plant1, loc1)
df1
plant1 loc1
1 FF MM
2 DO KB
# Dataframe 2
plant2 <- c("FF", "DO","DO")
loc2 <- c("MM", "KB","KB")
name <- c("name_1", "name_2","name_3")
frequency <- c(1, 2, 2)
df2 <- data.frame(plant2, loc2, name, frequency)
df2
plant2 loc2 name frequency
1 FF MM name_1 1
2 DO KB name_2 2
3 DO KB name_3 2
```
I need to bring to df1 value of name from df2 ONLY for those cases WHERE frequency == 1,
for the rest of the cases I need to set specific text.
This is the result I need to get:
plant3 loc3 name3
1 FF MM name_1
2 DO KB multiple
I am starting with the simplest code, where I need to add that condition:
df1 %>% left_join(df2, by=c("plant1" = "plant2", "loc1" = "loc2" ))
Of course I can do it in "dirty" way by simple left_join and then replacing values in name column for frequency !=1 and adding unique().
Is there more elegant way?
I was checking this discussion for the topic, but could not apply it for my case:
https://community.rstudio.com/t/how-can-i-join-two-tables-with-an-or-statement-in-r-using-dplyrs-join-functions/37633
here is a data.table possibility...
library(data.table)
# Make them data.tables
setDT(df1);setDT(df2)
# Set key for join
setkey(df1, plant1, loc1)
setkey(df2, plant2, loc2)
# Join
df2[df1, .(name3 = if (.N > 1) "multiple" else x.name), by = .EACHI][]
# plant2 loc2 name3
# 1: DO KB multiple
# 2: FF MM name_1

Merge two R dataframes by at least one merge ID across columns

I have a joining problem that I'm struggling with in that the join IDs I want to use for separate dataframes are spread out across three possible ID columns. I'd like to be able to join if at least one join ID matches. I know the _join and merge functions accept a vector of column names but is it possible to make this work conditionally?
For example, if I have the following two data frames:
df_A <- data.frame(dta = c("FOO", "BAR", "GOO"),
id1 = c("abc", "", "bcd"),
id2 = c("", "", "xyz"),
id3 = c("def", "fgh", ""), stringsAsFactors = F)
df_B <- data.frame(dta = c("FUU", "PAR", "KOO"),
id1 = c("abc", "", ""),
id2 = c("", "xyz", "zzz"),
id3 = c("", "", ""), stringsAsFactors = F)
> df_A
dta id1 id2 id3
1 FOO abc def
2 BAR fgh
3 GOO bcd xyz
> df_B
dta id1 id2 id3
1 FUU abc
2 PAR xyz
3 KOO zzz
I hope to end up with something like this:
dta.x dta.y id1 id2 id3
1 FOO FUU abc "" def [matched on id1]
2 BAR "" "" "" fgh [unmatched]
3 GOO PAR bcd xyz "" [matched on id2]
4 KOO "" "" zzz "" [unmatched]
So that unmatched dta1 and dta1 variables are retained but where there is a match (row 1 + 3 above) both dta1 and dta2 are joined in the new table. I have a sense that neither _join, merge, or match will work as is and that I'd need to write a function but I'm not sure where to start. Any help or ideas appreciated. Thank you
Basically, what you want to do is join by corresponding IDs, what you can do is to convert the original id columns to id_column and id_value, because you don't want to join with "", do I dropped it.
library(tidyverse)
df_A_long <- df_A %>%
pivot_longer(
cols = -dta,
names_to = "id_column",
values_to = "id_value"
) %>%
dplyr::filter(id_value != "")
df_B_long <- df_B %>%
pivot_longer(
cols = -dta,
names_to = "id_column",
values_to = "id_value"
) %>%
dplyr::filter(id_value != "")
We always use id_column and id_value to join A & B.
> df_B_long
# A tibble: 3 x 3
dta id_column id_value
<chr> <chr> <chr>
1 FUU id1 abc
2 PAR id2 xyz
3 KOO id2 zzz
The joining part is clear, but to create your desired output, we need to do some data wrangling to make it look identical.
df_joined <- df_A_long %>%
# join using id_column and id_value
full_join(df_B_long, by = c("id_column","id_value"),suffix = c("1","2")) %>%
# pivot back to long format
pivot_wider(
id_cols = c(dta1,dta2),
names_from = id_column,
values_from = id_value
) %>%
# if dta1 is missing, then in the same row, move value from dta2 to dta1
mutate(
dta1_has_value = !is.na(dta1), # helper column
dta1 = ifelse(dta1_has_value,dta1,dta2),
dta2 = ifelse(!dta1_has_value & !is.na(dta2),NA,dta2)
) %>%
select(-dta1_has_value) %>%
group_by(dta1) %>%
# condense multiple rows into one row
summarise_all(
~ifelse(all(is.na(.x)),"",.x[!is.na(.x)])
) %>%
# reorder columns
{
.[sort(colnames(df_joined))]
}
Result:
> df_joined
# A tibble: 4 x 5
dta1 dta2 id1 id2 id3
<chr> <chr> <chr> <chr> <chr>
1 BAR "" "" "" fgh
2 FOO FUU abc "" def
3 GOO PAR bcd xyz ""
4 KOO "" "" zzz ""
library(sqldf)
one <-
sqldf('
select a.*
, b.dta as dta_b
from df_A a
left join df_B b
on a.id1 <> ""
and (
a.id1 = b.id1
or a.id2 = b.id2)
')
two <-
sqldf('
select b.*
from df_B b
left join one
on b.dta = one.dta
or b.dta = one.dta_b
where one.dta is null
')
dplyr::bind_rows(one, two)
# dta id1 id2 id3 dta_b
# 1 FOO abc def FUU
# 2 BAR fgh <NA>
# 3 GOO bcd xyz PAR
# 4 KOO zzz <NA>

Joining / merging two data frames by symmetric differences in rows and columns

I would like to join / merge two data frames, but ignoring similarities in rows and columns in the resulting data frame. Consider the following example:
df1 <- data.frame(
id = c("a","b","c"),
a = runif(3,1,9),
b = runif(3,1,9)
)
df2 <- data.frame(
df1[1:2,],
c = runif(2,1,9)
)
Results in two data frames that have exactly four cells in common (not counting id), so df1[1:2,2:3] == df2[1:2,2:3]. However, they do differ in regard that df1 as an additional row and df2 has an additional column:
> print(df1)
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469
> print(df2)
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
I want a new data frame to consist of the symmetric differences between these two, so no duplicates in rows or columns. The closest result I have achieved is by using dplyr::full_join(df1, df2, by = "id"), but this results in duplicated columns.
The result should look like this:
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
3 c 5.608775 4.219469 NA
What's the best way of achieving this dynamically? Thanks
With data.table we can join on the 'id' and assign the 'c' from the second dataset to create the 'c' column in the first data. By default, the non-matching elements will be assigned as NA
library(data.table)
setDT(df1)[df2, c := c, on = .(id)]
df1
# id a b c
#1: a 4.601639 1.065642 7.476494
#2: b 6.065758 6.234421 8.929932
#3: c 4.000351 7.365717 NA
NOTE: The values are different as there was not set seed
In base R, an option would be match
df1$c <- df2$c[match(df1$id, df2$id)]
Regarding the OP's use of full_join (left_join would be fine based on the example), the trick is to remove the columns that are not needed in the second dataset
library(dplyr)
nm1 <- c("id", setdiff(names(df2), names(df1)))
left_join(df1, select(df2, nm1), by = 'id')
Another approach if one of the data frames has all the rows you want (df2 here):
library(dplyr)
bind_rows(df2, anti_join(df1, df2))
#Joining, by = c("id", "a", "b")
# id a b c
#1 a 1.912298 5.792475 6.899253
#2 b 2.537666 1.495075 1.186120
#3 c 5.947766 6.594028 NA
In this particular case this would be sufficient
library(sqldf)
sqldf("select * from df1 left natural join df2")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr:
library(dplyr)
left_join(df1, df2)
but in general you might need the following. Note this is perfectly general. We did not need to specify the column or row names in either the above or following code and in the following code it is symmetric in df1 and df2 so it does not rely on knowing the structure of either.
sqldf("select * from df1 left natural join df2
union
select * from df2 left natural join df1")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr. This will give a warning but still works. You can avoid the warning if id were character rather than factor or if you convert it to character first.
library(dplyr)
rbind(left_join(df1, df2), left_join(df2, df1)) %>% distinct
Note
Because the question did not use set.seed the code to generate the input is
not reproducible but we can copy the particular df1 and df2 so that we have the same data as in the question.
Lines1 <- "
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469"
df1 <- read.table(text = Lines1)
Lines2 <- "
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280"
df2 <- read.table(text = Lines2)

R function that merge rows and introduce a new merge variable

I have a data set like this....
ID Brand
--- --------
1 Cokacola
2 Pepsi
3 merge with 1
4 merge with 2
5 merge with 1
6 Fanta
And I want to write a R function which merge the rows and introduce new variable according to ID just like following...
ID Brand merge
---- -------- --------
1 Cokacola 1,3,5
2 Pepsi 2,4
6 Fanta 6
Your data:
dat <- data.frame(
id = 1:6,
brand = c('Cokacola', 'Pepsi', 'merge with 1', 'merge with 2', 'merge with 1', 'Fanta'))
Inelegant-but-functional code:
repeats <- grepl('^merge with', dat$brand)
groups <- ifelse(repeats, gsub('merge with ', '', dat$brand), dat$id)
merge <- sapply(unique(groups), function(x) paste(dat$id[groups==x], collapse=','))
dat <- dat[!repeats,]
dat$merge <- merge
dat
## id brand merge
## 1 1 Cokacola 1,3,5
## 2 2 Pepsi 2,4
## 6 6 Fanta 6
There are most certainly ways to make this more elegant, depending on the consistency and makeup of the data.
You could try
library(reshape2)
indx <- !grepl('merge', df$Brand)
df1 <- df[indx,]
val <- as.numeric(sub('[^0-9]+', '', df[!indx, 'Brand']))
ml <- melt(tapply(which(!indx), val, FUN=toString))
df2 <- merge(df1, ml, by.x='ID', by.y='Var1', all=TRUE)
df2$merge <- with(df2, ifelse(!is.na(value),
paste(ID, value, sep=', '), ID))
df2[-3]
# ID Brand merge
#1 1 Cokacola 1, 3, 5
#2 2 Pepsi 2, 4
#3 6 Fanta 6

Resources