I have a joining problem that I'm struggling with in that the join IDs I want to use for separate dataframes are spread out across three possible ID columns. I'd like to be able to join if at least one join ID matches. I know the _join and merge functions accept a vector of column names but is it possible to make this work conditionally?
For example, if I have the following two data frames:
df_A <- data.frame(dta = c("FOO", "BAR", "GOO"),
id1 = c("abc", "", "bcd"),
id2 = c("", "", "xyz"),
id3 = c("def", "fgh", ""), stringsAsFactors = F)
df_B <- data.frame(dta = c("FUU", "PAR", "KOO"),
id1 = c("abc", "", ""),
id2 = c("", "xyz", "zzz"),
id3 = c("", "", ""), stringsAsFactors = F)
> df_A
dta id1 id2 id3
1 FOO abc def
2 BAR fgh
3 GOO bcd xyz
> df_B
dta id1 id2 id3
1 FUU abc
2 PAR xyz
3 KOO zzz
I hope to end up with something like this:
dta.x dta.y id1 id2 id3
1 FOO FUU abc "" def [matched on id1]
2 BAR "" "" "" fgh [unmatched]
3 GOO PAR bcd xyz "" [matched on id2]
4 KOO "" "" zzz "" [unmatched]
So that unmatched dta1 and dta1 variables are retained but where there is a match (row 1 + 3 above) both dta1 and dta2 are joined in the new table. I have a sense that neither _join, merge, or match will work as is and that I'd need to write a function but I'm not sure where to start. Any help or ideas appreciated. Thank you
Basically, what you want to do is join by corresponding IDs, what you can do is to convert the original id columns to id_column and id_value, because you don't want to join with "", do I dropped it.
library(tidyverse)
df_A_long <- df_A %>%
pivot_longer(
cols = -dta,
names_to = "id_column",
values_to = "id_value"
) %>%
dplyr::filter(id_value != "")
df_B_long <- df_B %>%
pivot_longer(
cols = -dta,
names_to = "id_column",
values_to = "id_value"
) %>%
dplyr::filter(id_value != "")
We always use id_column and id_value to join A & B.
> df_B_long
# A tibble: 3 x 3
dta id_column id_value
<chr> <chr> <chr>
1 FUU id1 abc
2 PAR id2 xyz
3 KOO id2 zzz
The joining part is clear, but to create your desired output, we need to do some data wrangling to make it look identical.
df_joined <- df_A_long %>%
# join using id_column and id_value
full_join(df_B_long, by = c("id_column","id_value"),suffix = c("1","2")) %>%
# pivot back to long format
pivot_wider(
id_cols = c(dta1,dta2),
names_from = id_column,
values_from = id_value
) %>%
# if dta1 is missing, then in the same row, move value from dta2 to dta1
mutate(
dta1_has_value = !is.na(dta1), # helper column
dta1 = ifelse(dta1_has_value,dta1,dta2),
dta2 = ifelse(!dta1_has_value & !is.na(dta2),NA,dta2)
) %>%
select(-dta1_has_value) %>%
group_by(dta1) %>%
# condense multiple rows into one row
summarise_all(
~ifelse(all(is.na(.x)),"",.x[!is.na(.x)])
) %>%
# reorder columns
{
.[sort(colnames(df_joined))]
}
Result:
> df_joined
# A tibble: 4 x 5
dta1 dta2 id1 id2 id3
<chr> <chr> <chr> <chr> <chr>
1 BAR "" "" "" fgh
2 FOO FUU abc "" def
3 GOO PAR bcd xyz ""
4 KOO "" "" zzz ""
library(sqldf)
one <-
sqldf('
select a.*
, b.dta as dta_b
from df_A a
left join df_B b
on a.id1 <> ""
and (
a.id1 = b.id1
or a.id2 = b.id2)
')
two <-
sqldf('
select b.*
from df_B b
left join one
on b.dta = one.dta
or b.dta = one.dta_b
where one.dta is null
')
dplyr::bind_rows(one, two)
# dta id1 id2 id3 dta_b
# 1 FOO abc def FUU
# 2 BAR fgh <NA>
# 3 GOO bcd xyz PAR
# 4 KOO zzz <NA>
Related
I want to see whether the text column has elements outside the specified values of "a" and "b"
specified_value=c("a","b")
df=data.frame(key=c(1,2,3,4),text=c("a,b,c","a,d","1,2","a,b")
df_out=data.frame(key=c(1,2,3),text=c("c","d","1,2",NA))
This is what I have tried:
df=df%>%mutate(text_vector=strsplit(text, split=","),
extra=text_vector[which(!text_vector %in% specified_value)])
But this doesn't work, any suggestions?
We can split the 'text' by the delimiter , with separate_rows, grouped by 'key', get the elements that are not in 'specified_value' with setdiff and paste them together (toString), then do a join to get the other columns in the original dataset
library(dplyr) # >= 1.0.0
library(tidyr)
df %>%
separate_rows(text) %>%
group_by(key) %>%
summarise(extra = toString(setdiff(text, specified_value))) %>%
left_join(df) %>%
mutate(extra = na_if(extra, ""))
# A tibble: 4 x 3
# key extra text
# <dbl> <chr> <chr>
#1 1 c a,b,c
#2 2 d a,d
#3 3 1, 2 1,2
#4 4 <NA> a,b
Using setdiff.
df$outside <- sapply({
x <- lapply(strsplit(df$text, ","), setdiff, specified_value)
replace(x, lengths(x) == 0, NA)},
paste, collapse=",")
df
# key text outside
# 1 1 a,b,c c
# 2 2 a,d d
# 3 3 1,2 1,2
# 4 4 a,b NA
Data:
df <- structure(list(key = c(1, 2, 3, 4), text = c("a,b,c", "a,d",
"1,2", "a,b")), class = "data.frame", row.names = c(NA, -4L))
specified_value <- c("a", "b")
use stringi::stri_split_fixed
library(stringi)
!all(stri_split_fixed("a,b", ",", simplify=T) %in% specified_value) #FALSE
!all(stri_split_fixed("a,b,c", ",", simplify=T) %in% specified_value) #TRUE
An option using regex without splitting the data on comma :
#Collapse the specified_value in one string and remove from text
df$text1 <- gsub(paste0(specified_value, collapse = "|"), '', df$text)
#Remove extra commas
df$text1 <- gsub('(?<![a-z0-9]),', '', df$text1, perl = TRUE)
df
# key text text1
#1 1 a,b,c c
#2 2 a,d d
#3 3 1,2 1,2
#4 4 a,b
I have some difficulties with my code, and I hope some of you could help.
The dataset looks something like this:
df <- data.frame("group" = c("A", "A", "A","A_1", "A_1", "B","B","B_1"),
"id" = c("id1", "id2", "id3", "id2", "id3", "id5","id1","id1"),
"time" = c(1,1,1,3,3,2,2,5),
"Val" = c(10,10,10,10,10,12,12,12))
"group" indicate the group the individual "id" is in. "A_1" indicate that a subject has left the group.
For instance, one subject "id1" leaves the "group A" that becomes group "A_1", where only "id2" and "id3" are members. Similarly "id5" leaves group B that becomes "B_1" with only id1 as a member.
What I would like to have in the final dataset is an opposite type of groups identification, that should look something like this:
final <- data.frame("group" = c("A", "A", "A","A_1", "B","B","B_1"),
"id" = c("id1", "id2", "id3", "id1", "id5","id1","id5"),
"time" = c(1,1,1,3,2,2,5),
"Val" = c(10,10,10,10,12,12,12),
"groupid" = c("A", "A", "A","A", "B","B","B"))
Whereby "A_1" and "B_1" only indicate the subjects, "id1" and "id5" respectively, that have left the original group, rather than identifying remaining subjects.
Does anyone have suggestions on how I could systematically do this?
I thank you in advance for your help.
Follow up:
My data is a little more complex that in the above example as there are multiple "exits" from treatements, moreover group identifier can be of different character leghts (here for instance AAA and B). The data looks more like the following:
df2 <- data.frame("group" = c("AAA", "AAA", "AAA","AAA","AAA_1","AAA_1", "AAA_1","AAA_2","AAA_2","B","B","B_1"),
"id" = c("id1", "id2", "id3","id4", "id2", "id3","id4", "id2","id3", "id5","id1","id1"),
"time" = c(1,1,1,1,3,3,3,6,6,2,2,5),
"Val" = c(10,10,10,10,10,10,10,10,10,12,12,12))
Where at time 3 id1 leaves groups AAA, that becomes groups AAA_1, while at time 6, also id4 leaves group AAA, that becomes group AAA_2. As discussed previously, i would like groups with "_" to identify those id that left the group rather than the one remaining. Hence the final dataset should look something like this:
final2 <- data.frame("group" = c("A", "A", "A","A","A_1","A_2",
"B","B","B_1"),
"id" = c("id1", "id2", "id3","id4", "id1", "id4", "id5","id1","id5"),
"time" = c(1,1,1,1,3,6,2,2,5),
"Val" = c(10,10,10,10,10,10,12,12,12))
thanks for helping me with this
Ok you can try with dplyr in this way: maybe it's not elegant, but you get the result. The idea behind is to first fetch the ones that are in group ... but not in the relative ..._1 and change their group, fetch the others, and rbind them together:
library(dplyr)
# first you could find the one that are missing in the ..._1 groups
# and change their group to ..._1
dups <-
df %>%
group_by(id, groupid = substr(group,1,1)) %>%
filter(n() == 1)%>%
mutate(group = paste0(group,'_1')) %>%
left_join(df %>%
select(group, time, Val) %>%
distinct(), by ='group') %>%
select(group, id, time = time.y, Val = Val.y) %>%
ungroup()
dups
# A tibble: 2 x 5
groupid group id time Val
<chr> <chr> <fct> <dbl> <dbl>
1 A A_1 id1 3 10
2 B B_1 id5 5 12
# now you can select the ones that are in both groups:
dups2 <-
df %>%
filter(nchar(as.character(group)) == 1) %>%
mutate(groupid = substr(group,1,1))
dups2
group id time Val groupid
1 A id1 1 10 A
2 A id2 1 10 A
3 A id3 1 10 A
4 B id5 2 12 B
5 B id1 2 12 B
Last, rbind() them, arrange() them and order() the columns:
rbind(dups, dups2) %>%
arrange(group) %>%
select(group, id, time, Val, groupid)
# A tibble: 7 x 5
group id time Val groupid
<chr> <fct> <dbl> <dbl> <chr>
1 A id1 1 10 A
2 A id2 1 10 A
3 A id3 1 10 A
4 A_1 id1 3 10 A
5 B id5 2 12 B
6 B id1 2 12 B
7 B_1 id5 5 12 B
Hope it helps!
EDIT:
You can generalize it with some work, here my attempt, hope it helps:
library(dplyr)
df3 <- df2
# you have to set a couple of fields you need:
df3$group <-ifelse(
substr(df2$group,(nchar(as.character(df2$group))+1)-1,nchar(as.character(df2$group))) %in% c(0:9),
paste0(substr(df2$group,1,1),"_",substr(df2$group,(nchar(as.character(df2$group))+1)-1,nchar(as.character(df2$group)))),
paste0(substr(df2$group,1,1),"_0")
)
df3$util <- as.numeric(substr(df3$group,3,3))+1
# two empty lists to populate with a nested loop:
changed <- list()
final_changed <- list()
Now first we find who changes, then the other: the idea is the same of the previous part:
for (j in c("A","B")) {
df3_ <- df3[substr(df3$group,1,1)==j,]
for (i in unique(df3_$util)[1:length(unique(df3_$util))-1]) {
temp1 <- df3_[df3_$util == i,]
temp2 <- df3_[df3_$util == i+1,]
changes <- temp1[!temp1$id %in% temp2$id,]
changes$group <- paste0(j,'_',i )
changes <- changes %>% left_join(temp2, by = 'group') %>%
select(group , id = id.x, time = time.y, Val = Val.y)
changed[[i]] <- changes
}
final_changed[[j]] <- changed
}
change <- do.call(rbind,(do.call(Map, c(f = rbind, final_changed)))) %>% distinct()
change
group id time Val
1 A_1 id1 3 10
2 B_1 id5 5 12
3 A_2 id4 6 10
Then the remains, and put together:
remain <-
df3 %>% mutate(group = gsub("_0", "", .$group)) %>%
filter(nchar(as.character(group)) == 1) %>% select(-util)
rbind(change, remain) %>%
mutate(groupid = substr(group,1,1)) %>% arrange(group) %>%
select(group, id, time, Val, groupid)
group id time Val groupid
1 A id1 1 10 A
2 A id2 1 10 A
3 A id3 1 10 A
4 A id4 1 10 A
5 A_1 id1 3 10 A
6 A_2 id4 6 10 A
7 B id5 2 12 B
8 B id1 2 12 B
9 B_1 id5 5 12 B
I'm trying to match data across two tables through two columns in R: ID number & address. I'm primarily matching through ID number, but there is missing data so address is the back-up column for matching. Any ideas on how to do it? Does merge() allow an "or" in the "by" argument?
left_join to get the ones that match then filter out missing data & repeat
This doesn't work but for instance:
merge(table1, table2, by = 'ID number' or 'address')
is too long.
One way is to merge twice - first with id and then with address - and then clean up the final values -
table1 <- data.frame(
id = c(1, 2, 3),
address = letters[1:3],
stringsAsFactors = F
)
table2 <- data.frame(
id = c(1, NA_integer_, 3),
address = c(letters[1:2], NA_character_),
value = 10:12,
stringsAsFactors = F
)
d <- merge(table1, table2[c("id", "value")], by = "id", all.x = T)
result <- merge(d, table2[c("address", "value")], by = "address", all.x = T)
result$final_value <- with(result, ifelse(is.na(value.x), value.y, value.x))
address id value.x value.y final_value
1 a 1 10 10 10
2 b 2 NA 11 11
3 c 3 12 NA 12
With dplyr -
table1 %>%
left_join(select(table2, id, value), by = "id") %>%
left_join(select(table2, address, value), by = "address") %>%
mutate(
final_value = coalesce(value.x, value.y)
)
id address value.x value.y final_value
1 1 a 10 10 10
2 2 b NA 11 11
3 3 c 12 NA 12
I have a below lists (with sublists as well). But here the columns are unequal. "a" list has 2 columns and "b" lists has 3 columns.
f <- list(a=list(1,2.5,9.5),b=list("2","-true","3",4))
I need to append this list keeping references like below. For example,
COl1 COl2 COl3 Col4
a 1 false NA
b 2 true 3
As you can see above, there is a reference in col 1 from where the data object the lists is taken. Please guide
1) data.table Set names on the list giving the new list fnam and then use rbindlist from data.table:
library(data.table)
fnam <- lapply(f, function(x) setNames(x, paste0("COL", seq(2, length = length(x)))))
cbind(COL1 = names(f), rbindlist(fnam , fill = TRUE))
giving:
COL1 COL2 COL3 COL4
1: a 1 false <NA>
2: b 2 true 3
2) base R This alternative uses no packages. We create a character vector out of f and then read it in using read.table.
Lines <- paste(names(f), sapply(f, paste, collapse = " "))
nc <- max(lengths(f)) + 1
col.names <- paste0("COL", seq_len(nc))
read.table(text = Lines, header = FALSE, fill = TRUE, col.names = col.names)
giving:
COL1 COL2 COL3 COL4
1 a 1 false NA
2 b 2 true 3
Use some separator not appearing in the data if the data can contain spaces.
One option would be to set the names of the list elements using map and specify the .id as 'COL1' to create a new column based on the names of 'f'. Note that map returns a list, while map_df a tb_df/data.frame
1)
library(tidyverse)
f %>%
map_df(~ set_names(., paste0("COL", seq_along(.)+1)), .id = 'COL1')
# A tibble: 2 x 4
# COL1 COL2 COL3 COL4
# <chr> <dbl> <chr> <chr>
#1 a 1 false <NA>
#2 b 2 true 3
2) If the types are different, retype (from hablar) and then do
library(hablar)
f1 %>%
map_df(~ set_names(.x, paste0("COL", seq_along(.)+1)) %>%
map(retype), .id = 'COL1')
# A tibble: 2 x 4
# COL1 COL2 COL3 COL4
# <chr> <int> <chr> <int>
#1 a 1 false NA
#2 b 2 true 3
3) Or with type.convert
f1 %>%
map_df(~ map(.x, type.convert, as.is = TRUE) %>%
set_names(paste0("COL", seq_along(.x))), .id = "COL1")
# A tibble: 2 x 4
# COL1 COL1 COL2 COL3
# <chr> <int> <chr> <int>
#1 a 1 false NA
#2 b 2 true 3
4) if the integer/numeric is giving an issue, then convert it to common type ie. to numeric
f1 %>%
map_df(~ map(.x, type.convert, as.is = TRUE) %>%
map_if(is.integer, as.numeric) %>%
set_names(paste0("COL", seq_along(.x))), .id = "COL1")
5) As the types are mixed up, it may be better to do the retype after converting to a single data.frame
f %>%
map_df(~ map(.x, as.character) %>%
set_names(paste0("COL", seq_along(.x) + 1)), .id = "COL1") %>%
retype
data
f <- list(a = list(1, "false"), b = list(2, "true", "3"))
f1 <- list(a=list(1,"false"),b=list("2","true","3"))
How about another simple base R solution.
f <- list(a=list(1,2.5,9.5),b=list("2","-true","3",4))
m = matrix(NA,ncol=max(sapply(f,length)),nrow=length(f))
for(i in 1:nrow(m)) {
u = unlist(f[[i]])
m[i,1:length(u)] = u
}
your_data_frame = as.data.frame(m)
I have two objects:
Dataframe 1:
Address City
xyz City1
xyy City1
xxx City2
... ...
Dataframe 2
Column 1 Column 2 City
.... ... City1
.... ... City2
I want to join the two data-frames, so that I assign a random, but unique address from dataframe one to dataframe two, given that there is a match between the cities.
Essentially, the idea is to assign a random address for a given city.
I don't believe a join would work here, as the size of the dataframes varies and I need to assign a unique address value. Perhaps I'm mistaken though.
Any ideas how I can pull this off?
The idea is to pick a random row for each City in your first dataset and then join that info back to your second dataset.
# example datasets
df1 = read.table(text = "Address City
xyz City1
xyy City1
xxx City2
zzz City2", header=T, stringsAsFactors=F)
df2 = read.table(text = "Column1 Column2 City
1 3 City1
2 4 City2", header=T, stringsAsFactors=F)
library(dplyr)
set.seed(1) # for reproducible results
df1 %>%
group_by(City) %>% # for each city
sample_n(1) %>% # pick a random row
right_join(df2, by="City") %>% # right join df2
ungroup() # forget the grouping
# # A tibble: 2 x 4
# Address City Column1 Column2
# <chr> <chr> <int> <int>
# 1 xyz City1 1 3
# 2 xxx City2 2 4
A data.table alternative:
Scramble the entire address data once (sample(.I)), join on 'City', and select the first of the matches (mult = "first")
library(data.table)
setDT(d1)
setDT(d2)
d1[d1[ , sample(.I)]][d2, on = "City", mult = "first"]
# City Address
# 1: c1 a2
# 2: c2 a3
# 3: c3 a1
# 4: c4 a2
d1 <- data.frame(City = rep(paste0("c", 1:4), each = 4),
Address = paste0("a", 1:4))
d2 <- data.frame(City = paste0("c", 1:4))
Don't know if speeed is an issue, but this should be faster on a larger data.