I have two data.frames, one with only characters and the other one with characters and values.
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
merge(df1, df2)
x y
1 a 0
2 b 1
3 c 0
I want to merge df1 and df2. The characters a, b and c merged good and also have 0, 1, 0 but d and e has nothing. I want d and e also in the merge table, with the 0 0 condition. Thus for every missing row at the df2 data.frame, the 0 must be placed in the df1 table, like:
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():
zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0
> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Updated many years later to address follow up question
You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))
#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0
Created on 2019-01-03 by the reprex package (v0.2.1)
Or, as an alternative to #Chase's code, being a recent plyr fan with a background in databases:
require(plyr)
zz<-join(df1, df2, type="left")
zz[is.na(zz)] <- 0
Another alternative with data.table.
EXAMPLE DATA
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,x)
setkey(dt2,x)
CODE
dt2[dt1,list(y=ifelse(is.na(y),0,y))]
Assuming df1 has all the values of x of interest, you could use a dplyr::left_join() to merge and then either a base::replace() or tidyr::replace_na() to replace the NAs as 0s:
library(tidyverse)
# dplyr only:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
# dplyr and tidyr:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace_na(y, 0))
# In the sample data column `x` is a factor, which will give a warning with the join. This can be prevented by converting to a character before the join:
df_new <-
left_join(df1 %>% mutate(x = as.character(x)),
df2 %>% mutate(x = as.character(x)),
by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
I used the answer given by Chase (answered May 11 '11 at 14:21), but I added a bit of code to apply that solution to my particular problem.
I had a frame of rates (user, download) and a frame of totals (user, download) to be merged by user, and I wanted to include every rate, even if there were no corresponding total. However, there could be no missing totals, in which case the selection of rows for replacement of NA by zero would fail.
The first line of code does the merge. The next two lines change the column names in the merged frame. The if statement replaces NA by zero, but only if there are rows with NA.
# merge rates and totals, replacing absent totals by zero
graphdata <- merge(rates, totals, by=c("user"),all.x=T)
colnames(graphdata)[colnames(graphdata)=="download.x"] = "download.rate"
colnames(graphdata)[colnames(graphdata)=="download.y"] = "download.total"
if(any(is.na(graphdata$download.total))) {
graphdata[is.na(graphdata$download.total),]$download.total <- 0
}
Here, a data.table answer. This may be used in selected columns varying the cols_added_df2's definition
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
setDT(df1)
setDT(df2)
df3 <- merge(df1, df2, by = "x", all.x = TRUE)
cols_added_df2 <- setdiff(names(df2), names(df1))
df3[,
paste0(cols_added_df2) := lapply(.SD, function(col){
fifelse(is.na(col), 1, col)
}),
.SDcols = cols_added_df2
]
With {powerjoin} we can do:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
powerjoin::power_full_join(df1, df2, fill = 0)
#> Joining, by = "x"
#> x y
#> 1 a 0
#> 2 b 1
#> 3 c 0
#> 4 d 0
#> 5 e 0
Created on 2022-04-28 by the reprex package (v2.0.1)
Related
I have two datasets I want to merge on the variable id, one of which has two possible ids, for example:
df1 <- data.frame(id = c('a', 'b', 'c', 'q', 'z'),
id2 = c('NA', 'g', 'NA', 'd', 'e'),
var1 = 1:5,
var3 = c('hi', 'hello', 'bonjour', 'howdy', 'hi'))
df2 <- data.frame(id = c('a', 'b', 'c', 'd', 'e'),
var2 = 6:10,
var4 = 20:24)
I currently merge these datasets on the primary linking variable:
merge1 <- merge(x = df1,
y = df2,
by = 'id',
all = TRUE)
I need to re-merge those rows from the first dataframe that have the second id but did not match in the initial merge, so to do that I put them in a separate data frame, take them out of the fully matched dataset, and then merge the two:
df1.remerge <- merge1[which(!is.na(merge1$id2) &
is.na(merge1$var2)),]
df1.remerge$id <- df1.remerge$id2
merged <- merge1[which(is.na(merge1$id2) |
!is.na(merge1$var2)),]
merge2 <- merge(x = df1.remerge,
y = merged,
by = 'id',
all = TRUE,
suffixes = c('.m1', '.m2'))
# where .m1 = the remerged obs from df1 & .m2 = the original merged obs
This, though, creates two sets of the same variables (i.e. I end up with two var1s and two var2s). I can of course manually combine the variables, but I'd prefer not to, since my actual data is quite large (think millions of observations and 30-40 variables) and that seems rather inefficient.
Ultimately I want a dataset that looks roughly like this:
want.final <- data.frame(id = c('a', 'b', 'c', 'd', 'e'),
var1 = 1:5,
var2 = 6:10,
var3 = c('hi', 'hello', 'bonjour', 'howdy', 'hi'),
var4 = 20:24)
But what I get with this method is this:
get.final <- data.frame(id = c('a', 'b', 'c', 'd', 'e'),
var1.m1 = c('NA', 'NA', 'NA', 4, 5),
var1.m2 = c(1, 2, 3, 'NA', 'NA'),
var2.m1 = c('NA', 'NA', 'NA', 'NA', 'NA'),
var2.m2 = c(6, 7, 8, 9, 10),
var3.m1 = c('NA', 'NA', 'NA', 'howdy', 'hi'),
var3.m2 = c('hi', 'hello', 'bonjour', 'NA', 'NA'),
var4.m1 = c('NA', 'NA', 'NA', 'NA', 'NA'),
var4.m2 = c(20, 21, 22, 23, 24))
Does anyone know of a way to re-merge these observations and update the existing variables where they're missing in the master/x dataset and not missing in the using/y? In an ideal world I'd like something like the update option for Stata's merge that does just this.
If I understand correctly, the OP wants to find matching rows between df1$id and df2$id. For those rows in df1 where no matches are found, a second attempt should find matching rows between the alternative id df1$id2 and df2$id. Furthermore, the datasets are quite large (containing millions of rows) and the OP is restricted more or less to base R.
Base R
So, instead of doing multiple merges with datasets of millions of rows, we can resolve the duplicate id columns in df1 first before doing a single merge:
id1 <- df2$id[match(df1$id, df2$id)]
id2 <- df2$id[match(df1$id2, df2$id)]
df1$id <- ifelse(is.na(id1), id2, id1)
df1$id2 <- NULL
merge(df1, df2)
id var1 var3 var2 var4
1 a 1 hi 6 20
2 b 2 hello 7 21
3 c 3 bonjour 8 22
4 d 4 howdy 9 23
5 e 5 hi 10 24
Explanation
First, we check if df1$id is included in df2$id which returns id1 as
[1] "a" "b" "c" NA NA
Then, we check if df1$id2 is included in df2$id which returns id2 as
[1] NA NA NA "d" "e"
Now, we can coalesce id1 and id2, i.e., we pick pair-wise the first non-NA value and replace the id column in df1 which becomes
[1] "a" "b" "c" "d" "e"
The id2 column in df1 is removed as it is no longer needed.
Finally, the modified df1 and df2 are merged on the id column.
Edit: data.table approach
As the OP has pointed out that his production dataset consists of millions of observations and 30-40 variables it might be worthwhile to consider a data.table approach. data.table has the := assignment operator which allows for fast update of columns by reference.
Using data.table, the approach above can be implemented by
library(data.table)
setDT(df1)
setDT(df2)
df2[df1[, `:=`(id = fcoalesce(df2[df1, on = "id", x.id], df2[df1, on = "id==id2", x.id]),
id2 = NULL)], on = "id"]
In general, merge and dplyr::*_join will always give you the *.x/*.y variants of a shared-column; data.table is often the same, but its merge-assignment operation can help side-step it.
base R
out <- merge(merge(df1, df2, by="id", all.x=TRUE), df2,
by.x="id2", by.y="id", all.x = TRUE, suffixes = c("", ".y"))
out$id[is.na(out$var2)] <- out$id2[is.na(out$var2)]
out$var2[is.na(out$var2)] <- out$var2.y[is.na(out$var2)]
out[,c("id2","var2.y")] <- NULL
out
# id var1 var2
# 1 d 4 9
# 2 e 5 10
# 3 b 2 7
# 4 a 1 6
# 5 c 3 8
data.table
Renaming df2$var2 can be useful here for clarity and conditional reassignment.
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
setnames(DT2, "var2", "var2new")
DT1[DT2, var2 := var2new, on = .(id)
][DT2, c("id", "var2") := .(id2, fifelse(is.na(var2), var2new, var2)), on = .(id2 == id)
][, id2 := NULL]
# id var1 var2
# <char> <int> <int>
# 1: a 1 6
# 2: b 2 7
# 3: c 3 8
# 4: d 4 9
# 5: e 5 10
I have a dataframe which contains three columns, and a second which contains two columns.
df1 <- data.frame(X1 = c('A', 'A', 'A', 'A', 'A', 'A', 'B'),
X2 = c('B', 'B', 'B', 'C', 'C', 'D', 'C'),
X3 = c('C', 'D', 'E', 'D', 'E', 'E', 'D'))
df2 <- data.frame(X1 = c('A', 'A'),
X2 = c('B', 'D'))
Questions:
How do I find the rows in df1 which contain all the elements of a row of df2? i.e. rows 1:3 of df1 contain both A and B (first row of df2). I am looking to remove any rows of df1 which contain both elements of the rows of df2. So in the example, I would like to remove rows 1, 2, 3, 4 and 6 of df1 as these include A and B OR A and D.
Is there a quick way to count the number of rows for each row of df2 without looping? i.e. df2 row 1 would have a count of 3 and row 2 a count of 3.
Here is base R option using outer + intersect
mat <- lengths(
outer(
asplit(df1, 1),
asplit(df2, 1),
Vectorize(intersect)
)
) >= ncol(df2)
and you will obtain
> subset(df1, !rowSums(mat))
X1 X2 X3
5 A C E
7 B C D
> within(df2, cnt <- colSums(mat))
X1 X2 cnt
1 A B 3
2 A D 3
asplit splits the data frames by rows
outer produces all combinations of rows from df1 and df2
intersect gives the intersected elements of rows from two data frames
subset selects the rows which has less than one common elements
Using apply:
df1[ !apply(df1, 1, function(i) any(apply(df2, 1, function(j) all(j %in% i)))), ]
# X1 X2 X3
# 5 A C E
# 7 B C D
Do the similar loops for df2 match counts:
cbind(df2,
cnt = apply(df2, 1, function(i) sum(apply(df1, 1, function(j) all(i %in% j)))))
# X1 X2 cnt
# 1 A B 3
# 2 A D 3
You need to loop somehow. Here is one way to do it using dplyr and purrr:
1.
for(iRow in seq_len(nrow(df2))){
df1 <- df1 %>%
rowwise() %>%
filter(!all(as.character(df2[iRow,]) %in% c_across(everything())))
}
2.
df2 %>%
rowwise() %>%
mutate(n = sum(map_int(transpose(df1), ~all(c_across(everything()) %in% .x))))
Just be sure to do 2nd part before 1st because 1st part removes rows. Also you can first detect which rows to remove for each row of df2. This way you can count them and afterwards remove them.
df2 <- df2 %>%
rowwise() %>%
mutate(
indices = list(which(map_lgl(transpose(df1), ~all(c_across(everything()) %in% .x))))
) %>%
ungroup() %>%
mutate(n = map_int(indices, length))
df1 <- df2[["indices"]] %>%
unlist() %>%
unique() %>%
"*"(-1) %>%
df1[.,]
df2 <- df2 %>% select(-indices)
I would like to "copy paste" one column's value from df A under DF B's column values.
Below is I've visualized on what I'm trying to achieve
An option is to use bind_rows for the selected columns after making the type of the column same
library(dplyr)
bind_rows(df2, df1[1] %>%
transmute(ColumnC = as.character(ColumnA)))
# ColumnC ColumnD
#1 a b
#2 1 <NA>
#3 2 <NA>
#4 3 <NA>
data
df1 <- data.frame(ColumnA = 1:3, ColumnB = 4:6)
df2 <- data.frame(ColumnC = 'a', ColumnD = 'b',
stringsAsFactors = FALSE)
You may use also R base for this. You actually want to right join df2 with df1 :
df1 <- data.frame(1:3, 4:6)
names(df1) <- paste0("c", 1:2)
df2 <- data.frame("a", "b")
names(df2) <- paste0("c", 3:4)
# renaming column to join on
names(df2)[1] <- "c1"
merge(x = df1[,1,drop=FALSE], y = df2, by.y = c("c1"), all = TRUE)
I have two data.frames, one with only characters and the other one with characters and values.
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
merge(df1, df2)
x y
1 a 0
2 b 1
3 c 0
I want to merge df1 and df2. The characters a, b and c merged good and also have 0, 1, 0 but d and e has nothing. I want d and e also in the merge table, with the 0 0 condition. Thus for every missing row at the df2 data.frame, the 0 must be placed in the df1 table, like:
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():
zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0
> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Updated many years later to address follow up question
You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))
#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0
Created on 2019-01-03 by the reprex package (v0.2.1)
Or, as an alternative to #Chase's code, being a recent plyr fan with a background in databases:
require(plyr)
zz<-join(df1, df2, type="left")
zz[is.na(zz)] <- 0
Another alternative with data.table.
EXAMPLE DATA
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,x)
setkey(dt2,x)
CODE
dt2[dt1,list(y=ifelse(is.na(y),0,y))]
Assuming df1 has all the values of x of interest, you could use a dplyr::left_join() to merge and then either a base::replace() or tidyr::replace_na() to replace the NAs as 0s:
library(tidyverse)
# dplyr only:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
# dplyr and tidyr:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace_na(y, 0))
# In the sample data column `x` is a factor, which will give a warning with the join. This can be prevented by converting to a character before the join:
df_new <-
left_join(df1 %>% mutate(x = as.character(x)),
df2 %>% mutate(x = as.character(x)),
by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
I used the answer given by Chase (answered May 11 '11 at 14:21), but I added a bit of code to apply that solution to my particular problem.
I had a frame of rates (user, download) and a frame of totals (user, download) to be merged by user, and I wanted to include every rate, even if there were no corresponding total. However, there could be no missing totals, in which case the selection of rows for replacement of NA by zero would fail.
The first line of code does the merge. The next two lines change the column names in the merged frame. The if statement replaces NA by zero, but only if there are rows with NA.
# merge rates and totals, replacing absent totals by zero
graphdata <- merge(rates, totals, by=c("user"),all.x=T)
colnames(graphdata)[colnames(graphdata)=="download.x"] = "download.rate"
colnames(graphdata)[colnames(graphdata)=="download.y"] = "download.total"
if(any(is.na(graphdata$download.total))) {
graphdata[is.na(graphdata$download.total),]$download.total <- 0
}
Here, a data.table answer. This may be used in selected columns varying the cols_added_df2's definition
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
setDT(df1)
setDT(df2)
df3 <- merge(df1, df2, by = "x", all.x = TRUE)
cols_added_df2 <- setdiff(names(df2), names(df1))
df3[,
paste0(cols_added_df2) := lapply(.SD, function(col){
fifelse(is.na(col), 1, col)
}),
.SDcols = cols_added_df2
]
With {powerjoin} we can do:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
powerjoin::power_full_join(df1, df2, fill = 0)
#> Joining, by = "x"
#> x y
#> 1 a 0
#> 2 b 1
#> 3 c 0
#> 4 d 0
#> 5 e 0
Created on 2022-04-28 by the reprex package (v2.0.1)
What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?
A common answer, such as in this thread is to do the left outer join with either dplyr::left_join or data.table::merge or data.table's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA values by 0 in the joined data table. For example:
library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;
This approach necessarily assumes that there are no valid NA values in dt1 that need to be preserved. Yet, as you can see in the above example, the results are:
x new_col y
1: a 1 0
2: b 2 w
3: c 3 0
4: d 0 y
5: e 0 z
but the desired results are:
x new_col y
1: a 1 NA
2: b 2 w
3: c 3 NA
4: d 0 y
5: e 0 z
In such a trivial case, instead of using the data.table all elements replace syntax as above, just the NA values in new_col could be replaced:
library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));
However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.
There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join, data.table::merge, or data.table's bracket easily allowed the user to specify a fill value other than NA. Something like:
merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);
data.table's dcast function allows the user to specify fill value, so I figure there must be an easier way to do this that I'm just not thinking of.
Suggestions?
EDIT: #jangorecki pointed out in the comments that there is a feature request currently open on the data.table GitHug page to do exactly what I just mentioned, updating the nomatch=0 syntax. Should be in the next release of data.table.
I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)
left_join0 <- function(x, y, fill = 0L, ...){
z <- left_join(x, y, ...)
new_cols <- setdiff(names(z), names(x))
z <- replace_na(z, setNames(as.list(rep(fill, length(new_cols))), new_cols))
z
}
Could you use column indices to refer only to the new columns, as with left_join they'll all be on the right of the resulting data.frame? Here it would be in dplyr:
dt1 <- data.frame(x = c('a', 'b', 'c', 'd', 'e'),
y = c(NA, 'w', NA, 'y', 'z'),
stringsAsFactors = FALSE)
dt2 <- data.frame(x = c('a', 'b', 'c'),
new_col = c(1,2,3),
stringsAsFactors = FALSE)
merged <- left_join(dt1, dt2)
index_new_col <- (ncol(dt1) + 1):ncol(merged)
merged[, index_new_col][is.na(merged[, index_new_col])] <- 0
> merged
x y new_col
1 a <NA> 1
2 b w 2
3 c <NA> 3
4 d y 0
5 e z 0
The cleanest way at present may simply be to seed an intermediary table with the values to be joined on in the left table (dt1), chain a merge of dt2, set NA values to 0, merge intermediary table with dt1. Can be done entirely with data.table and doesn't depend on data.frame syntax, and the intermediary step ensures that there will be no nomatch NA results in the second merge:
library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
inter_table <- dt2[dt1[, list(x)]];
inter_table[is.na(inter_table)] <- 0;
setkey(inter_table, x);
merged <- inter_table[dt1];
> merged;
x new_col y
1: a 1 NA
2: b 2 w
3: c 3 NA
4: d 0 y
5: e 0 z
The benefit of this approach is that it doesn't depend on new columns being added on the right and stays inside data.table keyed speed optimizations. Crediting answer to #SamFirke because his solution also works and may be more useful in other contexts.