How to join NA values in R - r

I have two tables that are joined. After the join, some of the values come out as NA.
I am trying to join again with a third data set, but only on those NA values. How do I do it?
The joined results
library(plyr)
## first table
original_value <- c('old_a', 'old_b', 'old_c', 'old_d')
key <- c('a', 'b', 'c', 'd')
data <- data.frame(key, original_value, stringsAsFactors = FALSE)
## lookup table
new_value <- c('new_a', 'new_b')
key <- c('a', 'b')
lookup <- data.frame(key, new_value, stringsAsFactors = FALSE)
## the joined data
data_lookup_joined <- join(data, lookup, by = "key")
> data_lookup_joined
key original_value new_value
1 a old_a new_a
2 b old_b new_b
3 c old_c <NA>
4 d old_d <NA>
This is the output I am trying to get:
## a third data set to join the NA values
unmatched_value <- c('unmatched_c', 'unmatched_d')
key <- c('c', 'd')
unmatched_lookup <- data.frame(key, unmatched_value, stringsAsFactors = FALSE)
key original_value new_value
1 a old_a new_a
2 b old_b new_b
3 c old_c unmatched_c
4 d old_d unmatched_d
This is what I have tried that did not work.
data_lookup_joined$new_value [is.na(data_lookup_joined$new_value)] <- join(data_lookup_joined, unmatched_lookup, by = "key")
What do I need to do?

# join the rows with missing values
has_na = is.na(data_lookup_joined$new_value)
na_join = join(data_lookup_joined[has_na, c("key", "original_value")], unmatched_lookup)
# make the column names match
names(na_join)[3] = "new_value"
# put it back together
final_result = rbind(data_lookup_joined[!has_na, ], na_join)
Of course, the simpler way would be to rbind lookup and unmatched_lookup first, then you just need one join.

Related

Is there a way to update existing variables when merging in R?

I have two datasets I want to merge on the variable id, one of which has two possible ids, for example:
df1 <- data.frame(id = c('a', 'b', 'c', 'q', 'z'),
id2 = c('NA', 'g', 'NA', 'd', 'e'),
var1 = 1:5,
var3 = c('hi', 'hello', 'bonjour', 'howdy', 'hi'))
df2 <- data.frame(id = c('a', 'b', 'c', 'd', 'e'),
var2 = 6:10,
var4 = 20:24)
I currently merge these datasets on the primary linking variable:
merge1 <- merge(x = df1,
y = df2,
by = 'id',
all = TRUE)
I need to re-merge those rows from the first dataframe that have the second id but did not match in the initial merge, so to do that I put them in a separate data frame, take them out of the fully matched dataset, and then merge the two:
df1.remerge <- merge1[which(!is.na(merge1$id2) &
is.na(merge1$var2)),]
df1.remerge$id <- df1.remerge$id2
merged <- merge1[which(is.na(merge1$id2) |
!is.na(merge1$var2)),]
merge2 <- merge(x = df1.remerge,
y = merged,
by = 'id',
all = TRUE,
suffixes = c('.m1', '.m2'))
# where .m1 = the remerged obs from df1 & .m2 = the original merged obs
This, though, creates two sets of the same variables (i.e. I end up with two var1s and two var2s). I can of course manually combine the variables, but I'd prefer not to, since my actual data is quite large (think millions of observations and 30-40 variables) and that seems rather inefficient.
Ultimately I want a dataset that looks roughly like this:
want.final <- data.frame(id = c('a', 'b', 'c', 'd', 'e'),
var1 = 1:5,
var2 = 6:10,
var3 = c('hi', 'hello', 'bonjour', 'howdy', 'hi'),
var4 = 20:24)
But what I get with this method is this:
get.final <- data.frame(id = c('a', 'b', 'c', 'd', 'e'),
var1.m1 = c('NA', 'NA', 'NA', 4, 5),
var1.m2 = c(1, 2, 3, 'NA', 'NA'),
var2.m1 = c('NA', 'NA', 'NA', 'NA', 'NA'),
var2.m2 = c(6, 7, 8, 9, 10),
var3.m1 = c('NA', 'NA', 'NA', 'howdy', 'hi'),
var3.m2 = c('hi', 'hello', 'bonjour', 'NA', 'NA'),
var4.m1 = c('NA', 'NA', 'NA', 'NA', 'NA'),
var4.m2 = c(20, 21, 22, 23, 24))
Does anyone know of a way to re-merge these observations and update the existing variables where they're missing in the master/x dataset and not missing in the using/y? In an ideal world I'd like something like the update option for Stata's merge that does just this.
If I understand correctly, the OP wants to find matching rows between df1$id and df2$id. For those rows in df1 where no matches are found, a second attempt should find matching rows between the alternative id df1$id2 and df2$id. Furthermore, the datasets are quite large (containing millions of rows) and the OP is restricted more or less to base R.
Base R
So, instead of doing multiple merges with datasets of millions of rows, we can resolve the duplicate id columns in df1 first before doing a single merge:
id1 <- df2$id[match(df1$id, df2$id)]
id2 <- df2$id[match(df1$id2, df2$id)]
df1$id <- ifelse(is.na(id1), id2, id1)
df1$id2 <- NULL
merge(df1, df2)
id var1 var3 var2 var4
1 a 1 hi 6 20
2 b 2 hello 7 21
3 c 3 bonjour 8 22
4 d 4 howdy 9 23
5 e 5 hi 10 24
Explanation
First, we check if df1$id is included in df2$id which returns id1 as
[1] "a" "b" "c" NA NA
Then, we check if df1$id2 is included in df2$id which returns id2 as
[1] NA NA NA "d" "e"
Now, we can coalesce id1 and id2, i.e., we pick pair-wise the first non-NA value and replace the id column in df1 which becomes
[1] "a" "b" "c" "d" "e"
The id2 column in df1 is removed as it is no longer needed.
Finally, the modified df1 and df2 are merged on the id column.
Edit: data.table approach
As the OP has pointed out that his production dataset consists of millions of observations and 30-40 variables it might be worthwhile to consider a data.table approach. data.table has the := assignment operator which allows for fast update of columns by reference.
Using data.table, the approach above can be implemented by
library(data.table)
setDT(df1)
setDT(df2)
df2[df1[, `:=`(id = fcoalesce(df2[df1, on = "id", x.id], df2[df1, on = "id==id2", x.id]),
id2 = NULL)], on = "id"]
In general, merge and dplyr::*_join will always give you the *.x/*.y variants of a shared-column; data.table is often the same, but its merge-assignment operation can help side-step it.
base R
out <- merge(merge(df1, df2, by="id", all.x=TRUE), df2,
by.x="id2", by.y="id", all.x = TRUE, suffixes = c("", ".y"))
out$id[is.na(out$var2)] <- out$id2[is.na(out$var2)]
out$var2[is.na(out$var2)] <- out$var2.y[is.na(out$var2)]
out[,c("id2","var2.y")] <- NULL
out
# id var1 var2
# 1 d 4 9
# 2 e 5 10
# 3 b 2 7
# 4 a 1 6
# 5 c 3 8
data.table
Renaming df2$var2 can be useful here for clarity and conditional reassignment.
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
setnames(DT2, "var2", "var2new")
DT1[DT2, var2 := var2new, on = .(id)
][DT2, c("id", "var2") := .(id2, fifelse(is.na(var2), var2new, var2)), on = .(id2 == id)
][, id2 := NULL]
# id var1 var2
# <char> <int> <int>
# 1: a 1 6
# 2: b 2 7
# 3: c 3 8
# 4: d 4 9
# 5: e 5 10

R match date in one table if between two date columns in second table over category

Using base R only, I'm trying to iterate and test if table1$"DATE" is >= table2$"START" & <= table2$"STOP" where table1$"EVENT" == table2$"EVENT". I initially thought to get unique categories from table1$"EVENT" and then subset via looping then outer-joining to table2 and after that using for-loop to iterate each table1 row against table2 rows to return the value but for loops are slow and my real data set has over 3 million rows and is grows daily. In Python I'd probably try something with pd.intervalrange or some similar approach.
This is the table of events I want to return a value to based on if DATE is between table2's START and STOP columns where EVENT matches in both table1 and table2.
table1
table2 (lookup table)
desired outcome
Base R, wouldn't give you the most efficient approach especially on large data but here is one attempt
df1$Date <- as.Date(df1$Date, '%m/%d/%Y')
df2$Start <- as.Date(df2$Start, '%m/%d/%Y')
df2$Stop <- as.Date(df2$Stop, '%m/%d/%Y')
df1$result <- sapply(seq_len(nrow(df1)), function(x) {
inds <- df2$Event == df1$Event[x] &
df1$Date[x] >= df2$Start & df1$Date[x] <= df2$Stop
if (any(inds)) df2$Return[which.max(inds)] else NA
})
df1
# Event Date result
#1 A 2000-01-01 <NA>
#2 A 2019-02-15 abc
#3 B 2000-01-01 <NA>
#4 B 2019-02-15 bar
#5 B 2019-12-12 <NA>
#6 C 2017-07-07 <NA>
data
df1 <- data.frame(Event = c('A', 'A', 'B', 'B', 'B','C'), Date = c('1/1/2000',
'2/15/2019', '1/1/2000', '2/15/2019', '12/12/2019','7/7/2017'),
stringsAsFactors = FALSE)
df2 <- data.frame(Event = c('A', 'B', 'B', 'A', 'A'),
Start = c('1/1/2019','2/1/2019', '1/1/2019','2/1/2019', '3/1/2019'),
Stop = c('1/31/2019','2/28/2019', '1/31/2019', '2/28/2019', '3/30/2019'),
Return = c('foo', 'bar', 'baz', 'abc', 'xyz'), stringsAsFactors = FALSE)

In R how can you copy one column's value to another chosen column, without replacing the other columns values?

I would like to "copy paste" one column's value from df A under DF B's column values.
Below is I've visualized on what I'm trying to achieve
An option is to use bind_rows for the selected columns after making the type of the column same
library(dplyr)
bind_rows(df2, df1[1] %>%
transmute(ColumnC = as.character(ColumnA)))
# ColumnC ColumnD
#1 a b
#2 1 <NA>
#3 2 <NA>
#4 3 <NA>
data
df1 <- data.frame(ColumnA = 1:3, ColumnB = 4:6)
df2 <- data.frame(ColumnC = 'a', ColumnD = 'b',
stringsAsFactors = FALSE)
You may use also R base for this. You actually want to right join df2 with df1 :
df1 <- data.frame(1:3, 4:6)
names(df1) <- paste0("c", 1:2)
df2 <- data.frame("a", "b")
names(df2) <- paste0("c", 3:4)
# renaming column to join on
names(df2)[1] <- "c1"
merge(x = df1[,1,drop=FALSE], y = df2, by.y = c("c1"), all = TRUE)

Combining tables in R with some value replacement [duplicate]

I have two data.frames, one with only characters and the other one with characters and values.
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
merge(df1, df2)
x y
1 a 0
2 b 1
3 c 0
I want to merge df1 and df2. The characters a, b and c merged good and also have 0, 1, 0 but d and e has nothing. I want d and e also in the merge table, with the 0 0 condition. Thus for every missing row at the df2 data.frame, the 0 must be placed in the df1 table, like:
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():
zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0
> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Updated many years later to address follow up question
You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))
#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0
Created on 2019-01-03 by the reprex package (v0.2.1)
Or, as an alternative to #Chase's code, being a recent plyr fan with a background in databases:
require(plyr)
zz<-join(df1, df2, type="left")
zz[is.na(zz)] <- 0
Another alternative with data.table.
EXAMPLE DATA
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,x)
setkey(dt2,x)
CODE
dt2[dt1,list(y=ifelse(is.na(y),0,y))]
Assuming df1 has all the values of x of interest, you could use a dplyr::left_join() to merge and then either a base::replace() or tidyr::replace_na() to replace the NAs as 0s:
library(tidyverse)
# dplyr only:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
# dplyr and tidyr:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace_na(y, 0))
# In the sample data column `x` is a factor, which will give a warning with the join. This can be prevented by converting to a character before the join:
df_new <-
left_join(df1 %>% mutate(x = as.character(x)),
df2 %>% mutate(x = as.character(x)),
by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
I used the answer given by Chase (answered May 11 '11 at 14:21), but I added a bit of code to apply that solution to my particular problem.
I had a frame of rates (user, download) and a frame of totals (user, download) to be merged by user, and I wanted to include every rate, even if there were no corresponding total. However, there could be no missing totals, in which case the selection of rows for replacement of NA by zero would fail.
The first line of code does the merge. The next two lines change the column names in the merged frame. The if statement replaces NA by zero, but only if there are rows with NA.
# merge rates and totals, replacing absent totals by zero
graphdata <- merge(rates, totals, by=c("user"),all.x=T)
colnames(graphdata)[colnames(graphdata)=="download.x"] = "download.rate"
colnames(graphdata)[colnames(graphdata)=="download.y"] = "download.total"
if(any(is.na(graphdata$download.total))) {
graphdata[is.na(graphdata$download.total),]$download.total <- 0
}
Here, a data.table answer. This may be used in selected columns varying the cols_added_df2's definition
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
setDT(df1)
setDT(df2)
df3 <- merge(df1, df2, by = "x", all.x = TRUE)
cols_added_df2 <- setdiff(names(df2), names(df1))
df3[,
paste0(cols_added_df2) := lapply(.SD, function(col){
fifelse(is.na(col), 1, col)
}),
.SDcols = cols_added_df2
]
With {powerjoin} we can do:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
powerjoin::power_full_join(df1, df2, fill = 0)
#> Joining, by = "x"
#> x y
#> 1 a 0
#> 2 b 1
#> 3 c 0
#> 4 d 0
#> 5 e 0
Created on 2022-04-28 by the reprex package (v2.0.1)

Merge unequal dataframes and replace missing rows with 0

I have two data.frames, one with only characters and the other one with characters and values.
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
merge(df1, df2)
x y
1 a 0
2 b 1
3 c 0
I want to merge df1 and df2. The characters a, b and c merged good and also have 0, 1, 0 but d and e has nothing. I want d and e also in the merge table, with the 0 0 condition. Thus for every missing row at the df2 data.frame, the 0 must be placed in the df1 table, like:
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():
zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0
> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Updated many years later to address follow up question
You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))
#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0
Created on 2019-01-03 by the reprex package (v0.2.1)
Or, as an alternative to #Chase's code, being a recent plyr fan with a background in databases:
require(plyr)
zz<-join(df1, df2, type="left")
zz[is.na(zz)] <- 0
Another alternative with data.table.
EXAMPLE DATA
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,x)
setkey(dt2,x)
CODE
dt2[dt1,list(y=ifelse(is.na(y),0,y))]
Assuming df1 has all the values of x of interest, you could use a dplyr::left_join() to merge and then either a base::replace() or tidyr::replace_na() to replace the NAs as 0s:
library(tidyverse)
# dplyr only:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
# dplyr and tidyr:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace_na(y, 0))
# In the sample data column `x` is a factor, which will give a warning with the join. This can be prevented by converting to a character before the join:
df_new <-
left_join(df1 %>% mutate(x = as.character(x)),
df2 %>% mutate(x = as.character(x)),
by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
I used the answer given by Chase (answered May 11 '11 at 14:21), but I added a bit of code to apply that solution to my particular problem.
I had a frame of rates (user, download) and a frame of totals (user, download) to be merged by user, and I wanted to include every rate, even if there were no corresponding total. However, there could be no missing totals, in which case the selection of rows for replacement of NA by zero would fail.
The first line of code does the merge. The next two lines change the column names in the merged frame. The if statement replaces NA by zero, but only if there are rows with NA.
# merge rates and totals, replacing absent totals by zero
graphdata <- merge(rates, totals, by=c("user"),all.x=T)
colnames(graphdata)[colnames(graphdata)=="download.x"] = "download.rate"
colnames(graphdata)[colnames(graphdata)=="download.y"] = "download.total"
if(any(is.na(graphdata$download.total))) {
graphdata[is.na(graphdata$download.total),]$download.total <- 0
}
Here, a data.table answer. This may be used in selected columns varying the cols_added_df2's definition
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
setDT(df1)
setDT(df2)
df3 <- merge(df1, df2, by = "x", all.x = TRUE)
cols_added_df2 <- setdiff(names(df2), names(df1))
df3[,
paste0(cols_added_df2) := lapply(.SD, function(col){
fifelse(is.na(col), 1, col)
}),
.SDcols = cols_added_df2
]
With {powerjoin} we can do:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
powerjoin::power_full_join(df1, df2, fill = 0)
#> Joining, by = "x"
#> x y
#> 1 a 0
#> 2 b 1
#> 3 c 0
#> 4 d 0
#> 5 e 0
Created on 2022-04-28 by the reprex package (v2.0.1)

Resources