R - Merge two datatables and remove duplicates from older file?

R - Merge two datatables and remove duplicates from older file? - r

I have two databases - old one and update one.
Both have same structures, with unique ID.
If record changes - there's new record with same ID and new data.
So after rbind(m1,m2) I have duplicated records.
I can't just remove duplicated ID's, since the data could be updated.
There's no way to tell the difference which record is new, beside it being in old file or update file.
How can I merge two tables, and if there's row with duplicated ID, leave the one from newer file?
I know I could add column to both and just ifelse() this, but I'm looking for something more elegant, preferably oneliner.

hard to give the correct answer without sample data.. but here is an approach that you can adjust to your data..
#sample data
library( data.table )
dt1 <- data.table( id = 2:3, value = c(2,4))
dt2 <- data.table( id = 1:2, value = c(2,6))
#dt1
# id value
# 1: 2 2
# 2: 3 4
#dt2
# id value
# 1: 1 2
# 2: 2 6
#rowbind...
DT <- rbindlist( list(dt1,dt2), use.names = TRUE )
# id value
# 1: 2 2
# 2: 3 4
# 3: 1 2
# 4: 2 6
#deselect duplicated id from the buttom up
# assuming the last file in the list contains the updated values
DT[ !duplicated(id, fromLast = TRUE), ]
# id value
# 1: 3 4
# 2: 1 2
# 3: 2 6

Say you have:
old <- data.frame(id = c(1,2,3,4,5), val = c(21,22,23,24,25))
new <- data.frame(id = c(1,4), val = c(21,27))
so the record with id 4 has changed in the new dataset and 1 is a pure duplicate.
You can use dplyr::anti_join to find old records not in the new dataset and then just use rbind to add the new ones on.
joined <- rbind(anti_join(old,new, by = "id"),new)

You could use dplyr:
df_new %>%
full_join(df_old, by="id") %>%
transmute(id = id, value = coalesce(value.x, value.y))
returns
id value
1 1 0.03432355
2 2 0.28396359
3 3 0.01121692
4 4 0.57214035
5 5 0.67337745
6 6 0.67637187
7 7 0.69178855
8 8 0.83953140
9 9 0.55350251
10 10 0.27050363
11 11 0.28181032
12 12 0.84292569
given
df_new <- structure(list(id = 1:10, value = c(0.0343235526233912, 0.283963593421504,
0.011216921498999, 0.572140350239351, 0.673377452883869, 0.676371874753386,
0.691788548836485, 0.839531400706619, 0.553502510068938, 0.270503633422777
)), class = "data.frame", row.names = c(NA, -10L))
df_old <- structure(list(id = c(1, 4, 5, 3, 7, 9, 11, 12), value = c(0.111697669373825,
0.389851713553071, 0.252179590053856, 0.91874519130215, 0.504653975600377,
0.616259852424264, 0.281810319051147, 0.842925694771111)), class = "data.frame", row.names = c(NA,
-8L))

Related

How to create new variables using all possible subtractions combinations of the original ones in R?

So I have this big data set with 32 variables and I need to work with relative values of these variables using all possible subtractions among them. Ex. var1-var2...var1-var32; var3-var4...var3-var32, and so on. I'm new in R, so I would like to do this without going full manually on the process. I'm out of idea, other than doing all manually. Any help appreciated! Thanks!
Ex:
df_original
id
Var1
Var2
Var3
x
1
3
2
y
2
5
7
df_wanted
id
Var1
Var2
Var3
Var1-Var2
Var1-Var3
Var2-Var3
x
1
3
2
-2
-1
1
y
2
5
7
-3
-5
-2

You can do this combn which will create combination of columns taking 2 at a time. In combn you can apply a function to every combination where we can subtract the two columns from the dataframe and add the result as new columns.
cols <- grep('Var', names(df), value = TRUE)
new_df <- cbind(df, do.call(cbind, combn(cols, 2, function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste0(x, collapse = '-'))
}, simplify = FALSE)))
new_df
# id Var1 Var2 Var3 Var1-Var2 Var1-Var3 Var2-Var3
#1 x 1 3 2 -2 -1 1
#2 y 2 5 7 -3 -5 -2
data
df <- structure(list(id = c("x", "y"), Var1 = 1:2, Var2 = c(3L, 5L),
Var3 = c(2L, 7L)), class = "data.frame", row.names = c(NA, -2L))

Dynamically create value labels with haven::labelled, follow-up

Follow-up question to Dynamically create value labels with haven::labelled, where akrun provided a good answer using deframe.
I am using haven::labelled to set value labels of a variable. The goal is to create a fully documented dataset I can export to SPSS.
Now, say I have a df value_labels of values and their value labels. I also have a df df_data with variables to which I want allocate value labels.
value_labels <- tibble(
value = c(seq(1:6), seq(1:3), NA),
labels = c(paste0("value", 1:6),paste0("value", 1:3), NA),
name = c(rep("var1", 6), rep("var2", 3), "var3")
)
df_data <- tibble(
id = 1:10,
var1 = floor(runif(10, 1, 7)),
var2 = floor(runif(10, 1, 4)),
var3 = rep("string", 10)
)
Manually, I would create value labels for df_data$var1 and df_data$var2 like so:
df_data$var1 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3, values4 = 4, values5 = 5, values6 = 6))
df_data$var2 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3))
I need a more dynamic way of assigning correct value labels to the correct variable in a large dataset. The solution also needs to ignore character vectors, since I dont want these to have value labels. For that reason, var3 in value_labels is listed as NA.
The solution does not need to work with multiple datasets in a list.

Here is one option where we split the named 'value/labels' by 'name' after removing the NA rows, use the names of the list to subset the columns of 'df_data', apply the labelled and assign it to back to the same columns
lbls2 <- na.omit(value_labels)
lstLbls <- with(lbls2, split(setNames(value, labels), name))
df_data[names(lstLbls)] <- Map(haven::labelled,
df_data[names(lstLbls)], labels = lstLbls)
df_data
# A tibble: 10 x 4
# id var1 var2 var3
# <int> <dbl+lbl> <dbl+lbl> <chr>
# 1 1 2 [value2] 2 [value2] string
# 2 2 5 [value5] 2 [value2] string
# 3 3 4 [value4] 1 [value1] string
# 4 4 1 [value1] 2 [value2] string
# 5 5 1 [value1] 1 [value1] string
# 6 6 6 [value6] 2 [value2] string
# 7 7 1 [value1] 3 [value3] string
# 8 8 1 [value1] 1 [value1] string
# 9 9 3 [value3] 3 [value3] string
#10 10 6 [value6] 1 [value1] string

Merge R data frame or data table and overwrite values of multiple columns

How do you merge two data tables (or data frames) in R keeping the non-NA values from each matching column? The question Merge data frames and overwrite values provides a solution if each individual column is specified explicitly (as far as I can tell, at least). But, I have over 40 common columns between the two data tables, and it is somewhat random which of the two has an NA versus a valid value. So, writing ifelse statements for 40 columns seems inefficient.
Below is a simple example, where I'd like to join (merge) the two data.tables by the id and date columns:
dt_1 <- data.table::data.table(id = "abc",
date = "2018-01-01",
a = 3,
b = NA_real_,
c = 4,
d = 6,
e = NA_real_)
setkey(dt_1, id, date)
> dt_1
id date a b c d e
1: abc 2018-01-01 3 NA 4 6 NA
dt_2 <- data.table::data.table(id = "abc",
date = "2018-01-01",
a = 3,
b = 5,
c = NA_real_,
d = 6,
e = NA_real_)
setkey(dt_2, id, date)
> dt_2
id date a b c d e
1: abc 2018-01-01 3 5 NA 6 NA
Here is my desired output:
> dt_out
id date a b c d e
1: abc 2018-01-01 3 5 4 6 NA
I've also tried the dplyr::anti_join solution from left_join two data frames and overwrite without success.

I'd probably put the data in long form and drop dupes:
k = key(dt_1)
DTList = list(dt_1, dt_2)
DTLong = rbindlist(lapply(DTList, function(x) melt(x, id=k)))
setorder(DTLong, na.last = TRUE)
unique(DTLong, by=c(k, "variable"))
id date variable value
1: abc 2018-01-01 a 3
2: abc 2018-01-01 b 5
3: abc 2018-01-01 c 4
4: abc 2018-01-01 d 6
5: abc 2018-01-01 e NA

You can do this by using dplyr::coalesce, which will return the first non-missing value from vectors.
(EDIT: you can use dplyr::coalesce directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)
Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x data frame if they are present).
coalesce_join <- function(x,
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}

We can use {powerjoin}, do a left join and deal with the conflicts using coalesce_xy() (which is pretty much dplyr::coalesce()).
library(powerjoin)
power_left_join(dt_1, dt_2, by = "id", conflict = coalesce_xy)
# id date a b c d e
# 1 abc 2018-01-01 3 5 4 6 NA

Dynamically select all columns but among ones that start with a certain word exclude all but keep one

I have many data frames that come in such a format:
df1 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, Group = 1:2,
FORMULA_RULE = 1:2, FORMULA_TRANSFORM = 1:2, FORMULA_UNITE = 1:2,
FORMULA_CALCULATE = 1:2, FORMULA_JOIN = 1:2), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, FORMULA_RULE = 1:2,
FORMULA_META = c(NA, NA), FORMULA_DATA = 1:2, FORMULA_JOIN = 1:2,
FORMULA_TRANSFORM = 1:2, Group = 1:2), class = "data.frame", row.names = c(NA,
-2L))
View:
df1
ID Name Gender Group FORMULA_RULE FORMULA_TRANSFORM FORMULA_UNITE FORMULA_CALCULATE FORMULA_JOIN
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
df2
ID Name Gender FORMULA_RULE FORMULA_META FORMULA_DATA FORMULA_JOIN FORMULA_TRANSFORM Group
1 1 1 1 1 NA 1 1 1 1
2 2 2 2 2 NA 2 2 2 2
I want to write a code that would work on all such dataframes in a way that all columns are kept, but among the columns starts with FORMULA_, only FORMULA_TRANSFORM is selected. Please note that columns that do NOT start with FORMULA_ are not always the same, that is to say, I cannot simply write a code that always selects ID, Name, Gender, Group, and FORMULA_TRANSFORM, because there are some data frames that contain many other columns that do not start with FORMULA_ which I want to keep.
My attempt to solve this problem is this ugly code which works as expected:
library(tidyverse)
for(i in 1:length(ls(pattern = "df"))){
get(paste0("df", i)) %>%
select(-starts_with("FORMULA"),
(names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T))[!names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T) %in% "FORMULA_TRANSFORM"])
%>% print
}
Is there a more straight-forward way to do this?

With dplyr we can use select and it's pretty straight forward using starts_with and contains.
library(dplyr)
df1 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
Let's try with a dataframe without "FORMULA_TRANSFORM" column
df3 <- df1
df3$FORMULA_TRANSFORM <- NULL
df3 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2
With minus sign we are removing the columns that starts_with "FORMULA_" and selecting the one with "FORMULA_TRANSFORM". Instead of contains we can also use one_of() or matches() and it would still work.
Using base R we can use grep with invert and value set as TRUE
df1[c(grep("^FORMULA_", names(df1), invert = TRUE, value = TRUE),
"FORMULA_TRANSFORM")]
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
This creates a vector of column names where column name doesn't start with "FORMULA_" and we add "FORMULA_TRANSFORM" manually later.
The above method assumes that you always have "FORMULA_TRANSFORM" column in your dataframe and it will fail if there isn't. Safer option would be
get_selected_cols <- function(df1) {
cbind(df1[grep("^FORMULA_", names(df1), invert = TRUE)],
df1[names(df1) == "FORMULA_TRANSFORM"])
}
get_selected_cols(df1)
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
get_selected_cols(df3)
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2

In R, apply a function separately between columns with same names in different data frames

I have two data frames:
require(tidyverse)
set.seed(42)
df1 = data_frame(x = c(4,3), y = c(0, 0), z = c(NA, 3))
df2 = data_frame(x = sample(1:4, 100, replace = T), y = sample(c(-3, 0, 3), 100, replace = T), z = c(NA, NA, rep(3, 98))) %>% mutate(Tracking = row_number())
I would like to separately for each row of df1 AND for each column of df1 to find the indices of df2 for which df2 is equal to df1. If I tried to loop then each iteration would look like:
for (i in 1: nrow(df1)){
for (j in 1: ncol(df1)) {
L[[i]][j] = inner_join(df1[i,j], df2)
}
}
for example, the first element of the list is:
inner_join(df1[1,1], df2)
Joining, by = "x"
# A tibble: 26 x 4
x y z Tracking
<dbl> <dbl> <dbl> <int>
1 4. 0. NA 1
2 4. -3. NA 2
3 4. 0. 3. 4
4 4. 3. 3. 13
5 4. 0. 3. 16
6 4. -3. 3. 17
7 4. 0. 3. 21
8 4. 0. 3. 23
9 4. 0. 3. 24
10 4. 3. 3. 28
# ... with 16 more rows
However I am sure there's a more efficient way to do this. Possibly dplyr + purrr? I don't have much experience with purrr, but I have a feeling the map function can come in handy. I just don't know how to call the columns separately.

You could do something like
L <- map(names(df1),
function(.) {
out <- inner_join(x = df1[, ., drop = FALSE],
y = df2,
by = .)
split(out, out[[.]])
})
but I'm not sure if this is better or more efficient than the for loop you started with.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Merge two datatables and remove duplicates from older file? - r

Related

How to create new variables using all possible subtractions combinations of the original ones in R?

Dynamically create value labels with haven::labelled, follow-up

Merge R data frame or data table and overwrite values of multiple columns

Dynamically select all columns but among ones that start with a certain word exclude all but keep one

In R, apply a function separately between columns with same names in different data frames

Categories

Resources