Assign a value based on closest neighbour from other data frame - r

With generic data:
set.seed(456)
a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)
c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)
For each df1$b we want to find the nearest df2$d.
Then we create a new variable df1$XYZ that takes the df2$c value of the nearest df2$d
This question has guided me towards data.table library. But I am not sure if ddplyr and group_by can also be used:
Here was my data.table attempt:
library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )
dt[ ldt , list( d ) , roll = "nearest" ]

Here's one way with data.table:
require(data.table)
setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]
You need to get df2$c corresponding to the nearest value in df2$d for every df1$b. So, we need to join as df2[df1] which results in nrow(df1) rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"].
It returns the result you require. All we need to do is to add this back to df1 with the name XYZ. We do that using :=.
The thought process in constructing the rolling join is something like this (assuming df1 and df2 are both data tables):
We need get some value(s) for each row of df1. That means, i = df1 in x[i] syntax.
df2[df1]
We need to join df2$d with df1$b. Using on= that'd be:
df2[df1, on=c(d="b")]
We need just the c column. Use j to select just that column.
df2[df1, c, on=c(d="b")]
We don't need equi-join but roll to nearest join.
df2[df1, c, on=c(d="b"), roll="nearest"]
Hope this helps.

Related

data.table join based on switched string combinations

I have df1 which I would like to merge with df2 based on a common field id
id is always in the form of 21_2342_A_C (i.e. num_num_char_char). I want to merge df2 into df1 if either of the last two fields (sep="_") in id are switched.
So, if ID in df1 is 21_2342_A_C, then I want it to match if the entry in df2 is either 21_2342_A_C or 21_2342_C_A.
Is this possible using data.table? I've developed a cumbersome way involving creating two different columns and doing two different joins, but I was hoping there'd be a more elegant solution. I'll also happily take a non data.table solution.
This also includes creating two additional columns but only 1 merge:
dt <- data.table(
id = c("21_2342_A_C", "21_2342_C_A", "21_2342_A_B")
)
extract number and character part of id
sort character part
merge if number and character parts are same
remove merges on itself and/or duplicated merges (if row i is merged to row j then row j is merged on row i)
dt[, row_id := seq_len(.N)]
dt[, (c("id1", "id2")) := transpose(str_extract_all(dt$id, "([0-9]{2}_[0-9]{4})|([A-Z]_[A-Z])"))]
dt[, id2 := map_chr(str_split(id2, "_"), ~str_c(sort(.x), collapse = ""))]
res <- dt[dt, on = .(id1, id2)][row_id < i.row_id]
res[, c("row_id", "id1", "id2", "i.row_id") := NULL]
I also could not figure out how to do it without an intermediate id.
Here is my take:
df1 <- data.table(V1= "hello", id= "21_2342_A_C")
df2 <- data.table(V1= c("world1", "world2"), id= c("21_2342_A_C", "21_2342_C_A"))
sort_id <- function(x)
{
x <- unlist(tstrsplit(x, "_"))
return(paste0(c(x[1:2], sort(x[3:4])), collapse= "_"))
}
df1[, id2:= sort_id(id), id]
df2[, id2:= sort_id(id), id]
merge(df1,
df2,
"id2")

R - Most efficient way to remove all non-matched rows in a data.table rolling join (instead of 2-step procedure with semi join)

Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]

how to insert sequential rows in data.table in R (Example given)?

df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added .
df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80))
df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"),
location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
This is what I tried to obtain my result , but it did not work
df1<-df[,':='(hour=seq(0:23)),by=(customer)]
Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) :
Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact
performance too much for the fastest cases. Either change the type of the target column, or
coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table.
target <- data.table(
customer = rep(unique(df$customer), each = 24),
hour = 0:23)
df_join <- df[target, on = c("customer", "hour"),
.(customer, hour, visits = ifelse(is.na(visits), 0, visits))
]
all.equal(df_expected, df_join)
Edit:
This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer.
target <- df[ , .("customer" = rep(unique(customer), each = 24L),
"hour" = rep(0L:23L, times = uniqueN(customer))),
by = location_id]
df_join <- df[target, on = .NATURAL,
.(customer, location_id, hour, visits = fcoalesce(visits, 0))]
all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs:
df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE,
.(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer.
df_final <- data.table()
for(i in seq(24)){
if(i %in% df[,hour]){
a <- df[hour==i]
}else{
a <- data.table(customer="x", hour=i, visits=0)}
df_final <- rbind(df_final, a)
}
df_final
You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done).
df_final <- data.table()
for(j in unique(df[,customer])){
for(i in seq(24)){
if(i %in% df[,hour]){
if(df[hour==i,customer] %in% j){
a <- df[hour==i]
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
df_final <- rbind(df_final, a)
}
}
df_final

R - Append rows from dataframe to another one without duplicate on "primary keys columns"

I have two dataframes (A and B). B contains new values and A contains outdated values.
Each of these dataframes have one column representing the key and another one representing the value.
I want to add rows from B to A and then clean rows that contain duplicated keys from A (update A with the new values that are in B). Order doesn't really matter, I think it is easier in the other order : cleaning duplicates and then appending.
At the moment, I have done this script :
A <- bind_rows(B, A)
A <- A[!duplicated(A),]
The issue I have is that it doesn't clean rows because they are not real duplicates (value is different).
How could I handle this?
This is just a hunch because there's no example data provided, but I suspect a merge is a much safer approach than a row-bind:
Solution with data.table
library(data.table)
1 - Rename variables to prepare for a merge
setnames(A, old="value", new="value_A")
setnames(B, old="value", new="value_B")
2 - Merge, be sure to use the all arg
dt <- merge(A, B, by="key", all=TRUE)
3 - Use some rule for the update - for example: use value_B unless it's missing, in which case use value_A
dt[ , value := value_B]
dt[is.na(value), value := value_A]
Solution with Base R
names(A) <- c("key", "value_A")
names(B) <- c("key", "value_B")
df <- merge(A, B, by="key", all=TRUE)
df$value <- df$value_B
df[is.na(df$value), "value"] <- df[is.na(df$value), "value_A"]
Solution with dplyr/tidyverse
library(dplyr)
df <- full_join(A, B, by="key") %>%
mutate(value = ifelse(is.na(value_B), value_A, value_B))
Example Data
set.seed(1234)
A <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))
B <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))

Keep columns in join data.table

I do not get why in this join I can not retrieve the column sub_item of my DT2?
DT <- data.table(ID=c(1:4),OBS_VALUE=10:13)
DT2 <- data.table(ID=c(1:4),sum_item=c(10,11.5,12.5,18))
setkey(DT,ID)
setkey(DT2,ID)
S_toset_sum <- DT[DT2,diff := abs(OBS_VALUE-sum_item)][diff<3]
in the output I would like to have still sum_item as I want to keep this column instead of the OBS_VALUE column.
You have to specify the columns you wish to keep, as well as the key you wish to join on.
S_toset_sum <- DT[DT2, on = 'ID', .(ID, OBS_VALUE, sum_item, diff = abs(OBS_VALUE-sum_item))][diff<3]

Resources