Merge dataframes on matching A, B and *closest* C? - r

I have two dataframes like so:
set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
y=c(1, 1, 1, 5, 4),
time=c(17, 12, 11.6, 22.5, 2),
val=letters[1:5],
stringsAsFactors=F)
I want to merge to.merge into df (with all.x=T) such that:
df$x == to.merge$x AND
df$y == to.merge$y AND
abs(df$time - to.merge$time) <= 1; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances.
How can I do this?
So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows):
x y time val
1 1 1 8 NA
2 2 1 11 c
3 3 1 17 NA
4 1 2 27 NA
5 2 2 6 NA
6 3 2 27 NA
7 1 3 28 NA
8 2 3 20 NA
9 3 3 19 NA
10 1 4 2 NA
11 2 4 6 NA
12 3 4 5 NA
13 1 5 21 NA
14 2 5 12 NA
15 3 5 23 d
where to.merge was:
x y time val
1 2 1 17.0 a
2 2 1 12.0 b
3 2 1 11.6 c
4 3 5 22.5 d
5 2 4 2.0 e
Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1).
Also, there were two rows in to.merge that satisfied the condition for matching to df's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11.
Finally, there may be rows in to.merge that do not match anything in df.
One way that works is a for-loop, but it takes far too long for my data (df has ~12k rows and to.merge has ~250k rows)
df$value <- NA
for (i in 1:nrow(df)) {
row <- df[i, ]
idx <- which(row$x == to.merge$x &
row$y == to.merge$y &
abs(row$time - to.merge$time) <= 1)
if (length(idx)) {
j <- idx[which.min(row$time - to.merge$time[idx])]
df$val[i] <- to.merge$val[j]
}
}
I feel that I can somehow do a merge, like:
to.merge$closest_time_in_df <- sapply(to.merge$time,
function (tm) {
dts <- abs(tm - df$time)
# difference must be at most 1
if (min(dts) <= 1) {
df$time[which.min(dts)]
} else {
NA
}
})
merge(df, to.merge,
by.x=c('x', 'y', 'time'),
by.y=c('x', 'y', 'closest_time_in_df'),
all.x=T)
But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.

Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)
eg
library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))
# use join syntax with roll = 'nearest'
tm[DT, roll='nearest']
# x y time val
# 1: 1 1 8 NA
# 2: 1 2 27 NA
# 3: 1 3 28 NA
# 4: 1 4 2 NA
# 5: 1 5 21 NA
# 6: 2 1 11 c
# 7: 2 2 6 NA
# 8: 2 3 20 NA
# 9: 2 4 6 e
# 10: 2 5 12 NA
# 11: 3 1 17 NA
# 12: 3 2 27 NA
# 13: 3 3 19 NA
# 14: 3 4 5 NA
# 15: 3 5 23 d
You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)
new <- tm[DT, roll=-1, rollends =c(TRUE,TRUE)]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)
new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d

Using merge couple of times and aggregate once, here is how to do it.
set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)
#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
## x y time.x val time.y dif
## 1 2 1 17.0 a 11 6.0
## 2 2 1 12.0 b 11 1.0
## 3 2 1 11.6 c 11 0.6
## 4 2 4 2.0 e 6 4.0
## 5 3 5 22.5 d 23 0.5
#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
## x y dif time.x val time.y
## 1 2 1 0.6 11.6 c 11
## 2 2 4 4.0 2.0 e 6
## 3 3 5 0.5 22.5 d 23
#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
## x y time val
## 1 1 1 8 <NA>
## 2 1 2 27 <NA>
## 3 1 3 28 <NA>
## 4 1 4 2 <NA>
## 5 1 5 21 <NA>
## 6 2 1 11 c
## 7 2 2 6 <NA>
## 8 2 3 20 <NA>
## 9 2 4 6 <NA>
## 10 2 5 12 <NA>
## 11 3 1 17 <NA>
## 12 3 2 27 <NA>
## 13 3 3 19 <NA>
## 14 3 4 5 <NA>
## 15 3 5 23 d

mnel's answer uses roll = "nearest" in a data.table join but does not limit to +/- 1 as requested by the OP. In addition, MichaelChirico has suggested to use the on parameter.
This approach uses
roll = "nearest",
an update by reference, i.e., without copying,
setDT() to coerce a data.frame to data.table without copying (introduced 2014-02-27 with v.1.9.2 of data.table),
the on parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6).
So, the code below
library(data.table) # version 1.11.4 used
setDT(df)[setDT(to.merge), on = .(x, y, time), roll = "nearest",
val := replace(val, abs(x.time - i.time) > 1, NA)]
df
has updated df:
x y time val
1: 1 1 8 <NA>
2: 2 1 11 c
3: 3 1 17 <NA>
4: 1 2 27 <NA>
5: 2 2 6 <NA>
6: 3 2 27 <NA>
7: 1 3 28 <NA>
8: 2 3 20 <NA>
9: 3 3 19 <NA>
10: 1 4 2 <NA>
11: 2 4 6 <NA>
12: 3 4 5 <NA>
13: 1 5 21 <NA>
14: 2 5 12 <NA>
15: 3 5 23 d
Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer)
In case df must not be changed, a new data.table can be created by
result <- setDT(to.merge)[setDT(df), on = .(x, y, time), roll = "nearest",
.(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
result
which returns the same result as above.

Related

How can I identify the two smallest values (and indexes) rolling over a column in R?

I have been looking for an answer to this question for days. I think I am very close, but I cannot figure how to do the rolling piece of it.
I have a data.table with two columns like this:
df = data.table(a = c(8,3,6,12,15,21,4,5,1,32,13), b = c(12,3,1,66,4,7,32,6,76,2,11))
I would like to obtain the smallest two values and their indexes on a rolling four day window. Rfast::nth seems to give me everything I need except I cannot vectorize it. Obviously I am doing something wrong.
I need the output to look as follows:
a b low lowIdx low2 low2Idx
1: 8 12 NA NA NA NA
2: 3 3 NA NA NA NA
3: 6 1 NA NA NA NA
4: 12 66 3 2 6 3
5: 15 4 3 1 6 2
6: 21 7 6 1 12 2
7: 4 32 4 4 12 1
8: 5 6 4 3 4 3
9: 1 76 1 4 4 2
10: 32 2 1 3 4 1
11: 13 11 1 2 5 1
I have attempted this with different forms of the following:
n <- nrow(df)
df$low[4:n] <- Rfast::nth(df[(n-3):n]$a, 1)
df$lowIdx[4:n] <- Rfast::nth(df$a, 1, index.return = TRUE)
df$low2[4:n] <- Rfast::nth(df[(n-3):n]$a, 2)
df$low2Idx[4:n] <- Rfast::nth(df$a, 2, index.return = TRUE)
I have also been trying to work with the frollapply, but to no avail.
Thank you, Pete
Library runner also helps.
library(dplyr)
library(runner)
df %>% mutate(low = runner(x=a, k=4, f = function(x) ifelse(length(x) ==4, min(x), NA)),
lowIdx = runner(x = a, k =4, function(x) ifelse(length(x) ==4, which.min(x), NA)),
Low2 = runner(x=a, k=4, f = function(x) ifelse(length(x) ==4, sort(x)[2], NA)),
Low2Idx =runner(x=a, k=4, f = function(x) ifelse(length(x) ==4, order(x)[2], NA))
)
a b low lowIdx Low2 Low2Idx
1: 8 12 NA NA NA NA
2: 3 3 NA NA NA NA
3: 6 1 NA NA NA NA
4: 12 66 3 2 6 3
5: 15 4 3 1 6 2
6: 21 7 6 1 12 2
7: 4 32 4 4 12 1
8: 5 6 4 3 5 4
9: 1 76 1 4 4 2
10: 32 2 1 3 4 1
11: 13 11 1 2 5 1
With frollapply in data.table you can do :
library(data.table)
cols <- c('low', 'lowIdx', 'low2', 'low2Idx')
n <- 4
df[, (cols) := .(frollapply(a, n, min),
frollapply(a, n, which.min),
frollapply(a, n, function(x) sort(x)[2]),
frollapply(a, n, function(x) order(x)[2]))]
df
# a b low lowIdx low2 low2Idx
# 1: 8 12 NA NA NA NA
# 2: 3 3 NA NA NA NA
# 3: 6 1 NA NA NA NA
# 4: 12 66 3 2 6 3
# 5: 15 4 3 1 6 2
# 6: 21 7 6 1 12 2
# 7: 4 32 4 4 12 1
# 8: 5 6 4 3 5 4
# 9: 1 76 1 4 4 2
#10: 32 2 1 3 4 1
#11: 13 11 1 2 5 1
For a simple example like yours I suggest using frollapply. For better control of the window, especially if you have gaps in a dates column I recommend runner. See the documentation for more
df[, c('low', 'lowIdx', 'low2', 'low2Idx') := .(
min_run(a, k = 4, na_pad = TRUE),
runner(a, k = 4, function(x) ifelse(length(x) > 0, which.min(x), NA), na_pad = TRUE),
runner(a, k = 4, function(x) sort(x)[2], na_pad = TRUE),
runner(a, k = 4, function(x) order(x)[2], na_pad = TRUE)
)]
df
# a b low lowIdx low2 low2Idx
# 1: 8 12 NA NA NA NA
# 2: 3 3 NA NA NA NA
# 3: 6 1 NA NA NA NA
# 4: 12 66 3 2 6 3
# 5: 15 4 3 1 6 2
# 6: 21 7 6 1 12 2
# 7: 4 32 4 4 12 1
# 8: 5 6 4 3 5 4
# 9: 1 76 1 4 4 2
# 10: 32 2 1 3 4 1
# 11: 13 11 1 2 5 1

Impute missing variables but not at the beginning and the end?

Consider the following working example:
library(data.table)
library(imputeTS)
DT <- data.table(
time = c(1:10),
var1 = c(1:5, NA, NA, 8:10),
var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
var3 = c(1:6, rep(NA, 4))
)
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 NA 4 6
7: 7 NA NA NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
I want to impute the missing values at different points within the time series using the na_interpolation from the imputeTS package. However, I do not want to impute missing values at the beginning or the end of the series which can be of various length (In my application replacing those values would not make sense).
When I run the following code to impute the series, however all the NAs get replaced:
DT[,(cols_to_impute_example) := lapply(.SD, na_interpolation), .SDcols = cols_to_impute_example]
> DT
time var1 var2 var3
1: 1 1 1 1
2: 2 2 1 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 6
8: 8 8 6 6
9: 9 9 7 6
10: 10 10 8 6
What I want to achieve is:
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
a dplyr implementation:
we select the middle part of the df where we do the NA interpolation and then we bind it back.
library(imputeTS)
library(dplyr)
DT <- data_frame(
time = c(1:10),
var1 = c(1:5, NA, NA, 8:10),
var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
var3 = c(1:6, rep(NA, 4))
)
na_inter_middle<-function(row_start, row_end){
# extracts the first part of the df where no NA need to be replaced
DT[1:row_start,]->start
# middle part, interpolating NA values
DT[(row_start + 1):(nrow(DT) - row_end),]->middle
#end part
DT[(nrow(DT) - (row_end - 1) ):nrow(DT),]->end
start %>%
bind_rows(
middle %>%
mutate_all(na.interpolation)
) %>%
bind_rows(end)
}
na_inter_middle(2,3)
# A tibble: 10 x 4
time var1 var2 var3
<int> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 2 NA 2
3 3 3 1 3
4 4 4 2 4
5 5 5 3 5
6 6 5 4 6
7 7 5 4 6
8 8 8 6 NA
9 9 9 7 NA
10 10 10 8 NA
Maybe not so well known, you can also use additional parameters from approx in the na.interpolation function of imputeTS.
This one could be solved with:
library(imputeTS)
DT[,(2:4) := lapply(.SD, na_interpolation, yleft = NA , yright = NA), .SDcols = 2:4]
Here with yleft and yright you specify what to do with the trailing / leading NAs.
Which leads to the desired output:
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
Basically nearly all parameters that you find on the approx function description can also be given to the na.interpolation function as additional parameters for finetuning.
Library zoo offers a function for interpolation that allows more customization:
library(zoo)
DT[,(2:4) := lapply(.SD, na.approx, x = time, na.rm = FALSE), .SDcols = 2:4]

Rolling Join multiple columns independently to eliminate NAs

I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?
Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).

Fill missing values with new data R-Python

I have two dataset x and y
> x
a index b
1 1 1 5
2 NA 2 6
3 2 3 NA
4 NA 4 9
> y
index a
1 2 100
2 4 101
>
I would like to fill the missing values of x with the values contained in y.
I have tried to use the merge function but the result is not what I want.
> merge(x,y, by = 'index', all=T)
index a.x b a.y
1 1 1 5 NA
2 2 NA 6 100
3 3 2 7 NA
4 4 NA 9 101
In the real problem there are additional limitations:
1 - y does not fill all the missing values
2 - x and y have in common more variables (so not only a and index)
EDIT : More realistic example
> x
a index b c
1 1 1 5 NA
2 NA 2 6 NA
3 2 3 NA 5
4 NA 4 9 NA
5 NA 5 10 6
> y
index a c
1 2 100 4
2 4 101 NA
>
The solution would be accepted both in python or R
I used your merge idea and did the following using dplyr. I am sure there will be better ways of doing this task.
index <- 1:5
a <- c(1, NA, 2, NA, NA)
b <- c(5,6,NA,9,10)
c <- c(NA,NA,5,NA,6)
ana <- data.frame(index, a,b,c, stringsAsFactors=F)
index <- c(2,4)
a <- c(100, 101)
c <- c(4, NA)
bob <- data.frame(index, a,c, stringsAsFactors=F)
> ana
index a b c
1 1 1 5 NA
2 2 NA 6 NA
3 3 2 NA 5
4 4 NA 9 NA
5 5 NA 10 6
> bob
index a c
1 2 100 4
2 4 101 NA
ana %>%
merge(., bob, by = "index", all = TRUE) %>%
mutate(a.x = ifelse(a.x %in% NA, a.y, a.x)) %>%
mutate(c.x = ifelse(c.x %in% NA, c.y, c.x))
index a.x b c.x a.y c.y
1 1 1 5 NA NA NA
2 2 100 6 4 100 4
3 3 2 NA 5 NA NA
4 4 101 9 NA 101 NA
5 5 NA 10 6 NA NA
I overwrote a.x (ana$$a) using a.y (bob$a) using mutate. I did a similar thing for c.x (ana$c). If you remove a.y and c.y in the end, that will be the outcome you expect, I think.
Try:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
m1$b = x$b[match(m1$index, x$index)]
m1$c = x$c[match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
or, if there many other columns like b and c:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
for(nn in names(x)[3:4]) m1[,nn] = x[,nn][match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
If there are multiple columns to replace, you could try converting from wide to long form as shown in the first two methods and replace in one step
m1 <- merge(x,y, by="index", all=TRUE)
m1L <- reshape(m1, idvar="index", varying=grep("\\.", colnames(m1)), direction="long", sep=".")
row.names(m1L) <- 1:nrow(m1L)
lst1 <- split(m1L, m1L$time)
indx <- is.na(lst1[[1]][,4:5])
lst1[[1]][,4:5][indx] <- lst1[[2]][,4:5][indx]
res <- lst1[[1]][,c(4,1,2,5)]
res
# a index b c
#1 1 1 5 NA
#2 100 2 6 4
#3 2 3 NA 5
#4 101 4 9 NA
#5 NA 5 10 6
Or you could use dplyr with tidyr
library(dplyr)
library(tidyr)
z <- left_join(x, y, by="index") %>%
gather(Var, Val, matches("\\.")) %>%
separate(Var, c("Var1", "Var2"))
indx1 <- which(is.na(z$Val) & z$Var2=="x")
z$Val[indx1] <- z$Val[indx1+nrow(z)/2]
z %>%
spread(Var1, Val) %>%
filter(Var2=="x") %>%
select(-Var2)
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Or split the columns by matching names before the . and use lapply to replace the NA's.
indx <- grep("\\.", colnames(m1),value=TRUE)
res <- cbind(m1[!names(m1) %in% indx],
sapply(split(indx, gsub("\\..*", "", indx)), function(x) {
x1 <- m1[x]
indx1 <- is.na(x1[,1])
x1[,1][indx1] <- x1[,2][indx1]
x1[,1]} ))
res
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6

Use data.table in R to add multiple columns to a data.table with = with only one function call

This is a direkt expansion of this Question.
I have a dataset and I want to find all pairwise combinations of Variable v depending on Variables x and y:
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,1,6), v=1:18)
x y v
1: a 1 1
2: a 1 2
3: a 6 3
4: a 1 4
5: a 1 5
6: a 6 6
7: b 1 7
8: b 1 8
9: b 6 9
10: b 1 10
11: b 1 11
12: b 6 12
13: c 1 13
14: c 1 14
15: c 6 15
16: c 1 16
17: c 1 17
18: c 6 18
DT[, list(new1 = t(combn(sort(v), m = 2))[,1],
new2 = t(combn(sort(v), m = 2))[,2]),
by = list(x, y)]
x y new1 new2
1: a 1 1 2
2: a 1 1 4
3: a 1 1 5
4: a 1 2 4
5: a 1 2 5
6: a 1 4 5
7: a 6 3 6
8: b 1 7 8
9: b 1 7 10
10: b 1 7 11
11: b 1 8 10
12: b 1 8 11
13: b 1 10 11
14: b 6 9 12
15: c 1 13 14
16: c 1 13 16
17: c 1 13 17
18: c 1 14 16
19: c 1 14 17
20: c 1 16 17
21: c 6 15 18
The Code does what I want but the twice function call makes it slow for larger dataset. My dataset has more than 3 million rows and more than 1.3 million combinations of x and y.
Any suggestions on how to do this faster?
I would prefer something like:
DT[, list(c("new1", "new2") = t(combn(sort(v), m = 2))), by = list(x, y)]
This should work:
DT[, {
tmp <- combn(sort(v), m = 2 )
list(new1 = tmp[1,], new2 = tmp[2,] )
}
, by = list(x, y) ]
The following also works. The trick is to convert the matrix into a data.table.
DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
If necessary, just rename the columns after
r2 <- DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
setnames(r2, c("V1", "V2"), c("new1", "new2"))

Resources