How to update both data.tables in a join

How to update both data.tables in a join - r

Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? Please see my example below and the way I usually do it. However, this seems rather inefficient.
Example
library(data.table)
# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2),
y = c(1,3,6))
# data to merge
DTx <- data.table(x = 1:3,
y = 1,
k = "X")
# regular update join
copy(DT)[DTx,
on = .(x, y),
k := i.k][]
#> x y k
#> 1: 1 1 X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1 X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1 X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>
# DTx remains the same
DTx
#> x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X
What I usually do:
# set an Id variable
DTx[, Id := .I]
# assign the Id in merge
DT[DTx,
on = .(x, y),
`:=`(k = i.k,
matched_id = i.Id)][]
#> x y k matched_id
#> 1: 1 1 X 1
#> 2: 1 3 <NA> NA
#> 3: 1 6 <NA> NA
#> 4: 2 1 X 2
#> 5: 2 3 <NA> NA
#> 6: 1 6 <NA> NA
#> 7: 1 1 X 1
#> 8: 2 3 <NA> NA
#> 9: 2 6 <NA> NA
# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#> x y k Id matched
#> 1: 1 1 X 1 TRUE
#> 2: 2 1 X 2 TRUE
#> 3: 3 1 X 3 FALSE

Following Jan's comment:
This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.
You can pull the indices:
merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]
x y irow xrow
1: 1 1 1 1
2: 1 1 1 7
3: 2 1 2 4
4: 3 1 3 0
Then apply edits to each table using indices rather than merging or matching a second time:
rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]
How it works:
When joining, x[i], the symbol .I indexes rows of x
When grouping in a join with by=.EACHI, .GRP indexes each group, which means each row of i here
We drop the non-matching values of .I which are coded as zeros
On this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE]. I'm not sure why these differ.
Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? [...] seems rather inefficient.
I expect this is more efficient than multiple merges or %in% when the merge is costly enough.
It still requires multiple steps. I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.
Update logic is already complex in base R, with multiple edits on a single index allowed:
> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3
And there is the question of how to match and edit multiple indices at once:
> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5
In data.table updates, the latter issue is handled with mult=. In the update-two-tables use case, these questions would get much more complicated.

Related

Add new blank rows into dataset by group (in R)

I use R. I have dataframe like this:
dat <- data.frame(
group = c(1,1,1,1,1,1,2,2,2,2,2),
horizon = c(1,3,5,6,7,10,1,3,5,9,10),
value = c(1.0,0.9,0.8,0.6,0.3,0.0,0.5,0.6,0.8,0.9,0.8)
other = c(a,a,a,a,a,a,b,b,b,b,b)
)
And i would like to add row for every horizon that is missing (2,4,8 and 9 for the first group and 2,4,6,7,8 for the second group). Values (value) for the missing horizons would be blank.
I would like to get something like this:
datx <- data.frame(
group = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
horizon = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10),
value = c(1.0,"na",0.9,"na",0.8,0.6,0.3,"na","na",0.0,0.5,"na",0.6,"na",0.8,"na","na","na",0.9,0.8)
other = c(a,a,a,a,a,a,a,a,a,a,b,b,b,b,b,b,b,b,b,b)
)
i.e. englarged dataset with new horizons, blank or "na" spaces in "value" variable and retained "other" variable.
This is just an example. I am actually working with a much larger dataset.
Without the groups, the problem would be much easier to solve, i would use something like this:
newdat <- merge(data.frame(horizon=seq(1,10,1)),dat,all=TRUE)
newdat <- newdat[order(newdat$horizon),]
Thanks for help!

I'll assume that the values in the variable other are the characters, a or b, and that this is completely redundant with your variable group. If this is the case, you could accomplish this with full_join in the dplyrpackage.
a="a"
b="b"
dat <- data.frame(
group = c(1,1,1,1,1,1,2,2,2,2,2),
horizon = c(1,3,5,6,7,10,1,3,5,9,10),
value = c(1.0,0.9,0.8,0.6,0.3,0.0,0.5,0.6,0.8,0.9,0.8),
other = c(a,a,a,a,a,a,b,b,b,b,b)
)
groups <- expand.grid(group=c(1,2),horizon=1:10)
groups <- groups %>% dplyr::mutate(other=ifelse(group==1,"a","b"))
dat %>%
dplyr::full_join(groups,by=c('group','horizon','other')) %>%
dplyr::arrange(group,horizon)

Using data.table:
library(data.table)
setDT(dat)
fill = c("other")
RES =
dat[CJ(group = group, horizon = min(horizon):max(horizon), unique = TRUE),
on = .(group, horizon)
][, (fill) := lapply(.SD, \(x) x[which.min(is.na(x))]), by = group, .SDcols = fill]
RES[]
# group horizon value other
# <num> <int> <num> <char>
# 1: 1 1 1.0 a
# 2: 1 2 NA a
# 3: 1 3 0.9 a
# 4: 1 4 NA a
# 5: 1 5 0.8 a
# 6: 1 6 0.6 a
# 7: 1 7 0.3 a
# 8: 1 8 NA a
# 9: 1 9 NA a
# 10: 1 10 0.0 a
# 11: 2 1 0.5 b
# 12: 2 2 NA b
# 13: 2 3 0.6 b
# 14: 2 4 NA b
# 15: 2 5 0.8 b
# 16: 2 6 NA b
# 17: 2 7 NA b
# 18: 2 8 NA b
# 19: 2 9 0.9 b
# 20: 2 10 0.8 b
# group horizon value other

Determine dynamic window width: efficient rolling count of values which satisfy a condition

I have a data.frame with two columns a and b, where a is sorted. I want to get the rolling average of b, where the window is the range a - 5 to a (i.e. from the current value of a to wherever a - 5 is).
Performing the rolling average with different window widths is trivial using data.table::frollmean() (adaptive = TRUE; "each single observation has own corresponding rolling window width"), so the only problem is calculating those window widths.
So, given the following data.frame, how can I determine the window size for each mean?
set.seed(42)
x <- data.frame(
a = sort(runif(10, 0, 10)),
b = 1:10
)
x
#> a b
#> 1 1.346666 1
#> 2 2.861395 2
#> 3 5.190959 3
#> 4 6.417455 4
#> 5 6.569923 5
#> 6 7.050648 6
#> 7 7.365883 7
#> 8 8.304476 8
#> 9 9.148060 9
#> 10 9.370754 10
Created on 2020-07-03 by the reprex package (v0.3.0)
If I were to put the window size as a new column n, I'd expect the result to be
#> a b n
#> 1 1.346666 1 1
#> 2 2.861395 2 2
#> 3 5.190959 3 3
#> 4 6.417455 4 3
#> 5 6.569923 5 4
#> 6 7.050648 6 5
#> 7 7.365883 7 6
#> 8 8.304476 8 6
#> 9 9.148060 9 7
#> 10 9.370754 10 8
So, for example, there are two values between a[2] = 2.86 and 2.86 - 5 (including itself), and there are six values between a[8] = 8.30 and 8.30 - 5.
I've managed to do this using outer:
suppressPackageStartupMessages({
library(magrittr)
library(data.table)
})
f <- function(x, y) {
return(y %between% list(x - 5, x))
}
outer(x$a, x$a, f) %>% rowSums()
#> [1] 1 2 3 3 4 5 6 6 7 8
However, my real case has easily 5000 rows, and this method gets quite slow (takes around 10 seconds). One problem I see is that it compares every value of a to every other value of a, so has to perform some 25,000,000 comparisons. However, I know a is sorted, so if we find a stretch of TRUE results in the comparison and then a FALSE, we know all subsequent results for the current value of a will also be FALSE (that would mean we were in the allowable range and then moved past the highest allowable value of a, so everything else will also be rejected).
So, is there a better, faster way of doing this?

Because it seems that you will load data.table anyway (for frollmean), you may coerce your data.frame to data.table, and add the new column by reference.
findInterval is used to find the index of each subtracted value among the original values. This index is then subtracted from the original index, obtained by .I or seq_along, to get the window size.
setDT(x)
x[ , n := .I - findInterval(a - 5, a)]
# x
# a b n
# 1: 1.346666 1 1
# 2: 2.861395 2 2
# 3: 5.190959 3 3
# 4: 6.417455 4 3
# 5: 6.569923 5 4
# 6: 7.050648 6 5
# 7: 7.365883 7 6
# 8: 8.304476 8 6
# 9: 9.148060 9 7
# 10: 9.370754 10 8
Similar with base:
x$n = seq_along(x$a) - findInterval(x$a - 5, x$a)

Here is an alternative approach which aggregates in a non-equi self join:
library(data.table)
setDT(x)[, low := a - 5][
, n := x[x, on = .(a >= low , a <= a), by = .EACHI, .N]$N][
, low := NULL][]
a b n
1: 1.346666 1 1
2: 2.861395 2 2
3: 5.190959 3 3
4: 6.417455 4 3
5: 6.569923 5 4
6: 7.050648 6 5
7: 7.365883 7 6
8: 8.304476 8 6
9: 9.148060 9 7
10: 9.370754 10 8
But the OP is aiming at computing a rolling mean with a variable window size.
So, why stop here and call frollmean() when we can have it in one go?:
library(data.table)
setDT(x)[, low := a - 5][
, roll.mean := x[x, on = .(a >= low , a <= a), by = .EACHI, mean(b)]$V1][
, low := NULL][]
a b roll.mean
1: 1.346666 1 1.0
2: 2.861395 2 1.5
3: 5.190959 3 2.0
4: 6.417455 4 3.0
5: 6.569923 5 3.5
6: 7.050648 6 4.0
7: 7.365883 7 4.5
8: 8.304476 8 5.5
9: 9.148060 9 6.0
10: 9.370754 10 6.5
Benchmark
As the OP is concerned about the performance for his production use case here is a benchmark which varies the number of rows as well as the size of the window:
library(bench)
library(ggplot2)
bm <- press(
n = 10^(c(2, 3, 4)),
window_size = c(5, 15, 50),
{
set.seed(42)
x0 <- data.table(
a = sort(runif(n, 0, n)),
b = seq(n)
)
mark(
findInterval = {
x <- copy(x0)
x[, roll.mean := frollmean(b, .I - findInterval(a - window_size, a), adaptive = TRUE)]
},
non_equi_join = {
x <- copy(x0)
x[, low := a - window_size][
, roll.mean := x[x, on = .(a >= low , a <= a), by = .EACHI, mean(b)]$V1][
, low := NULL]
}
)
}
)
autoplot(bm)
Apparently,
the combination of Henrik's findInterval() approach with the adaptive frollmean() always is more than a magnitude faster than the non-equi join approach
window size seems to have no impact on performance.

data.table frollmean very slow

I'm trying to calculate the rolling mean of a column in a large data.table (~30M rows) aggregated by two other columns.
The rolling mean should include only the preceding N row values, not the row value itself.
For this purpose, i had to define my own rolling mean function based on frollmean function. (N=3)
Applying the function to the column is really really slow, rendering it rather useless.
Here is sample data:
require(data.table)
DT <- data.table(ID=c('A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C')
, value_type =c('type 1', 'type 1','type 2','type 1','type 2','type 2','type 1','type 1','type 2','type 1','type 1','type 1')
, value=c(1,4,7,2,3,5,1,6,8,2,2,3))
DT
ID value_type value
1: A type 1 1
2: A type 1 4
3: A type 2 7
4: A type 1 2
5: A type 2 3
6: A type 2 5
7: B type 1 1
8: B type 1 6
9: B type 2 8
10: C type 1 2
11: C type 1 2
12: C type 1 3
#this is the customised rolling function
lrollmean<-function(x){
head(frollmean(c(NA,NA,NA,x), n = 3, fill = NA, algo ="exact", align="right", na.rm = TRUE)[-(1:2)], -1)
}
> DT[, roll_mean := lrollmean(value), by=.(ID, value_type)]
> DT
ID value_type value roll_mean
1: A type 1 1 NaN
2: A type 1 4 1.0
3: A type 2 7 NaN
4: A type 1 2 2.5
5: A type 2 3 7.0
6: A type 2 5 5.0
7: B type 1 1 NaN
8: B type 1 6 1.0
9: B type 2 8 NaN
10: C type 1 2 NaN
11: C type 1 2 2.0
12: C type 1 3 2.0
This operation takes more than 30 minutes! I've got a reasonable machine which ample RAM, and I feel the long time of the operation has something to do with my code rather than the machine.

Can you try and see if its fast enough:
n <- 3L
DT[, roll_mean := {
v <- if (.N - n >= 1L) c(seq.int(n), rep(n, .N-n)) else seq.int(min(n, .N))
shift(frollmean(value, v, adaptive=TRUE))
}, .(ID, value_type)]
But if you have a large number of small groups, you can try:
setorder(DT[, rn := .I], ID, value_type)
rid <- DT[, rowid(ID, value_type)]
DT[, roll_mean := shift(frollmean(value, n))]
ix <- DT[rid==3L, which=TRUE]
set(DT, ix, "roll_mean", DT[, shift(frollmean(value, n - 1L))][ix])
ix <- DT[rid==2L, which=TRUE]
set(DT, ix, "roll_mean", DT[, shift(value)][ix])
DT[rid==1L, roll_mean := NA_real_]
setorder(DT, rn)[]

You can try frollapply and since frollmean doesn't completely suit your needs. You can also optimize the function you apply to the window, since you don't need a very complicated operation. I've tried a few modifications to your function that should cut your time down by around 50%.
library(data.table)
library(stringi)
N=1e6
set.seed(123)
DT <- data.table(ID=stri_rand_strings(N,3),
value=rnorm(N,5,5))
head(DT)
#> ID value
#> 1: HmP 12.2667538
#> 2: sw2 -2.2397053
#> 3: WtY 7.0911933
#> 4: SxS 0.4029431
#> 5: gZ6 8.6800795
#> 6: tF2 0.8228594
DT[,.(.N),by=ID][order(N)]
#> ID N
#> 1: HoR 1
#> 2: eNM 1
#> 3: I9h 1
#> 4: xjb 1
#> 5: eFH 1
#> ---
#> 234823: 34Y 15
#> 234824: Xcm 15
#> 234825: IOu 15
#> 234826: tob 16
#> 234827: f70 16
# Your function
lrollmean<-function(x){
head(frollmean(c(NA,NA,NA,x), n = 3, fill = NA, algo ="exact", align="right", na.rm = TRUE)[-(1:2)], -1)
}
#Possible modifications:
lrollmean1<-function(x,n){
frollapply(c(rep(NA,n),x),n+1,weighted.mean,c(rep(1,n),0),na.rm=T)[-(1:3)]
}
lrollmean2<-function(x,n){
frollapply(c(rep(NA,n),x),n+1,function(x) sum(x*c(rep(1,n),0)/n,na.rm = T))[-(1:3)]
}
lrollmean3<-function(x){ # More optimized assuming n=3
frollapply(c(NA,NA,NA,x),4,function(x) sum(x[1:3]/3,na.rm = T))[-(1:3)]
}
library(rbenchmark)
benchmark(original={DT[, roll_mean := lrollmean1(value,3), by=.(ID)]},
a={DT[, roll_mean := lrollmean1(value,3), by=.(ID)]},
b={DT[, roll_mean := lrollmean2(value,3), by=.(ID)]},
c={DT[, roll_mean := lrollmean3(value), by=.(ID)]}
,replications = 1,order = 'relative')
#> test replications elapsed relative user.self sys.self user.child
#> 4 c 1 6.740 1.000 6.829 0.000 0
#> 3 b 1 8.038 1.193 8.085 0.012 0
#> 1 original 1 13.599 2.018 13.692 0.000 0
#> 2 a 1 14.180 2.104 14.233 0.008 0
#> sys.child
#> 4 0
#> 3 0
#> 1 0
#> 2 0
Created on 2020-02-17 by the reprex package (v0.3.0)

How do I convert my list of lists to a usable data.frame (for printing out a table)?

I have a list of unnamed lists that I need to convert into a usable data.frame. For the most part, each list inside the list have the same element names but some will have some elements that others will not. So each list should be a Row in my data.frame, each variable name should be a column and in cases where a list doesn't have a particular variable the data.frame should contain an NA element.
In my example this_list is what I'm working with and this_df is what I would like to have. I've tried various ways to unlist and convert to data.frame, but my column names just become repeated and I get only 1 observation. Thank you.
this_list <- list(list(
Name = "One",
A = 2,
B = 3,
C = 4,
D = 5
),
list(
Name = "Two",
A = 5,
B = 2,
C = 1
))
this_df <- data.frame(Name=c("One","Two"),
A=c(2,5),
B=c(3,2),
C=c(4,1),
D=c(5,NA))

This is a task for which people frequently reach for dplyr::bind_rows or data.table::rbindlist. However, in base R, if the list elements are consistent, a quick base R solution is do.call(rbind, ...):
do.call(rbind, list(this_list[[1]][1:4], this_list[[2]]))
#> Name A B C
#> [1,] "One" 2 3 4
#> [2,] "Two" 5 2 1
It returns a matrix, but can be cleaned up fairly easily.
However, if the list elements are not consistent, it recycles in an annoying way (with a warning, thankfully):
do.call(rbind, this_list)
#> Warning in (function (..., deparse.level = 1) : number of columns of result
#> is not a multiple of vector length (arg 2)
#> Name A B C D
#> [1,] "One" 2 3 4 5
#> [2,] "Two" 5 2 1 "Two"
Thus the need for a more robust solution, e.g.
rbind_list <- function(list, ...){
# generate a vector of all variable names
vars <- Reduce(function(x, y){union(x, names(y))}, list, init = c());
filled_list <- lapply(list, function(x){
x <- x[vars] # add missing elements, reordering if necessary
names(x) <- vars # fix missing names
x <- lapply(x, function(y){
if (is.null(y)) { # replace NULL with NA
NA
} else if (is.list(y)) {
if (length(y) != 1) y <- list(y) # handle non-length-1 list columns
I(y) # add as-is class to list columns so they don't fail
} else {
y
}
})
as.data.frame(x, ...) # coerce to data frame
})
do.call(rbind, filled_list) # rbind resulting list of data frames
}
It does decidedly better than do.call(rbind, ...):
rbind_list(this_list, stringsAsFactors = FALSE)
#> Name A B C D
#> 1 One 2 3 4 5
#> 2 Two 5 2 1 NA
rbind_list(c(this_list, this_list))
#> Name A B C D
#> 1 One 2 3 4 5
#> 2 Two 5 2 1 NA
#> 3 One 2 3 4 5
#> 4 Two 5 2 1 NA
rbind_list(list(list(a = 1), list(b = 2)))
#> a b
#> 1 1 NA
#> 2 NA 2
rbind_list(list(list(a = 1), list(a = 1, b = 2)))
#> a b
#> 1 1 NA
#> 2 1 2
rbind_list(list(list(a = 1, b = 2), list(b = 2, a = 1)))
#> a b
#> 1 1 2
#> 2 1 2
...though list column handling is still inconsistent:
# correct; is a list column
rbind_list(list(list(a = 1, c = list('foo')), list(a = 1, c = list('baz'))))
#> a c
#> 1 1 foo
#> 2 1 baz
# also correct
rbind_list(list(list(a = 1, c = list(c('foo', 'bar'))), list(a = 1, c = list('baz'))))
#> a c
#> 1 1 foo, bar
#> 2 1 baz
# can handle non-encapsulated nested lists
rbind_list(list(list(a = 1, c = list('foo', 'bar')), list(a = 1, c = list('baz'))))
#> a c
#> 1 1 foo, bar
#> 2 1 baz
# ...which confuses dplyr
dplyr::bind_rows(list(list(a = 1, c = list('foo', 'bar')), list(a = 1, c = list('baz'))))
#> Error in bind_rows_(x, .id): Argument 2 must be length 1, not 2
# ...but fills missing list elements with NA because it doesn't track classes across observations
rbind_list(list(list(a = 1), list(c = list('baz'))))
#> a c
#> 1 1 NA
#> 2 NA baz
# ...which dplyr handles better
dplyr::bind_rows(list(list(a = 1), list(c = list('baz'))))
#> # A tibble: 2 x 2
#> a c
#> <dbl> <list>
#> 1 1.00 <NULL>
#> 2 NA <chr [1]>
While certainly more robust than do.call(rbind, ...), at scale this approach is likely to be considerably slower than package implementations written in C or C++.

You can use rbindlist from data.table:
library(data.table)
that_df <- as.data.frame(rbindlist(this_list, fill = TRUE))
# the result
Name A B C D
1: One 2 3 4 5
2: Two 5 2 1 NA

Solution using base R only. Sequentially does a full join on each list element. (edited based on comment from #RichScriven)
this_df <- Reduce(function(x, y) merge(x, y, all = TRUE), this_list)

Just another alternative using the dplyr package:
bind_rows(this_list)
# A tibble: 2 x 5
Name A B C D
<chr> <dbl> <dbl> <dbl> <dbl>
1 One 2 3 4 5
2 Two 5 2 1 NA
EDIT:
While we are at it. Here is another fast alternative from rlist:
list.stack(this_list, fill = TRUE)
Name A B C D
1 One 2 3 4 5
2 Two 5 2 1 NA

Calculate a mean, by a condition, within a factor [r]

I'm looking to calculate the simple mean of an outcome variable, but only for the outcome associated with the maximal instance of another running variable, grouped by factors.
Of course, the calculated statistic could be substituted for any other function, and the evaluation within the group could be any other function.
library(data.table) #1.9.5
dt <- data.table(name = rep(LETTERS[1:7], each = 3),
target = rep(c(0,1,2), 7),
filter = 1:21)
dt
## name target filter
## 1: A 0 1
## 2: A 1 2
## 3: A 2 3
## 4: B 0 4
## 5: B 1 5
## 6: B 2 6
## 7: C 0 7
With this frame, the desired output should return a mean value for target that meets the criteria of exactly 2.
Something like:
dt[ , .(mFilter = which.max(filter),
target = target), by = name][ ,
mean(target), by = c("name", "mFilter")]
... seems close, but isn't hitting it quite right.
The solution should return:
## name V1
## 1: A 2
## 2: B 2
## 3: ...

You could do this with:
dt[, .(meantarget = mean(target[filter == max(filter)])), by = name]
# name meantarget
# 1: A 2
# 2: B 2
# 3: C 2
# 4: D 2
# 5: E 2
# 6: F 2
# 7: G 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to update both data.tables in a join - r

Related

Add new blank rows into dataset by group (in R)

Determine dynamic window width: efficient rolling count of values which satisfy a condition

data.table frollmean very slow

How do I convert my list of lists to a usable data.frame (for printing out a table)?

Calculate a mean, by a condition, within a factor [r]

Categories

Resources