I have the below df:
df <- data.table(user = c('a', 'a', 'a', 'b', 'b')
, spend = 1:5
, shift_by = c(1,1,2,1,1)
); df
user spend shift_by
1: a 1 1
2: a 2 1
3: a 3 2
4: b 4 1
5: b 5 1
I am looking to create a lead lag column only this time the n parameter in data.table's shift function is dynamic and takes df$shiftby as input. My expected result is:
df[, spend_shifted := c(NA, 1, 1, NA, 4)]; df
user spend shift_by spend_shifted
1: a 1 1 NA
2: a 2 1 1
3: a 3 2 1
4: b 4 1 NA
5: b 5 1 4
However, with the below attempt it gives:
df[, spend_shifted := shift(x=spend, n=shift_by, type="lag"), user]; df
user spend shift_by spend_shifted
1: a 1 1 NA
2: a 2 1 NA
3: a 3 2 NA
4: b 4 1 NA
5: b 5 1 NA
This is the closest example I could find. However, I need a group by and am after a data.table solution because of speed. Truly look forward to finding any ideas.
I believe this will work. You can drop the newindex-column afterward.
df[, newindex := rowid(user) - shift_by]
df[newindex < 0, newindex := 0]
df[newindex > 0, spend_shifted := df[, spend[newindex], by = .(user)]$V1]
# user spend shift_by newindex spend_shifted
# 1: a 1 1 0 NA
# 2: a 2 1 1 1
# 3: a 3 2 1 1
# 4: b 4 1 0 NA
# 5: b 5 1 1 4
Here's another approach, using a data.table join. I use two helper-columns to join on:
df[, row := .I, by = .(user)]
df[, match_row := row - shift_by]
df[df, on = .(user, match_row = row), x := i.spend]
df[, c('row', 'match_row') := NULL]
# user spend shift_by spend_shifted x
# 1: a 1 1 NA NA
# 2: a 2 1 1 1
# 3: a 3 2 1 1
# 4: b 4 1 NA NA
# 5: b 5 1 4 4
Using matrix subsetting of data.frames:
df[,
spend_shifted :=
data.frame(shift(spend, n = unique(sort(shift_by))))[cbind(1:.N, shift_by)],
by = user]
Another solution (in addition to Wimpel's) without shift:
df[, {rows <- 1:nrow(.SD) - shift_by; .SD[replace(rows, rows <= 0, NA), spend]},
by = user]
Maybe this could help
> df[, spend_shifted := spend[replace(seq(.N) - shift_by, seq(.N) <= shift_by, NA)], user][]
user spend shift_by spend_shifted
1: a 1 1 NA
2: a 2 1 1
3: a 3 2 1
4: b 4 1 NA
5: b 5 1 4
I have carried out a benchmark test as scalability is very important for me.
df is same as original only repeating itself 10,000,000. Thus, 50,000,000 rows.
x <- 1e7
df <- data.table(user = rep(c('a', 'a', 'a', 'b', 'b'), x)
, spend = rep(1:5, x)
, shift_by = rep(c(1,1,2,1,1), x)
); df
user spend shift_by
1: a 1 1
2: a 2 1
3: a 3 2
4: b 4 1
5: b 5 1
benchmark:
a <-
microbenchmark(wimpel = {df[, newindex := rowid(user) - shift_by]
df[newindex < 0, newindex := 0]
df[newindex > 0, spend_shifted := df[, spend[newindex], by = .(user)]$V1]
}
, r2evans = {df[, spend_shifted := spend[{o <- seq_len(.N) - shift_by; o[o<1] <- NA; o; }], by = user]}
, sindri_1 = {df[, spend_shifted := data.frame(shift(spend, n = unique(sort(shift_by))))[cbind(1:.N, shift_by)], by = user]}
, sindri_2 = {df[, {rows <- 1:nrow(.SD) - shift_by; .SD[replace(rows, rows == 0, NA), spend]}, by = user]}
, talat = {df[, row := .I, by = .(user)]
df[, match_row := row - shift_by]
df[df, on = .(user, match_row = row), x := i.spend]
df[, c('row', 'match_row') := NULL]
}
, thomas = {df[, spend_shifted := spend[replace(seq(.N) - shift_by, seq(.N) <= shift_by, NA)], user]}
, times = 20
)
autoplot(a)
#ThomasIsCoding and #r2evans' methods are almost identical.
a[, .(mean=mean(time)), expr][order(mean)]]
expr mean
1: thomas 1974759530
2: r2evans 2121604845
3: sindri_2 2530492745
4: wimpel 4337907900
5: sindri_1 4585692780
6: talat 7252938170
I am still in the process of parsing the logic of all methods provided. I cannot thank you all enough for your methods contributed (of which there are many). I shall be voting for an answer in due course.
Related
Assuming I have a data.table as below
DT <- data.table(x = rep(c("b", "a", "c"), each = 3), v = c(1, 1, 1, 2, 2, 1, 1, 2, 2), y = c(1, 3, 6), a = 1:9, b = 9:1)
> DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
I have a variable sl <- c("a","b") that selects columns to compute rowSums. If I try the code below
DT[, ab := rowSums(.SD[, ..sl])]
I am still able to get the desired output but given a warning message telling
DT[, ab := rowSums(.SD[, ..sl])]
Warning message:
In [.data.table(.SD, , ..sl) :
Both 'sl' and '..sl' exist in calling scope. Please remove the '..sl' variable in calling scope for clarity.
However, no warnings occur when running
DT[, ab := rowSums(.SD[, sl, with = FALSE])]
I am wondering how to fix the warning issue when using .SD[, ..sl]. Thanks in advance!
It may be that the syntax to use is either specify the .SDcols and call the .SD or directly call the ..cols from the original object. According to ?data.table
x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols]
if we check the source code of data.table, line 248 seems to be the one triggering the warning
as
DT[, exists(..sl, where = DT)]
#[1] TRUE
and
DT[, .SD[, exists(..sl)]]
#[1] TRUE
DT[, .SD[, exists(..sl, where = .SD)]]
#[1] TRUE
Given the data.table dt <- data.table(a=c(1,NA,3), b = c(4:6))
a b
1: 1 4
2: NA 5
3: 3 6
... , the result for dt[is.na(a), a := sum(a, na.rm = T)] is:
a b
1: 1 4
2: 0 5
3: 3 6
... , instead of the expected:
a b
1: 1 4
2: 4 5
3: 3 6
What is going on? I am using data.table 1.12.8
We could use fcoalesce
library(data.table)
dt[, a := fcoalesce(a, sum(a, na.rm = TRUE))]
I'm trying to calculate the rolling mean of a column in a large data.table (~30M rows) aggregated by two other columns.
The rolling mean should include only the preceding N row values, not the row value itself.
For this purpose, i had to define my own rolling mean function based on frollmean function. (N=3)
Applying the function to the column is really really slow, rendering it rather useless.
Here is sample data:
require(data.table)
DT <- data.table(ID=c('A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C')
, value_type =c('type 1', 'type 1','type 2','type 1','type 2','type 2','type 1','type 1','type 2','type 1','type 1','type 1')
, value=c(1,4,7,2,3,5,1,6,8,2,2,3))
DT
ID value_type value
1: A type 1 1
2: A type 1 4
3: A type 2 7
4: A type 1 2
5: A type 2 3
6: A type 2 5
7: B type 1 1
8: B type 1 6
9: B type 2 8
10: C type 1 2
11: C type 1 2
12: C type 1 3
#this is the customised rolling function
lrollmean<-function(x){
head(frollmean(c(NA,NA,NA,x), n = 3, fill = NA, algo ="exact", align="right", na.rm = TRUE)[-(1:2)], -1)
}
> DT[, roll_mean := lrollmean(value), by=.(ID, value_type)]
> DT
ID value_type value roll_mean
1: A type 1 1 NaN
2: A type 1 4 1.0
3: A type 2 7 NaN
4: A type 1 2 2.5
5: A type 2 3 7.0
6: A type 2 5 5.0
7: B type 1 1 NaN
8: B type 1 6 1.0
9: B type 2 8 NaN
10: C type 1 2 NaN
11: C type 1 2 2.0
12: C type 1 3 2.0
This operation takes more than 30 minutes! I've got a reasonable machine which ample RAM, and I feel the long time of the operation has something to do with my code rather than the machine.
Can you try and see if its fast enough:
n <- 3L
DT[, roll_mean := {
v <- if (.N - n >= 1L) c(seq.int(n), rep(n, .N-n)) else seq.int(min(n, .N))
shift(frollmean(value, v, adaptive=TRUE))
}, .(ID, value_type)]
But if you have a large number of small groups, you can try:
setorder(DT[, rn := .I], ID, value_type)
rid <- DT[, rowid(ID, value_type)]
DT[, roll_mean := shift(frollmean(value, n))]
ix <- DT[rid==3L, which=TRUE]
set(DT, ix, "roll_mean", DT[, shift(frollmean(value, n - 1L))][ix])
ix <- DT[rid==2L, which=TRUE]
set(DT, ix, "roll_mean", DT[, shift(value)][ix])
DT[rid==1L, roll_mean := NA_real_]
setorder(DT, rn)[]
You can try frollapply and since frollmean doesn't completely suit your needs. You can also optimize the function you apply to the window, since you don't need a very complicated operation. I've tried a few modifications to your function that should cut your time down by around 50%.
library(data.table)
library(stringi)
N=1e6
set.seed(123)
DT <- data.table(ID=stri_rand_strings(N,3),
value=rnorm(N,5,5))
head(DT)
#> ID value
#> 1: HmP 12.2667538
#> 2: sw2 -2.2397053
#> 3: WtY 7.0911933
#> 4: SxS 0.4029431
#> 5: gZ6 8.6800795
#> 6: tF2 0.8228594
DT[,.(.N),by=ID][order(N)]
#> ID N
#> 1: HoR 1
#> 2: eNM 1
#> 3: I9h 1
#> 4: xjb 1
#> 5: eFH 1
#> ---
#> 234823: 34Y 15
#> 234824: Xcm 15
#> 234825: IOu 15
#> 234826: tob 16
#> 234827: f70 16
# Your function
lrollmean<-function(x){
head(frollmean(c(NA,NA,NA,x), n = 3, fill = NA, algo ="exact", align="right", na.rm = TRUE)[-(1:2)], -1)
}
#Possible modifications:
lrollmean1<-function(x,n){
frollapply(c(rep(NA,n),x),n+1,weighted.mean,c(rep(1,n),0),na.rm=T)[-(1:3)]
}
lrollmean2<-function(x,n){
frollapply(c(rep(NA,n),x),n+1,function(x) sum(x*c(rep(1,n),0)/n,na.rm = T))[-(1:3)]
}
lrollmean3<-function(x){ # More optimized assuming n=3
frollapply(c(NA,NA,NA,x),4,function(x) sum(x[1:3]/3,na.rm = T))[-(1:3)]
}
library(rbenchmark)
benchmark(original={DT[, roll_mean := lrollmean1(value,3), by=.(ID)]},
a={DT[, roll_mean := lrollmean1(value,3), by=.(ID)]},
b={DT[, roll_mean := lrollmean2(value,3), by=.(ID)]},
c={DT[, roll_mean := lrollmean3(value), by=.(ID)]}
,replications = 1,order = 'relative')
#> test replications elapsed relative user.self sys.self user.child
#> 4 c 1 6.740 1.000 6.829 0.000 0
#> 3 b 1 8.038 1.193 8.085 0.012 0
#> 1 original 1 13.599 2.018 13.692 0.000 0
#> 2 a 1 14.180 2.104 14.233 0.008 0
#> sys.child
#> 4 0
#> 3 0
#> 1 0
#> 2 0
Created on 2020-02-17 by the reprex package (v0.3.0)
This question is an addition to this post
> tempDT <- data.table(colA = c("E","E","A","A","E","A","E","A","E","A")
+ , lags = c(NA,1,1,2,3,1,2,NA,NA,1)
+ , group = c(1,1,1,1,1,1,1,2,2,2))
> tempDT
colA lags group
1: E NA 1
2: E 1 1
3: A 1 1
4: A 2 1
5: E 3 1
6: A 1 1
7: E 2 1
8: A NA 2
9: E NA 2
10: A 1 2
I have column colA, and need to find lags between current row and the previous row where colA == "E".
#Frank has proposed two approaches:
w = tempDT[colA == "E", which=TRUE]; tempDT[, v := shift(rowid(findInterval(.I, w))), by = "group"]
tempDT[, v:= shift(rowid(cumsum(colA=="E"))), by = "group"]
Since I'm having more than 72 million records, wondering if any other way that computes even faster.
I'm using dcast.data.table to convert a long data.table to a wide data.table
library(data.table)
library(reshape2)
set.seed(1234)
dt.base <- data.table(A = rep(c(1:3),2), B = rep(c(1:2),3), C=c(1:4,1,2),thevalue=rnorm(6))
#from long to wide using dcast.data.table()
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum)
#now some stuff happens e.g., please do not bother what happens between dcast and melt
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
now i want to melt the data.table back again to the original column layout and here i'm stuck, how do I separate the concatenated columnames from the casted data.table, this is my problem
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
I need two columns instead of one
the result that i'm looking for can be produced with this code
#update
dt.base[A==2 & B == 1 & C == 1, thevalue :=1]
dt.base[A==2 & B == 2 & C == 2, thevalue :=1]
#insert (2,1,3 was not there in the base data.table)
dt.newrow <- data.table(A=2, B=1, C=3, thevalue = 1)
dt.base <-rbindlist(list(dt.base, dt.newrow))
dt.base
As always any help is appreciated
Would that work for you?
colnames <- c("B", "C")
dt.melt[, (colnames) := (colsplit(variable, "_", colnames))][, variable := NULL]
subset(dt.melt, thevalue != 0)
# or dt.melt[thevalue != 0, ]
# A thevalue B C
#1: 1 -1.2070657 1 1
#2: 2 1.0000000 1 1
#3: 2 1.0000000 1 3
#4: 3 1.0844412 1 3
#5: 2 1.0000000 2 2
#6: 3 0.5060559 2 2
#7: 1 -2.3456977 2 4
If your data set isn't representable and there could be zeros in valid rows, here's alternative approach
colnames <- c("B", "C")
setkey(dt.melt[, (colnames) := (colsplit(variable, "_",colnames))][, variable := NULL], A, B, C)
setkey(dt.base, A, B, C)
dt.base <- dt.melt[rbind(dt.base, data.table(A = 2, B = 1, C = 3), fill = T)]
dt.base[, thevalue.1 := NULL]
## A B C thevalue
## 1: 1 1 1 -1.2070657
## 2: 1 2 4 -2.3456977
## 3: 2 1 1 1.0000000
## 4: 2 2 2 1.0000000
## 5: 3 1 3 1.0844412
## 6: 3 2 2 0.5060559
## 7: 2 1 3 1.0000000
Edit
As. suggested by #Arun, the most efficient way would be to use #AnandaMahto cSplit function, as it is using data.table too, i.e,
cSplit(dt.melt, "variable", "_")
Second Edit
In order to save the manual merges, you can set fill = NA (for example) while dcasting and then do everything in one go with csplit, e.g.
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum, fill = NA)
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
dt.cast <- cSplit(dt.melt, "variable", "_")[!is.na(thevalue)]
setnames(dt.cast, 3:4, c("B","C"))
# A thevalue B C
# 1: 1 -1.2070657 1 1
# 2: 2 1.0000000 1 1
# 3: 2 1.0000000 1 3
# 4: 3 1.0844412 1 3
# 5: 2 1.0000000 2 2
# 6: 3 0.5060559 2 2
# 7: 1 -2.3456977 2 4