Given a data.table, I would like to subset the items in there quickly. For example:
dt = data.table(a=1:10, key="a")
dt[a > 3 & a <= 7]
This is pretty slow still. I know I can do joins to get individual rows but is there a way to fact that the data.table is sorted to get quick subsets of this kind?
This is what I'm doing:
dt1 = data.table(id = 1, ym = c(199001, 199006, 199009, 199012), last_ym = c(NA, 199001, 199006, 199009), v = 1:4, key=c("id", "ym"))
dt2 = data.table(id = 1, ym = c(199001, 199002, 199003, 199004, 199005, 199006, 199007, 199008, 199009, 199010, 199011, 199012), v2 = 1:12, key=c("id","ym"))
For each id, here there is only 1, and ym in dt1, I would like to sum the values of v2 between current ym in dt1 and the last ym in dt1. That is, for ym == 199006 in dt1 I would like to return list(v2 = 2 + 3 + 4 + 5 + 6). These are the values of v2 in dt2 that are equal to or less than the current ym (excluding the previous ym). In code:
expr = expression({ #browser();
cur_id = id;
cur_ym = ym;
cur_dtb = dt2[J(cur_id)][ym <= cur_ym & ym > last_ym];
setkey(cur_dtb , ym);
list(r = sum(cur_dtb$v2))
})
dt1[,eval(expr ),by=list(id, ym)]
To avoid the logical condition, perform a rolling join of dt1 and dt2. Then shift ym forward by one position within id. Finally, sum over v2 by id and ym:
setkey(dt1, id, last_ym)
setkey(dt2, id, ym)
dt1[dt2,, roll = TRUE][
, list(v2 = v2, ym = c(last_ym[1], head(ym, -1))), by = id][
, list(v2 = sum(v2)), by = list(id, ym)]
Note that we want to sum everything since the last_ym so the key on dt1 must be last_ym rather than ym.
The result is:
id ym v2
1: 1 199001 1
2: 1 199006 20
3: 1 199009 24
4: 1 199012 33
UPDATE: correction
Regardless of the fact that data.table is sorted, you will be limited to the amount of time it takes to evaluate a > 3 & a <= 7 in the first place:
> dt = data.table(a=1:10000000, key="a")
> system.time(dt$a > 3 & dt$a <= 7)
user system elapsed
0.18 0.01 0.20
> system.time(dt[,a > 3 & a <= 7])
user system elapsed
0.18 0.05 0.24
> system.time(dt[a > 3 & a <= 7])
user system elapsed
0.25 0.07 0.31
Alternative approach:
> system.time({Indices = dt$a > 3 & dt$a <= 7 ; dt[Indices]})
user system elapsed
0.28 0.03 0.31
Multiple Subsets
There can be a speed issue here if you break up factors on an ad hoc basis rather than doing it all at once first:
> dt <- data.table(A=sample(letters, 10000, replace=T))
> system.time(for(i in unique(dt$A)) dt[A==i])
user system elapsed
5.16 0.42 5.59
> system.time(dt[,.SD,by=A])
user system elapsed
0.32 0.03 0.36
Related
I am writing a function where multiple ifelse are being used for data table operation. Although I am using data tables for speed but multiple ifelse making my code slow and this function is for large data set. Hence I was wondering if there is an alternative to iflese.
One example iflese from the function(there are close to 15 iflese ), in this example flag is set to 1 if x is blank else 0.
dt<-dt[,flag:=ifelse(is.na(x)|!nzchar(x),1,0)]
The fastest approach will probably depend on what your data looks like. Those mentioned in the comments are all comparable for this example:
(twice was mentioned by #DavidArenburg; and onceadd by #akrun. I'm not really sure how to benchmark these with replications > 1, since the objects are actually modified during the benchmark.)
DT <- data.table(x=sample(c(NA,"",letters),1e8,replace=TRUE))
DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
DT5 <- copy(DT)
DT6 <- copy(DT)
DT7 <- copy(DT)
library(rbenchmark)
benchmark(
ifelse = DT0[,flag:=ifelse(is.na(x)|!nzchar(x),1L,0L)],
keyit = {
setkey(DT1,x)
DT1[,flag:=0L]
DT1[J(NA_character_,""),flag:=1L]
},
twiceby = DT2[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L,by=x],
twice = DT3[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L],
onceby = DT4[, flag:= +(is.na(x)|!nzchar(x)), by=x],
once = DT5[, flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6[, flag:= (is.na(x)|!nzchar(x))+0L],
oncebyk = {setkey(DT7,x); DT7[, flag:= +(is.na(x)|!nzchar(x)), by=x]},
replications=1
)[1:5]
# test replications elapsed relative user.self
# 1 ifelse 1 19.61 31.127 17.32
# 2 keyit 1 0.63 1.000 0.47
# 6 once 1 3.26 5.175 2.68
# 7 onceadd 1 3.24 5.143 2.88
# 5 onceby 1 1.81 2.873 1.75
# 8 oncebyk 1 0.91 1.444 0.82
# 4 twice 1 3.17 5.032 2.79
# 3 twiceby 1 3.45 5.476 3.16
Discussion. In this example, keyit is the fastest. However, it's also the most verbose and it changes the sorting of your table. Also, keyit is very specific to the OP's question (taking advantage of the fact that exactly two character values fit the condition is.na(x)|!nzchar(x)), and so might not be as great for other applications, where it would need to be written something like
keyit = {
setkey(DT1,x)
flagem = DT1[,some_other_condition(x),by=x][(V1)]$x
DT1[,flag:=0L]
DT1[J(flagem),flag:=1L]
}
How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?
Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:
pop <- data.table(group_id = c(1, 2, 3),
N = c(4588L, 4589L, 4589L),
N_surv_1 = c(4213, 4243, 4264),
N_surv_2 = c(3703, 3766, 3820),
N_surv_3 = c(2953, 3054, 3159) )
# group_id N N_surv_1 N_surv_2 N_surv_3
# 1 4588 4213 3703 2953
# 2 4589 4243 3766 3054
# 3 4589 4264 3820 3159
(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)
What I have done: using the base diff and matrix transposition, we can:
diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff)
# produces desired output:
# group_id deaths_1 deaths_2 deaths_3
# 1 -375 -510 -750
# 2 -346 -477 -712
# 3 -325 -444 -661
I know that I can use base diff by group on a single column produced by melt.data.table, so this works but ain't pretty:
melt(pop,
id.vars = "group_id"
)[order(group_id)][, setNames(as.list(diff(value)),
paste0("deaths_",1:(ncol(pop)-2)) ),
keyby = group_id]
Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?
Well, you could subtract the subsets:
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
# N_surv_1 N_surv_2 N_surv_3
# 1: -375 -510 -750
# 2: -346 -477 -712
# 3: -325 -444 -661
You could assign these values to new columns with :=. I have no idea why tail and head are not made more easily available... As pointed out by #akrun, you could use with=FALSE instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols].
Anyway, this is pretty convoluted compared to simply reshaping:
melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
# group_id V1
# 1: 1 -375
# 2: 1 -510
# 3: 1 -750
# 4: 2 -346
# 5: 2 -477
# 6: 2 -712
# 7: 3 -325
# 8: 3 -444
# 9: 3 -661
Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff on each row, i.e. unlist(.SD):
pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]
# group_id deaths_1 deaths_2 deaths_3
# 1: 1 -375 -510 -750
# 2: 2 -346 -477 -712
# 3: 3 -325 -444 -661
Essentially, something like this if you ignore setting up the column names:
pop[, as.list(diff(unlist(.SD))), group_id]
Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]
I did some benchmarking
rows<-10000000
pop <- data.table(group_id = 1:rows,
N = runif(rows,3000,4000),
N_surv_1 = runif(rows,3000,4000),
N_surv_2 = runif(rows,3000,4000),
N_surv_3 = runif(rows,3000,4000))
system.time({
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]})
and it returned
user system elapsed
0.192 0.808 1.003
In contrast I did
system.time(pop[, as.list(diff(unlist(.SD))), group_id])
and it returned
user system elapsed
169.836 0.428 170.469
I also did
system.time({
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
})
which returned
user system elapsed
0.044 0.044 0.089
Finally, doing
system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])
returns
user system elapsed
223.360 1.736 225.315
Frank's Map solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.
I have a 1.5Mx7 data.table that I need to process through. The code I have written is running very slowly (.18s per row, estimated 75 hours to complete), and I'm hoping I can optimize it.
I'll put the pseudo-example code at the end, because it's long.
str(review)
Classes ‘data.table’ and 'data.frame': 1500000 obs. of 7 variables:
$ user_id : Factor w/ 375000 levels "aA1aJ9lJ1lB5yH5uR6jR7",..: 275929 313114 99332 277686 57473 31780 236964 44371 210127 217770 ...
$ stars : int 2 1 3 3 1 1 2 1 2 2 ...
$ business_id : Factor w/ 60000 levels "aA1kR2bK6nH8yQ9gU2uI9",..: 40806 29885 43018 58297 58444 31626 26018 2493 37883 34204 ...
$ votes.funny : int 3 0 0 7 2 9 6 8 2 7 ...
$ votes.useful: int 4 1 0 5 9 2 4 7 4 9 ...
$ votes.cool : int 5 3 6 8 3 2 0 8 10 9 ...
$ IDate : IDate, format: "2012-01-01" "2012-01-01" "2012-01-01" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "IDate"
I need to subset the dataset by date, and then compute several columns by business_id.
setkey(review, IDate)
system.time(
review[
#(IDate >= window.start) & (IDate <= window.end),
1:10,
.SD,
keyby = business_id
][
,
list(
review.num = .N,
review.users = length(unique(user_id)),
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
),
by = business_id
]
)
user system elapsed
1.534 0.000 1.534
Timing for smaller versions of the example dataset is
# 1% of original size - 15000 rows
user system elapsed
0.02 0.00 0.02
# 10% of original size - 150000 rows
user system elapsed
0.25 0.00 0.25
So, even though I'm only processing 10 rows, the time increases with the size of the original dataset.
I tried commenting out the review.users variable above, and the computation time on the original dataset fell tremendously:
user system elapsed
0 0 0
So, my challenge is making unique() work more quickly.
I need to count the unique values in user_id for each grouping of business_id.
Not sure what else to specify, but I'm happy to answer questions.
Here is some code to create a pseudo-example dataset. I'm not sure exactly what is the cause of the slowdown, so I've tried to recreate the data as specifically as possible, but because the processing time for the random variables is so long I've reduced the size by ~90%.
z <- c()
x <- c()
for (i in 1:6000) {
z <<- c(z, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
z <- rep(z, 25)
for (i in 1:37500) {
x <<- c(x, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
x <- rep(x, 4)
review2 <- data.table(
user_id = factor(x),
stars = as.integer(round(runif(150000) * 5, digits = 0)),
business_id = factor(z),
votes.funny = as.integer(round(runif(150000) * 10, digits = 0)),
votes.useful = as.integer(round(runif(150000) * 10, digits = 0)),
votes.cool = as.integer(round(runif(150000) * 10, digits = 0)),
IDate = rep(as.IDate("2012-01-01"), 150000)
)
setkey(review2, IDate)
How about this - an alternative to unique using an extra data.table within an anonymous function:
review2[,{
uid <- data.table(user_id)
rev_user <- uid[, .N, by = user_id][, .N]
#browser()
list(
review.num = .N,
review.users = rev_user,
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
)}, by = business_id]
It seems that length(unique()) is inefficient in calculating the length of factor variables as levels become very large.
Using uniqueN() instead (thanks #Frank):
user system elapsed
0.12 0.00 0.12
Using set(review, NULL, "user_id", as.character(review$user_id)) and length(unique)):
user system elapsed
0.11 0.00 0.11
I am writing a function where multiple ifelse are being used for data table operation. Although I am using data tables for speed but multiple ifelse making my code slow and this function is for large data set. Hence I was wondering if there is an alternative to iflese.
One example iflese from the function(there are close to 15 iflese ), in this example flag is set to 1 if x is blank else 0.
dt<-dt[,flag:=ifelse(is.na(x)|!nzchar(x),1,0)]
The fastest approach will probably depend on what your data looks like. Those mentioned in the comments are all comparable for this example:
(twice was mentioned by #DavidArenburg; and onceadd by #akrun. I'm not really sure how to benchmark these with replications > 1, since the objects are actually modified during the benchmark.)
DT <- data.table(x=sample(c(NA,"",letters),1e8,replace=TRUE))
DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
DT5 <- copy(DT)
DT6 <- copy(DT)
DT7 <- copy(DT)
library(rbenchmark)
benchmark(
ifelse = DT0[,flag:=ifelse(is.na(x)|!nzchar(x),1L,0L)],
keyit = {
setkey(DT1,x)
DT1[,flag:=0L]
DT1[J(NA_character_,""),flag:=1L]
},
twiceby = DT2[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L,by=x],
twice = DT3[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L],
onceby = DT4[, flag:= +(is.na(x)|!nzchar(x)), by=x],
once = DT5[, flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6[, flag:= (is.na(x)|!nzchar(x))+0L],
oncebyk = {setkey(DT7,x); DT7[, flag:= +(is.na(x)|!nzchar(x)), by=x]},
replications=1
)[1:5]
# test replications elapsed relative user.self
# 1 ifelse 1 19.61 31.127 17.32
# 2 keyit 1 0.63 1.000 0.47
# 6 once 1 3.26 5.175 2.68
# 7 onceadd 1 3.24 5.143 2.88
# 5 onceby 1 1.81 2.873 1.75
# 8 oncebyk 1 0.91 1.444 0.82
# 4 twice 1 3.17 5.032 2.79
# 3 twiceby 1 3.45 5.476 3.16
Discussion. In this example, keyit is the fastest. However, it's also the most verbose and it changes the sorting of your table. Also, keyit is very specific to the OP's question (taking advantage of the fact that exactly two character values fit the condition is.na(x)|!nzchar(x)), and so might not be as great for other applications, where it would need to be written something like
keyit = {
setkey(DT1,x)
flagem = DT1[,some_other_condition(x),by=x][(V1)]$x
DT1[,flag:=0L]
DT1[J(flagem),flag:=1L]
}
I am having a data.table with returns on n dates for m securities. I would like to do a multiple linear regression in the form of lm(ReturnSec1 ~ ReturnSec2 + ReturnSec3 + ... + ReturnSecM). The problem that I am having is that there might be dates missing for some of the securities and the regression should be on aligned dates. Here is what I have come up with so far:
#The data set
set.seed(1)
dtData <- data.table(SecId = rep(c(1,2,3), each= 4), Date = as.Date(c(1,2,3,5,1,2,4,5,1,2,4,5)), Return = round(rnorm(12),2))
#My solution so far
dtDataAligned <- merge(dtData[SecId == 1,list(Date, Return)], dtData[SecId == 2, list(Date, Return)], by='Date', all=TRUE)
dtDataAligned <- merge(dtDataAligned, dtData[SecId == 3,list(Date, Return)], by='Date', all=TRUE)
setnames(dtDataAligned, c('Date', 'Sec1', 'Sec2', 'Sec3'))
dtDataAligned[is.na(dtDataAligned)] <- 0
#This is what i want to do
fit <- lm(dtDataAligned[, Sec1] ~ dtDataAligned[, Sec2] + dtDataAligned[, Sec3])
Is there a better (more elegant, possibly faster) way of doing this without having to loop and merge the data.table to perform a regression on the values with aligned dates?
Here is a data.table solution using dcast.data.table, which takes data in the long format (your input) and converts it to the wide format required for the lm call.
lm(`1` ~ ., dcast.data.table(dtData, Date ~ SecId, fill=0))
Here is the output of the dcast call:
Date 1 2 3
1: 2014-01-02 -0.63 0.33 0.58
2: 2014-01-03 0.18 -0.82 -0.31
3: 2014-01-04 -0.84 0.00 0.00
4: 2014-01-05 0.00 0.49 1.51
5: 2014-01-06 1.60 0.74 0.39
I stole the lm piece from #G.Grothendieck. Note that if you have more than three columns in your real data you will need to specify the value.var parameter for dcast.data.table.
If the question is how to reproduce the output from the code shown in the question in a more compact fashion then try this:
library(zoo)
z <- read.zoo(dtData, split = 1, index = 2)
z0 <- na.fill(z, fill = 0)
lm(`1` ~., z0)
ADDED
Regarding the comment about elegance we could create a magrittr package pipeline out of the above like this:
library(magrittr)
dtData %>%
read.zoo(split = 1, index = 2) %>%
na.fill(fill = 0) %>%
lm(formula = `1` ~.)