R data.table sliding window - r

What is the best (fastest) way to implement a sliding window function with the data.table package?
I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:
library(data.table)
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
get_window <- function(date, factor1, factor2) {
criteria <- data.table(
date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
factor1=as.integer(factor1),
factor2=as.integer(factor2)
)
return(dt[criteria][, value])
}
output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]
for(i in nrow(output):1) {
print(i)
output[i, window_median:=median(get_window(date, factor1, factor2))]
}

data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :
Is there a fast way to run a rolling regression inside data.table?
Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :
Rolling median algorithm in C
The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).

I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
unique_set <- data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
output2 <- rbind(output2, unique_set[, date:=original_date-i])
}
setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]
That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated

This solution works but it takes a while.
df <- data.frame(
id=30000,
date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
myFun <- function(dff,df){
median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}
week_Med <- apply(df,1,myFun,df=df)
week_Med_df <- cbind(df,week_Med)

I address this in a related thread: https://stackoverflow.com/a/62399700/7115566
I suggest looking into the frollapply function. For instance, see below
library(data.table)
set.seed(17)
dt <- data.table(i = 1:100,
x = sample(1:10, 100, replace = T),
y = sample(1:10, 100, replace = T))
dt$index <- dt$x == dt$y
dt[,`:=` (MA = frollapply(index,10,mean)), ]
head(dt,12)

Related

Optimizing processing time in the nested for loops - R

I have two datasets with 24k and 15k rows. I used nested for loops in order to rewrite some data... however it takes forever to compute the operation.
does anyone have a suggestion how to optimize the code to speed the process?
my code:
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the full code with the imput looks like this:
df <- data[grepl("Trennscheiben", data$a_naziv) & data$SestavKolicina > 1,]
for(i in 1:length(df$kolicina)){
df$kolicina[i] <- df$kolicina[i] / 10
}
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the data:
LIXcode a_naziv RacunCenaNaEM kolicina
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 30
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 1
I haven't tried this on my own machine, but this should work
fun <- function(x,y){
x[which(x$LIXcode %in% y$LIXcode)]$kolicina =
y[which(x$LIXcode %in% y$LIXcode)]$kolicina
}
}
fun(data,df)
R has the capability to do them all in parallel
As far as I understand, the question concerns table "dt1" with key column "a" and any number of value columns and any number of observations. And then we have a "dt2" that has some sort of mapping - which means that column "a" has unique values and some column "b" has values that need to be written into "dt1" where columns "a" match.
I would suggest joining tables:
require(data.table)
dt1 <- data.table(a = sample(1:10, 1000, replace = T),
b = sample(letters, 1000, replace = T))
dt2 <- data.table(a = 1:10,
b = letters[1:10])
output <- merge(dt1, dt2, by = "a", all.x = T)
Also you can try:
dt1[,new_value:=dt2$b[match(a, dt2$a)]
Both of these solutions are vectorized, therefore almost instant.
Base solution (no data.table syntax, although I'd highly recommend you to learn it):
dt1$new_value <- dt2$b[match(dt1$a, dt2$a)]
And that's if I understood the question correctly...
Here's a working solution to accommodate for expected output:
dt1[a %in% dt2$a, b:=dt2$b[match(a, dt2$a)]]

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

Speed up quantile calculation

I am using the Hmisc Package to calculate the quantiles of two continous variables and compare the results in a crosstable. You find my code below.
My problem is that the calculation of the quantiles takes a considerable amount of time if the number of observations increases.
Is there any possibility to speed up this procedure by using the data.table, ddply or any other package?
Thanks.
library(Hmisc)
# Set seed
set.seed(123)
# Generate some data
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
c <- data.frame(a,b)
# Calculate quantiles
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)
# Output some descriptives
summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)
# Time spent for calculation:
# User System verstrichen
# 25.13 3.47 28.73
As stated by jlhoward and Ricardo Saporta data.table doesn't seem to speed up things too much in this case. The cut2 function is clearly the bottleneck here. I used another function to calculate the quantiles (see Is there a better way to create quantile "dummies" / factors in R?) and was able to decrease the calculation time by half:
qcut <- function(x, n) {
if(n<=2)
{
stop("The sample must be split in at least 3 parts.")
}
else{
break.values <- quantile(x, seq(0, 1, length = n + 1), type = 7)
break.labels <- c(
paste0(">=",break.values[1], " & <=", break.values[2]),
sapply(break.values[3:(n)], function(x){paste0(">",break.values[which(break.values == x)-1], " & <=", x)}),
paste0(">",break.values[(n)], " & <=", break.values[(n+1)]))
cut(x, break.values, labels = break.labels,include.lowest = TRUE)
}
}
c$a.quantile.2 <- qcut(c$a, 5)
c$b.quantile.2 <- qcut(c$b, 5)
summaryM(a.quantile.2 ~ b.quantile.2, data=c, overall=TRUE)
# Time spent for calculation:
# User System verstrichen
# 10.22 1.47 11.70
Using data.table would reduce the calculation time by another second, but I like the summary by the Hmisc package better.
You can use data.table's .N built in variable, to quickly tabulate.
library(data.table)
library(Hmisc)
DT <- data.table(a,b)
DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]
DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
You can break that last line down into two steps, to see what is going on. The second "[ " simply reshapes the data in a clean format.
DT.tabulated <- DT[, .N, keyby=list(b.quantile, a.quantile)]
DT.tabulated
DT.tabulated[, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
Data tables don't seem to improve things here:
library(Hmisc)
set.seed(123)
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
library(data.table)
# original approach
system.time({
c <- data.frame(a,b)
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)
smry.1 <-summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)
})
user system elapsed
72.79 6.22 79.02
# original data.table approach
system.time({
DT <- data.table(a,b)
DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]
smry.2 <- DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
})
user system elapsed
66.86 5.11 71.98
# different data.table approach (simpler, and uses table(...))
system.time({
dt <- data.table(a,b)
smry.3 <- table(dt[,lapply(dt,cut2,g=5)])
})
user system elapsed
67.24 5.02 72.26

Find max per group and return another column

Given the following test matrix:
testMatrix <- matrix( c(1,1,2,10,20,30,300,100,200,"A","B","C"), 3, 4)
colnames(testMatrix) <- c("GroupID", "ElementID", "Value", "Name")
Here I want to find the max per group and then return the name of that column.
E.g. I would expect 1, A and 2, C. If there is a tie with max, the first match would be fine.
After that I would have to attach this to the matrix with a new Column "GroupName"
How can I do this?
I already have the Group, Max Value combination:
groupMax <- aggregate (as.numeric(testMatrix[,3]), by=list( testMatrix[,1] ), max )
The way I used to add columns to my matrix works like this (let's assume there is also already a matrix groupNames with GroupID, Name combinations):
testMatrix <- cbind ( testMatrix, groupNames[match( testMatrix[,1], groupNames[,1] ), 2] )
A data.table solution for time and memory efficiency and syntactic elegance
library(data.table)
DT <- as.data.table(testMatrix)
DT[,list(Name = Name[which.max(Value)]),by = GroupID]
Base solution, not as simple as Dan M's:
testMatrix <- data.frame(GroupID = c(1,1,2), ElementID = c(10,20,30),
Value=c(300,100,200), Name=c("A","B","C"))
A <- lapply(split(testMatrix, testMatrix$GroupID), function(x) {
x[which.max(x$Value), c(1, 4)]
}
)
do.call(rbind, A)
As #Tyler said, a data.frame is easier to work with. Here's an option:
testMatrix <- data.frame(GroupID = c(1,1,2), ElementID = c(10,20,30), Value=c(300,100,200), Name=c("A","B","C"))
ddply(testMatrix, .(GroupID), summarize, Name=Name[which.max(Value)])
I figured out a nice way to do this via dplyr
filter(group_by(testMatrix,GroupID),min_rank(desc(Value))==1)

Is it possible to reuse generated columns in ddply?

I have a script where I'm using ddply, as in the following example:
ddply(df, .(col),
function(x) data.frame(
col1=some_function(x$y),
col2=some_other_function(x$y)
)
)
Within ddply, is it possible to reuse col1 without calling the entire function again?
For example:
ddply(df, .(col),
function(x) data.frame(
col1=some_function(x$y),
col2=some_other_function(x$y)
col3=col1*col2
)
)
You've got a whole function to play with! Doesn't have to be a one-liner! This should work:
ddply(df, .(col), function(x) {
tmp <- some_other_function(x$y)
data.frame(
col1=some_function(x$y),
col2=tmp,
col3=tmp
)
})
This appears to be a good candidate for data.table using the scoping rules of the j component. See FAQ 2.8 for details.
From the FAQ
No anonymous function is passed to
the j. Instead, an anonymous body is passed to the j.
So, for your case
library(data.table)
DT <- as.data.table(df)
DT[,{
col1=some_function(y)
col2=some_other_function(y)
col3= col1 *col2
list(col1 = col1, col2 = col2, col3 = col3)
}, by = col]
or a slightly more direct way :
DT[,list(
col1=col1<-some_function(y)
col2=col2<-some_other_function(y)
col3=col1*col2
), by = col]
This avoids one repetition each of col1 and col2, and avoids two repeats of col3; repetition is something we strive to reduce in data.table. The = followed by <- might initially look cumbersome. That allows the following syntactic sugar, though :
DT[,list(
"Projected return (%)"= col1<-some_function(y),
"Investment ($m)"= col2<-some_other_function(y),
"Return on Investment ($m)"= col1*col2
), by = col]
where the output can be sent directly to latex or html, for example.
I don't think that's possible, but it shouldn't matter too much, because at that point it's not an aggregation function anymore. For example:
#use summarize() in ddply()
data.means <- ddply(data, .(groups), summarize, mean = mean(x), sd = sd(x), n = length(x))
data.means$se <- data.means$sd / sqrt(data.means$n)
data.means$Upper <- data.means$mean + (data.means$SE * 1.96)
data.means$Lower <- data.means$mean - (data.means$SE * 1.96)
So I didn't calculate the SEs directly, but it wasn't so bad calculating it outside of ddply(). If you really wanted to, you could also do
ddply(data, .(groups), summarize, se = sd(x) / sqrt(length(x)))
Or to put it in terms of your example
ddply(df, .(col), summarize,
col1=some_function(y),
col2=some_other_function(y)
col3=some_function(y)*some_other_function(y)
)

Resources