R data.table performance - proceed line by line - r

I have a performance issue in R using a data.table
I have a data.table with mixed data and I need to do some simple computations proceeding line by line. Therefor I need to do a for-loop through all lines. Usually I avoid for-loops but since the computation of one line depends on the previous one - I can't avoid the for-loop.
I've build a basic example which reproduces the issue:
# create some sample-data
dt <- data.table(x=1:300000,y=rnorm(100),z=c("a","b","c","d"))
dt$new <- NA_integer_
# init some demo variables
xxx <- 612341
yyy <- 1
for (i in (1:nrow(dt))){
dt[i,new := xxx] # write something
yyy <- dt[i,y] # read something
if ((i %% 20000) == 0){print(i)} # see progress
}
Using system.time this runs: 230 seconds. Pretty long for very simple computations on 300.000 data-lines ...?
For performance optimization I allocate memory for the new column before the loop. Also I use the := operator to avoid copying.
Is there any way to speed this up?
Here is the operation I am trying to compute actually.
I want to add the value of the previous line if the current line belongs to the same class.
dt <- data.table(x=1:30000,y=rnorm(100),z=c("a","a","a","a","b","b","c","c","c","d"))
dt$new <- NA_real_
remember <- NA_real_
currentclass <- ""
for (i in (1:nrow(dt)))
{
if (dt[i,z] == currentclass)
{
dt[i,new := remember]
remember <- dt[i,y]
}
else
{
currentclass <- dt[i,z]
remember <- dt[i,y]
dt[i,new := NA]
}
}

Related

How would you write this using apply family of functions in R? Should you?

Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.

R Speed up string decomposition

I am relatively new to R, so my repertoire of commands is limited.
I am trying to write a script that will decompose a series of Markovian sequences, contained in a text string and delimited with a '>' sign, into a contingency "from - to" table.
The attached code, with dummy data, is where I have been able to get the code. On the small 7 case example included this will run relatively quickly. However the reality is that I have millions of cases to parse and my code just isn't efficient enough to process in a timely fashion (it had taken well over an hour and this time frame isn't feasible).
I'm convinced there is a more efficient way of structuring this code so that it executes quickly as I have seen this operation performed in other Markov packages within a few minutes. I need my own scripted version though to allow flexibility in processing hence I have not turned to these.
What I would like to request are improvements to the script to increase processing efficiency please.
Seq <- c('A>B>C>D', 'A>B>C', 'A', 'A', 'B', 'B>D>C', 'D') #7 cases
Lives <- c(0,0,0,0,1,1,0)
Seqdata <- data.frame(Seq, Lives)
Seqdata$Seq <- gsub("\\s", "", Seqdata$Seq)
fromstep <- list()
tostep <- list()
##ORDER 1##
for (x in 1:nrow(Seqdata)) {
steps <- unlist(strsplit(Seqdata$Seq[x], ">"))
for (i in 1:length(steps)) {
if (i==1) {fromstep <- c(fromstep, "Start")
tostep <- c(tostep, steps[i])
}
fromstep <- c(fromstep, steps[i])
if (i<length(steps)) {
tostep <- c(tostep, steps[i+1])
} else if (Seqdata$Lives[x] == 1) {
tostep <- c(tostep, 'Lives')
} else
tostep <- c(tostep, 'Dies')
}
}
transition.freq <- table(unlist(fromstep), unlist(tostep))
transition.freq
I'm not familiar with Markovian sequences, but this produces the same output:
xx <- strsplit(Seqdata$Seq, '>', fixed=TRUE)
table(From=unlist(lapply(xx, append, 'Start', 0L)),
To=unlist(mapply(c, xx, ifelse(Seqdata$Lives == 0L, 'Dies', 'Lives'))))

Adding a new method to data.table

I work a lot with time series. Most of my manipulations are done via data.table, but often I have to check data called by specific time, and for that I use xts method:
> timedata['2014-01-02/2014-01-03']
My data.table data is basically the exact copy of xts, with first colums index, containing time:
> dt_timedata <- data.table(index=index(timedata), coredata(timedata))
At some point data became way too large, so keeping both or converting them all the time is not really a good option (which it never was really), so I am thinking about making the same method for data.table. However, I only couldn't find any reasonably easy examples of modifying a generic method.
Is what I want even possible, and if so, where could I read about it?
PS Even though I can abviosly use something like
> from <- '2014-01-02'
> to <- '2014-01-03'
> period <- as.POSIXct(c(from, to))
> dt_timedata[index %between% period]
it is far less intuitive and beautiful, so I would rather write a new method.
Edit1 (example by request)
require(xts)
require(data.table)
days <- as.POSIXct(c('2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04'), origin='1970-01-01')
timedata <- xts(1:4, days)
dt_timedata <- data.table(index=index(timedata), coredata(timedata))
What I can do in xts:
> timedata['2014-01-01/2014-01-02']
[,1]
2014-01-01 1
2014-01-02 2
I want the exact same for [.data.table.
Edit2 (to illustrate what I do)
'[.data.table' = function(x, i, ...) {
if (!missing('i')) {
if (all(class(i) == "character")) {
# do some weird stuff
return(x[ *some indices subset I just created* ])
}
}
data.table:::'[.data.table'(x, i, ...)
}
So basically if i is character and suits my format (checks happen in weird stuff section) I return something and function never goes to the last command. Otherwise nothing happens and I just call
data.table:::'[.data.table'(x, i, ...)
And the thing is, this breaks expressions like
ind <- as.POSIXct('2014-01-01', origin='1970-01-01')
dt_timedata[index==ind]
Here's a trivial example for you to try:
require(data.table)
days <- as.POSIXct(c('2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04'), origin='1970-01-01')
dt_timedata <- data.table(index=days, value=1:4)
ind <- as.POSIXct('2014-01-01', origin='1970-01-01')
# now it works
dt_timedata[index==ind]
# new trivial [.data.table
'[.data.table' = function(x, I, ...) {
data.table:::`[.data.table`(x, I, ...)
}
# and now it doesn't work
dt_timedata[index==ind]
Modifying the method to add your own smth smth is very simple:
`[.data.table` = function(...) {
print("I'm doing smth custom")
data.table:::`[.data.table`(...)
}
dt = data.table(a = 1:5)
dt[, sum(a)]
#[1] "I'm doing smth custom"
#[1] 15
So just process the first argument however you like and feed it to the standard function.
Here's an example to handle your edit:
`[.data.table` = function(...) {
if (try(class(..2), silent = TRUE) == 'character')
print("boo")
else
data.table:::`[.data.table`(...)
}
dt = data.table(a = 1:10)
dt[a == 4]
# a
#1: 4
dt['sdf']
#[1] "boo"
#[1] "boo"

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

speeding up "for-loop" for deleting rows matching criteria

I am backtesting some investment strategy using R, I have a piece of script below:
set.seed(1)
output.df <- data.frame(action=sample(c("initial_buy","sell","buy"),
10000,replace=TRUE),stringsAsFactors=FALSE)
output.df[,"uid"] <- 1:nrow(output.df)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
output.df<<-output.df
}
print(system.time(cutrow.fx(output.df=output.df)))
The strategy will determine: 1) when to start buying a stock; 2) when to add additional contribution to the stock; and 3) when to sell all the stock. I have a dataframe with price of a stock for the past 10 years. I wrote 3 scripts to indicate which date should I buy/sell the stock, combine the 3 results and order them.
I need to remove some of the "impossible action", e.g. I cannot sell the same stock twice without buying new units beforehand, so I used the script above to delete those impossible action. But the for loop is kind of slow.
Any suggestion for speeding it up?
Update 01
I have updated the cutrow.fx into the following but fail:
cutrow.fx <- function(output.df) {
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
while (any(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy")|
any(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy")) {
output.df <- output.df[!(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy"),]
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
}
output.df[,"action_pre"] <- NULL
output.df<<-output.df
}
I used the vector comparison as somehow inspired (I used somehow as I'm not sure if I get exact what he means in the answer) by John, use a while-loop to repeat. But the output is not the same.
Is the for-loop here inevitable?
It looks like all you're doing is checking the last action. This doesn't require a loop at all. All you have to do is shift the vector and do straight vector comparisons. Here's an artificial example.
x <- sample(1:11)
buysell <- sample(c('buy', 'sell'), 11, replace = TRUE)
So, I have 11 samples, x, and whether I've bought or sold them. I want to make a boolean that shows whether I bought or sold the last sample.
bought <- c(NA, buysell[1:10])
which( bought == 'buy' )
Examine the x and buysell variables and you'll see the results here are the index of the x items where a buy was made on the prior item.
Also, you might want to check out he function %in%.
I tried to do something clever with vectorization, but failed because previous iterations of the loop can change the data relationships for later iterations through. So I couldn't lag the data by a set amount and compare lagged to real results.
What I can do is minimize the copying operation involved. R is assign-by-copy, so when you write a statement like output.df <- output.df[-loop.del,], you are copying the entire data structure for each row that is deleted. Instead of changing (and copying) the data frame, I made changes to a logical vector. Some other attempts at speed-up include using logical and (&&) instead of bitwise and (&), using %in% to make fewer comparisons, and minimizing accesses on output.df.
To compare the two functions I slightly modified OP solution such that the original data frame was not overwritten. It looks like this can improve speeds by a factor of 10, but it still takes a noticeable about of time (>0.5 sec). I'd love to see any faster solutions.
OP's solution (slightly modified in return value and without global assign)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
return(output.df)
}
ans1 <- cutrow.fx(output.df)
my solution
cutrow.fx2 <- function(output.df) {
##edge case if output.df has too few rows
if (nrow(output.df) < 2) return(output.df)
##logical vector of indices of rows to keep
idx <- c(TRUE,logical(nrow(output.df)-1))
##keeps track of the previous row
prev.row <- 1
prev.act <- output.df[prev.row,"action"]
for (current.row in seq_len(nrow(output.df))[-1]) {
##access output.df only once per iteration
current.act <- output.df[current.row,"action"]
##checks to see if current row is bad
##if so, continue to next row and leave previous row as is
if ( (prev.act %in% c("initial_buy","buy")) &&
(current.act == "initial_buy") ) {
next
} else if ( (prev.act == "sell") &&
(current.act %in% c("buy","sell")) ) {
next
}
##if current row is good, mark it in idx and update previous row
idx[current.row] <- TRUE
prev.row <- current.row
prev.act <- current.act
}
return(output.df[idx,])
}
ans2 <- cutrow.fx2(output.df)
checks that answers are the same
identical(ans1,ans2)
## [1] TRUE
#benchmarking
require(microbenchmark)
mb <- microbenchmark(
ans1=cutrow.fx(output.df)
,ans2=cutrow.fx2(output.df),times=50)
print(mb)
# Unit: milliseconds
# expr min lq median uq max
# 1 ans1 9630.1671 9743.1102 9967.6442 10264.7000 12396.5822
# 2 ans2 481.8821 491.6699 500.6126 544.4222 645.9658
plot(mb)
require(ggplot2)
ggplot2::qplot(y=time, data=mb, colour=expr) + ggplot2::scale_y_log10()
Here is some code that is a bit simpler and much faster. It does not loop over all elements, but only loops between matches. It matches forward rather than backward.
First, modify your cutrow.fx function. Remove the <<-output.df on the last line, and simply return the result. Then you can run two functions and compare the results.
cutrow.fx1 <- function(d) {
len <- length(d[,1])
o <- logical(len)
f <- function(a) {
switch(a,
initial_buy=c('buy', 'sell'),
buy=c('buy', 'sell'),
sell='initial_buy'
)
}
cur <- 1
o[cur] <- TRUE
while (cur < len) {
nxt <- match(f(d[cur,1]), d[(cur+1):len,1])
if (all(is.na(nxt))) {
break
} else {
cur <- cur + min(nxt, na.rm=TRUE);
o[cur] <- TRUE
}
}
d[o,]
}
Show that the results are correct:
identical(cutrow.fx1(output.df), cutrow.fx(output.df))
## [1] TRUE
And it is quite a bit faster. This is due to the partial vectorization of the problem, using match to find the next row to keep, rather than iterating to discard rows.
print(system.time(cutrow.fx(output.df)))
## user system elapsed
## 5.688 0.000 5.720
print(system.time(cutrow.fx1(output.df)))
## user system elapsed
## 1.050 0.000 1.056

Resources