speeding up "for-loop" for deleting rows matching criteria - r

I am backtesting some investment strategy using R, I have a piece of script below:
set.seed(1)
output.df <- data.frame(action=sample(c("initial_buy","sell","buy"),
10000,replace=TRUE),stringsAsFactors=FALSE)
output.df[,"uid"] <- 1:nrow(output.df)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
output.df<<-output.df
}
print(system.time(cutrow.fx(output.df=output.df)))
The strategy will determine: 1) when to start buying a stock; 2) when to add additional contribution to the stock; and 3) when to sell all the stock. I have a dataframe with price of a stock for the past 10 years. I wrote 3 scripts to indicate which date should I buy/sell the stock, combine the 3 results and order them.
I need to remove some of the "impossible action", e.g. I cannot sell the same stock twice without buying new units beforehand, so I used the script above to delete those impossible action. But the for loop is kind of slow.
Any suggestion for speeding it up?
Update 01
I have updated the cutrow.fx into the following but fail:
cutrow.fx <- function(output.df) {
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
while (any(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy")|
any(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy")) {
output.df <- output.df[!(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy"),]
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
}
output.df[,"action_pre"] <- NULL
output.df<<-output.df
}
I used the vector comparison as somehow inspired (I used somehow as I'm not sure if I get exact what he means in the answer) by John, use a while-loop to repeat. But the output is not the same.
Is the for-loop here inevitable?

It looks like all you're doing is checking the last action. This doesn't require a loop at all. All you have to do is shift the vector and do straight vector comparisons. Here's an artificial example.
x <- sample(1:11)
buysell <- sample(c('buy', 'sell'), 11, replace = TRUE)
So, I have 11 samples, x, and whether I've bought or sold them. I want to make a boolean that shows whether I bought or sold the last sample.
bought <- c(NA, buysell[1:10])
which( bought == 'buy' )
Examine the x and buysell variables and you'll see the results here are the index of the x items where a buy was made on the prior item.
Also, you might want to check out he function %in%.

I tried to do something clever with vectorization, but failed because previous iterations of the loop can change the data relationships for later iterations through. So I couldn't lag the data by a set amount and compare lagged to real results.
What I can do is minimize the copying operation involved. R is assign-by-copy, so when you write a statement like output.df <- output.df[-loop.del,], you are copying the entire data structure for each row that is deleted. Instead of changing (and copying) the data frame, I made changes to a logical vector. Some other attempts at speed-up include using logical and (&&) instead of bitwise and (&), using %in% to make fewer comparisons, and minimizing accesses on output.df.
To compare the two functions I slightly modified OP solution such that the original data frame was not overwritten. It looks like this can improve speeds by a factor of 10, but it still takes a noticeable about of time (>0.5 sec). I'd love to see any faster solutions.
OP's solution (slightly modified in return value and without global assign)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
return(output.df)
}
ans1 <- cutrow.fx(output.df)
my solution
cutrow.fx2 <- function(output.df) {
##edge case if output.df has too few rows
if (nrow(output.df) < 2) return(output.df)
##logical vector of indices of rows to keep
idx <- c(TRUE,logical(nrow(output.df)-1))
##keeps track of the previous row
prev.row <- 1
prev.act <- output.df[prev.row,"action"]
for (current.row in seq_len(nrow(output.df))[-1]) {
##access output.df only once per iteration
current.act <- output.df[current.row,"action"]
##checks to see if current row is bad
##if so, continue to next row and leave previous row as is
if ( (prev.act %in% c("initial_buy","buy")) &&
(current.act == "initial_buy") ) {
next
} else if ( (prev.act == "sell") &&
(current.act %in% c("buy","sell")) ) {
next
}
##if current row is good, mark it in idx and update previous row
idx[current.row] <- TRUE
prev.row <- current.row
prev.act <- current.act
}
return(output.df[idx,])
}
ans2 <- cutrow.fx2(output.df)
checks that answers are the same
identical(ans1,ans2)
## [1] TRUE
#benchmarking
require(microbenchmark)
mb <- microbenchmark(
ans1=cutrow.fx(output.df)
,ans2=cutrow.fx2(output.df),times=50)
print(mb)
# Unit: milliseconds
# expr min lq median uq max
# 1 ans1 9630.1671 9743.1102 9967.6442 10264.7000 12396.5822
# 2 ans2 481.8821 491.6699 500.6126 544.4222 645.9658
plot(mb)
require(ggplot2)
ggplot2::qplot(y=time, data=mb, colour=expr) + ggplot2::scale_y_log10()

Here is some code that is a bit simpler and much faster. It does not loop over all elements, but only loops between matches. It matches forward rather than backward.
First, modify your cutrow.fx function. Remove the <<-output.df on the last line, and simply return the result. Then you can run two functions and compare the results.
cutrow.fx1 <- function(d) {
len <- length(d[,1])
o <- logical(len)
f <- function(a) {
switch(a,
initial_buy=c('buy', 'sell'),
buy=c('buy', 'sell'),
sell='initial_buy'
)
}
cur <- 1
o[cur] <- TRUE
while (cur < len) {
nxt <- match(f(d[cur,1]), d[(cur+1):len,1])
if (all(is.na(nxt))) {
break
} else {
cur <- cur + min(nxt, na.rm=TRUE);
o[cur] <- TRUE
}
}
d[o,]
}
Show that the results are correct:
identical(cutrow.fx1(output.df), cutrow.fx(output.df))
## [1] TRUE
And it is quite a bit faster. This is due to the partial vectorization of the problem, using match to find the next row to keep, rather than iterating to discard rows.
print(system.time(cutrow.fx(output.df)))
## user system elapsed
## 5.688 0.000 5.720
print(system.time(cutrow.fx1(output.df)))
## user system elapsed
## 1.050 0.000 1.056

Related

find_element in dataframe in R

I am new to R. I wanted to define a R function, find_element, that takes as its inputs a list and a value of any type, and returns the value of the matched element in the input list that matches the value. thanks for your help
find_element <- function(arr, val){
count = 0
for(i in arr){
if (i == val){
print(count)
} else
count = count + 1
print ("No Match")
}
}
e.g.
arr <- 1:10
find_element(arr, 10)
# 10
find_element(arr, 12)
# NULL
Just for educational purposes, please, try (although this is not recommended practice in R!):
find_element <- function(arr, val) {
count = 1
for (i in arr) {
if (i == val) {
return(count)
} else
count = count + 1
}
return("No Match")
}
This will yield
arr <- 1:10
find_element(arr, 10)
#[1] 10
find_element(arr, 12)
#[1] "No Match"
Please, note
In R, elements of vectors, etc are numbered starting with 1
You have to use return instead of print to indicate the return value of a function (well, I know there's a short cut - but it's for the purpose of education, here)
The final return must come after the for loop.
Built-in function
Also for educational purposes, please, note that Sotos already has shown the R way in his comment:
which(arr == 10)
#[1] 10
which(arr == 12)
#integer(0)
In R, it's almost always better to use the well-documented built-in functions or those from packages. And, yes, try to avoid for loops in R.
Learnig R online
As pointed out in the (now deleted) answer of engAnt there are several ressources to learn R. https://www.rstudio.com/online-learning/#R lists a number of resources.

R for-loop iterating from central value out to extremes

I'm trying to improve the speed of my code, which is trying to optimise a value using 3 variables which have large ranges. The most likely output uses values in the middle of the ranges, so it is wasting time starting from the lowest possible value of each variable. I want to start from the middle value and iterate out! The actual problem has thousands of lines with numbers from 150-650. C,H and O limits will be defined somewhat based on the starting number, but will always be more likely at a central value in the defined range. Is there a way to define the for loop to work outwards like I want? The only, quite shabby, way I can think of is to simply redefine the value within the loop from a vector (e.g. 1=20, 2=21, 3=19, etc). See current code below:
set_error<-2.5
ct<-c(325.00214,325.00952,325.02004,325.02762,325.03535,325.03831,325.04588, 325.05641,325.06402,325.06766,325.07167,325.07454,325.10396)
FormFun<-function(x){
for(C in 1:40){
for(H in 1:80){
for(O in 1:40){
test_mass=C*12+H*1.007825+O*15.9949146-1.0072765
error<-1000000*abs(test_mass-x)/x
if(error<set_error){
result<-paste("C",C,"H",H,"O",O,sep ="")
return(result)
break;break;break;break
}
}
}
}
}
old_t <- Sys.time()
ct2<-lapply(ct,FormFun)
new_t <- Sys.time() - old_t # calculate difference
print(new_t)
Use vectorization and create a closure:
FormFun1_fac <- function(gr) {
gr <<- gr
function(x, set_error){
test_mass <- with(gr, C*12+H*1.007825+O*15.9949146-1.0072765)
error <- 1000000 * abs(test_mass - x) / x
ind <- which(error < set_error)[1]
if (is.na(ind)) return(NULL)
paste0("C", gr[ind, "C"],"H", gr[ind, "H"],"O", gr[ind, "O"])
}
}
FormFun1 <- FormFun1_fac(expand.grid(C = 1:40, H = 1:80, O = 1:40))
ct21 <- lapply(ct, FormFun1, set_error = set_error)
all.equal(ct2, ct21)
#[1] TRUE
This saves a grid of all combinations of C, H, O in the function environment and calculates the error for all combinations (which is fast in vectorized code). The first combination that passes the test is returned.

If else statement to delete repeated values

I'm a novice R user and have created a small script that is doing some trigonometry with movement data. I need to add a final column that deletes repeated values from the column before it.
I've tried adding an if else statement that seems to work when isolated, but keep having errors when it is put into the for loop. I'd appreciate any advice.
# trig loop
list.df <- vector("list", max(Sp_test$ID))
names1 <- c(1:max(Sp_test$ID))
for(i in 1:max(Sp_test$ID)) {
if(i %in% unique(Sp_test$ID)) {
idata <- subset(Sp_test, ID == i)
idata$originx <- idata[1,3]
idata$originy <- idata[1,4]
idata$deltax <- idata[,"UTME"]-idata[,"originx"]
idata$deltay <- idata[,"UTMN"]-idata[,"originy"]
idata$length <- sqrt((idata[,"deltax"])^2+(idata[,"deltay"]^2))
idata$arad <- atan2(idata[,"deltay"],idata[,"deltax"])
idata$xnorm <- idata[,"deltax"]/idata[,"length"]
idata$ynorm <- idata[,"deltay"]/idata[,"length"]
sumy <- sum(idata$ynorm, na.rm=TRUE)
sumx <- sum(idata$xnorm, na.rm=TRUE)
idata$vecsum <- atan2(sumy,sumx)
idata$width <- idata$length*sin(idata$arad-idata$vecsum)
# need if else statement excluding a repeat from the position just before it
list.df[[i]] <- idata
names1[i] <- i
} }
# this works alone, I think the problem is when it gets to the first of the dataset and there is not one before it
if (idata$width[j]==idata$width[j-1]) {
print("NA")
} else {
print(idata$width[j])
}
I think you want to use the function diff for this. diff(idata$width) will give the differences between successive values of idata$width. Then
idata$width[c(FALSE, diff(idata$width) == 0)] <- NA
I think does what you want. The initial FALSE is since there is no value corresponding to the first element (since as you rightly noted, the first element doesn't have an element before it).

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

R: Backtesting a trading strategy. Beginners to quantmod and R

I'm very new to R and trying to backtest a strategy I've programmed already in WealthLab.
Several stuff I don't understand (and it doesn't work obviously:)
I don't get the Close Prices nicely into a vector...or some kind of vector but it starts with structure and I don't really understand what this function does. Thats why my series[,1] call probably doesn't work.
n <- nrow(series) doesn't work either, but I need that for the Loop
So I guess if I get These 2 questions answered my strategy should work...I'm very thankful for any help..R seems quite complicated even with programming experience in other languages
#rm(list = ls(all = TRUE))
#import data, default is yahoo
require(quantmod)
series <- getSymbols('AAPL',from='2013-01-01')
#generate HLOC series
close <- Cl(AAPL)
open <- Op(AAPL)
low <-Lo(AAPL)
high <- Hi(AAPL)
#setting parameters
lookback <- 24 #24 days ago
startMoney <- 10000
#Empty our time series for position and returns
f <- function(x) 0 * x
position <- apply(series[,1],FUN=f)
colnames(position)="long_short"
returns <- apply(series[,1],FUN=f)
colnames(returns)="Returns"
trades = returns
colnames(trades)="Trades"
amount = returns
colnames(amount) = "DollarAmount"
amt[seq(1,lookback)] = startMoney
#Calculate all the necessary values in a loop with our trading strategy
n <- nrow(series)
for(i in seq(lookback+1,n)){
#get the return
if(position[i-1] == 1){
#we were long
returns[i] = close[i]/close[i-1] - 1
} else if(position[i-1] == -1){
#we were short
returns[i] = close[i-1]/close[i] - 1
}
#long/short position
if(open[i-lookback]<open[i] && low[i-1] < open[i]){
#go long
position[i] = 1
} else if(open[i-lookback]>open[i] && high[i-1] > open[i]){
# go short
position[i] = -1
} else {
position[i] = position[i-1]
}
#mark a trade if we did one
if(position[i] != position[i-1]) trades[i] = 1
#Calculate the dollar amount
amount[i] = amount[i-1]*exp(returns[i])
if(trades[i]) amount[i] = amount[i] - 2
}
Starting with the second question
> s <- getSymbols('SPY')
> nrow(s)
NULL
> class(s)
[1] "character"
> s.data <- get(s)
> class(s.data)
[1] "xts" "zoo"
> nrow(s.data)
[1] 1635
So if you want to work on the actual xts object you need to use get.
About your first question - i don't think you really need to pull the data as a vector - the xts object is an array indexed by date and it's easy to work with.
If you still want to get the data you can use
closing.prices <- coredata(Cl(s))
Now, to get you started with simple back testing of strategies i will suggest working in the following steps
define your strategy.
2. create an array or add a column to your xts object that will represent your position for each day. 1 for long, 0 for no position and -1 for short (later on you can play with the number for leverage).
3. multiply each days return with the position and you'll get your strategy return vector.
4. examine the results - my recommendation is PerformanceAnalytics.
simple strategy - buy when close over SMA20 , sell under
library(quantmod)
library(PerformanceAnalytics)
s <- get(getSymbols('SPY'))["2012::"]
s$sma20 <- SMA(Cl(s) , 20)
s$position <- ifelse(Cl(s) > s$sma20 , 1 , -1)
myReturn <- lag(s$position) * dailyReturn(s)
charts.PerformanceSummary(cbind(dailyReturn(s),myReturn))
and this is what you'll get

Resources