Related
I have a time series of prices and I am trying to create a trading strategy. I need to stop the trade as soon as the price_cap is met. I have data stored in a database and I need to create a list that indicates all the long positions (list called long). I have tried the following, but even if the condition is met, the list continues to store value:
long = list()
for (i in 5:length(index)) {
long[i] = data$target_price[i]
if (data$target_price > price_cap){
break
}
print(long)
}
Any idea?
Thank you so much.
The if is not a vectorized function. It expects an input logical expression with a length of 1. So, inside the loop, we need the data$price[i] instead of the whole column
for(i in seq_len(nrow(data))) {
long[[i]] <- data$target_price[i]
if(data$target_price[i] > price_cap) {
break
}
}
long
data
long <- list()
data <- data.frame(target_price = c(5, 10, 5, 20))
price_cap <- 6
I think #akrun already pointed out the issue and showed a clear way to fix it.
Below is another implementation for your objective
long <- with(data, as.list(target_price[seq(min(which(target_price > price_cap)) + 1)]))
which gives
> long
[[1]]
[1] 5
[[2]]
[1] 10
I'm trying to improve the speed of my code, which is trying to optimise a value using 3 variables which have large ranges. The most likely output uses values in the middle of the ranges, so it is wasting time starting from the lowest possible value of each variable. I want to start from the middle value and iterate out! The actual problem has thousands of lines with numbers from 150-650. C,H and O limits will be defined somewhat based on the starting number, but will always be more likely at a central value in the defined range. Is there a way to define the for loop to work outwards like I want? The only, quite shabby, way I can think of is to simply redefine the value within the loop from a vector (e.g. 1=20, 2=21, 3=19, etc). See current code below:
set_error<-2.5
ct<-c(325.00214,325.00952,325.02004,325.02762,325.03535,325.03831,325.04588, 325.05641,325.06402,325.06766,325.07167,325.07454,325.10396)
FormFun<-function(x){
for(C in 1:40){
for(H in 1:80){
for(O in 1:40){
test_mass=C*12+H*1.007825+O*15.9949146-1.0072765
error<-1000000*abs(test_mass-x)/x
if(error<set_error){
result<-paste("C",C,"H",H,"O",O,sep ="")
return(result)
break;break;break;break
}
}
}
}
}
old_t <- Sys.time()
ct2<-lapply(ct,FormFun)
new_t <- Sys.time() - old_t # calculate difference
print(new_t)
Use vectorization and create a closure:
FormFun1_fac <- function(gr) {
gr <<- gr
function(x, set_error){
test_mass <- with(gr, C*12+H*1.007825+O*15.9949146-1.0072765)
error <- 1000000 * abs(test_mass - x) / x
ind <- which(error < set_error)[1]
if (is.na(ind)) return(NULL)
paste0("C", gr[ind, "C"],"H", gr[ind, "H"],"O", gr[ind, "O"])
}
}
FormFun1 <- FormFun1_fac(expand.grid(C = 1:40, H = 1:80, O = 1:40))
ct21 <- lapply(ct, FormFun1, set_error = set_error)
all.equal(ct2, ct21)
#[1] TRUE
This saves a grid of all combinations of C, H, O in the function environment and calculates the error for all combinations (which is fast in vectorized code). The first combination that passes the test is returned.
I have been trying to find the percentage price jump (+-15 % change) in the numbers and when there is the jump it will give me the corresponding date. However, when there is a first jump it will break-out from the inner-for loop gives me the correspndong data for it and which will get store in temp.cyc data frame.
The program is not showing any error but it is repeating the same no. throughout and when I check the value of test.df it is showing NA.
Can anyone help me understand what's going on? And I am new to R so it will helpful if you can give your answer in detail. Thank You :)
# Defining variables
row= nrow(price.close)
col=ncol(price.close)
#Defining Matrix
m<-matrix(0,ncol=1,nrow=row)
p<-matrix(0,ncol=5,nrow=row)
# Dataframe to temporaily store percentage Change
test.df<- vector(mode="numeric", length=nrow(price.close))
# Dataframe to extract required Values
temp.cyc<-as.data.frame(p)
colnames(temp.cyc)<-c("cyc.duration","Start.date","End.date","Start.date.value","End.date.value")
for( j in 1:row)
{
for(i in j:row)
{
test.df<-(price.close[(i+1),2]-price.close[j,2])/price.close[j,2]
if(test.df >= 0.15 | test.df <= -0.15 | is.na(test.df)== TRUE )
{
temp.cyc$Start.date.value = price.close[j,2]
temp.cyc$End.date.value <- price.close[i,2]
temp.cyc$Start.date <- price.close[j,1]
temp.cyc$End.date <- price.close[i,1]
}
break
}
}
Seems to me you are using this for financial data , e.g. stock prices. If this assumption is right then I suggest that you should use packages that have this functionality. I would suggest quantmod package.
Here's short example how to get dates when price goes more then 15% up or down.
library(quantmod)
# create some dummy stock data over 10 days period
# next time I hope you will attach some of your data
stockClose <- c(100,50,75,70,68,100,115,120,130,100)
stockDates <- seq(as.Date("2014-01-01"),length=10,by="+1 days")
stock.xts <- as.xts(stockClose,stockDates)
# calculate change , check '?Delt' help for more info
change <- Delt(stock.xts)
#get only those rows where price change in both directions is higher then 15%
specialDays <- change[coredata(change) < -0.15 | coredata(change) > 0.15,]
#get dates
justDates <- index(specialDays)
which gives us "2014-01-02" ,"2014-01-03" ,"2014-01-06" and "2014-01-10"
In case you want to actually compare all possible combinations for entering and exiting position then you can use something like this :
library(quantmod)
calculatePeriods <- function(){
stockClose <- c(100,50,75,70,68,100,115,120,130,100)
stockDates <- seq(as.Date("2014-01-01"),length=10,by="+1 days")
stock.xts <- as.xts(stockClose,stockDates)
# you will be adding rows to thid df
result <- data.frame()
for(i in 1:(length(stock.xts)-1)){
for(j in 2:length(stock.xts)){
change <- (coredata(stock.xts[j])-coredata(stock.xts[i]))/coredata(stock.xts[i])
if(change < (-0.15) | change > (0.15)){
row <- data.frame("cyc.duration"=as.numeric(index(stock.xts[j])-index(stock.xts[i]),units="days"),"Start.date"=index(stock.xts[i]),"End.date"=index(stock.xts[j]),"Start.date.value"=coredata(stock.xts[i]),"End.date.value"=coredata(stock.xts[j]))
result <- rbind(result,row)
}
}
}
return(result)
}
I am backtesting some investment strategy using R, I have a piece of script below:
set.seed(1)
output.df <- data.frame(action=sample(c("initial_buy","sell","buy"),
10000,replace=TRUE),stringsAsFactors=FALSE)
output.df[,"uid"] <- 1:nrow(output.df)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
output.df<<-output.df
}
print(system.time(cutrow.fx(output.df=output.df)))
The strategy will determine: 1) when to start buying a stock; 2) when to add additional contribution to the stock; and 3) when to sell all the stock. I have a dataframe with price of a stock for the past 10 years. I wrote 3 scripts to indicate which date should I buy/sell the stock, combine the 3 results and order them.
I need to remove some of the "impossible action", e.g. I cannot sell the same stock twice without buying new units beforehand, so I used the script above to delete those impossible action. But the for loop is kind of slow.
Any suggestion for speeding it up?
Update 01
I have updated the cutrow.fx into the following but fail:
cutrow.fx <- function(output.df) {
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
while (any(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy")|
any(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy")) {
output.df <- output.df[!(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy"),]
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
}
output.df[,"action_pre"] <- NULL
output.df<<-output.df
}
I used the vector comparison as somehow inspired (I used somehow as I'm not sure if I get exact what he means in the answer) by John, use a while-loop to repeat. But the output is not the same.
Is the for-loop here inevitable?
It looks like all you're doing is checking the last action. This doesn't require a loop at all. All you have to do is shift the vector and do straight vector comparisons. Here's an artificial example.
x <- sample(1:11)
buysell <- sample(c('buy', 'sell'), 11, replace = TRUE)
So, I have 11 samples, x, and whether I've bought or sold them. I want to make a boolean that shows whether I bought or sold the last sample.
bought <- c(NA, buysell[1:10])
which( bought == 'buy' )
Examine the x and buysell variables and you'll see the results here are the index of the x items where a buy was made on the prior item.
Also, you might want to check out he function %in%.
I tried to do something clever with vectorization, but failed because previous iterations of the loop can change the data relationships for later iterations through. So I couldn't lag the data by a set amount and compare lagged to real results.
What I can do is minimize the copying operation involved. R is assign-by-copy, so when you write a statement like output.df <- output.df[-loop.del,], you are copying the entire data structure for each row that is deleted. Instead of changing (and copying) the data frame, I made changes to a logical vector. Some other attempts at speed-up include using logical and (&&) instead of bitwise and (&), using %in% to make fewer comparisons, and minimizing accesses on output.df.
To compare the two functions I slightly modified OP solution such that the original data frame was not overwritten. It looks like this can improve speeds by a factor of 10, but it still takes a noticeable about of time (>0.5 sec). I'd love to see any faster solutions.
OP's solution (slightly modified in return value and without global assign)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
return(output.df)
}
ans1 <- cutrow.fx(output.df)
my solution
cutrow.fx2 <- function(output.df) {
##edge case if output.df has too few rows
if (nrow(output.df) < 2) return(output.df)
##logical vector of indices of rows to keep
idx <- c(TRUE,logical(nrow(output.df)-1))
##keeps track of the previous row
prev.row <- 1
prev.act <- output.df[prev.row,"action"]
for (current.row in seq_len(nrow(output.df))[-1]) {
##access output.df only once per iteration
current.act <- output.df[current.row,"action"]
##checks to see if current row is bad
##if so, continue to next row and leave previous row as is
if ( (prev.act %in% c("initial_buy","buy")) &&
(current.act == "initial_buy") ) {
next
} else if ( (prev.act == "sell") &&
(current.act %in% c("buy","sell")) ) {
next
}
##if current row is good, mark it in idx and update previous row
idx[current.row] <- TRUE
prev.row <- current.row
prev.act <- current.act
}
return(output.df[idx,])
}
ans2 <- cutrow.fx2(output.df)
checks that answers are the same
identical(ans1,ans2)
## [1] TRUE
#benchmarking
require(microbenchmark)
mb <- microbenchmark(
ans1=cutrow.fx(output.df)
,ans2=cutrow.fx2(output.df),times=50)
print(mb)
# Unit: milliseconds
# expr min lq median uq max
# 1 ans1 9630.1671 9743.1102 9967.6442 10264.7000 12396.5822
# 2 ans2 481.8821 491.6699 500.6126 544.4222 645.9658
plot(mb)
require(ggplot2)
ggplot2::qplot(y=time, data=mb, colour=expr) + ggplot2::scale_y_log10()
Here is some code that is a bit simpler and much faster. It does not loop over all elements, but only loops between matches. It matches forward rather than backward.
First, modify your cutrow.fx function. Remove the <<-output.df on the last line, and simply return the result. Then you can run two functions and compare the results.
cutrow.fx1 <- function(d) {
len <- length(d[,1])
o <- logical(len)
f <- function(a) {
switch(a,
initial_buy=c('buy', 'sell'),
buy=c('buy', 'sell'),
sell='initial_buy'
)
}
cur <- 1
o[cur] <- TRUE
while (cur < len) {
nxt <- match(f(d[cur,1]), d[(cur+1):len,1])
if (all(is.na(nxt))) {
break
} else {
cur <- cur + min(nxt, na.rm=TRUE);
o[cur] <- TRUE
}
}
d[o,]
}
Show that the results are correct:
identical(cutrow.fx1(output.df), cutrow.fx(output.df))
## [1] TRUE
And it is quite a bit faster. This is due to the partial vectorization of the problem, using match to find the next row to keep, rather than iterating to discard rows.
print(system.time(cutrow.fx(output.df)))
## user system elapsed
## 5.688 0.000 5.720
print(system.time(cutrow.fx1(output.df)))
## user system elapsed
## 1.050 0.000 1.056
(EDIT: one of the issues here is scale, namely what works for one row will blow up/crash R on a 200,000 * 50 dataframe. For example, strptime must be applied column-wise, not row-wise, to avoid hanging.
I'm looking for working code solutions that you actually ran on 200,000 * 50 including your measured runtime, not just casual "this is easy" remarks. It's easy to get runtimes > 12 hrs if you pick the wrong fn. Next, I also asked you to make my zero-time adjustment code faster, the job's not finished till that's done. Noone attempted that so far.)
I want to vectorize and accelerate the following multistep log-time conversion, with millisecond accuracy, involving converting strtime() to a single numeric, followed by subtraction and then log() on a large data-frame (200,000 rows * 300 cols; other (non-time) columns omitted).
Code below.
As well as making it vectorized and fast, an extra problem is I'm not sure how best to represent the (higher-dimensional) intermediate values at each step e.g. as list from strtime, matrix, vector). I already tried apply,sapply,lapply,vapply,ddply::maply(),... but the incompatibility of intermediate format(s) keeps messing me up...
Each row has 50 columns time1..time50 (chr, format="HH:MM:SS.sss") representing time as string in millisecond resolution. I need millisecond accuracy.
Within each row, columns time1..time50 are in non-decreasing order, and I want to convert them into log of time before time50. The conversion fn parse_hhmmsecms() is at bottom, and needs serious vectorization and speeding up, you can see alternative versions commented out. What I figured so far: strtime() is faster than (multiple) substr() calls, I then convert somehow to list of three numeric (hh,mm,sec.ms), then convert to vector assuming the next step should be to vector-multiply with %*% c(3600,60,1) to convert to numeric seconds.
Here is pseudocode of what I do for each row, and each time-string; full code is at bottom:
for each row in dataframe { # vectorize this, loop_apply(), or whatever...
#for each time-column index i ('time1'..'time50') { # vectorize this...
hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
# Main computation
xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
# Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
#}
}
So there are five subproblems involved:
How to vectorize handling the list returned by strtime()? since it returns a list of 3 items, when passed a 2D dataframe or 1D row of time-strings, we will get a 3D or 2D intermediate object. (do we internally we use list-of-list? matrix of lists? array of lists?)
How to vectorize the entire function parse_hhmmsecms()?
Then do the subtraction and log
Vectorize the zero-time fixup code as well (this is now the slowest part by far)
How to accelerate steps 1...4.?
Code snippet below using ten example columns time41..50 (use random_hhmmsecms() if you want a bigger sample)
I did my best to follow these recommendations, this is as reproducible as I can get it in six hours' work:
# Each of 200,000 rows has 50 time strings (chr) like this...
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"),
time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
)), .Names = c("time41", "time42", "time43", "time44", "time45",
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")
# Handle millisecond timing and time conversion
options('digits.secs'=3)
# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW
#c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR
#as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}
random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %% 60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}
xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42 <- xx$logtime41 <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47 <- xx$logtime46 <- NA
# (we pass index vectors as the dataframe column ordering may change)
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
#if (i%%100==0) { print(paste('... row',i)) }
hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
# Now fix up all the 'zero-time' events to be evenly spaced between -3..0
Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))
}
You may be overcomplicating things.
Start with base classes which do milliseconds very well (and on appropriate operating systems even microseconds) but note that
you need to set options("digits.secs"=7) (that's the max that can be displayed) to see them displayed
you need an additional parsing character for strptime et al
all of which is in the docs, and countless examples here on SO.
Quick examples:
R> someTime <- ISOdatetime(2011, 12, 27, 2, 3, 4.567)
R> someTime
[1] "2011-12-27 02:03:04.567 CST"
R> now <- Sys.time()
R> now
[1] "2011-12-27 16:48:20.247298 CST" # microsecond display on Linux
R>
R> txt <- "2001-02-03 04:05:06.789123"
R> strptime(txt, "%Y-%m-%d %H:%M:%OS") # note the %0S for sub-seconds
[1] "2001-02-03 04:05:06.789123"
R>
And key functions such as strptime or as.POSIXct are all vectorised and you can throw entire columns at them.