Scenario (using quantstrat, blotter and portfolioanalytics)
I have 10k initial equity
I have a strategy that i want to backtest over 3000 symbol universe (stocks)
Let say the strategy is a simple MA crossover
Every time i get a buy crossover I buy 10k worth of stock and close position
on the sell crossover
For backtest purpose the strategy can trade without any portfolio restriction,
therefore i may be holding 100+ positions at any point in time, therefore the
initial equity shouldn't be considered.
I want to know the AVERAGE return of this strategy over all trades.
In reality if i only had 10k i would only be able to be in one trade at once, but i would like know statisctally what the average return would be.
I then want to compare this with the stock index benchmark.
Do i SUM or MEAN the return stream of each symbol
Is it the return of the portfolio, does this take into account the initial
equity? - i don't want the return to be as a percentage of the initial equity
or consider how may symbols are trading.
I'll add an example strategy when i get time, but the solution to the problem is:
#get the portfolio returns
instRets <- PortfReturns(account.st)
#for each column, NA the values where there is no return, because when the values are averaged out, you don't want 0's to be included in the calculation
# if there are no signals in the strategy, you will invest money elsewhere rather than just leaving lying around. Therefore you only calculate the returns #when the strategy is ACTIVE
for (i in 1:ncol(instRets)){
instRets[,i][instRets[,i] == 0] <- NA
}
#this will give you the average return when the strategy is active, if there are 100 trades on, you want the average return during that period.
portfRets <- xts(rowMeans(instRets, na.rm = T), order.by = index(instRets))
portfRets <- portfRets[!is.na(portfRets)]
Now you can compare the strategy with a benchmark SPY for example. If the strategy has alpha you can use a balancing rule to apply funds to the strategy when signals arise or keep invested in the index when there are no signals.
As far to my knowledge the returns analysis built into blotter uses the initial equity to work out returns, therefor invest the same amount in each trade as you have for initial equity. 10k initial equity, 10k per trade.
Related
I have two vectors, one with different interest rates in a monthly basis, and one with cashflows:
interest<-c(0,.0448,.0452,.0428,.0428,.0452,.051,.0475,.04997)
cash_flows<-c(44273.81 44176.68 66849.14 123792.30 101141.25 190894.12 82724.14 257075.63 176920.29 482068.00 429030.01 348291.50).
The first element of the interest vector is 0, because that is my present value basis. I want to take all values to that point; all the other values are interest rates of all the other months, I intend to discount them to that first period. The cash flows are also divided in a monthly basis. The R procedure that I am writing is the following:
discount_vector<-(1+i)^-(1:length(i)-1)
discount_vector*cash_flows
And in order to get the total amount investment I just call the following
sum(discount_vector*cash_flows).
Are my conceptual reasoning and code correct?
Tanks for any attention and support
Present value of formula using discount factor
i<-c(0,.0448,.0448,.0452,.0428,.0428,.0428,.0452,.0451,.0475,.0475,.04997)
cash_flows<-c(44273.81 44176.68 66849.14 123792.30 101141.25 190894.12 82724.14 257075.63 176920.29 482068.00 429030.01 348291.50).
discount_vector<-(1+i)^-(1:length(i)-1)
discount_vector*cash_flows
My goal is to understand whether clicks (say, on a website) or calls (say, over the phone to new contacts) have a greater impact on new sign-ups over a 90 week period in the USA. I thought this problem lent itself well to PLM.
My data consist of 210 DMA regions (DMA), 90 weeks (WEEK), new sign-ups (NEW) (outcome) and two predictors (CALLS and CLICKS), so for example:
plm_model <- plm(NEW ~ CALLS + CLICKS, data=df, index=c("DMA", "WEEK"), model="within")
First, does one standardize the predictors and outcome across all panels (so in total) or within each panel (as in, uniquely standardizing to each DMA)? The models differ when I did both, which it should, but I can't quite find any documentation on which is "more correct" or why one would do one or the other.
Second, when I look at the datafile across all DMA regions but by time, there is a lag of 4 periods on CLICKS x NEW but 20 period on CALLS x NEW. I have arrived at those using:
which.max(sapply(1:50, function(i) cor(df$NEW, lag(df$CLICKS, i), use = "complete")))
which.max(sapply(1:50, function(i) cor(df$NEW, lag(df$CALLS, i), use = "complete")))
This does make sense for the data-- there is more "NEW" sooner after "CLICKS" but "CALLS" take more time to pay off.
I thought the logical next step would be lag according to the correlation and then standardize based on the data available now (as in, fewer observations now), and then I was stuck if within or across panels (as mentioned above), and then I was going to do something like:
plm_model <- plm(NEW ~ CALLS_20_standardized + CLICKS_4_standardized, data=df, index=c("DMA", "WEEK"), model="within")
However, I got a little wrapped up if that is the correct step to finish this exercise off. Any insight here would be appreciated.
I am new to R, and trying to get a handle on things for a school project.
The idea is to model a simple and hypothetical electricity generation/storage system which uses only solar panels and battery storage to meet energy demand. The aim is, given a predetermined storage capacity, to select the least amount of solar paneling that ensures that demand will be satisfied on every day of the year. I have a full-year of daily data - solar insolation figures that determine how productive panels will be, day-time electricity demand, and night-time electricity demand. Surplus generation during the day is stored in batteries, up to the predetermined limit, and then discharged at night to meet demand.
I have written a simple R program, f(x), where x is the amount of solar paneling that is installed. Like the battery-storage parameter, it is invariant over the entire year.
After creating new variables for the total power output per day and total excess power produced per day and adding these as columns 4 and 5 to the original data frame, the program creates two new vectors "batterystartvector" and "batterymidvector," which respectively indicate the battery level at the start of each day and at the midpoint, between day and night.
The program loops over each day (row) in the data frame, and:
(1) Credits the excess power that is produced (column 5) to the storage system up to the predetermined limit (7500 Megawatt hours in my example) - which is then stored in "batterymidvector."
(4) Subtracts the night demand (column 3) from the total just registered in "batterymidvector" to determine how much energy there will be in storage at the start of the next day, and stores this figure in "batterystartvector."
(5) Repeats for all 365 days.
Ultimately, my aim is to use an optimization package, such as DEoptimr, to determine the lowest value for x that ensures that demand is satisfied on all days - that is that no values in either "batterymidvector" or "batterystartvector" are ever negative.
Since every entry in the two battery vectors is dependent on prior entries, I cannot figure out how to write a program that does not use a 'for' loop. But surely there must be a simpler and less clunky way.
Here is the code:
library(DEoptimR)
setwd("C:/Users/User/Desktop/Thesis Stuffs/R Programs")
data <- read.csv("optdata1.csv", header=TRUE)
#x is pv installed and y is pumped-storage capacity
#default is that system starts with complete pumped reservoir
f <- function(x) {
data$output <<- (data$insolation*x)/1000
data$daybalance <<- data$output - data$day
batterystartvector <<- vector(mode="numeric",length="365")
batterystartvector[1] <<- c(7500)
batterymidvector <<- vector(mode="numeric", length="366")
for(i in 1:nrow(data)) {
#charging up
batterymidvector[i] <<- min(batterystartvector[i] + data[i,5], 7500)
#depleting
batterystartvector[i+1] <<- (batterymidvector[i] - data[i,3])
}
}
I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.
I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.