I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56
I am trying to backtest stock returns given a 10 month moving average rule. The rule being, if the price is above the 10mnth average - buy, if it is below the 10mnth average - hold the value constant.
I know how to do this in excel very easily, but I am having trouble in R.
Below is my approach in R:
#Downloand financial data
library(Quandl)
SPY <- Quandl("YAHOO/INDEX_GSPC", type = "xts", collapse = "monthly")
head(SPY)
#Calculate log returns
SPY$log_ret <- diff(log(SPY$Close))
#Calculate moving average for Closing price
SPY$MA.10 <- rollapply(SPY$Close, width = 10, FUN = mean)
#Create binary rule to determine when to buy and when to hold
#1 = Buy
SPY$Action <- ifelse(SPY$MA.10 < SPY$Close, 1, 0)
#Create default value in a new column to backtest returns
SPY$Hit <- 100
#Calculate cumulative returns
SPY$Hit <-ifelse(SPY$Action == 1, SPY[2:n, "Hit"] *
(1 + SPY$log_ret), lag.xts(SPY$Hit, k=1))
Returns do get calculated correctly for an Action of 1, but when the Action is not 1, I find that SPY$Hit only lags 1 time, then defaults to the 100 value, while I would like it to hold the value from the last Action == 1 time.
This formula works very well in MS Excel and is very easy to implement, but it seems that the issue in R is that I cannot keep the value constant from the last Action == 1, how can I do this so that I can see how well this simple trading strategy would work?
Please let me know if I can clarify this further, thank you.
Sample of the desired output:
Action Return Answer
[1,] 0 0.00 100.00000
[2,] 1 0.09 109.00000
[3,] 1 0.08 117.72000
[4,] 1 -0.05 111.83400
[5,] 1 -0.03 108.47898
[6,] 0 -0.02 108.47898
[7,] 0 0.01 108.47898
[8,] 0 0.06 108.47898
[9,] 1 -0.03 105.22461
[10,] 0 0.10 105.22461
[11,] 1 -0.05 99.96338
Here's my guess, let me know what you think.
# Looping
Hit <- matrix(100, nrow = nrow(SPY))
for(row in 11:nrow(SPY)){ # 11 since you have NA's from your moving average
if(SPY$Action[row] == 1){
Hit[row] = Hit[row-1] * (1 + SPY$log_ret[row]) # here we needed row-1
} else {
Hit[row] = Hit[row-1]
}
}
SPY$Hit <- Hit
cbind(SPY$Action, SPY$Hit)
For your sample:
x <- data.frame(Action = c(0,1,1,1,1,0,0,0,1,0,1))
x$Return <- c(0,0.09,0.08,-0.05,-0.03,-0.02,0.01,0.06,-0.03,0.10,-0.05)
x$Answer <- matrix(100, nrow = nrow(x))
for(row in 2:nrow(x)){ # 11 since you have NA's from your moving average
if(x$Action[row] == 1){
x$Answer[row] = x$Answer[row-1] * (1 + x$Return[row])
} else {
x$Answer[row] = x$Answer[row-1]
}
}
x
Action Return Answer
1 0 0.00 100.00000
2 1 0.09 109.00000
3 1 0.08 117.72000
4 1 -0.05 111.83400
5 1 -0.03 108.47898
6 0 -0.02 108.47898
7 0 0.01 108.47898
8 0 0.06 108.47898
9 1 -0.03 105.22461
10 0 0.10 105.22461
11 1 -0.05 99.96338
In Excel there are 2 ways to attain it,
1. Go to the Data command find Data Analysis, find Moving Average,,
In the dialog box you need to put Input data range, Interval (in yur case 10), then output cell addresses.
After finding Result write this formula,
=if(A2 >B2, "Buy", "Hold")
Where A2 hold Price, B2 holds 10 months Moving Average value.
Any where in sheet number cells horizontally 1 to 10 (month number).
Below row put month's value 1 to 10.
Below row calculate 10 months Average.
And finally write the Above written formula to find Buy or hold.
I plotted a graph in R:
OBD=read.csv("OBD.CSV",header = TRUE,stringsAsFactors=FALSE)
x1 <- OBD$Time1
x2 <- OBD$Time2
y1<-OBD$Vehicle_speed
y2 <-OBD$Engine_speed
par(mar=c(5,4,4,5)+.1)
plot(x1,y1,type="l",col="yellow",ylab = "Vehicle speed")
par(new=TRUE)
plot(x2,y2,type="l",col="blue4",xaxt="n",yaxt="n",xlab="Time",ylab="")
axis(4)
mtext("Engine speed",side=4,line=3)
legend("topleft",col=c("blue4","yellow"),lty=1,legend=c("y1","y2"))
Sample data, CSV format:
Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023
From this table, i just want find a the most occurring engine speed and vehicle speed or most occurring range.
To find the most common (mode) vehicle speed, you can pull this from table
mySpeeds <- table(df$Vehicle_speed)
modeSpeed <- as.numeric(names(mySpeeds)[which.max(mySpeeds)])
modeSpeed
[1] 4.85
To get such a value for a range of speeds, you should use cut:
# get range categories
df$speedRange <- cut(df$Vehicle_speed, breaks=c(-Inf, 4.85, 4.90, 4.95, Inf))
mySpeedsRange <- table(df$speedRange)
modeSpeedRange <- names(mySpeedsRange)[which.max(mySpeedsRange)]
modeSpeedRange
[1] "(4.85,4.9]"
cut takes a numeric variable and returns a factor variable based on the second (breaks) argument. You can supply breaks with a single number indicating the number of breaks, or a vector, indicating the unique cut points. I included -Inf and Inf to ensure full coverage.
OBD <- read.csv(text = "Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023")
> table(OBD$Engine_speed)
649.5 650 650.5 651
2 6 4 3
Or for a couple of columns:
tables <- apply(OBD[ ,c(1,3,5)], 2, table)
> tables
$Vehicle_speed
4.82 4.85 4.87 4.89 4.91 4.94 4.96 4.98
1 2 2 2 2 2 2 2
$Engine_speed
649.5 650 650.5 651
2 6 4 3
$Engine_torq
11
15
To get only the most occuring:
> lapply(tables, which.max)
$Vehicle_speed
4.85
2
$Engine_speed
650
2
$Engine_torq
11
1
Does this solve the problem?