I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56
Related
I'm interested in filtering out data based on a set of rules.
I have a dataset that contains play data for all games in which a team had a .8 win probability at some point. What I'd like to do is find that point in which the win probability reached .8 and remove every play thereafter until the next game data begins. The dataset contains numerous games so once a game ends data from a new one begins in which the win probability goes back to around .5.
Here are the relevant columns and each row is a play in the game:
game_id = unique num for each game
team = team that will eventually get an .8 win prob
play_id = num that is increased (but not necessary in seq order for some reason) after each play
win_per = num showing what the teams win percentage chance at the start of that recorded play was
Example df
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
So in this small example, I have recorded 5 plays of two teams (a and b) who both obtained a win_prob of at least .8 at some point in the game. In both example cases, I would want to have all the plays removed AFTER they attained this .8 mark regardless of whether the win_prob kept rising or fell back below .8.
So team a would have the final two rows of data removed (win_prob == .81 and .85) and team b would have the final row removed (win_prob = .77)
I'm imagining running a for loop that checks if the team in any row is the same team as the prior row, and if so, find a win_prob >= .8 with the lowest play-id (as this would be the first time the team reached .8) and then somehow remove the rest of the rows following that match UNTIL the team != prior row's team.
Of course, you might know a better way as well. Thank you so much for helping me out!
No need to use a loop, that whole selection can be performed in 1 line using the dplyr package:
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
library(dplyr)
#group by team
#find the first row that exceeds .80 and add temp column
#save the row from 1 to the row that exceeds 0.80
#remove temp column
df %>% group_by(team, game_id) %>%
mutate(g80= min(which(win_per>=0.80))) %>%
slice(1:g80) %>%
select(-g80)
# A tibble: 7 x 4
# Groups: team [2]
game_id team play_id win_per
<dbl> <fct> <dbl> <dbl>
1 122 a 1 0.5
2 122 a 5 0.6
3 122 a 22 0.86
4 144 b 45 0.54
5 144 b 47 0.43
6 144 b 55 0.47
7 144 b 58 0.81
Here is a base R way using cumsum in ave
subset(df, ave(win_per > 0.8, game_id, FUN = function(x) c(0, cumsum(x)[-length(x)])) == 0)
# game_id team play_id win_per
#1 122 a 1 0.50
#2 122 a 5 0.60
#3 122 a 22 0.86
#6 144 b 45 0.54
#7 144 b 47 0.43
#8 144 b 55 0.47
#9 144 b 58 0.81
and using the similar concept in dplyr
library(dplyr)
df %>% group_by(game_id) %>% filter(lag(cumsum(win_per > 0.8) == 0, default = TRUE))
I have a data frame called "Navi", with 72 rows that describe all possible combinations of three variables f,g and h.
head(Navi)
f g h
1 40.00000 80 0.05
2 57.14286 80 0.05
3 74.28571 80 0.05
4 91.42857 80 0.05
5 108.57143 80 0.05
6 125.71429 80 0.05
I have a dataset that also contains these 3 variables f,g and h along with several others.
head(dataset1[,7:14])
# A tibble: 6 x 8
h f g L1 L2 Ref1 Ref2 FR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.02 20 100 53 53 0.501 2.00 2
2 0.02 20 260 67 67 0.200 5.01 5.2
3 0.02 20 420 72 71 0.128 7.83 8.4
4 0.02 20 580 72 72 0.0956 10.5 11.6
5 0.02 20 740 73 73 0.0773 12.9 14.8
6 0.02 20 900 72 71 0.0655 15.3 18
What I'm trying to do is:
for each row in the combinations data frame, filter the dataset by the three variables f,g and h.
Then, if there are exact matches, give me the matching rows of this dataset, then extract the values in the columns "L1" and "FR" in this dataset and calculate the average of them. Save the average value in the vectors "L_M2" and "FR_M2"
If there aren't exact matches, give me the rows where f,g,h in the dataset are closest to f,g,h from the data frame. Then extract all values for L and FR in these rows, and calculate the average. Save the average value in the vectors "L_M2" and "FR_M2".
What I've already tried:
I created two empty vectors where the extracted values shall be saved later on.
Then I am looping over every row of the combinations data frame, filtering the dataset by f,g and h.
The result would be multiple rows, where the values for f,g and h are the same in the dataset as in the row of the combinations data frame.
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
}
The thing is, I don't know what to do from here on. I don't know how to check for rows with closest values by multiple variables, if there are no exact matches...
I only did something more or less similar in the past, but I only checked for the closes "g" value like this:
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
for (i in 1:(nrow(Navi)){
Differences <- abs(Navi[i,2]- matchingRows$G)
indexofMin <- which(Differences == min (Differences))
L_M2 <- append(L_M2, matchingRows$L[[indexofMin]], after = length(L_M2))
FR_M2 <- append(FR_M2, matchingRows$FR[[indexofMin]], after = length(FR_M2))
}
So can anybody tell me how to achieve this extraction process?I am still pretty new to R, so please tell me If I made a rookie mistake or forgot to include some crucial information. Thank you!
First convert your data into dataframe (if not done before).
Navi <- data.frame(Navi)
Savi <- data.frame(dataset1[,7:14])
Then use merge to filter your lines:
df1 <- merge(Navi, Savi, by = c("f","g","h"))
Save "L1" and "FR" average from df1:
Average1 <- ((df1$L1+df1$FR)/2)
Get you your new Navi dataframe which doen not have exact match on f,g,h columns
Navi_new <- Navi[!duplicated(rbind(df1, Navi))[-seq_len(nrow(df1))], ]
For comparing the values with nearest match:
A1 <- vapply(Navi_new$f, function(x) x-Savi$f, numeric(3))
A2 <- apply(abs(A1), 2, which.min)
B1 <- vapply(A1$g, function(x) x-Savi$g, numeric(3))
B2 <- apply(abs(B1), 2, which.min)
C1 <- vapply(B1$g, function(x) x-Savi$g, numeric(3))
C2 <- apply(abs(C1), 2, which.min)
You can use C2 dataframe to get the average of "L1" and "FR" like 3 steps back.
I apologize if this is a duplicate or a bit confusing - I've searched all around SO but can't seem to apply find what I'm trying to accomplish. I haven't used functions/loops extensively, especially writing from scratch, so I'm not sure if the error is from the function (likely) or from the construct of the data. The basic flow as follows:
Dummy data set - grouping, type, rate, years, months
I'm running lm formula on the data set by grouping with this bit:
coef_models <- test_coef %>% group_by(Grouping) %>% do(model = lm(rate ~ years + months, data = .))
The result of the above gives me intercepts and coefficients for the variables -
what I'm trying to accomplish next (and failing) is for all the coefficients for the estimates that are negative, drop that component out of the equation and rerun the lm with just the positive coefficient. So for example a grouping of states, if the years coefficient is negative, I would want to run lm(rate ~ months, data = . with in the formula.
To get there, with plyr/broom, I'm taking the results and putting them into a data frame:
#removed lines with negative coefficients
library(dplyr)
library(broom)
coef_output_test <- as.data.frame(coef_models %>% tidy(model))
coef_output_test$Grouping <- as.character(coef_output_test$Grouping)
#drop these coefficients and rerun
coef_output_test_rerun <- coef_output_test[!(coef_output_test$estimate >= 0),]
From here, I'm trying to rerun the groupings with issues without the negative variable from the initial run. Because the variables will vary, some instances will be years dropping out, some will be months, I need to pass through the correct column to use. I think this is where I'm getting hung up:
lm_test_rerun_out <- data.frame(grouping=character()
, '(intercept)'=double()
, term=character()
, estimate=double()
, stringsAsFactors=FALSE)
lm_test_rerun <- function(r) {
y = coef_output_test_rerun$Grouping
x = coef_output_test_rerun$term
for (i in 2:nrow(coef_output_test_rerun)){
lm_test_rerun_out <- test_coef %>% group_by(Grouping["y"]) %>% do(model = lm(rate ~ x, data = .))
}
}
lm_test_rerun(coef_output_test_rerun)
I get this error:
variable lengths differ (found for 'x')
The output for function should be something like this dummy output:
Grouping, Term, (intercept), Estimate
Sports, Years, 0.56, 0.0430
States, Months, 0.67, 0.340
I'm surely not fluent in R, and I'm sure the parts above that do work could be done more efficiently, but the output of the function should be the grouping and x variable used, along with the intercept and estimate for each. Ultimately I'll be taking that output and appending back to the original 'coef_models' - but I can't get past this part for now.
EDIT: sample test_coef set
Grouping Drilldown Years Months Rate
Sports Basketball 10 23 0.42
Sports Soccer 13 18 0.75
Sports Football 9 5 0.83
Sports Golf 13 17 0.59
States CA 13 20 0.85
States TX 14 9 0.43
States AK 14 10 0.63
States AR 10 5 0.60
States ID 18 2 0.22
Countries US 8 19 0.89
Countries CA 9 19 0.86
Countries UK 2 15 0.64
Countries MX 21 15 0.19
Countries AR 8 11 0.62
Consider a base R solution with by that slices dataframe by one or more factors for any extended method to run on each grouped subset. Specifically, below will conditionally re-run lm model by checking coefficient matrix and ultimately returns a dataframe with needed values:
Data
txt <- ' Grouping Drilldown Years Months Rate
Sports Basketball 10 23 0.42
Sports Soccer 13 18 0.75
Sports Football 9 5 0.83
Sports Golf 13 17 0.59
States CA 13 20 0.85
States TX 14 9 0.43
States AK 14 10 0.63
States AR 10 5 0.60
States ID 18 2 0.22
Countries US 8 19 0.89
Countries CA 9 19 0.86
Countries UK 2 15 0.64
Countries MX 21 15 0.19
Countries AR 8 11 0.62'
test_coef <- read.table(text=txt, header=TRUE)
Code
df_list <- by(test_coef, test_coef$Grouping, function(df){
# FIRST MODEL
res <- summary(lm(Rate ~ Years + Months, data = df))$coefficients
# CONDITIONALLY DEFINE FORMULA
f <- NULL
if ((res["Years",1]) < 0 & (res["Months",1]) > 0) f <- Rate ~ Months
if ((res["Years",1]) > 0 & (res["Months",1]) < 0) f <- Rate ~ Years
# CONDITIONALLY RERUN MODEL
if (!is.null(f)) res <- summary(lm(f, data = df))$coefficients
# ITERATE THROUGH LENGTH OF res MATRIX SKIPPING FIRST ROW
tmp_list <- lapply(seq(length(res[-1,1])), function(i)
data.frame(Group = as.character(df$Grouping[[1]]),
Term = row.names(res)[i+1],
Intercept = res[1,1],
Estimate = res[i+1,1])
)
# RETURN DATAFRAME OF 1 OR MORE ROWS
return(do.call(rbind, tmp_list))
})
final_df <- do.call(rbind, unname(df_list))
final_df
# Group Term Intercept Estimate
# 1 Countries Months -0.0512500 0.04375000
# 2 Sports Years 0.6894118 -0.00372549
# 3 States Months 0.2754176 0.02941113
Do note: removing negative coeff of first and re-running new model can render the other component negative when previously it was positive.
I am trying to backtest stock returns given a 10 month moving average rule. The rule being, if the price is above the 10mnth average - buy, if it is below the 10mnth average - hold the value constant.
I know how to do this in excel very easily, but I am having trouble in R.
Below is my approach in R:
#Downloand financial data
library(Quandl)
SPY <- Quandl("YAHOO/INDEX_GSPC", type = "xts", collapse = "monthly")
head(SPY)
#Calculate log returns
SPY$log_ret <- diff(log(SPY$Close))
#Calculate moving average for Closing price
SPY$MA.10 <- rollapply(SPY$Close, width = 10, FUN = mean)
#Create binary rule to determine when to buy and when to hold
#1 = Buy
SPY$Action <- ifelse(SPY$MA.10 < SPY$Close, 1, 0)
#Create default value in a new column to backtest returns
SPY$Hit <- 100
#Calculate cumulative returns
SPY$Hit <-ifelse(SPY$Action == 1, SPY[2:n, "Hit"] *
(1 + SPY$log_ret), lag.xts(SPY$Hit, k=1))
Returns do get calculated correctly for an Action of 1, but when the Action is not 1, I find that SPY$Hit only lags 1 time, then defaults to the 100 value, while I would like it to hold the value from the last Action == 1 time.
This formula works very well in MS Excel and is very easy to implement, but it seems that the issue in R is that I cannot keep the value constant from the last Action == 1, how can I do this so that I can see how well this simple trading strategy would work?
Please let me know if I can clarify this further, thank you.
Sample of the desired output:
Action Return Answer
[1,] 0 0.00 100.00000
[2,] 1 0.09 109.00000
[3,] 1 0.08 117.72000
[4,] 1 -0.05 111.83400
[5,] 1 -0.03 108.47898
[6,] 0 -0.02 108.47898
[7,] 0 0.01 108.47898
[8,] 0 0.06 108.47898
[9,] 1 -0.03 105.22461
[10,] 0 0.10 105.22461
[11,] 1 -0.05 99.96338
Here's my guess, let me know what you think.
# Looping
Hit <- matrix(100, nrow = nrow(SPY))
for(row in 11:nrow(SPY)){ # 11 since you have NA's from your moving average
if(SPY$Action[row] == 1){
Hit[row] = Hit[row-1] * (1 + SPY$log_ret[row]) # here we needed row-1
} else {
Hit[row] = Hit[row-1]
}
}
SPY$Hit <- Hit
cbind(SPY$Action, SPY$Hit)
For your sample:
x <- data.frame(Action = c(0,1,1,1,1,0,0,0,1,0,1))
x$Return <- c(0,0.09,0.08,-0.05,-0.03,-0.02,0.01,0.06,-0.03,0.10,-0.05)
x$Answer <- matrix(100, nrow = nrow(x))
for(row in 2:nrow(x)){ # 11 since you have NA's from your moving average
if(x$Action[row] == 1){
x$Answer[row] = x$Answer[row-1] * (1 + x$Return[row])
} else {
x$Answer[row] = x$Answer[row-1]
}
}
x
Action Return Answer
1 0 0.00 100.00000
2 1 0.09 109.00000
3 1 0.08 117.72000
4 1 -0.05 111.83400
5 1 -0.03 108.47898
6 0 -0.02 108.47898
7 0 0.01 108.47898
8 0 0.06 108.47898
9 1 -0.03 105.22461
10 0 0.10 105.22461
11 1 -0.05 99.96338
In Excel there are 2 ways to attain it,
1. Go to the Data command find Data Analysis, find Moving Average,,
In the dialog box you need to put Input data range, Interval (in yur case 10), then output cell addresses.
After finding Result write this formula,
=if(A2 >B2, "Buy", "Hold")
Where A2 hold Price, B2 holds 10 months Moving Average value.
Any where in sheet number cells horizontally 1 to 10 (month number).
Below row put month's value 1 to 10.
Below row calculate 10 months Average.
And finally write the Above written formula to find Buy or hold.
I have a data frame, with which I would like to group the intervals based on whether the integer values are consecutive or not and then find the difference between the maximum and minimum value of each group.
Example of data:
x Integers
0.1 14
0.05 15
2.7 17
0.07 19
3.4 20
0.05 21
So Group 1 would consist of 14 and 15 and Group 2 would consist of 19,20 and 21.
The difference of each group then being 1 and 2, respectively.
I have tried the following, to first group the consecutive values, with no luck.
Breaks <- c(0, which(diff(Data$Integer) != 1), length(Data$Integer))
sapply(seq(length(Breaks) - 1),
function(i) Data$Integer[(Breaks[i] + 1):Breaks[i+1]])
Here's a solution using by():
df <- data.frame(x=c(0.1,0.05,2.7,0.07,3.4,0.05),Integers=c(14,15,17,19,20,21));
do.call(rbind,by(df,cumsum(c(0,diff(df$Integers)!=1)),function(g) data.frame(imin=min(g$Integers),imax=max(g$Integers),irange=diff(range(g$Integers)),xmin=min(g$x),xmax=max(g$x),xrange=diff(range(g$x)))));
## imin imax irange xmin xmax xrange
## 0 14 15 1 0.05 0.1 0.05
## 1 17 17 0 2.70 2.7 0.00
## 2 19 21 2 0.05 3.4 3.35
I wasn't sure what data you wanted in the output, so I just included everything you might want.
You can filter out the middle group with subset(...,irange!=0).