I created a data set like this, the reason is I wanted to use a time series based data set
getSymbols("^GSPC")
DF=data.frame(GSPC,DATE=time(GSPC))
PriceChange=(DF$GSPC.Close-DF$GSPC.Open)
DF$Class<-as.factor(ifelse(PriceChange>0,"UP","DOWN"))
DF$year = as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH = as.numeric(format(DF$DATE, format = "%m"))
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume Class Year Month
1418.03 1429.42 1407.86 1416.60 3429160000 Down 2007 1
1416.60 1421.84 1408.43 1418.34 3004460000 Up 2007 1
Then I replaced the year by a number which can be added on to keep the month index (I guess there is a smarter way of doing this)
DF=data.table(DF)
DF[year==2007,year:=0]
DF[year==2008,year:=12]
DF[year==2009,year:=24]
DF[year==2010,year:=36]
DF[year==2011,year:=48]
DF[year==2012,year:=60]
DF[year==2013,year:=72]
DF[year==2014,year:=84]
DF[year==2015,year:=96]
DF[year==2016,year:=108]
DF$Month_Index=(DF$year+DF$MONTH)
so the data set has now the additional column
Month_Index
01
01
Month_Index=115
Then I used the createTimeSlices from caret to make a growing window prediction.
TimeSlices=createTimeSlices(1:Month_Index, 5, horizon = 2,
fixedWindow = FALSE, skip = 0)
for(i in 1:nrow(DF))
{
plsFitTime <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
Prediction=predict(plsFitTime,DF[TimeSlices$test[[i]],])
}
Now I want to save predictions for each step and its proper index along with the accuracy. My question is how can I do this.
One of the ways to accomplish this:
library(quantmod)
library(data.table)
library(caret)
getSymbols("^GSPC")
DF <- data.frame(GSPC,DATE=time(GSPC))
PriceChange <- (DF$GSPC.Close-DF$GSPC.Open)
DF$Class <- as.factor(ifelse(PriceChange>0,"UP","DOWN"))
You can create Month indices in the following two ways:
# 1
DF$yearMon <- zoo::as.yearmon(DF$DATE)
DF <- data.table(DF)
DF[, Month_Index:= .GRP, by = yearMon]
# 2
DF$year <- as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH <- as.numeric(format(DF$DATE, format = "%m"))
DF[, Month_Index2 := .GRP, by = .(year, MONTH)]
identical(DF$Month_Index, DF$Month_Index2)
[1] TRUE
Month_Index <- length(unique(DF$Month_Index))
TimeSlices <- createTimeSlices(1:Month_Index, 5, horizon = 2,
fixedWindow = FALSE, skip = 0)
Create three empty lists to save your results:
totalSlices <- length(TimeSlices$train)
plsFitTime <- vector("list", totalSlices)
Prediction <- vector("list", totalSlices)
Accuracy <- vector("list", totalSlices)
Save all the results to these lists:
k <- 1:totalSlices
for(i in seq_along(k))
{
plsFitTime[[i]] <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
Prediction[[i]] <- predict(plsFitTime[[i]],
DF[TimeSlices$test[[i]],])
Accuracy[[i]] <- confusionMatrix(Prediction[[i]],
DF[TimeSlices$test[[i]],]$Class)$overall[1]
}
All the models are saved in plsFitTime, predictions in Prediction, and accuracies in Accuracy.
Update:
A more tidier approach would be to use the purrr package.
After creating time slices, you can use:
library(purrr)
customFunction <- function(x, y) {
model <- train(Class~.,
data = DF[x],
method = "pls")
prediction <- predict(model, DF[y])
accuracy <- confusionMatrix(prediction,
DF[y]$Class)$overall[1]
return(list(prediction, accuracy))
}
results <- map2_df(TimeSlices$train, TimeSlices$test, customFunction)
map2_df is a function that takes 2 lists .x and .y as the arguments, applies the function .f on all elements of those lists and returns the results as a dataframe.
You can create the function on the fly (just like lapply), but I created customFunction in the global environment just to keep the code clean.
DF[x] in the function is equivalent to DF[TimeSlices$train[[n]]] and DF[y] is DF[TimeSlices$test[[n]]]
map2_df now does everything that the for loop above did, and returns only the predictions and accuracies for all the models in the form of a dataframe.
class(results)
[1] "tbl_df" "tbl" "data.frame"
dim(results)
[1] 2 109
Each column in results is a list. 109 columns are the results from 109 models.
To access the result of each model (in this case prediction and accuracy) use results$columnName or results[[columnNumber]] .
If you want to store the models as well, just change the return statement in customFunction to include the model: return(list(model, prediction, accuracy))
You can use plyr to gather your results using a list:
results <- plyr::llply(1:length(TimeSlices$train), function(i){
plsFitTime <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
testData <- DF[TimeSlices$test[[i]],]
Prediction <- predict(plsFitTime, testData)
list(index = i, model = plsFitTime, prediction = Prediction)
})
# The model created for slice no. 3
results[[3]]$model
# ... and it's predictions
results[[3]]$prediction
You can add accuracy inside the function passed to llply if you need it.
Related
I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.
I am trying make a binary prediction (predicting QQQ states) using 16 input variables. My data set is 2001-2022. Here is what my data set looks like (predicting X0, which is 5 days ahead)
First I use cross validation with an 80-20 train test split on data from 2001-2017 in order
to test the accuracy of a potential model.
However, since I want our model doing forward predictions, I train the model using the 2001-2017 data set and make a chronological prediction for the 2018-2022 data set.
Understandably, the accuracy drops off
In order to improve the accuracy, I run an expanding window prediction model, where I keep
retraining the model using all prior available observations in order to predict the next state in the data set. For each model I increment the training set by one date. The output is a 2018-2022 prediction of states where the state for each date was predicted using a different training set. This ideally should also help the model to train on new market conditions/phases. The accuracy improves.
However, when I change the lags, I begin to notice that the accuracy does not begin to drop off with increased lags…
The code has been checked extensively and it seems like the lags for each dataset are legitimate. This leads to the question…what is wrong with mu model? Might there be a model better suited for our purposes? It also makes me wonder, why is there such a variability in the Sharpe for each model, is the 15th lag having the highest Sharpe purely coincidental? One theory was that the training set is quite comprehensive, therefore the model is great at making prediction regardless of lag in the near term. However, when I took the lags to an extreme, the accuracy still did not drop off:
Should I try using a different model? Any advice or guidance would be greatly appreciated. Please see my code below (the loop commented out is the expanding window RandomForest application).
library(ggplot2)
library(BatchGetSymbols)
library(data.table)
library(plyr)
library(quantmod)
library(PerformanceAnalytics)
defaultW <- getOption("warn")
options(warn = -1)
library(dplyr)
library(caret)
library(ranger)
### Data Import ######
states_full <- read.csv(file = "rolling_qqq_hidden_states_full_five_back.csv")
states_full$formatted_date <- as.Date(states_full$formatted_date)
states_full <- states_full[!duplicated(states_full$formatted_date),]
tickers <- c("QQQ", "^VXN")
l.out <- BatchGetSymbols(tickers = tickers,
first.date = states_full$formatted_date[1],
last.date = states_full$formatted_date[nrow(states_full)]+1, do.cache=FALSE, be.quiet = TRUE)
price_data <- data.frame(l.out$df.tickers$price.adjusted,l.out$df.tickers$ret.adjusted.prices, l.out$df.tickers$ref.date, l.out$df.tickers$ticker)
colnames(price_data) <- c("Value", "Daily Return", "Date", "Label")
QQQ_full <- price_data[which(price_data$Label == "QQQ"),]
# Make sure dates match
mylist <- c()
for (i in i:nrow(QQQ_full)){
if (sum(QQQ_full$Date[i] == states_full$formatted_date) != 1){
mylist <- c(mylist, i)
}
}
if(length(mylist) > 0){
QQQ_full <- QQQ_full[-mylist,]
}
mylist <- c()
for (i in 1:nrow(QQQ_01_17)){
if (sum(states_full$formatted_date[i] == QQQ_full$Date) != 1){
mylist <- c(mylist, i)
}
}
if(length(mylist) > 0){
states_full <- states_full[-mylist,]
}
# split the data into 2001-2017, 2018-2022
states_01_17 <- states_full[1:which(states_full$formatted_date == "2017-12-29"),]
states_17_22 <- states_full[(nrow(states_01_17)+1):nrow(states_full),]
QQQ_01_17<- QQQ_full[1:which(QQQ_full$Date == "2017-12-29"),]
QQQ_17_22 <- QQQ_full[(which(QQQ_full$Date == "2017-12-29")+1):nrow(QQQ_full),]
# build QQQ portfolio
QQQ_portfolio <- as.data.frame(matrix(nrow = nrow(QQQ_17_22) , ncol = 3))
colnames(QQQ_portfolio) <- c("Value", "Date", "Label")
QQQ_portfolio$Value <- 100
QQQ_portfolio$Label <- "QQQ Portfolio"
QQQ_portfolio$Date <- QQQ_17_22$Date
for(m in 2:nrow(QQQ_portfolio)){
QQQ_portfolio$Value[m] <- QQQ_portfolio$Value[m-1] * (1+QQQ_17_22$`Daily Return`[m])
}
# build non-lagged states portfolio
states_portfolio <- as.data.frame(matrix(nrow = nrow(QQQ_17_22) , ncol = 3))
colnames(states_portfolio) <- c("Value", "Date", "Label")
states_portfolio$Value <- 100
states_portfolio$Label <- "0 Lag RandomForest Prediction of MSDR"
states_portfolio$Date <- QQQ_17_22$Date
for(i in 2:nrow(states_portfolio)){
if (states_17_22$X0[i-1] == 1){
states_portfolio$Value[i] <- states_portfolio$Value[i-1] * (1+QQQ_17_22$`Daily Return`[i])
} else {
states_portfolio$Value[i] <- states_portfolio$Value[i-1]
}
}
# Calculate non-lagged sharpe as benchmark
#states_portfolio_returns <- data.frame(Delt(states_portfolio$Value)[-1])
#states_portfolio_returns_xts <- xts(states_portfolio_returns,states_portfolio$Date[-1])
#as.numeric(SharpeRatio.annualized(states_portfolio_returns_xts))
# bind portfolios together for plotting
port_comp <- rbind(QQQ_portfolio,states_portfolio)
# data set that will hold performance metrics
loop_output <- as.data.frame(matrix(0, nrow = 22, ncol = 8))
colnames(loop_output) <- c("Lag", "Cross Validation Accuracy 01-17","Forward Accuracy 18-22","Sharpe", "Average 1YR Rolling Sharpe",
"Median 1YR Rolling Sharpe","Min 1YR Rolling Sharpe","Max 1YR Rolling Sharpe")
# read macro data (do it each time because)
macro_full <- data.frame(read.csv("macroindicators3.csv"))
for (j in 2:ncol(macro_full)){
macro_full[j] <- as.numeric(nafill(macro_full[,j], type = "locf"))
}
macro_full$Date <- as.Date(macro_full[,1], "%m/%d/%Y")
macro_full <- macro_full[,-1]
macro_full <- macro_full[-1,]
# Remove NA columns, can try with more columns values later...
macro_no_na_full <- macro_full[,colSums(is.na(macro_full))==0]
# make sure dates match
mylist <- c()
for (k in 1:nrow(states_full)){
if (sum(states_full$formatted_date[k] == macro_full$Date) != 1){
mylist <- c(mylist, k)
}
}
if(length(mylist) > 0){
states_full <- states_full[-mylist,]
}
mylist <- c()
for (l in 1:nrow(macro_full)){
if (sum(macro_full$Date[l] == states_full$formatted_date) != 1){
mylist <- c(mylist, l)
}
}
if(length(mylist) > 0){
macro_full <- macro_full[-mylist,]
}
# states are a factor
states_full$X0 <- as.factor(states_full$X0)
set.seed(42)
for (i in 1:50){
if (i <= 8){
lag = i*5 # increment lag by 5 until 40
} else if (i <= 14){
lag = 40 + (i-8)*10 # increment lag by 10 until 100
} else {
lag = 100+(i-14)*100 # increment lag by 100 until 900
}
print(lag)
#Save lag
loop_output$Lag[i] <- lag
#Create a lagged data frame
full <- cbind(macro_no_na_full[1:(nrow(macro_no_na_full)-lag),], states_full[(lag+1):nrow(states_full),])
full_01_17 <- full[1:(which(full$Date == "2017-12-29")-lag),]
full_17_22 <- full[-(1:nrow(full_01_17)),]
# save version with dates to verify lags later
full_w_dates <- full
full_01_17_w_dates <- full_01_17
full_17_22_w_dates <- full_17_22
# remove dates for ml
full <- full[,-c(17,18)]
full_01_17 <- full_01_17[,-c(17,18)]
full_17_22 <- full_17_22[,-c(17,18)]
# this is just for cross validation model
x_01_17 <- data.frame(full_01_17[,-ncol(full_01_17)])
y_01_17 <- full_01_17$X0
# run cross validation model
train=sample(nrow(full_01_17),nrow(full_01_17)*.8,replace=FALSE) #Train/Test
rf.reg = ranger(y = y_01_17[train], x= x_01_17[train,] ,mtry=round(sqrt(ncol(x_01_17))),num.trees=200,importance = "impurity")
y.rf.pred = predict(rf.reg, x_01_17[-train,])$predictions # Predict with bagging
# cross validation model accuracy
rf.acc = mean(y.rf.pred==y_01_17[-train]) # Directly compute the accuracy
#rf.acc
#table(y.rf.pred,y_01_17[-train])
loop_output$`Cross Validation Accuracy 01-17`[i] <- rf.acc
# Expanding window models - takes a while
# prediction <- as.data.frame(matrix(0,nrow = nrow(full_17_22), ncol= 2)) # data set to store predictions
# prediction$V1 <- as.factor(c(0,1))[1] # store predictions as a factor
# previous = 0 # progress bar
# for(a in nrow(full_01_17):(nrow(full)-1)){ #expanding window starts with 2001-2017, next iteration is 2001-2017+1day
# progress = (a-nrow(full_01_17))/(nrow(full)-1-nrow(full_01_17)) # progress bar
# progress = round_any(progress, 0.01) # progress bar
# if (progress != previous){ # progress bar
# print(progress) # progress bar
# }
# previous = progress # progress bar
# rf.reg = ranger(full$X0[1:a]~.,data=full[1:a,],mtry=round(sqrt(ncol(x_01_17))),num.tree=800,importance = 'impurity') # ranger model
# y.rf.pred = predict(rf.reg, full[a+1,])$prediction # make the prediction on the a+1 observation
# prediction$V1[a-nrow(full_01_17)+1] <- y.rf.pred #save the prediction
# prediction$V2<-as.Date(prediction$V2) # save the date so we can verify lags
# prediction$V2[a-nrow(full_01_17)+1] <- as.Date(full_w_dates$formatted_date[a+1])
# if (a == nrow(full)-1) message("Done!") # gives a status update
# }
#
# write.csv(prediction, paste(lag,"lagprediction.csv", sep = "")) # save the prediction so we don't have to rerun
####
### to read-in results from already completed backtets
prediction <- read.csv(paste(lag,"lagprediction.csv", sep = ""))[2]
###
full_17_22_w_pred <- full_17_22_w_dates
full_17_22_w_pred$prediction <- prediction$V1
# Evaluate the accuracy
rf.acc = mean(full_17_22_w_pred$prediction==full_17_22_w_pred$X0)
loop_output$`Forward Accuracy 18-22`[i] <- rf.acc
# build a portfolio out of the predicted states
portfolio <- as.data.frame(matrix(0,nrow = nrow(full_17_22), ncol= 3))
colnames(portfolio) <- c("Value", "Date", "Label")
portfolio$Date <- full_17_22_w_pred$formatted_date
portfolio$Value <- 100
portfolio$Label <- paste(lag,"Lag RandomForest Prediction of MSDR", sep = " ")
for(b in 2:nrow(portfolio)){
if (full_17_22_w_pred$prediction[b-1] == 1){
portfolio$Value[b] <- portfolio$Value[b-1] * (1+QQQ_17_22$`Daily Return`[b])
} else {
portfolio$Value[b] <- portfolio$Value[b-1]
}
}
# save it to dataset containing port
port_comp <- rbind(port_comp, portfolio)
# calculate Sharpe
portfolio_returns <- data.frame(Delt(portfolio$Value)[-1])
portfolio_returns_xts <- xts(portfolio_returns, portfolio$Date[-1])
loop_output$Sharpe[i] <- as.numeric(SharpeRatio.annualized(portfolio_returns_xts))
# rolling sharpe
mylist <- c()
for (z in 1:(nrow(portfolio_returns)-252)){
portfolio_xts_rolling <- portfolio_returns_xts[z:(z+252)]
mylist <- c(mylist, as.numeric(SharpeRatio.annualized(portfolio_xts_rolling)))
}
loop_output$`Average 1YR Rolling Sharpe`[i]<- mean
loop_output$`Median 1YR Rolling Sharpe`[i]<- median(mylist)
loop_output$`Min 1YR Rolling Sharpe`[i]<- min(mylist)
loop_output$`Max 1YR Rolling Sharpe`[i]<- max(mylist)
}
options(warn = defaultW)
# plot output
ggplot(port_comp, aes(x = port_comp$Date, y = port_comp$Value, color = port_comp$Label, group = port_comp$Label))+geom_line()
#loop_output_v1 <- rbind(loop_output_v1, loop_output)
loop_output_v1
I want to extract, .metrics (RMSE) from a Rolling origin forecast resampling
(tibble: 52 x 5) by "id" columns which consist of slices.
The replicating codes are given below. Here is my attempt.
metric <- resamples_fitted$.resample_results
metric
all <- metric[[1]][[".metrics"]]
res <- unlist(all)
estimate <- res[ grepl(".estimate", names(res))]
I want to get the ".estimate" by "slices" in the data frame. For each slice, there will be one RMSE. These are full codes
library(tidymodels)
library(modeltime)
library(modeltime.resample)
library(tidyverse)
library(timetk)
library(resample)
resample_spec <- rolling_origin(
data = m750,
initial = 200,
assess = 3,
cumulative = TRUE,
skip = 1,
overlap = 0 )
resamples_fitted <- m750_models %>%
modeltime_fit_resamples(
resamples = resample_spec,
control = control_resamples(verbose = FALSE)
)
resamples_fitted
metric <- resamples_fitted$.resample_results
metric
all <- metric[[1]][[".metrics"]]
res <- unlist(all)
estimate <- res[ grepl(".estimate", names(res))]
After spending some time, I found a solution, which may not be very elegant though, it solves the problem.
metric <- resamples_fitted$.resample_results
metric
all <- metric[[1]][[ ".metrics"]]
res <- unlist(all)
estimate <- res[ grepl(".estimate", names(res))]
typeof(estimate)
dat <- as.data.frame(sapply(estimate, as.numeric))
data <- dat[complete.cases(dat), ]
all_id <- data.frame(metric[[1]]$id)
al <- cbind(all_id,data)
I have a data frame as "df" and 41 variables var1 to var41. If I write this command
pcdtest(plm(var1~ 1 , data = df, model = "pooling"))[[1]]
I can see the test value. But I need to apply this test 41 times. I want to access variable by column number which is "df[1]" for "var1" and "df[41]" for "var41"
pcdtest(plm(df[1]~ 1 , data = dfp, model = "pooling"))[[1]]
But it fails. Could you please help me to do this? I will have result in for loop. And I will calculate the descriptive statistics for all the results. But it is very difficult to do test for each variable.
I think you can easily adapt the following code to your data. Since you didn't provide any of your data, I used data that comes with the plm package.
library(plm) # for pcdtest
# example data from plm package
data("Cigar" , package = "plm")
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# example for one column
p_model <- plm(formula = pop~1, data = Cigar.p, model = "pooling")
pcdtest(p_model)[[1]]
# run through multiple models
l_plm_models <- list() # store plm models in this list
l_tests <- list() # store testresults in this list
for(i in 3:ncol(Cigar.p)){ # start in the third column, since the first two are state and year
fmla <- as.formula(paste(names(Cigar.p)[i], '~ 1', sep = ""))
l_plm_models[[i]] <- plm(formula = as.formula(paste0(colnames(Cigar.p)[i], "~ 1", sep = "")),
data = Cigar.p,
model = "pooling")
l_tests[[i]] <- pcdtest(l_plm_models[[i]])[[1]]
}
testresult <- data.frame("z" = unlist(l_tests), row.names = (colnames(Cigar.p[3:11])))
> testresult
z
price 175.36476
pop 130.45774
pop16 155.29092
cpi 176.21010
ndi 175.51938
sales 99.02973
pimin 175.74600
fact1 176.21010
fact2 176.21010
# example for cipstest
matrix_results <- matrix(NA, nrow = 11, ncol = 2) # use 41 here for your df
l_ctest <- list()
for(i in 3:ncol(Cigar.p)){
l_ctest[[i]] <- cipstest(Cigar.p[, i], lags = 4, type = 'none', model = 'cmg', truncated = F)
matrix_results[i, 1] <- as.numeric(l_ctest[[i]][1])
matrix_results[i, 2] <- as.numeric(l_ctest[[i]][7])
}
res <- data.frame(matrix_results)
names(res) <- c('cips-statistic', 'p-value')
print(res)
Try using as.formula(), for example:
results <- list()
for (i in 1:41){
varName <- paste0('var',i)
frml <- paste0(varName, ' ~ 1')
results[[i]] <-
pcdtest(plm(as.formula(frml) , data = dfp, model = "pooling"))[[1]]
}
You can use reformulate to create the formula and apply the code for 41 times using lapply :
var <- paste0('var', 1:41)
result <- lapply(var, function(x) pcdtest(plm(reformulate('1', x),
data = df, model = "pooling"))[[1]])
I've got a big database which I've split up by year and created at train and test for each respective year
#split the dataset into a list of datasets
Y <- split(dat_all, dat_all$year)
#create a train and test dataset for all years
#takes Y is inp
create_sets <- function(x){
train_set <- sample(2, nrow(x), replace = TRUE, prob = c(0.7, 0.3))
train <- x[train_set == 1, ]
test <- x[train_set == 2, ]
assign('x', list(train = train, test = test))
}
Ylist <- lapply(Y, create_sets)
To call each item out you use Ylist$'2016'$train
I've made an accuracy ratio function which I can run each list through individually but I am looking for a way to do it all in one to save massive amounts of code (theres 16 years of data)
Below is how I currently create an accuracy ratio for one year
val_train<-Ylist$'2016'$train
val_train$pred<-predict(modf,newdata=Ylist$'2016'$train)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
train_AR
modfull <-ModFit(test)
val_test<-test
val_test$pred<-predict(modf,newdata=test)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
test_AR
AR_Logistic1<-c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
AccuracyRatio
Just to clarify I'm trying to run through my whole list through my accuracy ratio and then output the AR for each year for its train and test.
Any help is greatly appreciated
With lapply and wrapping the AR calculations in a function you can summarise the output as below.
Without sample data, I could not test it but let us know if you face any errors.
fn_Calc_AR <- function(yearDat = listInput) {
#yearDat <== Ylist$'2016'
trainDat <- yearDat$train
testDat <- yearDat$test
val_train<- trainDat
val_train$pred<-predict(modf,newdata= trainDat)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
#train_AR
modfull <-ModFit(testDat)
val_test<-testDat
val_test$pred<-predict(modf,newdata=testDat)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
#test_AR
AR_Logistic1<-c(train_AR,test_AR) # removed extraneous comma, previous input c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
#confirm yearName is being created
try(yearName <- head(names(x),1)) #retain only year
if(length(yearName) > 0L) {
AR_DF <- data.frame(yearName = yearName , AccuracyRatio,stringsAsFactors=FALSE)
}else{
AR_DF <- AccuracyRatio
}
return(AR_DF)
}
Summarise Output:
AR_Summary = do.call(rbind,lapply(Ylist,fn_Calc_AR))
Aggregate Dataset:
aggregateTrain = do.call(rbind,lapply(Ylist,function(x) x$train))
aggregateTest = do.call(rbind,lapply(Ylist,function(x) x$test))
aggregateList = list(train = aggregateTrain,test = aggregateTest)
AR_AggregateSummary = do.call(rbind,lapply(aggregateList,function(x) fn_Calc_AR(x) ))