The strategy is carried out on a "rolling" basis:
For each day,n, the previous k days of the differenced logarithmic returns of a stock market index are used as a window for fitting an optimal ARIMA models.
#Install relevant packages
install.packages("quantmod")
install.packages("forecast")
#Import the necessary libraries
library(quantmod)
library(forecast)
#Get S&P 500
getSymbols("^GSPC", from = "2000-01-01")
#Compute the daily returns
gspcRet<-(log(Cl(GSPC)))
#Use only the last two years of returns
gspc500<-tail(gspcRet,500)
spReturns<-diff(gspc500)
spReturns[as.character(head(index(Cl(GSPC)),1))] = 0
# Create the forecasts vector to store the predictions
windowLength<- 500
foreLength<-length(spReturns) - windowLength
forecasts <- vector(mode="list", length=foreLength)
fit1 <- vector(mode="list", length=foreLength)
for (d in 0:foreLength) {
# Obtain the S&P500 rolling window for this day
spReturnsOffset<- spReturns[(1+d):(windowLength+d)]
#Searching for the best models
order.matrix<-matrix(0,nrow = 3, ncol = 6 * 2 * 6)
aic.vec<- numeric(6 * 2 * 6)
k<-1
for(p in 0:5) for(d in 0:1) for(q in 0:5){
order.matrix[,k]<-c(p,d,q)
aic.vec[k]<- AIC(Arima( spReturnsOffset, order=c(p,d,q)))
k<-k+1
}
ind<- order(aic.vec,decreasing=F)
aic.vec<- aic.vec[ind]
order.matrix<- order.matrix[,ind]
order.matrix<- t(order.matrix)
result<- cbind(order.matrix,aic.vec)
#colnames(result)<- c("p","d","q","AIC")
p1<- result[1,1]
p2<- result[2,1]
p3<- result[3,1]
p4<- result[4,1]
d1<- result[1,2]
d2<- result[2,2]
d3<- result[3,2]
d4<- result[4,2]
q1<- result[1,3]
q2<- result[2,3]
q3<- result[3,3]
q4<- result[4,3]
#I THINK CODE IS CORRECT TILL HERE PROBLEM IS WITH THE FOLLOWING CODE I GUESS
fit1[d+1]<- Arima(spReturnsOffset, order=c(p1,d1,q1))
forecasts[d+1]<- forecast(fit1,h=1)
#forecasts[d+1]<- unlist(fcast$mean[1])
}
I get the following Error:
Error in x - fits : non-numeric argument to binary operator
In addition: Warning messages:
1: In fit1[d + 1] <- Arima(spReturnsOffset, order = c(p1, d1, q1)) :
number of items to replace is not a multiple of replacement length
2: In mean.default(x, na.rm = TRUE) :
argument is not numeric or logical: returning NA
Can anyone please suggest a fix?
Related
My course requires me to use the Udacity's enron financial data to craft a financial fraud detection model in R.
I wrote a function for the calculation (The split_train_set just split the data in 1 70-30 training and testing set.
library(e1071)
library(caret)
nb_runner <- function(dataset, rm.na=FALSE) {
split_df <- split_train_set(dataset, rm.na)
nb <- naiveBayes(x=split_df$x_train_set, y=split_df$y_train_set$poi)
nb_predict <- predict(nb, newdata=split_df$x_test_set, type='class')
cm <- confusionMatrix(nb_predict, split_df$y_test_set$poi, positive='True')
return(cm)
}
It worked fine in the beginning.
However, after I am trying to clean up the data by removing the rows with more than 15 NAs by the following code, and rerun the same nb_runner()
remove_high_na <- function(dataset, threshold = 0.7) {
# The range of NA in rows is 2 to 17
# Since we have only 22 features in the dataset, high level of NA makes the col useless
# Hence, we will remove rows with high level of NA, and we will set the threshold as 0.7.
# The row with NA higher than 0.7 (> 15.6) will be removed.
threshold_cols <- floor(ncol(dataset) * threshold)
df <- subset(dataset, rowSums(is.na(dataset)) <= threshold_cols)
# df <- dataset[-which(rowSums(is.na(dataset)) > threshold_cols),]
return(df)
}
Error in object$levels[apply(L, 2, which.max)] :
invalid subscript type 'list'
The code failed and the traceback is as follows:
4.
factor(object$levels[apply(L, 2, which.max)], levels = object$levels)
3.
predict.naiveBayes(nb, newdata = split_df$x_test_set, type = "class")
2.
predict(nb, newdata = split_df$x_test_set, type = "class") at POI_helpers.R#38
1.
nb_runner(df_1)
I am not quite sure what I was doing wrong since the same dataset worked fine in other classifiers.
Thank you in advance for your help.
I am a student trying to build a neural network for regression for the first time to predict my variable Math_G3. However I am unable to do it due to the following error:
Error in x - y : non-conformable arrays
In addition: Warning messages:
1: In cbind(1, act.temp) :
number of rows of result is not a multiple of vector length (arg 1)
2: In cbind(1, act.temp) :
number of rows of result is not a multiple of vector length (arg 1)
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
Here is my code for it:
index_train_math<-sample(1:nrow(dat_math),0.6*nrow(dat_math))
#from 1st row to end , put random 70 % AS TRAINING DATA
train_math <- dat_math[index_train_math,]
# training data of math
maxs <-apply(dat_math,2,max)
mins <- apply(dat_math,2,min)
scaled <- as.data.frame(scale(dat_math,center = mins,scale=maxs-mins))
#returns a matrix that needs to be coerced into a data frame
train_math <- scaled[index_train_math,]
test_math <- scaled[-index_train_math,]
n_math <- names(train_math)
f_math <- as.formula(paste("Math_G3 ~ traveltime + studytime + failure + famrel + goout + Dalc + Walc + health + absences + Math_G1 + Math_G2", paste(n_math[!n_math %in% "Math_G3"], collapse = " + ")))
nn <- neuralnet(f_math,data=train_math,hidden=c(5,3),linear.output=T)
Below is a preview of my dataset:
May I know what is wrong with my code and how I can fix it? Thank you!
I am trying to use a rolling window using linear regression. I don't know how I should store the output from forecast into a variable, which I could use for plotting ant etc.
predict.1 <- function(P){
results <- rep(0, P)
for( i in 0:2){
y1<-window(y,start=1937+i,end=1966+i)
x1<-window(ll,start=1937+i,end=1966+i)
dx2<-window(ll,start=1937+i,end=1966+i)
in.sample<-data.frame(y1,x1,x2)
names(in.sample)<-c("Outcome","Predictor1","Predictor2")
x1.pred<-window(x1,start=1967+i,end=1967+i)
x2.pred<-window(x2,start=1967+i,end=1967+i)
out.sample<-data.frame(x1.pred,x2.pred)
new.data<-out.sample
names(new.data)<-c("Predictor1","Predictor2")
results[i]<-predict(lm(Outcome~Predictor1+Predictor2,data=in.sample),new.data,se.fit=TRUE)
}
results
}
I receive this message:
Warning messages:
1: In results[i] <- predict(lm(Outcome ~ Predictor1 + Predictor2, data = in.sample), :
number of items to replace is not a multiple of replacement length
2: In results[i] <- predict(lm(Outcome ~ Predictor1 + Predictor2, data = in.sample), :
number of items to replace is not a multiple of replacement length.
I don't know how to overcome the problem.
I´ve been trying to backtest the predictability of a regression (trying to get one-step-ahead predictions) by implementing a rolling window regression and calculating and recording the difference between the estimation and the last available day, for each day in the past, in a column.
I tried to apply Christoph_J ´s answer at Rolling regression return multiple objects
There is no syntax error in the code. However, I´m not sure if there is a semantic error. Is the value in row i of the "predicted" column, the ex-ante prediction of the row i value of the OpCl column?
library(zoo)
library(dynlm)
library(quantmod)
sp <- getSymbols("^GSPC", auto.assign=FALSE)
sp$GSPC.Adjusted <- NULL
colnames(sp) <- gsub("^GSPC\\.","",colnames(sp))
sp$Number<-NA
sp$Number<-1:nrow(sp)
sp$OpCl <- OpCl(sp)
sp$ClHi <- HiCl(sp)
sp$LoCl <- LoCl(sp)
sp$LoHi <- LoHi(sp)
#### LAG
spLag <- lag(sp)
colnames(spLag) <- paste(colnames(sp),"lag",sep="")
sp <- na.omit(merge(sp, spLag))
### REGRESSION
f <- OpCl ~ Openlag + Highlag + OpCllag + ClHilag
OpClLM <- lm(f, data=sp)
#sp$OpClForecast <- NA
#sp$OpClForecast <- tail(fitted(OpClLM),1)
#####################################################
rolling.regression <- function(series) {
mod <- dynlm(formula = OpCl ~ L(Open) + L(High) + L(OpCl) + L(ClHi),
data = as.zoo(series))
nextOb <- min(series[,6])+1 # To get the first row that follows the window
if (nextOb<=nrow(sp)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(OpCl=sp[nextOb,'OpCl'],
Open=sp[nextOb,'Open'],High=sp[nextOb,'High'],
OpCl=sp[nextOb,'OpCl'], ClHi=sp[nextOb,'ClHi']))
attributes(predicted)<-NULL
#Solution ; Get column names right
c(predicted=predicted,
AdjR = summary(mod)$adj.r.squared)
}
}
rolling.window <- 300
results.sp <- rollapply(sp, width=rolling.window,
FUN=rolling.regression, by.column=F, align='right')
sp<-cbind(sp,results.sp)
View(sp)
I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.