storing data from a loop in a matrix - r

I would like to code a loop for cross-validation: computing MSE for a one- and a four-step forecast and store the results in a matrix. The problem I get is that the columns for the 1 to 3-step forecast get overwritten and I get just the 4-step forecast in all columns. Anybody can help?
k<-20
n<-length(xy)-1
h<-4
start <- tsp(xy) [1]+k
j <- n-k
mseQ1 <- matrix(NA,j,h)
colnames(mseQ1) <- paste0('h=',1:h)
for(i in 1:j)
{
xtrain <- window(xy, end=start+(i-1))
xvalid <- window(xy, start=start+i, end=start+i)
qualifiedETS <- ets(xtrain, alpha=NULL, beta=NULL, additive.only=TRUE, opt.crit="mse")
fcastHW <- forecast(qualifiedETS, h=h)
mseQ1[i,] <- ((fcastHW[['mean']]-xvalid)^2)
}

Related

Loop that generates all possible linear models and replaces the dependent variable name with a value from a list

So I want to test all possible linear regression models obtainable with 1 to 5 independent variables and one of the 18 dependent variables.
I have the code working for generating all the linear regression models with the 1st dependent variable and the 5 independent ones, but I am unsure how to run this code for each of the 18 dependent variables I want to check for.
GClist <- data.frame(GC1, GC2, GC3, GC4, GC5, GC6, GC7, GC8, GC9, GC10, GC11, GC12, GC13, GC14, GC15, GC16, GC17, GC18)
So far, I made a list of the 18 DVs, I also tried to loop using a foreach loop and then trying to parse a list containing the 18 DV names.
I tried to use:
for(value in GClist)
{
the code below
}
// but then I did not manage to make it work and include the "value" in the code
// I also tried to use foreach, but I using df[j] and having df containing all my 18 dependent variables did not seem to work.
foreach (j=1:18) %do% {the code below }
Anyway, the code that works is this:
df1 <- data.frame(GC1, D1, D2, D3, D4, D5)
library(foreach)
#create the linear models list
xcomb <- foreach(i=1:5, .combine=c) %do% {combn(names(df1)[-1], i, simplify=FALSE) }
formlist <- lapply(xcomb, function(l) formula(paste(names(df1), paste(l, collapse="+"), sep="~")))```
# get the p value for each model
models.p <- sapply(formlist, function (i) {
f <- summary(lm(i))$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
})
# R squared for each model
models.r.sq <- sapply(formlist, function(i) summary(lm(i))$r.squared)
# adjusted R squared for each model
models.adj.r.sq <- sapply(formlist, function(i) summary(lm(i))$adj.r.squared)
# MSEp squared for each model
models.MSEp <- sapply(formlist, function(i) anova(lm(i))['Mean Sq']['Residuals',])
# Full model MSE for each linear model
MSE <- anova(lm(formlist[[length(formlist)]]))['Mean Sq']['Residuals',]
# Mallow's Cp - skipped for now
df.model.eval <- data.frame(model=as.character(formlist), pF=models.p,
r.sq=models.r.sq, adj.r.sq=models.adj.r.sq, MSEp=models.MSEp)
How can I run this code for each of the 18 dependent variables, so that I can gather all the information in df.model.eval? Right now I have everything I need to know about the models that use GC1 as the dependent variable. My goal is to see all the models (from GC1 to GC18) and highlight all that have statistical significance and those that don't.
I managed to do it, by creating a dataframe with possible DVs;
GC <- data.frame(GC1, GC2, GC3, GC4, GC5, GC6, GC7, GC8, GC9, GC10, GC11, GC12, GC13, GC14, GC15, GC16, GC17, GC18)
Then, using George Dontas' code found at https://stats.stackexchange.com/a/6862, I prepare my first dataframe to generate linear regression models.
library(foreach)
finallist <- list()
xcomb <- foreach(i=1:5, .combine=c) %do% {combn(names(df1)[-1], i, simplify=FALSE) }
I am replacing the first element of my dataframe here:
for(j in c(names(GC)))
{
names(df1)[1] <- j #switch DVs
#George's code to create a list with all the possible combinations of the DV and the IVs
formlist <- lapply(xcomb, function(l) formula(paste(names(df1), paste(l, collapse="+"), sep="~")))
#append the list of linear regression models for each DV to the main list
finallist <- append(finalist, formlist)
}
I now have a list of all the linear regression models, which I can then use to assess and interpret relevant statistical indices, with the
as.character(finallist)
My final code is:
df1 <- data.frame(GC1, D1, D2, D3, D4, D5)
library(foreach)
library(rlist)
finallist <- list()
xcomb <- foreach(i=1:5, .combine=c) %do% {combn(names(df1)[-1], i, simplify=FALSE) }
#iterate through DV names
for(j in c(names(GC)))
{
names(df1)[1] <- j #switch DVs (I know 1st step replaces same DV name)
#create a list with all the possible combinations of the DV and the IVs
formlist <- lapply(xcomb, function(l) formula(paste(names(df1), paste(l, collapse="+"), sep="~")))
#append the list of linear regression models for each DV to the main list
finallist <- append(finalist, formlist)
}
# get the p value for each model
models.p <- sapply(formlist, function (i) {
f <- summary(lm(i))$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
})
# R squared for each model
models.r.sq <- sapply(finallist, function(i) summary(lm(i))$r.squared)
# adjusted R squared for each model
models.adj.r.sq <- sapply(finallist, function(i) summary(lm(i))$adj.r.squared)
# MSEp squared for each model
models.MSEp <- sapply(finallist, function(i) anova(lm(i))['Mean Sq']['Residuals',])
# Full model MSE for each linear model
MSE <- anova(lm(finallist[[length(finallist)]]))['Mean Sq']['Residuals',]
# Mallow's Cp - skipped for now
df.model.eval <- data.frame(model=as.character(finallist), pF=models.p,
r.sq=models.r.sq, adj.r.sq=models.adj.r.sq, MSEp=models.MSEp)

rmse function issue in R

I have an R code that contains some nested bracket for loop within which I used rmse() function from Metrics package. I tried it without the function and it worked, but inside my nested R code it does not.
Here is what I desire to do with R
I have generated a 50-time series dataset.
I lice the same time series dataset into chunks of the following sizes: 2,3,...,48,49 making me have 48 different time series formed from step 1 above.
I divided each 48-time series dataset into train and test sets so I can use rmse function in Metrics package to get the Root Mean Squared Error (RMSE) for the 48 subseries formed in step 2.
The RMSE for each series is then tabulated according to their chunk sizes
I obtained the best ARIMA model for each 48 different time series data set.
My R code
# simulate arima(1,0,0)
library(forecast)
library(Metrics)
n <- 50
phi <- 0.5
set.seed(1)
wn <- rnorm(n, mean=0, sd=1)
ar1 <- sqrt((wn[1])^2/(1-phi^2))
for(i in 2:n){
ar1[i] <- ar1[i - 1] * phi + wn[i]
}
ts <- ar1
t<-length(ts)# the length of the time series
li <- seq(n-2)+1 # vector of block sizes(i.e to be between 1 and n exclusively)
RMSEblk<-matrix(nrow = 1, ncol = length(li))#vector to store block means
colnames(RMSEblk)<-li
for (b in 1:length(li)){
l<- li[b]# block size
m <- ceiling(t / l) # number of blocks
blk<-split(ts, rep(1:m, each=l, length.out = t)) # divides the series into blocks
singleblock <- vector() #initialize vector to receive result from for loop
for(i in 1:10){
res<-sample(blk, replace=T, 100) # resamples the blocks
res.unlist<-unlist(res, use.names = F) # unlist the bootstrap series
# Split the series into train and test set
train <- head(res.unlist, round(length(res.unlist) * 0.6))
h <- length(res.unlist) - length(train)
test <- tail(res.unlist, h)
# Forecast for train set
model <- auto.arima(train)
future <- forecast(test, model=model,h=h)
nfuture <- as.numeric(out$mean) # makes the `future` object a vector
# use the `rmse` function from `Metrics` package
RMSE <- rmse(test, nn)
singleblock[i] <- RMSE # Assign RMSE value to final result vector element i
}
#singleblock
RMSEblk[b]<-mean(singleblock) #store into matrix
}
RMSEblk
The error I got
#Error in rmse(test, nn): unused argument (nn)
#Traceback:
But when I wrote
library(forecast)
train <- head(ar1, round(length(ar1) * 0.6))
h <- length(ar1) - length(train)
test <- tail(ar1, h)
model <- auto.arima(train)
#forecast <- predict(model, h)
out <- forecast(test, model=model,h=h)
nn <- as.numeric(out$mean)
rmse(test, nn)
It did work
Please point out what I am missing?
I am able to run your code after making two very small corrections in your for loop. See the two commented lines:
for (b in 1:length(li)){
l<- li[b]
m <- ceiling(t / l)
blk<-split(ts, rep(1:m, each=l, length.out = t))
singleblock <- vector()
for(i in 1:10){
res<-sample(blk, replace=T, 100)
res.unlist<-unlist(res, use.names = F)
train <- head(res.unlist, round(length(res.unlist) * 0.6))
h <- length(res.unlist) - length(train)
test <- tail(res.unlist, h)
model <- auto.arima(train)
future <- forecast(test, model=model,h=h)
nfuture <- as.numeric(future$mean) # EDITED: `future` instead of `out`
RMSE <- rmse(test, nfuture) # EDITED: `nfuture` instead of `nn`
singleblock[i] <- RMSEi
}
RMSEblk[b]<-mean(singleblock)
}
It is possible that these typos did not result in errors because nn and out were defined in the global environment while you ran the for loop. A good debugging tip is to restart R and try to reproduce the problem.
Your code does not define nn. Other code that works has nn. To start code with clean slate use this line as first executable line:
rm(list=ls())

R -fill matrix with pairs of simulated MVN data

I first want to simulate correlated MVN data using the mvrnorm function from the MASS package. Then I want to repeat this simulation i times and fill results in a matrix so that first results are in columns i, i+1, second in i+2, i+3 and so on.
So far I did the following:
SimYCB <- c(73.1,60.6,59.6,54.5,57.9,61.14)
SimPCB <- c(15.7,18.25,22.38,20.22,16.53,18.616)
SimCB <- data.frame(SimYCB,SimPCB)
n=20
m=1000
MVSimCB = matrix()
for(i in 1:m)
{MVSimCB[,i]=mvrnorm(n, mu=mean(SimCB),
Sigma=cov(SimCB))}
What is the mistake?
May be this helps
MVSimCB <- matrix(,ncol=m, nrow=n)
set.seed(24)
for(i in seq(1,m, by=2)){
MVSimCB[, i:(i+1)] <- mvrnorm(n, mu=colMeans(SimCB), Sigma=cov(SimCB))
}
Or you could use replicate
set.seed(24)
MVSimCB2 <- do.call(cbind, replicate(m/2, mvrnorm(n, mu=colMeans(SimCB),
Sigma=cov(SimCB)), simplify=FALSE))
all.equal(MVSimCB, MVSimCB2, check.attributes=FALSE)
#[1] TRUE

Non-Deterministic behaviour of svm{e1071}

I noticed that SVM when fed with decision.values=T (plus sigmoid to get probabilities ) produces non-deterministic result when I permute data frame under analysis. Does anyone has any idea why? Please try the code yourself
install.packages("e1071")
library(e1071)
A <- cbind(rnorm(20,1,1),rnorm(20,1,1),rep(1,20))
B <- cbind(rnorm(20,9,1),rnorm(20,9,1),rep(0,20))
dataframe <- as.data.frame(rbind(A,B))
predc <- rep(0,length(dataframe[,1]))
K <- length(dataframe[1,])
permutator <- sample(nrow(dataframe))
dataframe$V3 <- factor(dataframe$V3)
dataframe <- dataframe[permutator, ]
for(i in 1:length(dataframe[,1])) {
frm <- as.formula(object=paste("V",as.character(K), " ~ .",sep=""))
r <- svm(formula=frm, data=(dataframe[-i,]))
predicted <- predict(r,newdata=dataframe[i,],decision.values=TRUE)
predc[i] <- sigmoid(attr(predicted,'decision.values')[1])
}
plot(sort(predc))
[edited: code]

Iterate a function R code

I have a function myF(g,m,alpha,gam,theta,beta). Which returns three estimates of parameters. I want to iterate this function for (i in 1:10). How can i do this it in R?
myF <- function(g,m,alpha,gam,theta,beta){
dat <- sim.data(g,m,alpha,gam,theta,beta)
time <- dat$times
delta <- dat$cens
i <- dat$group
X1<-dat$cov #cov~rbinom
n <- length(levels(as.factor(i)))
di <- aggregate(delta,by=list(i),FUN=sum)[,2]
D <- sum(di)
loglik <- function(par){
.........................................
return(-lik)
}
initial=c(0.5,0.5,-0.5,0.5)
maxF <- nlm(loglik, initial)
return(c(theta=exp(maxF$estimate[2]),beta1=maxF$estimate[3],alpha=exp(maxF$estimate[2])))
}
This can easily be done using replicate:
replicate(10, myF(g,m,alpha,gam,theta,beta))
This will create a 3*10 matrix of the parameter estimates, where each column is the result of a separate iteration.

Resources