So I want to test all possible linear regression models obtainable with 1 to 5 independent variables and one of the 18 dependent variables.
I have the code working for generating all the linear regression models with the 1st dependent variable and the 5 independent ones, but I am unsure how to run this code for each of the 18 dependent variables I want to check for.
GClist <- data.frame(GC1, GC2, GC3, GC4, GC5, GC6, GC7, GC8, GC9, GC10, GC11, GC12, GC13, GC14, GC15, GC16, GC17, GC18)
So far, I made a list of the 18 DVs, I also tried to loop using a foreach loop and then trying to parse a list containing the 18 DV names.
I tried to use:
for(value in GClist)
{
the code below
}
// but then I did not manage to make it work and include the "value" in the code
// I also tried to use foreach, but I using df[j] and having df containing all my 18 dependent variables did not seem to work.
foreach (j=1:18) %do% {the code below }
Anyway, the code that works is this:
df1 <- data.frame(GC1, D1, D2, D3, D4, D5)
library(foreach)
#create the linear models list
xcomb <- foreach(i=1:5, .combine=c) %do% {combn(names(df1)[-1], i, simplify=FALSE) }
formlist <- lapply(xcomb, function(l) formula(paste(names(df1), paste(l, collapse="+"), sep="~")))```
# get the p value for each model
models.p <- sapply(formlist, function (i) {
f <- summary(lm(i))$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
})
# R squared for each model
models.r.sq <- sapply(formlist, function(i) summary(lm(i))$r.squared)
# adjusted R squared for each model
models.adj.r.sq <- sapply(formlist, function(i) summary(lm(i))$adj.r.squared)
# MSEp squared for each model
models.MSEp <- sapply(formlist, function(i) anova(lm(i))['Mean Sq']['Residuals',])
# Full model MSE for each linear model
MSE <- anova(lm(formlist[[length(formlist)]]))['Mean Sq']['Residuals',]
# Mallow's Cp - skipped for now
df.model.eval <- data.frame(model=as.character(formlist), pF=models.p,
r.sq=models.r.sq, adj.r.sq=models.adj.r.sq, MSEp=models.MSEp)
How can I run this code for each of the 18 dependent variables, so that I can gather all the information in df.model.eval? Right now I have everything I need to know about the models that use GC1 as the dependent variable. My goal is to see all the models (from GC1 to GC18) and highlight all that have statistical significance and those that don't.
I managed to do it, by creating a dataframe with possible DVs;
GC <- data.frame(GC1, GC2, GC3, GC4, GC5, GC6, GC7, GC8, GC9, GC10, GC11, GC12, GC13, GC14, GC15, GC16, GC17, GC18)
Then, using George Dontas' code found at https://stats.stackexchange.com/a/6862, I prepare my first dataframe to generate linear regression models.
library(foreach)
finallist <- list()
xcomb <- foreach(i=1:5, .combine=c) %do% {combn(names(df1)[-1], i, simplify=FALSE) }
I am replacing the first element of my dataframe here:
for(j in c(names(GC)))
{
names(df1)[1] <- j #switch DVs
#George's code to create a list with all the possible combinations of the DV and the IVs
formlist <- lapply(xcomb, function(l) formula(paste(names(df1), paste(l, collapse="+"), sep="~")))
#append the list of linear regression models for each DV to the main list
finallist <- append(finalist, formlist)
}
I now have a list of all the linear regression models, which I can then use to assess and interpret relevant statistical indices, with the
as.character(finallist)
My final code is:
df1 <- data.frame(GC1, D1, D2, D3, D4, D5)
library(foreach)
library(rlist)
finallist <- list()
xcomb <- foreach(i=1:5, .combine=c) %do% {combn(names(df1)[-1], i, simplify=FALSE) }
#iterate through DV names
for(j in c(names(GC)))
{
names(df1)[1] <- j #switch DVs (I know 1st step replaces same DV name)
#create a list with all the possible combinations of the DV and the IVs
formlist <- lapply(xcomb, function(l) formula(paste(names(df1), paste(l, collapse="+"), sep="~")))
#append the list of linear regression models for each DV to the main list
finallist <- append(finalist, formlist)
}
# get the p value for each model
models.p <- sapply(formlist, function (i) {
f <- summary(lm(i))$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
})
# R squared for each model
models.r.sq <- sapply(finallist, function(i) summary(lm(i))$r.squared)
# adjusted R squared for each model
models.adj.r.sq <- sapply(finallist, function(i) summary(lm(i))$adj.r.squared)
# MSEp squared for each model
models.MSEp <- sapply(finallist, function(i) anova(lm(i))['Mean Sq']['Residuals',])
# Full model MSE for each linear model
MSE <- anova(lm(finallist[[length(finallist)]]))['Mean Sq']['Residuals',]
# Mallow's Cp - skipped for now
df.model.eval <- data.frame(model=as.character(finallist), pF=models.p,
r.sq=models.r.sq, adj.r.sq=models.adj.r.sq, MSEp=models.MSEp)
I would like to normalize the data this way:
(trainData - mean(trainData)) / sd(trainData)
(testData - mean(trainData)) / sd(trainData)
For the Train set I can use the function scale(). How can I do for the test set? I tried in different ways the lapply() function .. but I did not succeed.
Many thanks! An exemple of code:
Train <- data.frame(matrix(c(1:100),10,10))
Test <- data.frame(matrix(sample(1:100),10,10))
scaled.Train <- scale(Train)
ct <- ncol(Test)
rt <- nrow(Test)
ncol(Train)
sdmatrix <- data.frame(matrix(,rt,ct))
for (i in 1:ct){
sdmatrix[1,i] <- mean(Train[,i])
sdmatrix[2,i] <- sd(Train[,i])
}
Test <- rbind(Test, sdmatrix)
normTest <- function(x){
a <- x[rt-1]
b <- x[rt]
x <- (x-a)/b
}
Test <- lapply(Test[1:(rt-2),],normTest)
I would like to code a loop for cross-validation: computing MSE for a one- and a four-step forecast and store the results in a matrix. The problem I get is that the columns for the 1 to 3-step forecast get overwritten and I get just the 4-step forecast in all columns. Anybody can help?
k<-20
n<-length(xy)-1
h<-4
start <- tsp(xy) [1]+k
j <- n-k
mseQ1 <- matrix(NA,j,h)
colnames(mseQ1) <- paste0('h=',1:h)
for(i in 1:j)
{
xtrain <- window(xy, end=start+(i-1))
xvalid <- window(xy, start=start+i, end=start+i)
qualifiedETS <- ets(xtrain, alpha=NULL, beta=NULL, additive.only=TRUE, opt.crit="mse")
fcastHW <- forecast(qualifiedETS, h=h)
mseQ1[i,] <- ((fcastHW[['mean']]-xvalid)^2)
}
Is there any way to obtain the names of functions from a list? For example is possible to do something like this?
sqrt <- function(x){
squared <- (x*x)
}
divide <- function(x){
half <- (x/2)
}
funclist <- list(sqrt, divide)
for(i in seq_along(funclist)){
printnameoffunction(funclist[[i]])
}
In a quite big data frame, I have to pick up some random rows to execute a function. In my example, the first function I use is the variance and then a function closed to the real one I use in my script, called after f. I do not detail the purpose of f but it deals with truncated Gaussian distribution and maximum-likelihood estimation.
My problem is that my code is way too slow with the second function and I suppose a bit of optimization of the for loop or the sample function could help me.
Here is the code :
df <- as.data.frame(matrix(0,2e+6,2))
df$V1 <- runif(nrow(df),0,1)
df$V2 <- sample(c(1:10),nrow(df), replace=TRUE)
nb.perm <- 100 # number of permutations
res <- c()
for(i in 1:nb.perm) res <- rbind(res,tapply(df[sample(1:nrow(df)),"V1"],df$V2,var))
library(truncnorm)
f <- function(d) # d is a vector
{
f2 <- function(x) -sum(log(dtruncnorm(d, a=0, b=1, mean = x[1], sd = x[2])))
res <- optim(par=c(mean(d),sd(d)),fn=f2)
if(res$convergence!=0) warning("Optimization has not converged")
return(list(res1=res$par[1],res2=res$par[2]^2))
}
for(i in 1:nb.perm) res2 <- rbind(res,tapply(df[sample(1:nrow(df)),"V1"],df$V2,function(x) f(x)$res2))
I hope I am clear enough.