Excluding an intercept in regsubsets (leaps package)? - r

I am running some model averaging procedures using the output from the regsubsets command from the leaps package. Once I exclude an intercept, I get an error message that I cannot make sense of:
Reordering variables and trying again: Error in if
(any(index[force.out] == -1)) stop("Can't force the same variable in
and out") : missing value where TRUE/FALSE needed
This problem seems to occur only once my predictor matrix has more columns than the dependent variable has observations (which is one of the reasons for using leaps in the first place). See the example code below:
# Load the package --------------------------------------------------------
require(stats)
require(leaps)
# Some artificial data ----------------------------------------------------
y <- rnorm(20)
x1 <- rnorm(20*20)
dim(x1) <- c(20,20)
x2 <- rnorm(20*21)
dim(x2) <- c(20,21)
# Allow intercept ---------------------------------------------------------
summary(regsubsets(x1,y))$which
summary(regsubsets(x2,y))$which
# Without intercept -------------------------------------------------------
summary(regsubsets(x1,y,intercept=FALSE))$which
summary(regsubsets(x2,y,intercept=FALSE))$which

This usually happens when you have a Linear Dependency in the input variables - you should see a warning , when you run it with Intercept = T.
When you remove the linear dependent column from input predictor , you will be able to run regsubsetsObj with intercept = F . You will have manually remove the linearly dependent column . Its usually a derived column, calculated from existing metrics.

Related

Using a For Loop to Run Multiple Response Variables through a Train function to create multiple seperate models in R

I am trying to create a for loop to index thorugh each individual response variable I have and train a model using the train() funciton within the Caret Package. I have about 30 response variable and 43 predictor variables. I can train each model individually but I would like to automate the process and have a for loop run through a model (I would like to eventually upscale to multiple models if possible, i.e. lm, rf, cubist, etc.). I then want to save each model to a dataframe along with R-squared values and RMSE values. The individual models that I currenlty have that will run for me goes as follows, with column 11 being the response variable and column 35-68 being predictor variables.
data_Mg <- subset(data_3, !is.na(Mg))
mg.lm <- train(Mg~., data=data_Mg[,c(11,35:68)], method="lm", trControl=control)
mg.cubist <- train(Mg~., data=data_Mg[,c(11,35:68)], method="cubist", trControl=control)
mg.rf <- train(Mg~., data=data_Mg[,c(11,35:68)], method="rf", trControl=control, na.action = na.roughfix)
max(mg.lm$results$Rsquared)
min(mg.lm$results$RMSE)
max(mg.cubist$results$Rsquared)
min(mg.cubist$results$RMSE)
max(mg.rf$results$Rsquared) #Highest R squared
min(mg.rf$results$RMSE)
This gives me 3 models with everything the relevant information that I need. Now for the for loop. I've only tried the lm model so far for this.
bucket <- list()
for(i in 1:ncol(data_4)) { # for-loop response variables, need to end it at response variables, rn will run through all variables
data_y<-subset(data_4, !is.na(i))#get rid of NA's in the "i" column
predictors_i <- colnames(data_4)[i] # Create vector of predictor names
predictors_1.1 <- noquote(predictors_i)
i.lm <- train(predictors_1.1~., data=data_4[,c(i,35:68)], method="lm", trControl=control)
bucket <- i.lm
#mod_summaries[[i - 1]] <- summary(lm(y ~ ., data_y[ , c("i.lm", predictors_i)]))
#data_y <- data_4
}
Below is the error code that I am getting, with Bulk_Densi being the first variable in predictors_1.1. The error code is that variable lengths differ so I originally thought that my issue was that quotes were being added around "Bulk_Densi" but after trying the NoQuote() function I have not gotten anywehre so I am unsure of where I am going wrong.
Error code that I am getting
Please let me know if I can provide any extra info and thanks in advance for the help! I've already tried the info in How to train several models within a loop for and was struggling with that as well.

Save customized function inside function in MLFlow log_model

I would like to do something with MLFlow but I do not find any solution on Internet. I am working with MLFlow and R, and I want to save a regression model. The thing is that by the time I want to predict the testing data, I want to do some transformation of that data. Then I have:
data <- #some data with numeric regressors and dependent variable called 'y'
# Divide into train and test
ind <- sample(nrow(data), 0.8*nrow(data), replace = FALSE)
dataTrain <- data[ind,]
dataTest <- data[-ind,]
# Run model in the mlflow framework
with(mlflow_start_run(), {
model <- lm(y ~ ., data = dataTrain)
predict_fun <- function(model, data_to_predict){
data_to_predict[,3] <- data_to_predict[,3]/2
data_to_predict[,4] <- data_to_predict[,4] + 1
return(predict(model, data_to_predict))
}
predictor <- crate(~predict_fun(model,dataTest),model)
### Some code to use the predictor to get the predictions and measure the accuracy as a log_metric
##################
##################
##################
mlflow_log_model(predictor,'model')
}
As you can notice, my prediction function not only consists in predict the new data you are evaluating, but it also makes some transformations in the third and fourth columns. All examples I saw on the web use the function predict in the crate as the default function of R.
Once I save this model, when I run it in another notebook with some Test data, I get the error: "predict_fun" doesn't exist. That is because my algorithm has not saved this specific function. Do you know what can I do to save and specific prediction function that I have created instead of the default functions that are in R?
This is not the real example I am working with, but it is an approximation of it. The fact is that I want to save extra functions apart from the model itself.
Thank you very much!

trouble using foreach from doParallel with gamm4

I am trying to use foreach to make use of parallel processing for a complete subsets regression problem. I am trying to fit a complete list of models using the gamm4 package, using the binomial function where the response is provided as a proportion, and the weights argument supplies the number of trials. The code works fine when run using %do% but fails under %dopar% (retutns only NA's for AIC and BIC). Strangely, the code does work using %dopar% fine if the weights argument to the gamm4 call is left out, but obviously this is not a viable solution. I have been using similar code with no issues based on a gaussian distribution and a binomial distribution where the response is entered as 1,0s (thus no need for a call to weights) with no problems at all. I am using windows 7 64bit, with R version 3.1.2. I have updated all the relevant packages. A reproducible (but toy) example:
set.seed(666)
# generate a random factor with a random offset effect
random.factor=factor(sort(rep(1:10,10)))
random.effect=sort(rep(rnorm(10),10))
# generate some random predictor variables
X1 = rnorm(100)
X2 = rnorm(100)
X3 = rnorm(100)
X4 = rep(0,100) # make it so one variable fails (just to check the "try" if statement)
#X4 = rnorm(100)
X5 = rnorm(100)
# calculate a response variable based on some of the predictors
z = 1 + 2*X1 + 3*X2 + 2*X3^2 # linear combination with a bias
pr = 1/(1+exp(-(z+random.effect))) # pass through an inv-logit function
y = rbinom(n=100,size=100,pr)/100 # bernoulli response variable.
# Note that the response variable is a proprotion of successes of 100 trials
# We want to feed the number of trials as a "weights" argument to gamm
# now make a data frame of predictors
pred.dat=data.frame(X1=X1,X2=X2,X3=X3,X4=X4,X5=X5)
pred.vars=colnames(pred.dat)
# make a dataframe for passing to gamm
use.dat = data.frame(random.factor=random.factor,y=y,pred.dat)
# now set up the models to run
# this includes all combinations of variables, but only up to a total of two in
# any one model
model.fits.test=c(combn(1:ncol(pred.dat), 1,simplify = F),
combn(1:ncol(pred.dat), 2,simplify = F))
models.use=list(1,2,3,4,5)
n.models=length(model.fits.test)
require(lme4)
require(doParallel)
registerDoParallel(cores=4)
# if I run this using do, it works fine (with error values from the try argument
# returned for models that fail)
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%do%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~",
paste(colnames(pred.dat)[vars.vec],collapse="+"),"+(1|random.factor)",sep=""))
model.fit=try(glmer(formula.l,
data=use.dat,
family="binomial",
weights=rep(100,nrow(use.dat))))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit)
out.vec["BIC"]<-BIC(model.fit)
}
return(out.vec)
}
out.dat
# if I run using dopar, nothing is returned.
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%dopar%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~",
paste(colnames(pred.dat)[vars.vec],collapse="+"),"+(1|random.factor)",sep=""))
model.fit=try(glmer(formula.l,
data=use.dat,
family="binomial",
weights=rep(100,nrow(use.dat))))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit)
out.vec["BIC"]<-BIC(model.fit)
}
return(out.vec)
}
out.dat
# Now run dopar without the weights argument (not really appropriate,
# but for the sake of demonstration). I get results again, but it doesn't
# really make sense to do this. Also, my real example fails unless I can supply
# weights.
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%dopar%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~1+",
paste("s(",colnames(pred.dat)[vars.vec],")",collapse="+"),sep=""))
model.fit=try(gamm4(formula.l, random=~(1|random.factor),
data=use.dat,family="binomial"))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit$mer)
out.vec["BIC"]<-BIC(model.fit$mer)
}
return(out.vec)
}
out.dat

R: How to make column of predictions for logistic regression model?

So I have a data set called x. The contents are simple enough to just write out so I'll just outline it here:
the dependent variable, Report, in the first column is binary yes/no (0 = no, 1 = yes)
the subsequent 3 columns are all categorical variables (race.f, sex.f, gender.f) that have all been converted to factors, and they're designated by numbers (e.g. 1= white, 2 = black, etc.)
I have run a logistic regression on x as follows:
glm <- glm(Report ~ race.f + sex.f + gender.f, data=x,
family = binomial(link="logit"))
And I can check the fitted probabilities by looking at summary(glm$fitted).
My question: How do I create a fifth column on the right side of this data set x that will include the predictions (i.e. fitted probabilities) for Report? Of course, I could just insert the glm$fitted as a column, but I'd like to try to write a code that predicts it based on whatever is in the race, sex, gender columns for a more generalized use.
Right now I the follow code which I will hope create a predicted column as well as lower and upper bounds for the confidence interval.
xnew <- cbind(xnew, predict(glm5, newdata = xnew, type = "link", se = TRUE))
xnew <- within(xnew, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
Unfortunately I get the error:
Error in eval(expr, envir, enclos) : object 'race.f' not found
after the cbind code.
Anyone have any idea?
There appears to be a few typo in your codes; First Xnew calls on glm5 but your model as far as I can see is glm (by the way using glm as name of your output is probably not a good idea). Secondly make sure the variable race.f is actually in the dataset you wish to do the prediction from. My guess is R can't find that variable hence the error.

use stepAIC on a list of models

I want to do stepwise regression using AIC on a list of linear models. idea is to use e a list of linear models and then apply stepAIC on each list element. It fails.
I tried to track the problem down. I think I found the problem. However, I don't understand the cause. Try the code to see the difference between three cases:
require(MASS)
n<-30
x1<-rnorm(n, mean=0, sd=1) #create rv x1
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10))
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d))
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())- works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d)))
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k])
stepAIC(lin.model.id1) # check stepAIC - works!
I am pretty sure that stepAIC() needs the original data from data.frame "dat". That is what I was thinking of before. (Hope I am right on that)
But there is no parameter in stepAIC() where I can pass the original data frame. Obviously, for plain models not wrapped in a list it's enough to pass the model. (last three lines in code) So I am wondering:
Q1: How does stepAIC knows where to find the original data "dat" (not only the model data which is passed as parameter)?
Q2: How can I possibly know that there is another parameter in stepAIC() which is not explicitly stated in the help pages? (maybe my English is just too bad to find)
Q3: How can I pass that parameter to stepAIC()?
It must be somewhere in the environment of the apply function and passing on the data. Either lm() or stepAIC() and the pointer/link to the raw data must get lost somewhere. I have not a good understanding what an environment in R does. For me it was kind of isolating local from global variables. But maybe its more complicated. Anyone who can explain that to me in regard to the problem above? Honestly, I dont read much out of the R documentation. Any better understanding would help me.
OLD:
I have data in a dataframe df that can be split into several subgroups. For that purpose I created a groupID called df$id. lm() returns the coefficent as expected for the first subgroup. I want to do a stepwise regression using AIC as criterion for each subgroup separately. I use lmList {lme4} which results in a model for each subgroup (id). But if I use stepAIC{MASS} for the list elements it throws an error. see below.
So the question is: What mistake is in my procedure/syntax? I get results for single models but not the ones created with lmList. Does lmList() store different information on the model than lm() does?
But in the help it states:
class "lmList": A list of objects of class lm with a common model.
>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept) Gap.um Standoff.um Voidflaeche.px
62.306133 -0.009878 0.026317 -0.015048
>stepAIC(lme4.list.lm[[1]], direction="backward")
#stepAIC on first element on the list of linear models
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
Obviously something does not work with the list. But I have not an idea what it might be.
Since I tried to do the same with the base package which creates the same model (at least the same coefficients). Results are below:
>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),])
# id is in order, so should be the same subgroup as for the first list element in lmList
Coefficients:
(Intercept) Gap.um Standoff.um Voidflaeche.px
62.306133 -0.009878 0.026317 -0.015048
Well, this is what I get returned using stepAIC on my linear.model .
As far as I know the akaike information criterion can be used to estimate which model better balances between fit and generalization given some data.
>stepAIC(lin.model,direction="backward")
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Step: AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Gap.um 1 28.51 7215.8 291.38
<none> 7187.3 293.14
- Voidflaeche.px 1 717.63 7904.9 296.85
Step: AIC=291.38
Scherkraft.N ~ Voidflaeche.px
Df Sum of Sq RSS AIC
<none> 7215.8 291.38
- Voidflaeche.px 1 795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])
Coefficients:
(Intercept) Voidflaeche.px
71.7183 -0.0151
I read from the output I should use the model: Scherkraft.N ~ Voidflaeche.px because this is the minimal AIC. Well, it would be nice if someone could shortly describe the output. My understanding of the stepwise regression (assuming backwards elimination) is all regressors are included in the initial model. Then the least important one is eliminated. The criterion to decide is the AIC. and so forth... Somehow I have problems to get the tables interpreted right. It would be nice if someone could confirm my interpretation. The "-"(minus) stands for the eliminated regressor. On top is the "start" model and in the table table below the RSS and AIC are calculated for possible eliminations. So the first row in the first table says a model Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px - Standoff.um would result in an AIC 293.14. Choose the one without Standoff.um: Scherkraft.N~Gap.um+Voidflaeche.px
EDIT:
I replaced the lmList{lme4} with dlply() to create the list of models.
Still stepAIC is not coping with the list. It throws another error. Actually, I believe it is a problem with the data stepAIC needs to run through. I was wondering how it calculates the AIC-value for each step just from the model data. I would take the original data to construct the models leaving one regressor out each time. Thereof I would calculate the AIC and compare. So how stepAIC is working if it has not access to the original data. (I cant see a parameter where I pass the original data to stepAIC). Still, I have no clue why it works with a plain model but not with the model wrapped in a list.
>model.list.all <- dlply(df, .id, function(x)
{return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found
I'm not sure what may have changed in the versioning to make the debugging so difficult, but one solution would be to use do.call, which evaluates the expressions in the call before executing it. This means that instead of storing just d in the call, so that update and stepAIC need to go find d in order to do their work, it stores a full representation of the data frame itself.
That is, do
do.call("lm", list(y~x1+x2+x3, data=d))
instead of
lm(y~x1+x2+x3, data=d)
You can see what it's trying to do by looking at the call element of the model, perhaps like this:
dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call
It's also possible to make your list of data frames in the global environment and then construct the call so that update and stepAIC look for each data frame in turn, because their environment chains always lead back to the global environment; like this:
dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )
To see what's changed, run dat.lin.model.lst[[1]]$call again.
As it seems that stepAIC goes out of loop environment (that is in global environment) to look for the data it needs, I trick it using the assign function:
results <- do.call(rbind, lapply(response, function (i) {
assign("i", response, envir = .GlobalEnv)
mdl <- gls(as.formula(paste0(i,"~",paste(expvar, collapse = "+")), data= parevt, correlation = corARMA(p=1,q=1,form= ~as.integer(Year)), weights= varIdent(~1/Linf_var), method="ML")
mdl <- stepAIC(mdl, direction ="backward")
}))

Resources