How to add specific conditions to stepAIC - r

I am running a regression with 37 variables, and I am using stepAIC to perform model selection. I do NOT want a predictive model. I just want to find out what varibles have the best explanatory power.
My current code looks like:
fitObject <- lm(mydata)
DEP.select <- stepAIC(fitObject, direction = 'both', scope= list(lower = ~AUC), trace = F, k = log(obs))
# DEP is my dependent variable, and AUC is an independent variable I was want to have in my model.
The problem is that a lot of my variables have high correlation, and the result stepAIC gives me contains several of those highly correlated variables. Notice that I have forced AUC in the model, multicollinearity is a problem especially when those variables highly correlated with AUC are chosen in the model.
Is there a way to specify in the function some thresholds for correlation or p-value of the coefficients?
Or any comments on other approaches that can solve my problem are welcome.
Thank you!

Perhaps Variance Inflation Factor will work better for you. This article explains some of the logic. http://en.wikipedia.org/wiki/Variance_inflation_factor
Example use:
v=ezvif(df,yvar ='columnNameOfWhichYouAreTryingToPredict')
Here is the function I wrote that combines VIF::vif with cross validation.
require(VIF)
require(cvTools);
#returns selected variables using VIF and kfolds cross validation
ezvif=function(df,yvar,folds=5,trace=F){
f=cvFolds(nrow(df),K=folds);
findings=list();
for(v in names(df)){
if(v==yvar)next;
findings[[v]]=0;
}
for(i in 1:folds){
rows=f$subsets[f$which!=i]
y=df[rows,yvar];
xdf=df[rows,names(df) != yvar]; #remove output var
vifResult=vif(y,xdf,trace=trace,subsize=min(200,floor(nrow(xdf))))
for(v in names(xdf)[vifResult$select]){
findings[[v]]=findings[[v]]+1; #vote
}
}
findings=(sort(unlist(findings),decreasing = T))
if(trace) print(findings[findings>0]);
return( c(yvar,names(findings[findings==findings[1]])) )
}

I would recommend to remove the variables with high correlations. The libraries caret and corrplot can help:
library(corrplot)
library(caret)
dm = data.matrix(mydata[,names(mydata) != 'DEP'] #without your outcome var
Visualize your correlations clustering highly correlated together
corrplot(cor(dm), order = 'hclust')
And find the indices of variables that you could remove due to high (>0.75) correlations
findCorrelations(cor(dm), 0.75)
Removing these variables can improve your model. After removing the variables, continue doing the stepAIC as you described in your question.

To assess multicollinearity between predictors when running the dredge function (MuMIn package), include the following max.r function as the "extra" argument:
max.r <- function(x){
corm <- cov2cor(vcov(x))
corm <- as.matrix(corm)
if (length(corm)==1){
corm <- 0
max(abs(corm))
} else if (length(corm)==4){
cormf <- corm[2:nrow(corm),2:ncol(corm)]
cormf <- 0
max(abs(cormf))
} else {
cormf <- corm[2:nrow(corm),2:ncol(corm)]
diag(cormf) <- 0
max(abs(cormf))
}
}
then simply run dredge specifying the number of predictor variables and including the max.r function:
options(na.action = na.fail)
Allmodels <- dredge(Fullmodel, rank = "AIC", m.lim=c(0, 3), extra= max.r)
Allmodels[Allmodels$max.r<=0.6, ] ##Subset models with max.r <=0.6 (not collinear)
NCM <- get.models(Allmodels, subset = max.r<=0.6) ##Retrieve models with max.r <=0.6 (not collinear)
model.sel(NCM) ##Final model selection table
This works for lme4 models. For nlme models see: https://github.com/rojaff/dredge_mc

Related

for loop in train(caret) to select different predictors in a lm model

I'm just a beginner, so i hope you can help with a problem due the KNN model (via train in caret) in R.
I tried this:
models.list = as.list(vector(length = ncol(FIFA21_db)))
for(i in 1:ncol(mtcars)) {
models.list[[i]] <- train(x = mtcars[,i], y = mtcars[,1], method = "lm")
}
This cause the error " Please use column names for x": Do you know how i can use the column names instead of observations in a for loop? My goal is to use different variables for a lm regression.

R is only returning non-zero coefficient estimates when using the "poly" function to generate predictors. How do I get the zero values into a vector?

I'm using regsubsets from the leaps library to perform the best subset selection. I need to compare the coefficients it generates to the "true" coefficients I specified when simulating the data (by comparison, meaning, the difference between them squared, and the square root taken of the sum), for each number of predictors.
Since there are 16 different models that regsubsets generated, I use a loop to do this automatically. It would work except that when I extract the coefficients from the best model fit with x predictors, it only gives me the non-zero coefficients of the polynomial fit. This messes up the size of the coefi vector causing it to be smaller in size than the truecoef true coefficients vector.
If I could somehow force all coefficients to be spat out from the model, I wouldn't have an issue. But after looking extensively, I don't know how to do that.
Alternative ways of solving this problem would also be appreciated.
library(leaps)
regfit.train=regsubsets(y ~ poly(x,25, raw = TRUE), data=mydata[train,], nvmax=25)
truecoef = c(3,0,-7,4,-2,8,0,-5,0,2,0,4,5,6,3,2,2,0,3,1,1)
coef.errors = rep(NA, 16)
for (i in 1:16) {
coefi = coef(regfit.train, id=i)
coef.errors[i] = mean((truecoef-coefi)^2)
}
The equation I'm trying to estimate, where j is the coefficient and r refers to the best model containing "r" coefficients:
Thanks!
This is how I ended up solving it (with some help):
The loop indexes which coefficients are available and performs the subtraction, for those unavailable, it assumes they are zero.
truecoef = c(3,0,-7,4,-2,8,0,-5,0,2,0,4,5,6,3,2,2,0,3,1,1)
val.errors = rep(NA, 16)
x_cols = colnames(x, do.NULL = FALSE, prefix = "x.")
for (i in 1:16) {
coefis = coef(regfit.train, id = i)
val.errors[i] = sqrt(sum((truecoef[x_cols %in% names(coefis)] -
coefis[names(coefis) %in% x_cols])^2) + sum(truecoef[!(x_cols %in% names(coefis))])^2)
}

How to do stepwise regression in r for more independent variables and less observations?

I'm trying to do stepwise regression for following data:
y <- c(1.2748, 1.2574, 1.5571, 1.4178, 0.8491, 1.3606, 1.4747, 1.3177, 1.2896, 0.8453)
x <- data.frame(A = c(2,3,4,5,6,2,3,4,5,6),
B = c(2,4,1,3,5,1,3,5,2,4)*100,
C = c(9,5,11,5,11,7,13,7,13,9),
D = c(6,5,3,7,6,4,3,7,5,4),
E = c(1,1,0.8,0.8,0.6,0.6,0.4,0.4,0.2,0.2))
x$A2 <- x$A^2
x$B2 <- x$B^2
x$C2 <- x$C^2
x$D2 <- x$D^2
x$E2 <- x$E^2
x$AB <- x$A*x$B
As we can see, it has 10 observations and 11 independent variables so I can't build a linear regression model for it. In fact, only a few factors is useful and in this case, I need to use stepwise regression and "forward" to add independent variables into my formula. But stats:: step function cannot be used. I wonder if there is a method to do it. I know there is a package called "StepReg" but I don't fully understand how to use it and how to read the results. Thank you!
I just run stepwise regression with the data you provided using R package StepReg
Here is the code, and hope this can help you.
library(StepReg)
df <- data.frame(y,x)
# forward method with information criterion 'AIC', you can choose other information criterion
stepwise(df, y="y", exclude=NULL, include=NULL, Class=NULL,
selection="forward", select="AIC")
# forward method with significant levels, significant level for entry = 0.15
stepwise(df, y="y", exclude=NULL, include=NULL, Class=NULL,
selection="forward", select="SL",sle=0.15)
You can use olsrr package which gives similar results as that of SPSS. Here is the solution
library(olsrr)
df <- data.frame(y, x)
model <- lm(y ~ ., data = df)
smlr <- ols_step_both_p(model, pent = 0.05, prem = 0.1) #pent p value; variables with p value less than pent will enter into the model.
#premp value; variables with p more than prem will be removed from the model.
You can get the model details by calling
smlr
smlr$model
smlr$beta_pval #regression coefficients with p-values
I have kept the values of pent and prem same as the default values used by SPSS.

Strange glmulti results: Why are interaction variables from the candidate model dropped/not included?

I have been using glmulti to obtain model averaged estimates and relative importance values for my variables of interest. In running glmulti I specified a candidate model for which all variables and interactions were included based on a priori knowledge (see code below).
After running the glmutli model I studied the results by using the functions summary() and weightable(). There seem to be a number of strange things going on with the results which I do not understand.
First of all, when I run my candidate model with lme4 glmer() function I obtain an AIC value of 2086. In the glmulti output this candidate model (with exactly the same formula) has a lower AIC value (2107), as a result of which it appears at position 8 out of 26 in the list of all potential models (as obtained through the weigtable() function).
What seems to be causing this problem is that the logArea:Habitat interaction is dropped from the candidate model, despite level=2 being specified. The function summary(output_new#objects[[8]]) provides a different formula (without the logArea:Habitat interaction variable) compared to the formula provided through weightable(). This explains why the candidate model AIC value is not the same as obtained through lme4, but I do not understand why the interaction variables logArea:Habitat is missing from the formula. The same is happening for other possible models. It seems that for all models with 2 or more interactions, one interaction is dropped.
Does anyone have an explanation for what is going on? Any help would be much appreciated!
Best,
Robert
Note: I have created a subset of my data (https://drive.google.com/open?id=1rc0Gkp7TPdnhW6Bw87FskL5SSNp21qxl) and simplified the candidate model by removing variables in order to decrease model run time. (The problem remains the same)
newdat <- Data_ommited2[, c("Presabs","logBodymass", "logIsolation", "Matrix", "logArea", "Protection","Migration", "Habitat", "Guild", "Study","Species", "SpeciesStudy")]
glmer.glmulti <- function (formula, data, random, ...) {
glmer(paste(deparse(formula), random), data = data, family=binomial(link="logit"),contrasts=list(Matrix=contr.sum, Habitat=contr.treatment, Protection=contr.treatment, Guild=contr.sum),glmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 100000)))
}
output_new <- glmulti(y = Presabs ~ Matrix + logArea*Protection + logArea*Habitat,
data = sampledata,
random = '+(1|Study)+(1|Species)+(1|SpeciesStudy)',
family = binomial,
method = 'h',
level=2,
marginality=TRUE,
crit = 'aic',
fitfunc = glmer.glmulti,
confsetsize = 26)
print(output_new)
summary(output_new)
weightable(output_new)
I found a post (https://stats.stackexchange.com/questions/341356/glmulti-package-in-r-reporting-incorrect-aicc-values) of someone who encountered the same problem and it appears that the problem was caused by this line of code:
glmer.glmulti <- function (formula, data, random, ...) {
glmer(paste(deparse(formula), random), data = data, family=binomial(link="logit"))
}
By changing this part of the code into the following the problem was solved:
glmer.glmulti<-function(formula,data,random,...) {
newf <- formula
newf[[3]] <- substitute(f+r,
list(f=newf[[3]],
r=reformulate(random)[[2]]))
glmer(newf,data=data,
family=binomial(link="logit"))
}

ROC curve in R using ROCR package

Can someone explain me please how to plot a ROC curve with ROCR.
I know that I should first run:
prediction(predictions, labels, label.ordering = NULL)
and then:
performance(prediction.obj, measure, x.measure="cutoff", ...)
I am just not clear what is meant with prediction and labels. I created a model with ctree and cforest and I want the ROC curve for both of them to compare it in the end. In my case the class attribute is y_n, which I suppose should be used for the labels. But what about the predictions? Here are the steps of what I do (dataset name= bank_part):
pred<-cforest(y_n~.,bank_part)
tablebank<-table(predict(pred),bank_part$y_n)
prediction(tablebank, bank_part$y_n)
After running the last line I get this error:
Error in prediction(tablebank, bank_part$y_n) :
Number of cross-validation runs must be equal for predictions and labels.
Thanks in advance!
Here's another example: I have the training dataset(bank_training) and testing dataset(bank_testing) and I ran a randomForest as below:
bankrf<-randomForest(y~., bank_training, mtry=4, ntree=2,
keep.forest=TRUE,importance=TRUE)
bankrf.pred<-predict(bankrf, bank_testing, type='response')
Now the bankrf.pred is a factor object with labels c=("0", "1"). Still, I don't know how to plot ROC, cause I get stuck to the prediction part. Here's what I do
library(ROCR)
pred<-prediction(bankrf.pred$y, bank_testing$c(0,1)
But this is still incorrect, cause I get the error message
Error in bankrf.pred$y_n : $ operator is invalid for atomic vectors
The predictions are your continuous predictions of the classification, the labels are the binary truth for each variable.
So something like the following should work:
> pred <- prediction(c(0.1,.5,.3,.8,.9,.4,.9,.5), c(0,0,0,1,1,1,1,1))
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
to generate an ROC.
EDIT: It may be helpful for you to include the sample reproducible code in the question (I'm having a hard time intepreting your comment).
There's no new code here, but... here's a function I use quite often for plotting an ROC:
plotROC <- function(truth, predicted, ...){
pred <- prediction(abs(predicted), truth)
perf <- performance(pred,"tpr","fpr")
plot(perf, ...)
}
Like #Jeff said, your predictions need to be continuous for ROCR's prediction function. require(randomForest); ?predict.randomForest shows that, by default, predict.randomForest returns a prediction on the original scale (class labels, in classification), whereas predict.randomForest(..., type = 'prob') returns probabilities of each class. So:
require(ROCR)
data(iris)
iris$setosa <- factor(1*(iris$Species == 'setosa'))
iris.rf <- randomForest(setosa ~ ., data=iris[,-5])
summary(predict(iris.rf, iris[,-5]))
summary(iris.preds <- predict(iris.rf, iris[,-5], type = 'prob'))
preds <- iris.preds[,2]
plot(performance(prediction(preds, iris$setosa), 'tpr', 'fpr'))
gives you what you want. Different classification packages require different commands for getting predicted probabilities -- sometimes it's predict(..., type='probs'), predict(..., type='prob')[,2], etc., so just check out the help files for each function you're calling.
This is how you can do it:
have our data in a csv file,("data_file.csv") but you may need to give the full path here. In that file have the column headers, which here I will use
"default_flag", "var1", "var2", "var3", where default_flag is 0 or 1 and the other variables have any value.
R code:
rm(list=ls())
df <- read.csv("data_file.csv") #use the full path if needed
mylogit <- glm(default_flag ~ var1 + var2 + var3, family = "binomial" , data = df)
summary(mylogit)
library(ROCR)
df$score<-predict.glm(mylogit, type="response" )
pred<-prediction(df$score,df$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
Note that df$score will give you the probability of default.
In case you want to use this logit (same regression coefficients) to test in another data df2 set for cross validation, use
df2 <- read.csv("data_file2.csv")
df2$score<-predict.glm(mylogit,newdata=df2, type="response" )
pred<-prediction(df2$score,df2$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
The problem is, as pointed out by others, prediction in ROCR expects numerical values. If you are inserting predictions from randomForest (as the first argument into prediction in ROCR), that prediction needs to be generated by type='prob' instead of type='response', which is the default. Alternatively, you could take type='response' results and convert to numerical (that is, if your responses are, say 0/1). But when you plot that, ROCR generates a single meaningful point on ROC curve. For having many points on your ROC curve, you really need the probability associated with each prediction - i.e. use type='prob' in generating predictions.
The problem may be that you would like to run the prediction function on multiple runs for example for cross-validatation.
In this case for prediction(predictions, labels, label.ordering = NULL) function the class of "predictions" and "labels" variables should be list or matrix.
Try this one:
library(ROCR)
pred<-ROCR::prediction(bankrf.pred$y, bank_testing$c(0,1)
The function prediction is present is many packages. You should explicitly specify(ROCR::) to use the one in ROCR. This one worked for me.

Resources