Computing all subsets of a vector in R - r

I want do write a small function which I can use for automatic feature selection in a logistic regression in R, by testing in a brute force manner all subsets of predictor variables and then evaluate via CV their classification performance.
Surprisingly I did not find a package which does this "all subset feature selection" and thus I would like to implement it myself.
Unfortunately my limited R knowledge makes me fail to write a loop which generates all subsets of a given vector and I was wondering if someone could point me in the right direction

Caveat incernor
The bestglm package is what you are after
The function bestglm selects the best subset of inputs for the glm family. The selec-
tion methods available include a variety of information criteria as well as cross-validation
The vignette goes through a number of examples.
library(bestglm)
data(SAHeart)
# using Cross valiation for selection
out<-bestglm(SAheart,IC = 'CV', family=binomial, t = 10)
out
# CVd(d = 373, REP = 10)
# BICq equivalent for q in (0.190525988534159, 0.901583162187443)
# Best Model:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -6.44644451 0.92087165 -7.000372 2.552830e-12
# tobacco 0.08037533 0.02587968 3.105731 1.898095e-03
# ldl 0.16199164 0.05496893 2.946967 3.209074e-03
# famhistPresent 0.90817526 0.22575844 4.022774 5.751659e-05
# typea 0.03711521 0.01216676 3.050542 2.284290e-03
# age 0.05046038 0.01020606 4.944159 7.647325e-07

Wouldn't drop1() and add1() be helpful for your purpose? They come with the usual caution that automatic feature selection may not always be the most appropriate thing to do, but I presume you have made an informed choice on this.

You can use paste() + combn(), e.g.
varnames <- c("a","b","c")
rhs <- unlist( sapply(1:length(varnames),function(k) apply(combn(varnames,k),2,paste,collapse=" + ") ) )
formulae <- as.formula( quote( paste("z ~", rhs) ) )
... but perhaps there is a more elegant way?

Related

Allowing for aliased coefficients when running `grangertest()` in R

I'm currently trying to run a granger causality analysis in R/R Studio. I am receiving errors about aliased coefficients when using the function grangertest(). From my understanding, this occurs because there is perfect multicolinearity between the variables.
Due to having a very large number of pairwise comparisons (e.g. 200+), I would like to simply run the granger with the aliased coefficients as per normal rather than returning an error. According to one answer here, the solution is or was to add set singular.ok=TRUE, but either I am doing it incorrectly the answer s out of date. I've tried checking the documentation, but have come up empty. Any help would be appreciated.
library(lmtest)
x <- c(0,1,2,3)
y <- c(0,3,6,9)
grangertest(x,y,1) # I want this to run successfully even if there are aliased coefficients.
grangertest(x,y,1, singular.ok=TRUE) # this also doesn't work
"Error in waldtest.lm(fm, 2, ...) :
there are aliased coefficients in the model"
Additionally is there a way to flag x and y are actually aliased variables? There seem to be some answers like here but I'm having issues getting it working properly.
alias((x~ y))
Thanks in advance.
After some investigation and emailing the creator of the grangertest package, they sent me this solution. The solution should run on aliased variables when granger test does not. When the variables are not aliased, the solution should give the same values as the normal granger test.
library(lmtest)
library(dynlm)
# Some data that is multicolinear
x <- c(0,1,2,3,4)
y <- c(0,3,6,9,12)
# Some data that is not multicolinear
# x <- c(0,125,200,230,777)
# y <- c(0,3,6,9,200)
# Convert to time series (this is an important step)
x=ts(x)
y=ts(y)
# This will run even when the data is multicolinear (but also when it is not)
# and is functionally the same as running the granger test (which by default uses the waldtest
m1 = dynlm(x ~ L(x, 1:1) + L(y, 1:1))
m2 = dynlm(x ~ L(x, 1:1))
result <-anova(m1, m2, test="F")
# This will fail if the data is multicolinear or aliased but should give the same results as the anova otherwise (F value and P value etc)
#grangertest(y,x,1)

How to write a function to check model assumptions for a linear model in R?

I'm making a lot of models in R and trying to check the model assumptions for all of them. It would be awesome if I could write a function to do it all in one go, but it doesn't seem to be working.
I have:
assumptionfunction <- function(y, modelobject){
plot(x)
plot(y, x$residuals)
qqnorm(x$residuals)
}
And I'm getting lots of errors.
Instead of creating your own function, you can use an existing one. The beautiful check_model() function from the performance package does just that:
library(performance)
library(see)
model <- lm(mpg ~ wt * cyl + gear, data = mtcars)
check_model(model)
If you insist on using some objective tests, there is the gvlma package.
library(gvlma)
gvlma(model)
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Value p-value Decision
Global Stat 1.770046 0.7780 Assumptions acceptable.
Skewness 0.746520 0.3876 Assumptions acceptable.
Kurtosis 0.003654 0.9518 Assumptions acceptable.
Link Function 0.927065 0.3356 Assumptions acceptable.
Heteroscedasticity 0.092807 0.7606 Assumptions acceptable.
Now if you don't like gvlma because it doesn't explicitly name the tests used and gives Skewness and Kurtosis but not overall normality from, say, Shapiro-Wilk, I made a convenience function. It gets all tests names and assumptions at once with the total number of assumptions that are not respected. You can take it and modify it to suit your needs.
# Load the function:
source("https://raw.githubusercontent.com/RemPsyc/niceplots/master/niceAssFunction.R")
View(niceAss(model))
Interpretation: (p) values < .05 imply assumptions are not respected.
Diagnostic is how many assumptions are not respected for a given model or variable.
Applied to a list of models:
# Define our dependent variables
(DV <- names(mtcars[-1]))
# Make list of all formulas
(formulas <- paste(DV, "~ mpg"))
# Make list of all models
models.list <- sapply(X = formulas, FUN = lm, data = mtcars, simplify = FALSE, USE.NAMES = TRUE)
# Make diagnostic table
(ass.table <- do.call("rbind", lapply(models.list, niceAss)))
# Use the Viewer for better results
View(ass.table)

predict.lm with arbitrary coefficients r

I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...

How to extract info from package in R and use in function?

I apologize for the vague question title. What I want to do is run a regression in R using geeglm from the geepack R package, then use information from that to calculate a quasilikelihood information criteria (QIC; Pan 2001). I can do this fairly easily for single models but I would like to write a general function that can do this for a variety of different types of models. I guess my real question is whether there is a better alternative than having a long series of nested ifelse statements?
Here's my current code:
library(geepack)
data(dietox) #data from the geepack package
# Run gee regression
dietox$Cu <- as.factor(dietox$Cu)
mf <- formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 <- geeglm(mf, data = dietox, id = Pig, family = gaussian, corstr = "ar1")
Then I can run a function to calculate the quasilikelihood:
QlogLik.normal <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2)
quasi.R
}
However, I would like to write a function that is more general because the quasilikelihood function is different for every distribution. The above function would work for gee1 because it had a gaussian (normal) distribution. If I wanted to generalize it for a variety of distributions I could use a series of nested ifelse statements (below), but I don't know if this is the best way to do this. Does anyone have other options or a better solution? This just doesn't seem very elegant to say the least (clearly I don't have much programming or R experience).
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
ifelse(model.R$modelInfo$variance == "poisson",
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R)) - mu.R),
ifelse(model.R$modelInfo$variance == "gaussian",
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2),
ifelse(model.R$modelInfo$variance == "binomial",
# Quasilikelihood for Binomial
quasi.R <- sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
quasi.R <- "Error: distribution not recognized")))
quasi.R
}
In this example, I used the model output from geeglm to extract the type of distribution used to model the variance
model.R$modelInfo$variance
but there may be other ways to determine what distribution was used in the geeglm model. Any help would be appreciated.
You should be able to rewrite your function like this:
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
type <- family(model.R)$family
switch(type,
poisson = sum((y*log(mu.R)) - mu.R),
gaussian = sum(((y - mu.R)^2)/-2),
binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
stop("Error: distribution not recognized"))
}
As #baptise points out, switch useful in these cases. You use family(model.R)$family to automatically detect what family type should be used with switch.
Also, if your commands for what to do in different cases run beyond one line, you can wrap the lines with curly brackets ({ do something here }) instead.
switch(type,
type1 = { something <- do(this)
thisis(something) },
type2 = do(that))
I hope this helps!
You may also use model.R$family$family which gives the type of distribution used to model the variance, but so far I didn't know if you could eliminate those ifelse statements. The quasi.R in your code differs among different distributions, so you have to define each of them separately.
BTW, it is a good question and thanks for posting it: I had similar situations in the past, and hope to get some advice on how to write the codes more efficiently.

use stepAIC on a list of models

I want to do stepwise regression using AIC on a list of linear models. idea is to use e a list of linear models and then apply stepAIC on each list element. It fails.
I tried to track the problem down. I think I found the problem. However, I don't understand the cause. Try the code to see the difference between three cases:
require(MASS)
n<-30
x1<-rnorm(n, mean=0, sd=1) #create rv x1
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10))
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d))
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())- works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d)))
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k])
stepAIC(lin.model.id1) # check stepAIC - works!
I am pretty sure that stepAIC() needs the original data from data.frame "dat". That is what I was thinking of before. (Hope I am right on that)
But there is no parameter in stepAIC() where I can pass the original data frame. Obviously, for plain models not wrapped in a list it's enough to pass the model. (last three lines in code) So I am wondering:
Q1: How does stepAIC knows where to find the original data "dat" (not only the model data which is passed as parameter)?
Q2: How can I possibly know that there is another parameter in stepAIC() which is not explicitly stated in the help pages? (maybe my English is just too bad to find)
Q3: How can I pass that parameter to stepAIC()?
It must be somewhere in the environment of the apply function and passing on the data. Either lm() or stepAIC() and the pointer/link to the raw data must get lost somewhere. I have not a good understanding what an environment in R does. For me it was kind of isolating local from global variables. But maybe its more complicated. Anyone who can explain that to me in regard to the problem above? Honestly, I dont read much out of the R documentation. Any better understanding would help me.
OLD:
I have data in a dataframe df that can be split into several subgroups. For that purpose I created a groupID called df$id. lm() returns the coefficent as expected for the first subgroup. I want to do a stepwise regression using AIC as criterion for each subgroup separately. I use lmList {lme4} which results in a model for each subgroup (id). But if I use stepAIC{MASS} for the list elements it throws an error. see below.
So the question is: What mistake is in my procedure/syntax? I get results for single models but not the ones created with lmList. Does lmList() store different information on the model than lm() does?
But in the help it states:
class "lmList": A list of objects of class lm with a common model.
>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept) Gap.um Standoff.um Voidflaeche.px
62.306133 -0.009878 0.026317 -0.015048
>stepAIC(lme4.list.lm[[1]], direction="backward")
#stepAIC on first element on the list of linear models
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
Obviously something does not work with the list. But I have not an idea what it might be.
Since I tried to do the same with the base package which creates the same model (at least the same coefficients). Results are below:
>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),])
# id is in order, so should be the same subgroup as for the first list element in lmList
Coefficients:
(Intercept) Gap.um Standoff.um Voidflaeche.px
62.306133 -0.009878 0.026317 -0.015048
Well, this is what I get returned using stepAIC on my linear.model .
As far as I know the akaike information criterion can be used to estimate which model better balances between fit and generalization given some data.
>stepAIC(lin.model,direction="backward")
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Step: AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Gap.um 1 28.51 7215.8 291.38
<none> 7187.3 293.14
- Voidflaeche.px 1 717.63 7904.9 296.85
Step: AIC=291.38
Scherkraft.N ~ Voidflaeche.px
Df Sum of Sq RSS AIC
<none> 7215.8 291.38
- Voidflaeche.px 1 795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])
Coefficients:
(Intercept) Voidflaeche.px
71.7183 -0.0151
I read from the output I should use the model: Scherkraft.N ~ Voidflaeche.px because this is the minimal AIC. Well, it would be nice if someone could shortly describe the output. My understanding of the stepwise regression (assuming backwards elimination) is all regressors are included in the initial model. Then the least important one is eliminated. The criterion to decide is the AIC. and so forth... Somehow I have problems to get the tables interpreted right. It would be nice if someone could confirm my interpretation. The "-"(minus) stands for the eliminated regressor. On top is the "start" model and in the table table below the RSS and AIC are calculated for possible eliminations. So the first row in the first table says a model Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px - Standoff.um would result in an AIC 293.14. Choose the one without Standoff.um: Scherkraft.N~Gap.um+Voidflaeche.px
EDIT:
I replaced the lmList{lme4} with dlply() to create the list of models.
Still stepAIC is not coping with the list. It throws another error. Actually, I believe it is a problem with the data stepAIC needs to run through. I was wondering how it calculates the AIC-value for each step just from the model data. I would take the original data to construct the models leaving one regressor out each time. Thereof I would calculate the AIC and compare. So how stepAIC is working if it has not access to the original data. (I cant see a parameter where I pass the original data to stepAIC). Still, I have no clue why it works with a plain model but not with the model wrapped in a list.
>model.list.all <- dlply(df, .id, function(x)
{return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found
I'm not sure what may have changed in the versioning to make the debugging so difficult, but one solution would be to use do.call, which evaluates the expressions in the call before executing it. This means that instead of storing just d in the call, so that update and stepAIC need to go find d in order to do their work, it stores a full representation of the data frame itself.
That is, do
do.call("lm", list(y~x1+x2+x3, data=d))
instead of
lm(y~x1+x2+x3, data=d)
You can see what it's trying to do by looking at the call element of the model, perhaps like this:
dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call
It's also possible to make your list of data frames in the global environment and then construct the call so that update and stepAIC look for each data frame in turn, because their environment chains always lead back to the global environment; like this:
dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )
To see what's changed, run dat.lin.model.lst[[1]]$call again.
As it seems that stepAIC goes out of loop environment (that is in global environment) to look for the data it needs, I trick it using the assign function:
results <- do.call(rbind, lapply(response, function (i) {
assign("i", response, envir = .GlobalEnv)
mdl <- gls(as.formula(paste0(i,"~",paste(expvar, collapse = "+")), data= parevt, correlation = corARMA(p=1,q=1,form= ~as.integer(Year)), weights= varIdent(~1/Linf_var), method="ML")
mdl <- stepAIC(mdl, direction ="backward")
}))

Resources