use stepAIC on a list of models - r

I want to do stepwise regression using AIC on a list of linear models. idea is to use e a list of linear models and then apply stepAIC on each list element. It fails.
I tried to track the problem down. I think I found the problem. However, I don't understand the cause. Try the code to see the difference between three cases:
require(MASS)
n<-30
x1<-rnorm(n, mean=0, sd=1) #create rv x1
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10))
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d))
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())- works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d)))
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k])
stepAIC(lin.model.id1) # check stepAIC - works!
I am pretty sure that stepAIC() needs the original data from data.frame "dat". That is what I was thinking of before. (Hope I am right on that)
But there is no parameter in stepAIC() where I can pass the original data frame. Obviously, for plain models not wrapped in a list it's enough to pass the model. (last three lines in code) So I am wondering:
Q1: How does stepAIC knows where to find the original data "dat" (not only the model data which is passed as parameter)?
Q2: How can I possibly know that there is another parameter in stepAIC() which is not explicitly stated in the help pages? (maybe my English is just too bad to find)
Q3: How can I pass that parameter to stepAIC()?
It must be somewhere in the environment of the apply function and passing on the data. Either lm() or stepAIC() and the pointer/link to the raw data must get lost somewhere. I have not a good understanding what an environment in R does. For me it was kind of isolating local from global variables. But maybe its more complicated. Anyone who can explain that to me in regard to the problem above? Honestly, I dont read much out of the R documentation. Any better understanding would help me.
OLD:
I have data in a dataframe df that can be split into several subgroups. For that purpose I created a groupID called df$id. lm() returns the coefficent as expected for the first subgroup. I want to do a stepwise regression using AIC as criterion for each subgroup separately. I use lmList {lme4} which results in a model for each subgroup (id). But if I use stepAIC{MASS} for the list elements it throws an error. see below.
So the question is: What mistake is in my procedure/syntax? I get results for single models but not the ones created with lmList. Does lmList() store different information on the model than lm() does?
But in the help it states:
class "lmList": A list of objects of class lm with a common model.
>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept) Gap.um Standoff.um Voidflaeche.px
62.306133 -0.009878 0.026317 -0.015048
>stepAIC(lme4.list.lm[[1]], direction="backward")
#stepAIC on first element on the list of linear models
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
Obviously something does not work with the list. But I have not an idea what it might be.
Since I tried to do the same with the base package which creates the same model (at least the same coefficients). Results are below:
>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),])
# id is in order, so should be the same subgroup as for the first list element in lmList
Coefficients:
(Intercept) Gap.um Standoff.um Voidflaeche.px
62.306133 -0.009878 0.026317 -0.015048
Well, this is what I get returned using stepAIC on my linear.model .
As far as I know the akaike information criterion can be used to estimate which model better balances between fit and generalization given some data.
>stepAIC(lin.model,direction="backward")
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Step: AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Gap.um 1 28.51 7215.8 291.38
<none> 7187.3 293.14
- Voidflaeche.px 1 717.63 7904.9 296.85
Step: AIC=291.38
Scherkraft.N ~ Voidflaeche.px
Df Sum of Sq RSS AIC
<none> 7215.8 291.38
- Voidflaeche.px 1 795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])
Coefficients:
(Intercept) Voidflaeche.px
71.7183 -0.0151
I read from the output I should use the model: Scherkraft.N ~ Voidflaeche.px because this is the minimal AIC. Well, it would be nice if someone could shortly describe the output. My understanding of the stepwise regression (assuming backwards elimination) is all regressors are included in the initial model. Then the least important one is eliminated. The criterion to decide is the AIC. and so forth... Somehow I have problems to get the tables interpreted right. It would be nice if someone could confirm my interpretation. The "-"(minus) stands for the eliminated regressor. On top is the "start" model and in the table table below the RSS and AIC are calculated for possible eliminations. So the first row in the first table says a model Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px - Standoff.um would result in an AIC 293.14. Choose the one without Standoff.um: Scherkraft.N~Gap.um+Voidflaeche.px
EDIT:
I replaced the lmList{lme4} with dlply() to create the list of models.
Still stepAIC is not coping with the list. It throws another error. Actually, I believe it is a problem with the data stepAIC needs to run through. I was wondering how it calculates the AIC-value for each step just from the model data. I would take the original data to construct the models leaving one regressor out each time. Thereof I would calculate the AIC and compare. So how stepAIC is working if it has not access to the original data. (I cant see a parameter where I pass the original data to stepAIC). Still, I have no clue why it works with a plain model but not with the model wrapped in a list.
>model.list.all <- dlply(df, .id, function(x)
{return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start: AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
Df Sum of Sq RSS AIC
- Standoff.um 1 2.81 7187.3 293.14
- Gap.um 1 29.55 7214.0 293.37
<none> 7184.4 295.12
- Voidflaeche.px 1 604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found

I'm not sure what may have changed in the versioning to make the debugging so difficult, but one solution would be to use do.call, which evaluates the expressions in the call before executing it. This means that instead of storing just d in the call, so that update and stepAIC need to go find d in order to do their work, it stores a full representation of the data frame itself.
That is, do
do.call("lm", list(y~x1+x2+x3, data=d))
instead of
lm(y~x1+x2+x3, data=d)
You can see what it's trying to do by looking at the call element of the model, perhaps like this:
dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call
It's also possible to make your list of data frames in the global environment and then construct the call so that update and stepAIC look for each data frame in turn, because their environment chains always lead back to the global environment; like this:
dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )
To see what's changed, run dat.lin.model.lst[[1]]$call again.

As it seems that stepAIC goes out of loop environment (that is in global environment) to look for the data it needs, I trick it using the assign function:
results <- do.call(rbind, lapply(response, function (i) {
assign("i", response, envir = .GlobalEnv)
mdl <- gls(as.formula(paste0(i,"~",paste(expvar, collapse = "+")), data= parevt, correlation = corARMA(p=1,q=1,form= ~as.integer(Year)), weights= varIdent(~1/Linf_var), method="ML")
mdl <- stepAIC(mdl, direction ="backward")
}))

Related

LMER Factor vs numeric Interaction

I am attempting to use lmer to model my data.
My data has 2 independent variables and a dependent variable.
The first is "Morph" and has values "Identical", "Near", "Far".
The second is "Response" which can be "Old" or "New".
The dependent variable is "Fix_Count".
So here is a sample dataframe and what I currently have for running the linear model.
Subject <- c(rep(1, times = 6), rep(2, times = 6))
q <- c("Identical", "Near", "Far")
Morph <- c(rep(q, times = 4))
t <- c(rep("old", times = 3),rep("new", times=3))
Response <- c(rep(t, times = 2))
Fix_Count <- sample(1:9, 12, replace = T)
df.main <- data.frame(Subject,Morph, Response, Fix_Count, stringsAsFactors = T)
df.main$Subject <- as.factor(df.main$Subject)
res = lmer(Fix_Count ~ (Morph * Response) + (1|Subject), data=df.main)
summary(res)
And the output looks like this:
The issue is I do not want it to do combination but an overall interaction of Morph:Response.
I can get it to do this by converting Morph to numeric instead of factor. However I'm not sure conceptually that makes sense as the values don't properly represent 1,2,3 but low-mid-high (ordered but qualitative).
So: 1. Is it possible to run lmer to get interaction effects between 2 factor variables?
2. Or do you think numeric is a fine way to class "Identica", "Near", "Far"?
3. I have tried setting contrasts to see if that can help, but sometimes I get an error and other times it seems like nothing is changed. If contrasts would help, could you explain how I would implement this?
Thank you so much for any help you can offer. I have also posted this question to stack exchange as I am unsure if this is a coding issue or a stats issue. However I can remove it from the less relevant forum once I know.
Best, Kirk
Two problems I see. First, you should be using a factor variable for Subject. It's clearly not a continuous or integer variable. And to (possibly) address part of your question, there is an interaction function designed to work with regression formulas. I'm pretty sure that the formula interface will interpret the "*" operator that you used as a call to interaction, but the labeling of the output may be different and perhaps more to your liking. I get the same number of coefficients with:
res = lmer(Fix_Count ~ interaction(Morph , Response) + (1|Subject), data=df.main)
But that's not an improvement.
However, they differ from the model created with Morph*Response. Probably there is a different set of contrast options.
The way to get an overall statistical test of the interaction is to compare nested models:
res_simple = lmer(Fix_Count ~ Morph + Response + (1|Subject), data=df.main)
And then do an anova for the model comparison:
anova(res,res_simple)
refitting model(s) with ML (instead of REML)
Data: df.main
Models:
res_simple: Fix_Count ~ Morph + Response + (1 | Subject)
res: Fix_Count ~ interaction(Morph, Response) + (1 | factor(Subject))
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
res_simple 6 50.920 53.830 -19.460 38.920
res 8 54.582 58.461 -19.291 38.582 0.3381 2 0.8445
My opinion is that it is sufficiently close to the boundary for stats vs coding that it could have been acceptable on either forum. (You are not supposed to cross post, however.) If you are satisfied with a coding answer then we are done. If you need help with understanding model comparison, then you may need to edit your CV.com's question to request a more theory-based answer than mine. (I checked to make sure the anova results are the same regardless of whether you use the interaction function or the "*" operator.)

R - Determine goodness of fit of new data with predict function based on existing lm

I am trying to apply an existing model to a new data set. I try to explain it with an example. I am wondering what an elegant way to determine the goodness of the fit would look like.
Basically, I run a regression and obtain a model. With the summary function I obtain the usual output such as adjusted R-squared, p-value etc.
model.lm <- lm(Sepal.Length ~ Petal.Length, data = iris[1:75,])
summary(model.lm)
Now I want to run the predict function on new data and I am curious to know how the model performs on the new data.
pred.dat <- predict(model.lm, newdata = iris[76:150,])
I wanted to ask how I can for instance get an adjusted R-squared for the predicted values with the new data. For instance, is there something similar like the summary function? Ideally, I would like to find out what the best practice of obtaining the goodness of fit of a an existing model based on new data looks like.
Many thanks
You can translate the formula of R-squared into a function, such as:
r_squared <- function(vals, preds) {
1 - (sum((vals - preds)^2) / sum((vals - mean(preds))^2))
}
# Test
> r_squared(iris[76:150,]$Sepal.Length, pred.dat)
#[1] 0.5675686
Building upon this function, and using the correct formula we can also define adjusted R-squared as:
r_squared_a <- function(vals, preds, k) {
1 - ((1-r_squared(vals, preds))*(length(preds)-1))/(length(preds) - k - 1)
}
Where k is the number of predictors, thus:
> r_squared_a(iris[76:150,]$Sepal.Length, pred.dat, 1)
#[1] 0.5616448

predict.lm with arbitrary coefficients r

I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...

How to get values of Z - statistic from glm object?

How can I get values of Z - statistics as a vector from glm object?
For example, I have
fit <- glm(y ~ 0 + x,binomial)
How can I access the column Pr(>|z|) the same way I get estimates of coefficients with fit$coef?
I believe that
coef(summary(fit))[,"Pr(>|z|)"]
will get you what you want. (summary.glm() returns an object that has a coef() method that returns the coefficient table.) (By the way, if accessor methods exist it's better to use them than to directly access the components of the fitted model -- e.g. coef(fit) is better than fit$coef.)
pull out p-values and r-squared from a linear regression gives a similar answer.
I would suggest methods(class="summary.glm") to find available accessor methods, but it's actually a little bit trickier than that because the default methods (in this case coef.default()) may also be relevant ...
PS if you want the Z values, coef(summary(fit))[,"z value"] should do it (your question is a little bit ambiguous: usually when people say "Z statistic" they mean the want the value of the test statistic, rather than the p value)
You can access to the info you want by doing
utils::data(anorexia, package="MASS") # Some data
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
family = gaussian, data = anorexia) # a glm model
summary(anorex.1)
summary(anorex.1)$coefficients[,3] # vector of t-values
(Intercept) Prewt TreatCont TreatFT
3.716770 -3.508689 -2.163761 2.138933
summary(anorex.1)$coefficients[,4] # vector of p-values
(Intercept) Prewt TreatCont TreatFT
0.0004101067 0.0008034250 0.0339993147 0.0360350847
summary(anorex.1)$coefficients[,3:4] # matrix
t value Pr(>|t|)
(Intercept) 3.716770 0.0004101067
Prewt -3.508689 0.0008034250
TreatCont -2.163761 0.0339993147
TreatFT 2.138933 0.0360350847
str function will show you where each element within an object is located, and [[ accessors (better than $, as pointed out by #DWin and #Ben Bolker) will extract the info for you. Try str(summary(anorex.1)) to see what this function does.
I use the summary syntax like: summary(my_model)[[1]][[2]]. You can try to use different combinations of numbers in "[[]]" to extract the data required. Of course, if I correctly understood your question :)

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources