Iteratively adding variables to an lm() function in R? - r

Simple question, but I'm finding myself boggled.
I'm looking to make a loop that will continuously add variables to the IV of an lm() function. I would test the results of the LM until a condition is met. I'm just having trouble finding a way to dynamically adding variables to the IV part of the regression, one at a time.
The 1st iteration would look like:
lm(Y ~ X, data = data)
The second iteration like:
lm(Y ~ X + X2, data = data)
The third iteration like:
lm(Y ~ X + X2 + X3, data = data)
And so on...
If any of you could point me in the right direction, I'd appreciate it very much.
Thanks!

An alternative way is to use Y ~ . as the formula and provide the subset of data as required. Here, . means "all columns not otherwise in the formula" (see ?formula). Using mtcars as an example:
Y <- 'mpg'
Xs <- names(mtcars)[-1]
fits <- lapply(seq_along(Xs), function(x){
lm(paste(Y, '~ .'), data = mtcars[, c(Y, Xs[1:x])])
})

We can use reformulate to create the formula after passing the independent variables as a list
out <- lapply(list("X", c("X", "X2"), c("X", "X2", "X3")),
function(x) lm(reformulate(x, response = "Y"), data = data))
Or make it automated
Xs <- setdiff(names(data), "Y")
ind <- sequence(seq_along(Xs))
lapply(split(Xs[ind], cumsum(ind == 1)), function(x)
lm(reformulate(x, response = "Y"), data = data))

Related

how to store results from multiple regressions in a single dataframe in a neat way

I am going to run dozens of regressions of different Ys on the same X. I want to score the coefficients and standard errors of each regression to a single data frame.
The dataframe is like
Y1, Y2, Y3, ... , Y50, X
1, 2, 3, ..., 50, 1
...
I can do it like for each Y
model1 <- lm (Y1~X, data = data)
summary1 <- summary(model1)
list1 <- list(coef=summary1$coefficients[2,1],se=summary1$coefficients[2,2])
# only coef of X is of interest
And then generate the dataframe I want by
df <- as.data.frame(list1,list2,...,list50)
I am a rookie, is there a more neat way to do this in R? I tried to write functions with the name of the variables as input, but it fails if I define it as function(variable) and use variable directly inside the function.
Thank you so much for your inspiration.
You can try using lapply to loop over each Y variables.
cols <- grep('Y\\d+', names(data))
do.call(rbind, lapply(cols, function(x) {
model <- lm(reformulate('X', x), data)
summary <- summary(model)
data.frame(coef = summary$coefficients[2,1],
se = summary$coefficients[2,2])
})) -> df
df

How to correctly pass formulas associated with a variable name with random effects into fitted regression models in `R`?

I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".

Formula from Data.frame Columns

I want to create a regression model from a vector (IC50) against a number of different molecular descriptors (A,B,C,D etc).
I want to use,
model <- lm (IC50 ~ A + B + C + D)
the molecular descriptors are found in the columns of a data.frame. I would like to use a function that takes the IC50 vector and the appropriately sub-setted data.frame as inputs.
My problem is that I can't convert the columns to formula for the model.
Can anyone help.
Sample data and feeble attempt,
IC50 <- c(0.1,0.2,0.55,0.63,0.005)
descs <- data.frame(A=c(0.002,0.2,0.654,0.851,0.654),
B=c(56,25,89,55,60),
C=c(0.005,0.006,0.004,0.009,0.007),
D=c(189,202,199,175,220))
model <- function(x=IC50,y=descs) {
a <- lm(x ~ y)
return(a)
}
I went down the substitute/deparse route but this didn't import the data.
You can do simply
model <- function(x = IC50, y = descs)
lm(x ~ ., data = y)

Rank a list of models based on AIC values

After applying a model between one response variable and several exlanatory variables across a dataframe, I would like to rank each model by the AIC score.
I have encountered a very similar question that does exactly what I want to do.
Using lapply on a list of models, but it does not seem to work for me and I'm not sure why. Here's an example using the mtcars dataset:
lm_multiple <- lapply(mtcars[,-1], function(x) summary(lm(mtcars$mpg ~ x)))
An approved answer from the link above suggested:
sapply(X = lm_multiple, FUN = AIC)
But this does not work for me, I get this warning message.
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "summary.lm"
Here is an answer from the original question...
x <- seq(1:10)
y <- sin(x)^2
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)
you should remove the summary like this
lm_multiple <- lapply(mtcars[,-1], function(x) lm(mtcars$mpg ~ x))
sapply(X = lm_multiple, FUN = AIC)

Use of offset in lm regression - R

I have this code
dens <- read.table('DensPiu.csv', header = FALSE)
fl <- read.table('FluxPiu.csv', header = FALSE)
mydata <- data.frame(c(dens),c(fl))
dat = subset(mydata, dens>=3.15)
colnames(dat) <- c("x", "y")
attach(dat)
and I would like to do a least-square regression on the data contained in dat, the function has the form
y ~ a + b*x
and I want the regression line to pass through a specific point P(x0,y0) (which is not the origin).
I'm trying to do it like this
x0 <- 3.15
y0 <-283.56
regression <- lm(y ~ I(x-x0)-1, offset=y0)
(I think that data = dat is not necessary in this case) but I get this error :
Error in model.frame.default(formula = y ~ I(x - x0) - 1, : variable
lengths differ (found for '(offset)').
I don't know why. I guess that I haven't defined correctly the offset value but I couldn't find any example online.
Could someone explain to me how offset works, please?
Your offset term has to be a variable, like x and y, not a numeric constant. So you need to create a column in your dataset with the appropriate values.
dat$o <- 283.56
lm(y ~ I(x - x0) - 1, data=dat, offset=o)
In fact, the real issue here is that you should specify offset with a vector whose length is the same as the number of rows (or the length, if data is composed as a vector) of your data. The following code will do your job as expected:
regression <- lm(y ~ I(x-x0)-1, offset = rep(y0, length(y)))
Here is a good explanation for those who are interested:
http://rfunction.com/archives/223

Resources