How to use predict.lm in for loop? - r

I need to use a linear regression. Since each predictor is added to the model respectively, I should use a for loop to fit the model.
set.seed(98274) # Creating example data
y <- rnorm(1000)
x1 <- rnorm(1000) + 0.2 * y
x2 <- rnorm(1000) + 0.2 * x1 + 0.1 * y
x3 <- rnorm(1000) - 0.1 * x1 + 0.3 * x2 - 0.3 * y
data <- data.frame(y, x1, x2, x3)
head(data) # Head of data
mod_summaries <- list() # Create empty list
for(i in 2:ncol(data)) { # Head of for-loop
predictors_i <- colnames(data)[2:i] # Create vector of predictor names
mod_summaries[[i - 1]] <- summary( # Store regression model summary in list
lm(y ~ ., data[ , c("y", predictors_i)]))
}
Then, I tried to predict the test data using those models in another for loop. My code is provided in the following.
## Test
set.seed(44) # Creating test data
y <- rnorm(1000)
x1 <- rnorm(1000) + 0.19 * y
x2 <- rnorm(1000) + 0.2 * x1 + 0.11 * y
x3 <- rnorm(1000) - 0.12 * x1 + 0.28 * x2 - 0.33 * y
test <- data.frame(y, x1, x2, x3)
predict_models <- matrix(nrow = nrow(test), ncol = 3)
for(i in 2:ncol(data)) { # Head of for-loop
predictors_i <- colnames(data)[2:i] # Create vector of predictor names
predict_models[,i-1] <- predict.lm(mod_summaries[[i-1]], test[,2:i])
}
predict_models
but it throws out the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
'data' must be a data.frame, environment, or list
In addition: Warning message:
In predict.lm(mod_summaries[[i - 1]], test[, 2:i]) :
calling predict.lm(<fake-lm-object>) ...

First, you want to store just the models, not the summaries.
mod_summaries <- vector('list', ncol(data) - 1L) ## preallocate list of known length, it's way more efficient
for (i in seq_len(ncol(data))[-1]) {
predictors_i <- colnames(data)[2:i]
mod_summaries[[i - 1]] <- lm(y ~ ., data[, c("y", predictors_i)])
}
Then, data for predict actually doesn't change, only columns in model are used, so using test is sufficient.
predict_models <- matrix(nrow=nrow(test), ncol=ncol(test) - 1L)
for (i in seq_len(ncol(data))[-1]) {
predict_models[, i - 1] <- predict.lm(mod_summaries[[i - 1]], test)
}
That's actually it.
head(predict_models)
# [,1] [,2] [,3]
# [1,] -0.115690784 -0.19149611 -0.4815419
# [2,] -0.004721430 0.03814865 0.1894562
# [3,] -0.110812904 0.02312155 0.2579051
# [4,] 0.004264032 -0.06147035 -0.2328833
# [5,] 0.320110168 -0.04145044 -0.3229186
# [6,] -0.040603638 0.01977484 -0.1090088
Alternatively, and more R-ish, you could do the same in just two lines of code, without for loops, though.
ms <- lapply(seq_along(data)[-1], \(i) lm(reformulate(names(data)[2:i], 'y'), data))
pm <- sapply(ms, predict, test)
head(pm)
# [,1] [,2] [,3]
# 1 -0.115690784 -0.19149611 -0.4815419
# 2 -0.004721430 0.03814865 0.1894562
# 3 -0.110812904 0.02312155 0.2579051
# 4 0.004264032 -0.06147035 -0.2328833
# 5 0.320110168 -0.04145044 -0.3229186
# 6 -0.040603638 0.01977484 -0.1090088

Related

How to add coefficients to existing data base such that their effect on the final intercept is given?

Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)

Applying lm() and predict() to multiple columns in a data frame

I have an example dataset below.
train<-data.frame(x1 = c(4,5,6,4,3,5), x2 = c(4,2,4,0,5,4), x3 = c(1,1,1,0,0,1),
x4 = c(1,0,1,1,0,0), x5 = c(0,0,0,1,1,1))
Suppose I want to create separate models for column x3, x4, x5 based on column x1 and x2. For example
lm1 <- lm(x3 ~ x1 + x2)
lm2 <- lm(x4 ~ x1 + x2)
lm3 <- lm(x5 ~ x1 + x2)
I want to then take these models and apply them to a testing set using predict, and then create a matrix that has each model outcome as a column.
test <- data.frame(x1 = c(4,3,2,1,5,6), x2 = c(4,2,1,6,8,5))
p1 <- predict(lm1, newdata = test)
p2 <- predict(lm2, newdata = test)
p3 <- predict(lm3, newdata = test)
final <- cbind(p1, p2, p3)
This is a simplified version where you can do it step by step, the actual data is far too large. Is there a way to create a function or use a for statement to combine this into one or two steps?
I had an inclination to close your question as a duplicate to Fitting a linear model with multiple LHS, but sadly the prediction issue is not addressed over there. On the other hand, Prediction of 'mlm' linear model object from lm() talks about prediction, but is a little bit far off your situation, as you work with formula interface instead of matrix interface.
I did not manage to locate a perfect duplicate target in "mlm" tag. So I think it a good idea to contribute another answer for this tag. As I said in linked questions, predict.mlm does not support se.fit, and at the moment, this is also a missing issue in "mlm" tag. So I would take this chance to fill such gap.
Here is a function to get standard error of prediction:
f <- function (mlmObject, newdata) {
## model formula
form <- formula(mlmObject)
## drop response (LHS)
form[[2]] <- NULL
## prediction matrix
X <- model.matrix(form, newdata)
Q <- forwardsolve(t(qr.R(mlmObject$qr)), t(X))
## unscaled prediction standard error
unscaled.se <- sqrt(colSums(Q ^ 2))
## residual standard error
sigma <- sqrt(colSums(residuals(mlmObject) ^ 2) / mlmObject$df.residual)
## scaled prediction standard error
tcrossprod(unscaled.se, sigma)
}
For your given example, you can do
## fit an `mlm`
fit <- lm(cbind(x3, x4, x5) ~ x1 + x2, data = train)
## prediction (mean only)
pred <- predict(fit, newdata = test)
# x3 x4 x5
#1 0.555956679 0.38628159 0.60649819
#2 0.003610108 0.47653430 0.95848375
#3 -0.458483755 0.48014440 1.27256318
#4 -0.379061372 -0.03610108 1.35920578
#5 1.288808664 0.12274368 0.17870036
#6 1.389891697 0.46570397 0.01624549
## prediction error
pred.se <- f(fit, newdata = test)
# [,1] [,2] [,3]
#[1,] 0.1974039 0.3321300 0.2976205
#[2,] 0.3254108 0.5475000 0.4906129
#[3,] 0.5071956 0.8533510 0.7646849
#[4,] 0.6583707 1.1077014 0.9926075
#[5,] 0.5049637 0.8495959 0.7613200
#[6,] 0.3552794 0.5977537 0.5356451
We can verify that f is correct:
## `lm1`, `lm2` and `lm3` are defined in your question
predict(lm1, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.1974039 0.3254108 0.5071956 0.6583707 0.5049637 0.3552794
predict(lm2, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.3321300 0.5475000 0.8533510 1.1077014 0.8495959 0.5977537
predict(lm3, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.2976205 0.4906129 0.7646849 0.9926075 0.7613200 0.5356451

R: using predict() on new data with high dimensionality [duplicate]

It is possible to use a shortcut for formula in lm()
m <- matrix(rnorm(100), ncol=5)
lm(m[,1] ~ m[,2:5]
here it would be the same as
lm(m[,1] ~ m[,2] + m[,3] + m[,4] + m[,5]
but in the case when variables are not of the same level (at least this is my assumption for now) this does not work and I get the error:
Error in model.frame.default(formula = hm[, 1] ~ hm[, 2:4], drop.unused.levels = TRUE) :
invalid type (list) for variable 'hm[, 2:4]'
Data (hm):
N cor.distance switches time
1 50 0.04707842 2 0.003
2 100 -0.10769441 2 0.004
3 200 -0.01278359 2 0.004
4 300 0.04229509 5 0.008
5 500 -0.04490092 6 0.010
6 1000 0.01939561 4 0.007
Is there some shortcut still possible to avoid having to write the long formula?
Try lm(y ~ ., data) where . means "every other column in data besides y.
m <- matrix(rnorm(100), ncol =5)
m <- as.data.frame(m)
names(m) <- paste("m", 1:5, sep="")
lm(m1 ~., data=m)
You can reassign m to include only the columns you as the predictors
m <- m[ ,2:4]
lm(m1 ~ ., data=m)
There is another one shortcut for the cases when a dependent variable is in the first column:
data <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
lm(data)

Shortcut using lm() in R for formula

It is possible to use a shortcut for formula in lm()
m <- matrix(rnorm(100), ncol=5)
lm(m[,1] ~ m[,2:5]
here it would be the same as
lm(m[,1] ~ m[,2] + m[,3] + m[,4] + m[,5]
but in the case when variables are not of the same level (at least this is my assumption for now) this does not work and I get the error:
Error in model.frame.default(formula = hm[, 1] ~ hm[, 2:4], drop.unused.levels = TRUE) :
invalid type (list) for variable 'hm[, 2:4]'
Data (hm):
N cor.distance switches time
1 50 0.04707842 2 0.003
2 100 -0.10769441 2 0.004
3 200 -0.01278359 2 0.004
4 300 0.04229509 5 0.008
5 500 -0.04490092 6 0.010
6 1000 0.01939561 4 0.007
Is there some shortcut still possible to avoid having to write the long formula?
Try lm(y ~ ., data) where . means "every other column in data besides y.
m <- matrix(rnorm(100), ncol =5)
m <- as.data.frame(m)
names(m) <- paste("m", 1:5, sep="")
lm(m1 ~., data=m)
You can reassign m to include only the columns you as the predictors
m <- m[ ,2:4]
lm(m1 ~ ., data=m)
There is another one shortcut for the cases when a dependent variable is in the first column:
data <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
lm(data)

profile confidence intervals in R: mle2

I am trying to use the command mle2, in the package bbmle. I am looking at p2 of "Maximum likelihood estimation and analysis with the bbmle package" by Bolker. Somehow I fail to enter the right start values. Here's the reproducible code:
l.lik.probit <-function(par, ivs, dv){
Y <- as.matrix(dv)
X <- as.matrix(ivs)
K <-ncol(X)
b <- as.matrix(par[1:K])
phi <- pnorm(X %*% b)
sum(Y * log(phi) + (1 - Y) * log(1 - phi))
}
n=200
set.seed(1000)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
latentz<- 1 + 2.0 * x1 + 3.0 * x2 + 5.0 * x3 + 8.0 * x4 + rnorm(n,0,5)
y <- latentz
y[latentz < 1] <- 0
y[latentz >=1] <- 1
x <- cbind(1,x1,x2,x3,x4)
values.start <-c(1,1,1,1,1)
foo2<-mle2(l.lik.probit, start=list(dv=0,ivs=values.start),method="BFGS",optimizer="optim", data=list(Y=y,X=x))
And this is the error I get:
Error in mle2(l.lik.probit, start = list(Y = 0, X = values.start), method = "BFGS", :
some named arguments in 'start' are not arguments to the specified log-likelihood function
Any idea why? Thanks for your help!
You've missed a couple of things, but the most important is that by default mle2 takes a list of parameters; you can make it take a parameter vector instead, but you have to work a little bit harder.
I have tweaked the code slightly in places. (I changed the log-likelihood function to a negative log-likelihood function, without which this would never work!)
l.lik.probit <-function(par, ivs, dv){
K <- ncol(ivs)
b <- as.matrix(par[1:K])
phi <- pnorm(ivs %*% b)
-sum(dv * log(phi) + (1 - dv) * log(1 - phi))
}
n <- 200
set.seed(1000)
dat <- data.frame(x1=rnorm(n),
x2=rnorm(n),
x3=rnorm(n),
x4=rnorm(n))
beta <- c(1,2,3,5,8)
mm <- model.matrix(~x1+x2+x3+x4,data=dat)
latentz<- rnorm(n,mean=mm%*%beta,sd=5)
y <- latentz
y[latentz < 1] <- 0
y[latentz >=1] <- 1
x <- mm
values.start <- rep(1,5)
Now we do the fit. The main thing is to specify vecpar=TRUE and to use parnames to let mle2 know the names of the elements in the parameter vector ...
library("bbmle")
names(values.start) <- parnames(l.lik.probit) <- paste0("b",0:4)
m1 <- mle2(l.lik.probit, start=values.start,
vecpar=TRUE,
method="BFGS",optimizer="optim",
data=list(dv=y,ivs=x))
As pointed out above for this particular example you have just re-implemented the probit regression (although I understand that you now want to extend this to allow for heteroscedasticity in some way ...)
dat2 <- data.frame(dat,y)
m2 <- glm(y~x1+x2+x3+x4,family=binomial(link="probit"),
data=dat2)
As a final note, I would say that you should check out the parameters argument, which allows you to specify a sub-linear model for any one of the parameters, and the formula interface:
m3 <- mle2(y~dbinom(prob=pnorm(eta),size=1),
parameters=list(eta~x1+x2+x3+x4),
start=list(eta=0),
data=dat2)
PS confint(foo2) appears to work fine (giving profile CIs as requested) with this set-up.
ae <- function(x,y) all.equal(unname(coef(x)),unname(coef(y)),tol=5e-5)
ae(m1,m2) && ae(m2,m3)

Resources