Shortcut using lm() in R for formula - r

It is possible to use a shortcut for formula in lm()
m <- matrix(rnorm(100), ncol=5)
lm(m[,1] ~ m[,2:5]
here it would be the same as
lm(m[,1] ~ m[,2] + m[,3] + m[,4] + m[,5]
but in the case when variables are not of the same level (at least this is my assumption for now) this does not work and I get the error:
Error in model.frame.default(formula = hm[, 1] ~ hm[, 2:4], drop.unused.levels = TRUE) :
invalid type (list) for variable 'hm[, 2:4]'
Data (hm):
N cor.distance switches time
1 50 0.04707842 2 0.003
2 100 -0.10769441 2 0.004
3 200 -0.01278359 2 0.004
4 300 0.04229509 5 0.008
5 500 -0.04490092 6 0.010
6 1000 0.01939561 4 0.007
Is there some shortcut still possible to avoid having to write the long formula?

Try lm(y ~ ., data) where . means "every other column in data besides y.
m <- matrix(rnorm(100), ncol =5)
m <- as.data.frame(m)
names(m) <- paste("m", 1:5, sep="")
lm(m1 ~., data=m)
You can reassign m to include only the columns you as the predictors
m <- m[ ,2:4]
lm(m1 ~ ., data=m)

There is another one shortcut for the cases when a dependent variable is in the first column:
data <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
lm(data)

Related

How to add coefficients to existing data base such that their effect on the final intercept is given?

Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)

How can i use predict and expand_grid simultaneously while keeping control variables constant?

I would like to predict the values of an interaction using expand_grid. The first step i use is to run the model with the interaction alone and it works with no problem. Here is the example below:
dat <- data.frame(time=gl(n = 2,k = 5000),
y= rnorm(nrow(dat), mean=1000, sd=400),
a=factor(rep(c(1,2),times=5000)),
b=factor(rep(c(1,2),times=5000)),
c= rnorm(nrow(dat), mean=40, sd=10),
d= rnorm(nrow(dat), mean=550, sd=10))
m = lm(y ~ a*b, data=dat)
pred <- expand.grid(a= factor(1:2), b= factor(1:2))
pred$y <- predict(m,pred)
However, i would like to add some controls variables (c and d) for the model as follows:
m = lm(y ~ a*b + c + d, data=dat)
But then expand.grid does not work, only if i include all the variables. I am interested in the interaction, but also would like to add the controls. However, adding them in the expand.grid gives me too many scenarios to interpret. Is there any other way that is more efficient?
You can just put the c and d variables in expand.grid() as constants and it should work fine.
dat <- data.frame(time=gl(n = 2,k = 5000),
y= rnorm(10000, mean=1000, sd=400),
a=factor(sample(1:2, 10000, replace=TRUE)),
b=factor(sample(1:2, 10000, replace=TRUE)),
c= rnorm(10000, mean=40, sd=10),
d= rnorm(10000, mean=550, sd=10))
m = lm(y ~ a*b, data=dat)
pred <- expand.grid(a= factor(1:2), b= factor(1:2),
c=mean(dat$c), d=mean(dat$d))
pred$y_hat <- predict(m,newdata=pred)
# a b c d y_hat
# 1 1 1 39.90915 550.064 1002.733
# 2 2 1 39.90915 550.064 1006.523
# 3 1 2 39.90915 550.064 1015.462
# 4 2 2 39.90915 550.064 1007.281
One problem was that your a and b variables were exactly the same, so the prediction threw a warning. I fixed that in the code below.

how to put a threshold for Step package?

Thanks to this post regarding the failure of stepwise variable selection in lm
I have a data for example looks like below as described in that post
set.seed(1) # for reproducible example
x <- sample(1:500,500) # need this so predictors are not perfectly correlated.
x <- matrix(x,nc=5) # 100 rows, 5 cols
y <- 1+ 3*x[,1]+2*x[,2]+4*x[,5]+rnorm(100) # y depends on variables 1, 2, 5 only
# you start here...
df <- data.frame(y,as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
step(full.model,direction="backward")
What I need is to select only 5 best variables and then 6 best variables out of these 20, does anyone know how to make this contarains?
MuMIn::dredge() has the option about the limits for number of terms. [NOTE]: the number of combinations, the time required, grows exponentially with number of predictors.
set.seed(1) # for reproducible example
x <- sample(100*20)
x <- matrix(x, nc = 20) # 20 predictor
y <- 1 + 2*x[,1] + 3*x[,2] + 4*x[,3] + 5*x[,7] + 6*x[,8] + 7*x[,9] + rnorm(100) # y depends on variables 1,2,3,7,8,9 only
df <- data.frame(y, as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
library(MuMIn)
# options(na.action = "na.fail") # trace = 2: a progress bar is displayed
dredge(full.model, m.lim = c(5, 5), trace = 2) # result: x2, x3, x7, x8, x9

How to create vector of multiple model objects through loop

I have a large data-set with multiple target variables. Currently, I am having issues in writing code/loop for one of the part of the model i.e
mod <- list(ah=ah,bn=bn).
#Detailed code is as follows:
jk<- data.frame(y=runif(40), l=runif(40), m=runif(40), p=runif(40))
ah <- lm(l ~ p, jk)
bn <- lm(m ~ y, jk)
mod <- list(ah=ah,bn=bn)
for (i in names(mod))
{
jk[[i]] <- predict(mod[[i]], jk)
}
Problem is that if there are 200 models then it will be cumbersome task to write ah=ah, bn=bn for 200 times. Therefore, I need a loop to run the same so as to use in below predict function.
If we are only concerned about getting the 'mod' in a list, create the objects within a new environment and get the values using mget after listing the objects (ls()) from the environment
e1 <- new.env()
e1$ah <- lm(l ~ p, jk)
e1$bn <- lm(m ~ y, jk)
mod <- mget(ls(envir=e1), envir = e1)
mod
#$ah
#Call:
#lm(formula = l ~ p, data = jk)
#Coefficients:
#(Intercept) p
# 0.4800 0.0145
#$bn
#Call:
#lm(formula = m ~ y, data = jk)
#Coefficients:
#(Intercept) y
# 0.37895 -0.02564
Or another option is using paste
mod1 <- mget(paste0(c("a", "b"), c("h", "n")), envir = e1)
names(mod1)
#[1] "ah" "bn"
This will be useful if there are many objects and we want to return them in a sequence i.e. suppose we have 'ah1', 'ah2', ... in an environment
e2 <- new.env()
e2$ah1 <- 1:5
e2$ah2 <- 1:6
e2$ah3 <- 3:5
new1 <- mget(paste0("ah", 1:3), envir = e2)
new1
#$ah1
#[1] 1 2 3 4 5
#$ah2
#[1] 1 2 3 4 5 6
#$ah3
#[1] 3 4 5
Now, applying the loop to get the predict based on the 'mod'
for (i in names(mod)){
jk[[i]] <- predict(mod[[i]], jk)
}
head(jk)
# y l m p ah bn
#1 0.2925740 0.47038243 0.5268515 0.9267596 0.4934493 0.3714515
#2 0.2248911 0.37568719 0.1203445 0.5141895 0.4874671 0.3731871
#3 0.7042230 0.27253736 0.5068240 0.6584371 0.4895587 0.3608958
#4 0.5188971 0.21981567 0.2168941 0.7158389 0.4903910 0.3656480
#5 0.6626196 0.04366575 0.3655512 0.3298476 0.4847942 0.3619626
#6 0.9204438 0.07509480 0.3494581 0.7410798 0.4907570 0.3553514
data
set.seed(24)
jk<- data.frame(y=runif(40), l=runif(40), m=runif(40), p=runif(40))

R: using predict() on new data with high dimensionality [duplicate]

It is possible to use a shortcut for formula in lm()
m <- matrix(rnorm(100), ncol=5)
lm(m[,1] ~ m[,2:5]
here it would be the same as
lm(m[,1] ~ m[,2] + m[,3] + m[,4] + m[,5]
but in the case when variables are not of the same level (at least this is my assumption for now) this does not work and I get the error:
Error in model.frame.default(formula = hm[, 1] ~ hm[, 2:4], drop.unused.levels = TRUE) :
invalid type (list) for variable 'hm[, 2:4]'
Data (hm):
N cor.distance switches time
1 50 0.04707842 2 0.003
2 100 -0.10769441 2 0.004
3 200 -0.01278359 2 0.004
4 300 0.04229509 5 0.008
5 500 -0.04490092 6 0.010
6 1000 0.01939561 4 0.007
Is there some shortcut still possible to avoid having to write the long formula?
Try lm(y ~ ., data) where . means "every other column in data besides y.
m <- matrix(rnorm(100), ncol =5)
m <- as.data.frame(m)
names(m) <- paste("m", 1:5, sep="")
lm(m1 ~., data=m)
You can reassign m to include only the columns you as the predictors
m <- m[ ,2:4]
lm(m1 ~ ., data=m)
There is another one shortcut for the cases when a dependent variable is in the first column:
data <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
lm(data)

Resources