Constrain number of predictor variables in stepwise regression in R - r

I would like to be able to do a forward stepwise linear regression, but constrain the number of predictor variables to a maximum (in my specific case, three). Here is some sample data.
set.seed(123)
myDep <- runif(100)
pred1 <- myDep + runif(100)
pred2 <- myDep + rnorm(100)
pred3 <- myDep + runif(100) + rnorm(100)
pred4 <- myDep + runif(100) + runif(100)
pred5 <- runif(100)
myDF <- data.frame(myDep, pred1, pred2, pred3, pred4, pred5)
If I were to simply run a linear regression using the following code below, I would get all five predictor variables, obviously.
myModel <- lm(myDep ~ ., data = myDF)
What I would like to do it use step() or other R command to run a forward-direction stepwise that picks only three predictor variables and then stops.
For what it is worth, I tried this:
step(lm(myDep ~ ., data = myDF), steps = 3, direction = "forward")
and the results were the following -- but not what I want because it uses all five predictor variables.
Start: AIC=-378.09
myDep ~ pred1 + pred2 + pred3 + pred4 + pred5
Call:
lm(formula = myDep ~ pred1 + pred2 + pred3 + pred4 + pred5, data = myDF)
Coefficients:
(Intercept) pred1 pred2 pred3 pred4 pred5
-0.16617 0.30043 0.07983 0.03670 0.17869 0.01606
I'm sure there's a way to do this, but I cannot seem to figure out the proper formatting. Thanks in advance.

You could use the regsubsets package in R, where you can limit the variables and choose your method ("forward").
https://www.rdocumentation.org/packages/leaps/versions/2.1-1/topics/regsubsets
library(regsubsets)
b <- regsubsets(myDep ~ ., data=myDF, nbest=1, nvmax=[enter your max # of predictors])
summary(b)

Related

How can I apply interaction term with linear predictors in mgcv package

I have a question about using interaction term in mgcv package with 2 linear predictor.
I wrote the code to fit interaction between x1 and x2,
mgcv::gam(y ~ x1 + x2 + ti(x1, x2, k = 3), method = "REML", data = DATA)
but I'm really not sure this is correct way to using ti function in mgcv.
Should I use different method (like glm?) to do this work?
Thank you for your answer.
If you want a non-linear interaction, you have two choices:
gam(y ~ te(x1, x2), method = "REML", data = DATA)
or
gam(y ~ s(x1) + s(x2) + ti(x1, x2), method = "REML", data = DATA)
In the first model, the main effects and interactions are bound up in the one two-dimensional function te(x1,x2).
In the second model, because ti(x1, x2) will exclude the main smooth effects or the marginal terms (x1, x2), you should also include the main
smooth effects of these terms too using s().
With your formulation, you would not get any non-linear main effects, only linear main effects and a non-linear pure-interaction, which is likely not what you want.
Here's an example:
library("mgcv")
library("gratia")
library("ggplot2")
library("patchwork")
set.seed(1)
df2 <- data_sim(2, n = 1000, dist = "normal", scale = 1, seed = 2)
m1 <- gam(y ~ s(x, k = 5) + s(z, k = 5) + ti(x, z, k = 5),
data = df2,
method = "REML")
m2 <- gam(y ~ x + z + ti(x, z, k = 5),
data = df2
method = "REML")
pl <- plot_layout(nrow = 1, ncol = 3)
p1 <- draw(m1) + pl
p2 <- draw(m2, parametric = TRUE) + pl
p1 - p2 + plot_layout(nrow = 2)
which produces
Notice how in this case, you would be missing the non-linear in the marginal smooths/terms, which is not accounted for in the ti() term, because it has no marginal main effects (the ti() smooths are the same across both models).
If you just want to fit a linear "main effects" and their linear interaction, just use the formula as you would with glm():
gam(y ~ x1 + x2 + x1:x2, ....)
Note that the term "linear predictor" refers to the entire model (in the case), or more specifically the entire formula on the RHS of ~.

LavaanPlot Floating Circle

I simulated the following data:
library(lavaan)
library(lavaanPlot)
set.seed(2002)
#simulate predictor variables
pred1<- c(1:60)
pred2<- rnorm(60, mean=100, sd=10)
pred3 <- .05 + .05*pred1 + rnorm(length(pred1),1,.5)
#simulate response variables
resp <- 350 -2*pred1 -50*pred3 + rnorm(length(pred1),50,50)
#create df
df <- cbind(resp, pred1, pred3, pred2)
Developed the following model:
#sem model
model <-
'pred2 ~ pred1
resp ~ pred1
resp~ pred3
pred3 ~ pred1'
Fit the model:
# fit model
fit <- sem(model, data = df)
summary(fit,rsq = T, fit.measures = TRUE, standardized = TRUE)
Using the lavaanPlot function I get a floating bubble in the right corner. I would like to know what it means, why it appears and how to remove it from the output diagram.
lavaanPlot(name = "MODEL1", fit, labels = df, coefs = TRUE)

How to set all coefficients to one in model?

To fix certain coefficient in regression to one we can use offset function.
I want to set all coefficients to 1.
Let's take this example:
set.seed(42)
y <- rnorm(100)
df <- data.frame("Uni" = runif(100), "Exp" = rexp(100), "Wei" = rweibull(100, 1))
lm(y~ offset(2*get("Uni")) + Exp + Wei, data = df)
Call:
lm(formula = y ~ offset(Uni) + offset(Exp) + offset(Wei), data = df)
Coefficients:
(Intercept)
-2.712
This code works, however what if I have huge amount of data e.g. 800 variables and I want to do for all of them ? Writing all their names would be not so efficient. Is there any solution which allows us to do it more tricky ?
I think I found one solution if we do it this way:
set.seed(42)
# Assign everything to one data frame
df <- data.frame("Dep" = rnorm(100), "Uni" = runif(100),
"Exp" = rexp(100), "Wei" = rweibull(100, 1))
varnames <- names(df)[-1]
# Create formula for the sake of model creation
form <- paste0("offset","(",varnames, ")",collapse = "+")
form <- as.formula(paste0(names(df)[1], "~", form))
lm(form, data = df)
1) terms/update The following one-liner will produce the indicated formula.
update(formula(terms(y ~ ., data = df)), ~ offset(.))
## y ~ offset(Uni + Exp + Wei)
2) reformulate/sprintf another approach is:
reformulate(sprintf("offset(%s)", names(df)), "y")
## y ~ offset(Dep) + offset(Uni) + offset(Exp) + offset(Wei)
3) rowSums Another approach is to simply sum each row:
lm(y ~ offset(rowSums(df)))
4) lm.fit We could use lm.fit in which case we don't need a formula:
lm.fit(cbind(y^0), y, offset = rowSums(df))
5) mean If you only need the coefficient then it is just:
mean(y - rowSums(df))

R Predict using multiple models

I am new to R and trying to predict outcomes on a dataset using 4 different GLM's. I have tried running as one large model and while I do get results the model doesn't converge properly and I end up with N/A's. I therefore have four models:
model_team <- glm(mydata$OUT ~ TEAM + OPPONENT, family = "binomial",data = mydata )
model_conf <- glm(mydata$OUT ~ TCONF + OCONF, family = "binomial",data = mydata)
model_tstats <- glm(mydata$OUT ~ TPace + TORtg + TFTr + T3PAr + TTS. + TTRB. + TAST. + TSTL. + TBLK. + TeFG. + TTOV. + TORB. + TFT.FGA, family = "binomial",data = mydata)
model_ostats<- glm(mydata$OUT ~ OPace + OORtg + OFTr + O3PAr + OTS. + OTRB. + OAST. + OSTL. + OBLK. + OeFG. + OTOV. + OORB. + OFT.FGA, family = "binomial",data = mydata)
I then want to predict the outcomes using a different data set using the four models
predict(model_team, model_conf, model_tstats, model_ostats, fix, level = 0.95, type = "probs")
Is there a way to use all four models with joining them into one large set?
I don't really understand why you are trying to do what you are doing. I also don't have any example data that is a representation of the data you are working with. However, below is an example of how you could combine multiple GLMs into one using the resulting coefficients. Note that this will not work well if you have multicollinearity between the variables in your dataset.
# I used the iris dataset for my example
head(iris)
# Run several models
model1 <- glm(data = iris, Sepal.Length ~ Sepal.Width)
model2 <- glm(data = iris, Sepal.Length ~ Petal.Length)
model3 <- glm(data = iris, Sepal.Length ~ Petal.Width)
# Get combined intercept
intercept <- mean(
coef(model1)['(Intercept)'],
coef(model2)['(Intercept)'],
coef(model3)['(Intercept)'])
# Extract coefficients
coefs <- as.matrix(
c(coef(model1)[2],
coef(model2)[2],
coef(model3)[2])
# Get the feature values for the predictions
ds <- as.matrix(iris[,c('Sepal.Width', 'Petal.Length', 'Petal.Width')])
# Linear algebra: Matrix-multiply values with coefficients
prediction <- ds %*% coefs + intercept
# Let's look at the results
plot(iris$Petal.Length, prediction)

In R, is there a parsimonious or efficient way to get a regression prediction holding all covariates at their means?

I'm wondering if there is essentially a faster way of getting predictions from a regression model for certain values of the covariates without manually specifying the formulation. For example, if I wanted to get a prediction for a given dependent variable at means of the covariates, I can do something like this:
glm(ins ~ retire + age + hstatusg + qhhinc2 + educyear + married + hisp,
family = binomial, data = dat)
meanRetire <- mean(dat$retire)
meanAge <- mean(dat$age)
meanHStatusG <- mean(dat$hStatusG)
meanQhhinc2 <- mean(dat$qhhinc2)
meanEducyear <- mean(dat$educyear)
meanMarried <- mean(dat$married)
meanYear <- mean(dat$year)
ins_predict <- coef(r_3)[1] + coef(r_3)[2] * meanRetire + coef(r_3)[3] * meanAge +
coef(r_3)[4] * meanHStatusG + coef(r_3)[5] * meanQhhinc2 +
coef(r_3)[6] * meanEducyear + coef(r_3)[7] * meanMarried +
coef(r_3)[7] * meanHisp
Oh... There is a predict function:
fit <- glm(ins ~ retire + age + hstatusg + qhhinc2 + educyear + married + hisp,
family = binomial, data = dat)
newdat <- lapply(dat, mean) ## column means
lppred <- predict(fit, newdata = newdat) ## prediction of linear predictor
To get predicted response, use:
predict(fit, newdata = newdat, type = "response")
or (more efficiently from lppred):
binomial()$linkinv(lppred)

Resources