Finding specified number of predictors using stepwise regression - r

I am trying to find limited number of predictors (max=6) among 104 variables. So, I am using stepwise regression (for each variable I have 10532 values). I tried MATLAB:
mdl = stepwiselm(Pr, obs,'PEnter', 0.06)
However, it gave me about 70 variable.
Also, I tried to solve the problem using R package leaps:
b <- leaps::regsubsets(obs ~ ., data=Pr, nbest=1, nvmax=6)
I get the error below:
"Error in leaps.exhaustive(a, really.big) :
Exhaustive search will be S L O W, must specify really.big=T"
I know it should be an easy way to solve this problem, but I cannot seem to figure out the proper formatting.
Thank you in advance.

Use
leaps::regsubsets(obs ~ ., data=Pr, nbest=1, nvmax=6, really.big=T)
or you can try
library(MASS)
# Fit the full model
full.model <- lm(obs ~ ., data=Pr)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)

Related

R: How to obtain diagnostic plots for a lavaan mediation model?

I wasn't sure whether this was more appropriate to ask here or CrossValidated as I'm specifically asking about using R / lavaan...
I'm not sure if I've completely misunderstood how violations of assumptions are checked. I understand that we can obtain diagnostic plots for linear models with:
model <- lm(data$outcome ~ data$predictor)
plot(model, which = c(1:6))
But I'm having trouble figuring out how I should do this for a mediation model fitted like so:
model <- 'outcome ~ c*predictor + b*mediator
mediator ~ a*predictor
indirect_effect := a*b
total_effect := c + (a*b)
'
model.fit <- lavaan::sem(
model = model,
data = data,
missing = "FIML",
estimator = "ML")
Then if I try obtaining plots in the same way (plot(model.fit, which = c(1:6))), I get Error in as.double(y) : cannot coerce type 'S4' to vector of type 'double'.
Also, to check for violations of assumptions for Pearson's correlation, would we do so by looking at the structure of each variable individually, or by making a linear model (lm(data$outcome ~ data$predictor)), or using the correlation itself (cor.test(data$var1, data$var2)) in some way?
Try:
lavaanPlot::lavaanPlot(model = model.fit, coefs=T)

Broom::tidy function not working for glm2 object?

I'm trying to 'tidy' up a binary regression (so using a log link not a logit link -> so I get RR estimates not OR) using the broom function 'tidy' on a 'glm2' object. However its giving me an error saying
> tidy(model, conf.int=TRUE, exponentiate=TRUE)
Error: no valid set of coefficients has been found: please supply starting values
Here is a reproducible example of what I mean:
library(tidyverse)
library(glm2)
library(broom)
data(iris)
glimpse(iris)
table(iris$Species)
##create an outcome
df <-iris %>%
mutate(outcome = case_when(Petal.Width>2 ~1,
TRUE ~0))
#fit stardard glm
glm(outcome ~ Sepal.Length+Sepal.Width, data=df,
family = binomial(link="log"))
# -> doesnt converge using a log link due to parameter space issues (common in fitting binary regression).
# go to glm2 to fit the model instead, but need starting values for this:
p0 <- sum(as.numeric(df$outcome))/length(as.numeric(df$outcome))
start.val <- c(log(p0),rep(0,2))
model<-glm2(outcome ~ Sepal.Length+Sepal.Width, data=df,
family = binomial(link="log"),
start = start.val)
##get warnings, but converges
model$converged
##now tidy up and display model
tidy(model, conf.int=TRUE, exponentiate=TRUE)
#error -> wants starting values again? also shows warnings from previous
# (which are now saying model hasnt converged?)
tidy(model, conf.int=TRUE, exponentiate=TRUE, start=start.val)
# doesnt recognise starting values?
Any ideas on how to get tidy to work, or do I just do it manually?

lm (linear regression) function generated un-removable outliers

I have performed linear regression (lm) on two modified p-value types: q-value and Benjamini-Hochberg. Results gives two astronomical outliers, however, after removal of those, new outliers are always present. Could someone please replicate the code and see if issue prevails? What could be the possible source of an issue?
Here is the full code for easy copy/paste:
library(qvalue)
p = 50
m = 10
pval = c(rbeta(m,1,100), runif(p-m,0,1))
BHpval <- p.adjust(pval,method="BH")
qval_ <- qvalue(pval)
print(qval_$pi0)
fit2 <- lm(qval_$qvalues ~ BHpval)
plot(fit2)

R Variable Length Differ when build linear model for residuals

I am working on a problem where I want to build a linear model using residuals of two other linear models. I have used UN3 data set to show my problem since its easy put the problem here than using my actual data set.
Here is my R code:
head(UN3)
m1.lgFert.purban <- lm(log(Fertility) ~ Purban, data=UN3)
m2.lgPPgdp.purban <- lm(log(PPgdp) ~ Purban, data=UN3)
m3 <- lm(residuals(m1.lgFert.purban) ~ residuals(m2.lgPPgdp.purban))
Here is the error I am getting:
> m3 <- lm(residuals(m1.lgFert.purban) ~ residuals(m2.lgPPgdp.purban))
Error in model.frame.default(formula = residuals(m1.lgFert.purban) ~ residuals(m2.lgPPgdp.purban), :
variable lengths differ (found for 'residuals(m2.lgPPgdp.purban)')
I am not really understanding the why this error actually take place. If it was log related issue then I should have gotten the error when I am building first two models.
Your default na.action is most likely na.omit (check with options("na.action")). This means that NA values get removed silently, resulting in different lengths of the residuals vectors. You probably want to use na.action="na.exclude", which pads the residuals with NAs.
library(alr3)
options("na.action")
#$na.action
#[1] "na.omit"
m1.lgFert.purban <- lm(log(Fertility) ~ Purban, data=UN3,na.action="na.exclude")
m2.lgPPgdp.purban <- lm(log(PPgdp) ~ Purban, data=UN3,na.action="na.exclude")
m3 <- lm(residuals(m1.lgFert.purban) ~ residuals(m2.lgPPgdp.purban))
#Coefficients:
# (Intercept) residuals(m2.lgPPgdp.purban)
# -0.01245 -0.18127

How to manually specify outer knots for smoother in gam (mgcv package)

I am fitting GAM models to data using the mgcv package in R. Some of my predictors are circular, so I am using a periodic smoother. I run into an issue in cross validation where my holdout dataset can contain values outside the range of the training data. Since the gam package automatically chooses knots for the smooths, this leads to an error (see my related question here -- thanks to #nograpes and #DWin for their explanations of the errors there).
How can I manually specify the outer knots in a periodic smooth?
Example code
The first block generates some data.
library(mgcv)
set.seed(223) # produces error.
# set.seed(123) # no error.
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
The next block fits the GAM model with the periodic smooth:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
It will be somewhere in the specification of the smoother term s(x,bs="cc",k=5) where I'm sure you'll be able to set some knots, but this is not obvious to me from the help of gam or from googling.
This block will fit some holdout data and produce the error if you set the seed as above:
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Ideally, I would only set the outer knots and let gam pick the rest.
Apologies if this question is better for CrossValidated than SO.
Try this:
gamFit <- gam(y ~ s(x,bs="cc",k=5),
knots=list( x=seq(-pi,pi, len=5) ),
data=df, family=binomial())
You will find a worked example at:
?smooth.construct.cr.smooth.spec
I learned in testing this code that the 'k' parameter in s() needs to match the 'len' parameter in the 'x'-seq() value passed to knots(). I thought incorrectly that the knots argument would get passed to s().
You can do this in {mgcv} now and for some years (but perhaps not at the time the question was posed and answered). Using the model in #IRTFM's answer, one can just specify the outer knots for a cyclic CRS:
gamFit <- gam(y ~ s(x, bs = "cc"),
knots = list(x = c(-pi, pi)),
data = df, family = binomial())

Resources