How to do stepwise regression in r for more independent variables and less observations? - r

I'm trying to do stepwise regression for following data:
y <- c(1.2748, 1.2574, 1.5571, 1.4178, 0.8491, 1.3606, 1.4747, 1.3177, 1.2896, 0.8453)
x <- data.frame(A = c(2,3,4,5,6,2,3,4,5,6),
B = c(2,4,1,3,5,1,3,5,2,4)*100,
C = c(9,5,11,5,11,7,13,7,13,9),
D = c(6,5,3,7,6,4,3,7,5,4),
E = c(1,1,0.8,0.8,0.6,0.6,0.4,0.4,0.2,0.2))
x$A2 <- x$A^2
x$B2 <- x$B^2
x$C2 <- x$C^2
x$D2 <- x$D^2
x$E2 <- x$E^2
x$AB <- x$A*x$B
As we can see, it has 10 observations and 11 independent variables so I can't build a linear regression model for it. In fact, only a few factors is useful and in this case, I need to use stepwise regression and "forward" to add independent variables into my formula. But stats:: step function cannot be used. I wonder if there is a method to do it. I know there is a package called "StepReg" but I don't fully understand how to use it and how to read the results. Thank you!

I just run stepwise regression with the data you provided using R package StepReg
Here is the code, and hope this can help you.
library(StepReg)
df <- data.frame(y,x)
# forward method with information criterion 'AIC', you can choose other information criterion
stepwise(df, y="y", exclude=NULL, include=NULL, Class=NULL,
selection="forward", select="AIC")
# forward method with significant levels, significant level for entry = 0.15
stepwise(df, y="y", exclude=NULL, include=NULL, Class=NULL,
selection="forward", select="SL",sle=0.15)

You can use olsrr package which gives similar results as that of SPSS. Here is the solution
library(olsrr)
df <- data.frame(y, x)
model <- lm(y ~ ., data = df)
smlr <- ols_step_both_p(model, pent = 0.05, prem = 0.1) #pent p value; variables with p value less than pent will enter into the model.
#premp value; variables with p more than prem will be removed from the model.
You can get the model details by calling
smlr
smlr$model
smlr$beta_pval #regression coefficients with p-values
I have kept the values of pent and prem same as the default values used by SPSS.

Related

Why does R and PROCESS render different result of a mediation model (one is significant, the other one is not)?

As a newcomer who just gets started in R, I am confused about the result of the mediation analysis.
My model is simple: IV 'T1Incivi', Mediator 'T1Envied', DV 'T2PSRB'. I ran the same model in SPSS using PROCESS, but the result was insignificant in PROCESS; however, the indirect effect is significant in R. Since I am not that familiar with R, could you please help me to see if there is anything wrong with my code? And tell me why the result is significant in R but not in SPSS?Thanks a bunch!!!
My code in R:
X predict M
apath <- lm(T1Envied~T1Incivi, data=dat)
summary(apath)
X and M predict Y
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
summary(bpath)
Bootstrapping for indirect effect
getindirect <- function(dataset,random){
d=dataset[random,]
apath <- lm(T1Envied~T1Incivi, data=d)
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
indirect <- apath$coefficients["T1Incivi"]*bpath$coefficients["T1Envied"]
return(indirect)
}
library(boot)
set.seed(6452234)
Ind1 <- boot(data=dat,
statistic=getindirect,
R=5000)
boot.ci(Ind1,
conf = .95,
type = "norm")`*PSRB as outcome*
In your function getindirect all linear regressions should be based on the freshly shuffled data in d.
However there is the line
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
that makes the wrong reference to the variable dat which should really not be used within this function. That alone can explain incoherent results.

Is it possible to adapt standard prediction interval code for dlm in R with other distribution?

Using the dlm package in R I fit a dynamic linear model to a time series data set, consisting of 20 observations. I then use the dlmForecast function to predict future values (which I can validate against the genuine data for said period).
I use the following code to create a prediction interval;
ciTheory <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(c(0.05,0.95))) +
as.vector(t(fut1$f)))
However my data does not follow a normal distribution and I wondered whether it would be possible to
adapt the qnorm function for other distributions. I have tried qt, but am unable to apply qgamma.......
Just wondered if anyone knew how you would go about sorting this.....
Below is a reproduced version of my code...
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
Cheers

Is there an R function that resolve a second order linear model?

I´m a begginer in R and programming and struggling in doing problably a simple task.
I've made a code that creates a second model order and i want to input variables in this model and find the "Y value"
I´ve tried to use the predict function, but is actually pretty complex and I can't got anywhere.
I did this so far:
modFOI <- rsm(Rendimento~FO(x1,x2,x3,x4)+TWI(x1,x2,x3,x4)+PQ(x1,x2,x3,x4),data=CR) # com interações
summary(modFOI)
print(modFOI)
With that, i found the SO model, but now i want to create variables like x1,x2,x3 and input that in the model and find the Y. I also woud like to find the optimum Y
Simplest way to create a polynomial (2nd order) that I can think of is the following:
DF <- data.frame(x = runif(10,0,1),
y = runif(10,0,1) )
mod <- lm(DF$y ~ DF$x + I(DF$x^2))
predict(mod, new.data=data.frame(x=c(1,2,3,4,5)))
NB. when using predict the new.data must be in a data.frame format, and the variable must have the same name as the variable in the model (here, x)
Hope this helps
The optimum value is shown as the stationary point in the output of summary(modFOI). You may also run steepest(modFOI) to see a trace of the estimated values along the path of steepest ascent.
To predict, create a data frame with the desired sets of x values. For example,
testdat <- data.frame(x1 = -1:1, x2 = 0, x3 = 0, x4 = 1)
Then use the predict() function with this is newdata:
predict(modFOI, newdata = testdat)

predict value from a non parametric model in R

The target is to predict the values to the test set with the package monreg model, but this si not working with the predict function, because there isn't a model object to use the prediction function.
Giving an example:
require(monreg) # Package ‘monreg’
x <- rnorm(100)
y <- x + rnorm(100)
x_train=x[0:80]
y_train=y[0:80]
x_test=x[81:100]
y_test=y[81:100]
mon1 <- monreg(x, y, hd = .5, hr = .5)
# I was expecting to get the prediction over the test partion as R usualy works
predict(mon1,h=length(y_test))
But this is not working. In the case this package doesnt have any predict function, I would accept any advice to implement the Narayada Watson regression in R in order to predict values like this example I gave.

How to add specific conditions to stepAIC

I am running a regression with 37 variables, and I am using stepAIC to perform model selection. I do NOT want a predictive model. I just want to find out what varibles have the best explanatory power.
My current code looks like:
fitObject <- lm(mydata)
DEP.select <- stepAIC(fitObject, direction = 'both', scope= list(lower = ~AUC), trace = F, k = log(obs))
# DEP is my dependent variable, and AUC is an independent variable I was want to have in my model.
The problem is that a lot of my variables have high correlation, and the result stepAIC gives me contains several of those highly correlated variables. Notice that I have forced AUC in the model, multicollinearity is a problem especially when those variables highly correlated with AUC are chosen in the model.
Is there a way to specify in the function some thresholds for correlation or p-value of the coefficients?
Or any comments on other approaches that can solve my problem are welcome.
Thank you!
Perhaps Variance Inflation Factor will work better for you. This article explains some of the logic. http://en.wikipedia.org/wiki/Variance_inflation_factor
Example use:
v=ezvif(df,yvar ='columnNameOfWhichYouAreTryingToPredict')
Here is the function I wrote that combines VIF::vif with cross validation.
require(VIF)
require(cvTools);
#returns selected variables using VIF and kfolds cross validation
ezvif=function(df,yvar,folds=5,trace=F){
f=cvFolds(nrow(df),K=folds);
findings=list();
for(v in names(df)){
if(v==yvar)next;
findings[[v]]=0;
}
for(i in 1:folds){
rows=f$subsets[f$which!=i]
y=df[rows,yvar];
xdf=df[rows,names(df) != yvar]; #remove output var
vifResult=vif(y,xdf,trace=trace,subsize=min(200,floor(nrow(xdf))))
for(v in names(xdf)[vifResult$select]){
findings[[v]]=findings[[v]]+1; #vote
}
}
findings=(sort(unlist(findings),decreasing = T))
if(trace) print(findings[findings>0]);
return( c(yvar,names(findings[findings==findings[1]])) )
}
I would recommend to remove the variables with high correlations. The libraries caret and corrplot can help:
library(corrplot)
library(caret)
dm = data.matrix(mydata[,names(mydata) != 'DEP'] #without your outcome var
Visualize your correlations clustering highly correlated together
corrplot(cor(dm), order = 'hclust')
And find the indices of variables that you could remove due to high (>0.75) correlations
findCorrelations(cor(dm), 0.75)
Removing these variables can improve your model. After removing the variables, continue doing the stepAIC as you described in your question.
To assess multicollinearity between predictors when running the dredge function (MuMIn package), include the following max.r function as the "extra" argument:
max.r <- function(x){
corm <- cov2cor(vcov(x))
corm <- as.matrix(corm)
if (length(corm)==1){
corm <- 0
max(abs(corm))
} else if (length(corm)==4){
cormf <- corm[2:nrow(corm),2:ncol(corm)]
cormf <- 0
max(abs(cormf))
} else {
cormf <- corm[2:nrow(corm),2:ncol(corm)]
diag(cormf) <- 0
max(abs(cormf))
}
}
then simply run dredge specifying the number of predictor variables and including the max.r function:
options(na.action = na.fail)
Allmodels <- dredge(Fullmodel, rank = "AIC", m.lim=c(0, 3), extra= max.r)
Allmodels[Allmodels$max.r<=0.6, ] ##Subset models with max.r <=0.6 (not collinear)
NCM <- get.models(Allmodels, subset = max.r<=0.6) ##Retrieve models with max.r <=0.6 (not collinear)
model.sel(NCM) ##Final model selection table
This works for lme4 models. For nlme models see: https://github.com/rojaff/dredge_mc

Resources