Selecting the statistically significant variables in an R glm model - r

I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).
After running my glm and viewing a summary of my model, I see those variables that are statistically significant. I would like to be able to select those variables and run another model and compare performance. Is there a way I can parse the model summary and select only the ones that are significant?

Although #kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:
x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
x4 <- y + 5 # this will make a nice significant variable to test our code
(mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))
Our model is then:
model <- glm(formula=y~x1+x2+x3+x4,data=mydata)
And the Boolean vector of the coefficients can indeed be extracted by:
toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith
But this is not all! In addition, we can do this:
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE]
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",relevant.x))
EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+"))) to include all variables.
And run the regression with only significant variables as OP originally wanted:
sig.model <- glm(formula=sig.formula,data=mydata)
In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.

You can get access the pvalues of the glm result through the function "summary". The last column of the coefficients matrix is called "Pr(>|t|)" and holds the pvalues of the factors used in the model.
Here's an example:
#x is a 10 x 3 matrix
x = matrix(rnorm(3*10), ncol=3)
y = rnorm(10)
res = glm(y~x)
#ignore the intercept pval
summary(res)$coeff[-1,4] < 0.05

For people having issue with Maxim.K command on
sig.formula <- as.formula(paste("y ~",relevant.x))
use this
sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))
Final codes will be like
toselect.x <- summary(glmText)$coeff[-1,4] < 0.05 # credit to kith
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE]
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))
this fixes the bug you're facing with picking of the first variable alone.

in
sig.formula <- as.formula(paste("y ~",relevant.x))
you paste only the first variable of relevant.x the others are ignored (try for example to invert the condition to >0.5)

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

Random forest variable importance AND direction of correlation for binomial response

I am using the randomForest package in R, but am not partial to solutions using other packages.
my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.
Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

R: e1071 svm function - is it necessary to convert categorical to dummies?

I know svm model needs preprocessing that converts categorical variables into dummy variables. However, when I am using e1071's svm function to fit a model with unconverted data (see train and test), no error pops up. I am assuming the function automatically converts them.
However, when I am using the converted data (see train2 and test2) to fit a svm model, this function gives me a different result (as indicated, p1 and p2 are not the same).
Could anyone let me know what happened to the unconverted data? Does the function just ignore the categorical variables, or something else happened?
library(e1071)
library(dummies)
set.seed(0)
x = data.frame(matrix(rnorm(200, 10, 10), ncol = 5)) #fake numerical predictors
cate = factor(sample(LETTERS[1:5], 40, replace=TRUE)) #fake categorical variables
y = rnorm(40, 50, 10) #fake response
data = cbind(y,cate,x)
ind = sample(40, 30, replace=FALSE)
train = data[ind, ]
test = data[-ind, ]
#without dummy
data = cbind(y,cate,x)
svm.model = svm(y~., train)
p1 = predict(svm.model, test)
#with dummy
train2 = cbind(train[,-2], dummy(train[,2]))
colnames(train2) = c('y', paste0('X',1:5), LETTERS[1:4])
test2 = cbind(test[,-2], dummy(test[,2]))
colnames(test2) = c('y', paste0('X',1:5), LETTERS[1:4])
svm.model2 = svm(y~., train2)
p2 = predict(svm.model2, test2)
What you're observing is indeed as you stated, that dummies are converted automatically. In fact we can reproduce both svm.model1 and svm.model2 quite easily.
mf <- model.frame(y ~ . - 1, train) # - 1 because the intercept is unused in svm.
mt <- terms(mf)
X <- model.matrix(mt, mf)
Xtest <- model.matrix(mt, test)
Y <- model.response(mf)
svm.model3 <- svm(X, Y)
Note that i did not use svm(formula, data) but svm(x, y). Now which model did we actually recreate? Lets compare with p1 and p2
all.equal(p1, predict(svm.model3, newdata = Xtest))
# [1] "Mean relative difference: 0.03064692"
all.equal(p2, predict(svm.model3, newdata = Xtest))
# [1] TRUE
It seems we've recreated model 2, with our manual dummies. Now the reason why this reproduces svm.model2 and not svm.model1 is that due to the scale parameter. From help(svm) (note the part in bold)
A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.
From this we can see that likely the difference (and issue really) comes from svm not correctly identifying binary columns as dummies, but apparently being smart enough to do this when performing automatic conversion. We can test this theory by setting the scale parameter manually
#labels(mt) = 'cate', 'X1', 'X2', ...
#names(attr(X, 'constrasts')) = 'cate'
#eg: scale = Anything but 'cate'
not_dummies <- !(labels(mt) %in% names(attr(X, 'contrasts')))
n <- table(attr(X, 'assign'))
scale <- rep(not_dummies, n)
svm.model4 <- svm(X, Y, scale = scale)
all.equal(p1, predict(svm.model4, newdata = Xtest))
# [1] TRUE
all.equal(p2, predict(svm.model4, newdata = Xtest))
# [1] "Mean relative difference: 0.03124989"
So what we see is, that
1) svm as stated converts factors into dummy variables automatically.
2) It does however, in the case dummies are provided, not check for these, causing possibly unexpected behaviour if one manually creates these.
From the documentation it is clear that it is treated at least slightly differently, hence the comment "If the predictor variables include factors, the formula interface must be used to get a correct model matrix.".
Personal hunch the differences have to do with scaling (the default in svm). Note the difference between...
> svm.model$x.scale$`scaled:center`
X1 X2 X3 X4 X5
10.091157 8.739654 10.395121 7.856475 11.660454
> svm.model2$x.scale$`scaled:center`
X1 X2 X3 X4 X5 A B C D X.NA.
10.0911569 8.7396541 10.3951208 7.8564754 11.6604540 0.2000000 0.1333333 0.1333333 0.2333333 0.3000000

R: one regression model for 2 different data sets to prepare for waldtest

I have two different data sets. Each of them represents one portfolio of my two portfolios.
y(p) as dependent variable and x1(p), x2(p),x3(p),x4(p) as independent variables.
(p) indicates a portfolio-specific value. column 1 of each variable represents portfolio 1 and column 2 represents portfolio 2.
The regression equation is:
y(p)=∝(p)+ 𝛽1(p)*x1(p)+𝛽2(p)*x2(p)+𝛽3(p)*x3(p)+𝛽4(p)*x4(p)
What i did so far is to implement a separate regression model for each portfolio in R:
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
My objective is to compare the two intercepts of both regression models. Within the scope of this comparison i need to test the joint significance of these intercepts. As far as i can tell, using the wald test should be appropriate.
If I use the waldtest-function from the lmtest-package it does not work.
Obviously, because the response variable is not the same for both models.
library(lmtest)
waldtest(lm1,lm2)
In waldtest.default(object, ..., test = match.arg(test)) :
models with response "y[, 2]" removed because response differs from model 1
All workarounds I tried so far did not work either, e.g. R: Waldtest: "Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml"
My guess is that the regression needs to be done in a different way to fix the problems regarding the waldtest.
So that leads to my question:
Is there a possibility to do the regression in one model, which still generates portfolio-specific intercepts and coefficients? (I assume, that this would fix the problems with the waldtest-function.)
Any advice or suggestion will be appreciated.
The following data can be used for a reproducible example:
y=matrix(rnorm(10),ncol=2)
x1=matrix(rnorm(10),ncol=2)
x2=matrix(rnorm(10),ncol=2)
x3=matrix(rnorm(10),ncol=2)
x4=matrix(rnorm(10),ncol=2)
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
library(lmtest)
waldtest(lm1,lm2)
Best regards,
Simon
Here are three ways to test intercepts equality. The second one is an implementation of the accepted answer to this question, while the other two are implementations of the second answer to the aforementioned question under different assumptions.
Let
n <- 5
y <- matrix(rnorm(10), ncol = 2)
x <- matrix(rnorm(10), ncol = 2)
First, we may indeed perform the test with only a single model. For that purpose we create a new vector Y that concatenates y[, 1] and y[, 2]. As for the independent variables, we create a block-diagonal matrix with the regressors of one model at the upper-left block and those for the other model at the lower-right block. Lastly, I create a group factor indicating the hidden model. Hence,
library(Matrix)
Y <- c(y)
X <- as.matrix(bdiag(x[, 1], x[, 2]))
G <- factor(rep(0:1, each = n))
Now the unrestricted model is
m1 <- lm(Y ~ G + X - 1)
while the restricted one is
m2 <- lm(Y ~ X)
Testing for intercepts equality gives
library(lmtest)
waldtest(m1, m2)
# Wald test
#
# Model 1: Y ~ G + X - 1
# Model 2: Y ~ X
# Res.Df Df F Pr(>F)
# 1 6
# 2 7 -1 0.5473 0.4873
so that, as expected, we cannot reject they equality. A problem with this solution, however, is that it is like estimating the two models separately but assuming that the errors have the same variance in both. Also, we don't allow for a cross-correlation between errors.
Second, we can relax the assumption of identical errors variance by estimating two separate models and employing a Z-test as follows.
M1 <- lm(y[, 1] ~ x[, 1])
M2 <- lm(y[, 2] ~ x[, 2])
Z <- unname((coef(M1)[1] - coef(M2)[1]) / (coef(summary(M1))[1, 2]^2 + coef(summary(M2))[1, 2])^2)
2 * pnorm(-abs(Z))
# [1] 0.5425736
leading to the same conclusion.
Lastly, we can employ the SUR in this way allowing for model-dependent errors variance as well as contemporaneous errors cross-dependence (that may be not necessary in your case, it matters what kind of data you are using). For that we can use the systemfit package as follows:
library(systemfit)
eq1 <- y[, 1] ~ x[, 1]
eq2 <- y[, 2] ~ x[, 2]
m <- systemfit(list(eq1, eq2), method = "SUR")
In this case we also are able to perform the Wald test:
R <- matrix(c(1, 0, -1, 0), nrow = 1) # Restriction matrix
linearHypothesis(m, R, test = "Chisq")
# Linear hypothesis test (Chi^2 statistic of a Wald test)
#
# Hypothesis:
# eq1_((Intercept) - eq2_(Intercept) = 0
#
# Model 1: restricted model
# Model 2: m
#
# Res.Df Df Chisq Pr(>Chisq)
# 1 7
# 2 6 1 0.3037 0.5816

convert dredge function outputs to data.frame in R

I am using the MuMIn package in R to select the best model for my data. Here, I use an example using the Cement data set provided with the code.
require(MuMIn)
data(Cement)
d <- data.frame(Cement)
idx <- seq(11,13)
avgmod.95p <- list()
for (i in 1:length(idx)){
d2 <- d[1:idx[i],]
fm1 <- lm(y ~ ., data = d2)
dd <- dredge(fm1, extra = c("R^2", F = function(x)
summary(x)$fstatistic[[1]]))
# 95% confidence set:
confset.95p <- get.models(dd, cumsum(weight) <= .95)
avgmod.95p[[i]] <- model.avg(confset.95p)
}
As you can see, I'm running and iteration loop to construct the model average estimate for the dataset (which I alter the length of here, for illustration). The variable avgmod.95 returns:
> avgmod.95p[[1]][3]
$avg.model
Estimate Std. Error Adjusted SE Lower CI Upper CI
(Intercept) 56.1637849 15.06079485 15.15303057 26.4643908 85.8631791
X1 1.4810616 0.14016773 0.16302190 1.1615446 1.8005787
X2 0.6850913 0.05397343 0.06358329 0.5604704 0.8097123
X4 -0.6063184 0.05919637 0.06964775 -0.7428255 -0.4698113
X3 0.2126228 0.19480789 0.23502854 -0.2480246 0.6732703
which includes the estimated parameter and the lower and upper confidence intervals.
How do I combine all of the outputs from the iteration loop into one data.frame, for example:
Variable Estimate Lower CI Upper CI
X1 1.4810616 1.1615446 1.8005787
X1
X1
X2
i.e. I would have three values for X1, X2 and X3 where three is the number of iterations in the loop.
How can this be done? I have tried:
do.call(rbind.data.frame, avgmod.95p)
but it doesn't work, in the sense that it provides an error.
You are assigning it to a list, so let's use lapply
#get number of rows for each model
no.of.rows <-unlist(lapply(avgmod.95p, function(x) nrow(x$avg.model)))
#use lapply again to rbind the models
foo<-do.call(rbind, lapply(avgmod.95p, function(x) x$avg.model))
Now make it into a nice data.frame using no.of rows to indicate which model it came from:
result.df <- data.frame(Model.No = rep(seq(1:length(no.of.rows)),no.of.rows),
Coefs = rownames(foo),
foo)
If you modify your index in the for loop, you can give it fancy names as well, and use that.
your avgmod.95p will be named this and we can use that.

Resources