I am using the MuMIn package in R to select the best model for my data. Here, I use an example using the Cement data set provided with the code.
require(MuMIn)
data(Cement)
d <- data.frame(Cement)
idx <- seq(11,13)
avgmod.95p <- list()
for (i in 1:length(idx)){
d2 <- d[1:idx[i],]
fm1 <- lm(y ~ ., data = d2)
dd <- dredge(fm1, extra = c("R^2", F = function(x)
summary(x)$fstatistic[[1]]))
# 95% confidence set:
confset.95p <- get.models(dd, cumsum(weight) <= .95)
avgmod.95p[[i]] <- model.avg(confset.95p)
}
As you can see, I'm running and iteration loop to construct the model average estimate for the dataset (which I alter the length of here, for illustration). The variable avgmod.95 returns:
> avgmod.95p[[1]][3]
$avg.model
Estimate Std. Error Adjusted SE Lower CI Upper CI
(Intercept) 56.1637849 15.06079485 15.15303057 26.4643908 85.8631791
X1 1.4810616 0.14016773 0.16302190 1.1615446 1.8005787
X2 0.6850913 0.05397343 0.06358329 0.5604704 0.8097123
X4 -0.6063184 0.05919637 0.06964775 -0.7428255 -0.4698113
X3 0.2126228 0.19480789 0.23502854 -0.2480246 0.6732703
which includes the estimated parameter and the lower and upper confidence intervals.
How do I combine all of the outputs from the iteration loop into one data.frame, for example:
Variable Estimate Lower CI Upper CI
X1 1.4810616 1.1615446 1.8005787
X1
X1
X2
i.e. I would have three values for X1, X2 and X3 where three is the number of iterations in the loop.
How can this be done? I have tried:
do.call(rbind.data.frame, avgmod.95p)
but it doesn't work, in the sense that it provides an error.
You are assigning it to a list, so let's use lapply
#get number of rows for each model
no.of.rows <-unlist(lapply(avgmod.95p, function(x) nrow(x$avg.model)))
#use lapply again to rbind the models
foo<-do.call(rbind, lapply(avgmod.95p, function(x) x$avg.model))
Now make it into a nice data.frame using no.of rows to indicate which model it came from:
result.df <- data.frame(Model.No = rep(seq(1:length(no.of.rows)),no.of.rows),
Coefs = rownames(foo),
foo)
If you modify your index in the for loop, you can give it fancy names as well, and use that.
your avgmod.95p will be named this and we can use that.
Related
I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)
I am using the randomForest package in R, but am not partial to solutions using other packages.
my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.
Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836
I have written my function for MLR. However, there seems to an issue with output (see examples in the end).
But when I run the code, line by line, the output is correct.
mlr <- function(dependentvar, dataset) {
x <- model.matrix(dependentvar ~., dataset) # Design Matrix for x
y <- dependentvar # dependent variable
betas <- solve(crossprod(x))%*%crossprod(x,y) # beta values
SST <- t(y)%*%y - (sum(y)^2/dim(dataset)[1]) # total sum of squares
SSres <- t(y)%*%y -(t(betas)%*%crossprod(x,y)) # sum of squares of residuals
SSreg <- SST - SSres # regression sum of squares
sigmasqr <- SSres/(length(y) - dim(dataset)[2]) # variance or (MSE)
varofbeta <- sigmasqr[1]*solve( crossprod(x)) # variance of beta
cat("SST:", SST,"SSresiduals:", SSres,"SSregression:", SSreg, sep = "\n", append = FALSE)
return(betas)
}
To see the problem, try
mlr(trees$Height, trees)
I get the same problem even if I get rid of $
Height <- trees$Height
mlr(Height, trees)
Use the following:
x <- model.matrix(reformulate(".", dependentvar), dataset)
y <- dataset[[dependentvar]]
and pass in dependentvar as a string.
Example:
mlr("Height", trees)
I am trying to fit a model for each segment in my data using the lm() function in conjunction with the plyr package because my data is grouped by a key.
I've managed to run the model and get the coefficients along with the R^2 & adj r-squared but I am struggling with the p-values.
library("plyr")
#Sample data
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
#model
model_1 <- dlply(test_data, .(key),
function(test_data) lm(y ~ x1 + x2,data = test_data))
#coefficients
ldply(model_1, coef)
#adj r-squared
ldply(model_1, function(x) summary(x)$r.squared)
I've tried this which gets me the key and the p-value but it doesn't have the names of the variables which I need to be able to merge the output with the coefficients from the model later.
#p-values but missing the variable names
ldply(model_1, function(x) summary(x)$coefficients)[,c(1,5)]
I've tried to fit the models using Do and then tidy from the dplyr package and this works fine with a small data set because it actually returns everything I need but my actual data contains over 1,000 different segments and RStudio end up crashing.
I'm using the "dplyr" package to formatting the output. In the function that you use inside the "dlply" function you should use summary() to the lm(), so when you call "coef" it will also include the p.values.
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
model<-by(test_data,test_data$key,function(x)summary(lm(y~x1+x2,x)))
R2<-t(data.frame(lapply(model,function(x)x$adj.r.squared)));colnames(R2)<-"R2_adj";R2
R2_adj
a -0.8939647
b 0.4292186
Co<-as.data.frame(t(data.frame(lapply(model,function(x)x$coef))))
colnames(Co)<-c("intercept","x1","x2")
library(dplyr)
Co%>%
mutate(key=substr(rownames(Co),1,1),
variable=substr(rownames(Co),3,12))%>%
select(key,variable,intercept,x1,x2)
key variable intercept x1 x2
1 a Estimate 162.1822438 -0.6037364 0.07628315
2 a Std..Error 141.3436897 1.8054132 2.29385395
3 a t.value 1.1474318 -0.3344035 0.03325545
4 a Pr...t.. 0.3699423 0.7698867 0.97649134
5 b Estimate 271.0532276 0.3624009 -3.62853907
6 b Std..Error 196.2769562 0.9166979 3.25911570
7 b t.value 1.3809733 0.3953330 -1.11335080
8 b Pr...t.. 0.3013515 0.7307786 0.38142882
No need for plyr I think, sapply will do just fine.
sapply(model_1, function(x) summary(x)$coefficients[, 4])
a b
(Intercept) 0.3699423 0.3013515
x1 0.7698867 0.7307786
x2 0.9764913 0.3814288
And t() will get those in the same configuration as your estimates.
By the way, you may want to look at the multidplyr package, to do with tidy and dplyr::do after all.
I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).
After running my glm and viewing a summary of my model, I see those variables that are statistically significant. I would like to be able to select those variables and run another model and compare performance. Is there a way I can parse the model summary and select only the ones that are significant?
Although #kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:
x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
x4 <- y + 5 # this will make a nice significant variable to test our code
(mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))
Our model is then:
model <- glm(formula=y~x1+x2+x3+x4,data=mydata)
And the Boolean vector of the coefficients can indeed be extracted by:
toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith
But this is not all! In addition, we can do this:
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE]
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",relevant.x))
EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+"))) to include all variables.
And run the regression with only significant variables as OP originally wanted:
sig.model <- glm(formula=sig.formula,data=mydata)
In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.
You can get access the pvalues of the glm result through the function "summary". The last column of the coefficients matrix is called "Pr(>|t|)" and holds the pvalues of the factors used in the model.
Here's an example:
#x is a 10 x 3 matrix
x = matrix(rnorm(3*10), ncol=3)
y = rnorm(10)
res = glm(y~x)
#ignore the intercept pval
summary(res)$coeff[-1,4] < 0.05
For people having issue with Maxim.K command on
sig.formula <- as.formula(paste("y ~",relevant.x))
use this
sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))
Final codes will be like
toselect.x <- summary(glmText)$coeff[-1,4] < 0.05 # credit to kith
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE]
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))
this fixes the bug you're facing with picking of the first variable alone.
in
sig.formula <- as.formula(paste("y ~",relevant.x))
you paste only the first variable of relevant.x the others are ignored (try for example to invert the condition to >0.5)