I ran several time-series regressions (one for every year) and now I'd like to generate a table similar to what coef() returns but also with level of significance (the stars), R-squared and F-statistic for each year which will look somewhat like this:
b0 b1 b2 b3 b4 R-sq. F-stat.
2010 ...*
2011 ...
2012 ...**
So far I tried mtable() from the memisc-package which gives me years as columns and coeffecients as rows but I'd prefer the result to be "transposed" (like above).
Since we don't have access to your data or the code you used to run your models, I created my own dummy models using the mtcars dataset:
data("mtcars")
model1 <- lm(mpg ~ wt + cyl, data = mtcars)
model2 <- lm(mpg ~ wt + cyl + hp, data = mtcars)
For future reference, you'll always want to supply some of your data using, for example, dput(head(my_dataframe, 20)). You should also put up more of the code you used to get where you're at; in fact, the minimum amount of code needed to reproduce your problem. You may want to read How to Create a Great R Reproducible Example for more information; it just helps others help you.
Then I rigged up the following (clumsy) function that I think does roughly what you're looking for. In any event, it should get you started in the right direction:
get_row <- function(x, coef_names) {
coef_mat <- coef(summary(x))
these_coef_names <- rownames(coef_mat)
rows <- match(coef_names, these_coef_names)
p <- coef_mat[rows, 4]
stars <- c("", "*", "**", "***")[(p < 0.05) + (p < 0.01) + (p < 0.001) + 1]
coefs <- round(coef_mat[rows, 1], 3)
output <- paste0(coefs, stars)
output <- ifelse(grepl("NA", output), NA, output)
return(output)
}
get_table <- function(...) {
models <- list(...)
if ( any(lapply(models, class) != "lm" ) ) {
stop("This function has only been tested with lm objects.")
}
coef_names <- unique(unlist(sapply(models, variable.names)))
coef_table <- t(sapply(models, get_row, coef_names))
colnames(coef_table) <- coef_names
return(coef_table)
}
get_table(model1, model2)
# (Intercept) wt cyl hp
# [1,] "39.686***" "-3.191***" "-1.508**" NA
# [2,] "38.752***" "-3.167***" "-0.942" "-0.018"
Related
I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}
I use mtcars data to explain my problem. For example, I am trying to estimate the regression coefficient for variable cyl with mpg as dependent variable, and to assess the changes of the coefficient by including other other variables.
Step 1: lm(mpg ~ cyl, data = df) to get crude coefficient for cyl
Step 2: add each of other variables one at a time into the model in step 1, choose the one with the largest change (%) in coefficient of cyl, and add that variable in the above model.
Step 3: repeat step 2 by adding each variable of the remaining variables to the model above, and again find the one with the largest change in coefficient of `cyl';
Steps ....: repeat until all variables are included in the model.
library(dplyr)
df <- mtcars %>% select(mpg, cyl, disp, hp, wt)
my_fun1 <- function(df=data) {
out_df <- data.frame(matrix(ncol = 0, nrow = (length(df) - 1)))
md_1 <- lm(mpg ~ cyl, data = df)
out_df$Models[1] <- "Crude"
out_df$Estimate[1] <- md_1$coefficients[2]
pre_change <- 0
to_rm <- 0
for (k in 2:(length(df)-1)) {
for (i in 3:length(df)) {
if (!i %in% to_rm) {
md_tmp <- update(md_1, . ~ . + df[[i]])
change <- abs(100*(md_tmp$coefficients[2] - md_1$coefficients[2])/md_1$coefficients[2])
dif <- md_tmp$coefficients[2] - md_1$coefficients[2]
if (change >= pre_change) {
out_df$Estimate[k] <- md_tmp$coefficients[2]
out_df$Models[k] <- paste("+", names(df)[[i]])
out_df$Diff[k] <- md_tmp$coefficients[2] - md_1$coefficients[2]
picked <- names(df)[[i]]
picked_i = i
pre_change <- out_df$`Change (%)`[k] <- change
}
}
}
to_rm <- c(to_rm, picked_i)
md_1 <- update(md_1, .~. + eval(as.name(paste(picked))))
pre_change = 0
}
out_df
}
my_fun1(df = df)
After running above, I expected to get regression coefficients of cyl at each step in the following format.
Models Estimate Diff Change (%)
1 Crude -2.875790 NA NA
2 + wt -1.507795 1.367995 47.56937
3 + hp -1.227420 0.280375 18.59504
4 + disp -1.227420 1.037274 45.80194
However, steps 1 and 2 provide correct results, steps 3 and 4 are incorrect. Any suggestions would be appreciated.
You can probably make this a little easier by using the vectorized property of R and avoiding painful for loops.
my_fun2 <- function(dat, i) {
fit <- lm(mpg ~ cyl, data=dat)
crude <- fit$coef[2]
# vectorized evaluation function
# fits model and calculates coef and change
evav <- Vectorize(function(i) {
# create extension string from the "i"s
cf.ext <- paste(names(dat)[i], collapse="+")
# update formula with extensions
beta <- update(fit, as.formula(
paste0("mpg~cyl",
# paste already accepted coefs case they exist
if (length(bests) != 0) {
paste("", names(dat)[bests], sep="+", collapse="")
},
"+", cf.ext)
))$coe[2]
# calculate Diff
beta.d <- abs(crude - beta)
# calculate Change %
beta.d.perc <- 100 / crude*beta.d
# set an attribute "cf.name" to be able to identify coef later
return(`attr<-`(c(beta=beta, beta.d=beta.d,
beta.d.perc=beta.d.perc),
"cf.name", cf.ext))
}, SIMPLIFY=FALSE) # simplifying would strip off attributes
# create empty vector bests
bests <- c()
# lapply evav() over the "i"s
res <- lapply(i, function(...) {
# run evav()
i.res <- evav(i)
# find largest change
largest <- which.max(mapply(`[`, i.res, 2))
# update "bests" vector within function environment with `<<-`
bests <<- c(bests, i[largest])
# same with the "i"s
i <<- i[-largest]
return(i.res[[largest]])
})
# summarize everything into matrix and give dimnames
res <- `dimnames<-`(rbind(c(crude, NA, NA),
do.call(rbind, res)),
list(
c("crude",
paste0("+ ", mapply(attr, res, "cf.name"))),
c("Estimate", "Diff", "Change (%)")))
return(res)
}
Usage
my_fun2(mtcars[c("mpg", "cyl", "disp", "hp", "wt")], i=3:5)
# Estimate Diff Change (%)
# crude -2.8757901 NA NA
# + wt -1.5077950 1.367995 -47.56937
# + hp -0.9416168 1.934173 -67.25711
# + disp -1.2933197 1.582470 -55.02733
Check
Checking the Diffs:
fit <- lm(mpg ~ cyl, data=mtcars[c("mpg", "cyl", "disp", "hp", "wt")])
sapply(c("disp", "hp", "wt"), function(x)
unname(abs(fit$coe[2] - update(fit, as.formula(paste("mpg~cyl+", x)))$coe[2])))
# disp hp wt
# 1.2885133 0.6110965 1.3679952
sapply(c("disp", "hp"), function(x)
unname(abs(fit$coe[2] - update(fit, as.formula(paste("mpg~cyl+wt+", x)))$coe[2])))
# disp hp
# 1.090847 1.934173
sapply(c("disp"), function(x)
unname(abs(fit$coe[2] - update(fit, as.formula(paste("mpg~cyl+wt+hp+", x)))$coe[2])))
# disp
# 1.58247
Should look fine.
Is there, either for the lm() function or for some other function for linear regression, an argument such that the reference group can be set to always be the biggest group rather than the alphabetical/numerical default in lm()?
As this's often done in stats, I'm thinking I somehow keep missing it when I search the documentation or that I'm looking in the wrong places. Any help would be appreciated!
Below, even when in a UDF, is what I'd like NOT to have to keep doing.
mtcars # load dataset
mtcars <- mtcars[1:31, ] # remove a now so that there is a single biggest group
lm(mpg ~ gear+carb+disp, data = mtcars ) # carb's group 1 is the reference by default
mtcars$carb <- as.factor(mtcars$carb)
mtcars <- within(mtcars, carb <- relevel(carb, ref = "4")) # set carb's group 4 as the reference
lm(mpg ~ gear+carb+disp, data = mtcars )
I don't believe there is a built-in function to do that but it's not that difficult to write one.
largest_ref <- function(DF, col){
DF[[col]] <- factor(DF[[col]])
tbl <- table(DF[[col]])
largest <- names(tbl)[which.max(tbl)]
DF[[col]] <- relevel(DF[[col]], ref = largest)
DF
}
Now I will reload the test dataset and change a copy of it. Then run regressions on both datasets, the one releveled by your code and the one releveled by the function above.
data(mtcars)
mtcars <- mtcars[1:31, ]
mtc <- mtcars
mtcars$carb <- as.factor(mtcars$carb)
mtcars <- within(mtcars, carb <- relevel(carb, ref = "4")) # set carb's group 4 as the reference
fit1 <- lm(mpg ~ gear + carb + disp, data = mtcars)
mtc <- largest_ref(mtc, "carb")
fit2 <- lm(mpg ~ gear + carb + disp, data = mtc)
identical(coef(fit1), coef(fit2))
#[1] TRUE
As you can see, the results are the same. You can further see it with (output omited).
summary(fit1)
summary(fit2)
It doesn't look like lm has any option for this, but you can just create a wrapper function to change the levels of a factor accounting to frequeuncy, then use that in the formula.
big.ref <- function(x) {
if(!is.factor(x)) x<-factor(x)
counts <- sort(table(x), decreasing = TRUE)
relevel(x, ref=names(counts)[1])
}
lm(mpg ~ gear + big.ref(carb) + disp, data = mtcars )
I am trying to model an outcome as a function of several exposures, adjusting the models for any covariates that may be confounders (≥ 10% ∆ to outcome coefficient when added to model). I am looking at many covariates as potential confounders, so have created a dataframe with all of them and am using lapply (the outcome and exposures are in a separate dataframe which has already been attached). To make sorting through all my outputs easier, I have tried to write a function which will only display the output if the covariate is a confounder. The exposures and number of them are different in each model, so I find myself having to write code like bellow each time I run my analyses, but know there must be an easier way. Would there be a function I could write to just lapply with, using the model without confounders and the Covariates dataframe as arguments? Thanks!
lapply(Covariates, function(x) {
model <- summary(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
if ((model$coefficients[2, 1] - summary(lm(Outcome ~ Exposure))$coefficients[2, 1])/
model$coefficients[2, 1] >= .1)
return(model)
})
I have written a function to solve this problem!
confounder <- function(model) {
model.sum <- summary(model)
model.b <- model.sum$coefficients[2, 1]
oldmodel <- update(model, . ~ . -x)
oldmodel.sum <- summary(oldmodel)
oldmodel.b <- oldmodel.sum$coefficients[2, 1]
model.frame <- tidy(model.sum)
model.sub <- subset(model.frame, term = "x")
model.sub.b <- model.sub[, 5]
if ((model.b - oldmodel.b)/model.b >= .1 |
model.sub.b < .05)
return(model.sum)
}
I then lapply this function to the model:
lapply(Covariates, function(x) {
confounder(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
})
R users, I've still been hashing out bits and pieces related to my initial question as seen here and now I'm quite stuck.
http://stackoverflow.com/questions/12270578/skipping-over-an-error-warning-in-an-lme-loop-in-r
Here is the code using mtcars as an example dataset. I want to save the lower and upper confidence intervals for every response variable as listed below (but not for the intercept, though I still need its other coefficients for the lme but I've got that already), all in one go (my real dataset is very large and I'm trying to automate it as much as possible)
library(log10)
library(nlme)
library(lattice)
responseVariables = c("mpg",
"disp",
"hp")
carModels <- list()
carModelNames <- list()
coint<- list()
coint2<- list()
coint$fixed <- list()
lower<- list()
upper<- list()
carCIlower <- list()
carCIupper <- list()
for (i in responseVariables){
print("Doing: ")
print(i)
mtcars$tmp <- as.numeric(mtcars[,i])
tmpLme <- lme( log10(tmp) ~ I(log10(wt)), random = ~1 | carb / gear / am, data=mtcars,na.action=na.omit )
carModels <- append(carModels, list(tmpLme))
carModelNames <- append(carModelNames,i)
coint <- try(intervals(tmpLme))
if (inherits(coint, "try-error")) {
tmpLme <- lme(log10(tmp) ~ log10(wt), random = ~1 | carb / gear, data=mtcars, na.action=na.omit);
coint <- try(intervals(tmpLme));
} else if (inherits(coint, "try-error")) {
tmpLme <- lme(log10(tmp) ~ log10(wt), random = ~1 | carb / gear, data=mtcars, na.action=na.omit, method="ML");
coint <- try(intervals(tmpLme));
} else if (inherits(coint, "try-error")) {
tmpLme <- lme(log10(tmp) ~ log10(wt), random = ~1 | carb, data=mtcars, na.action=na.omit);
coint <- try(intervals(tmpLme));
#}
coint2<- append(coint, list(tmpLme))
lower <- dim(coint2$fixed)[1]
upper <- dim(coint2$fixed)[1]
carCIlower <- append(carCIlower, coint2$fixed[2,1],lower)
carCIupper <- append(carCIupper, coint2$fixed[2,3],upper)
vs_wt <- cbind(carModelNames , carCIlower , carCIlower )
}
}
Currently I can get the CI values if I run the commands for each response variable, but not as part of the loop. The loop doesn't proceed past the coint2 statement. Well it does but it doesn't give me answers for coint2 and beyond. Or if I run those lines again it'll only give me values for the last item in the loop (i=hp). Also I see there's an unbalanced curly bracket (I've commented it out to show you) but if I use it gives me the lower CI for "disp" for the lower and upper CI for all resp variables.
Can someone point out what's missing?