How to show reference level from lm output? - r

I'd like to add a reference level to the final output of linear regression output lm().
For example:
levels(iris$Species)
"setosa" "versicolor" "virginica"
summary(lm(Sepal.Length ~ Petal.Width + Species, iris))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78044 0.08308 57.543 < 2e-16 ***
Petal.Width 0.91690 0.19386 4.730 5.25e-06 ***
Speciesversicolor -0.06025 0.23041 -0.262 0.794
Speciesvirginica -0.05009 0.35823 -0.140 0.889
I'd like to have it like:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78044 0.08308 57.543 < 2e-16 ***
Petal.Width 0.91690 0.19386 4.730 5.25e-06 ***
Speciessetosa
Speciesversicolor -0.06025 0.23041 -0.262 0.794
Speciesvirginica -0.05009 0.35823 -0.140 0.889
I've been looking for it a lot for a while but no clues yet. Any help would be highly appreciated.
#EDIT
Data for further expansion:
iris$Petal.Width <- as.factor(ifelse(iris$Petal.Width >1, "Big", "Small"))
levels(iris$Petal.Width)
"Big" "Small"

Here is a basic workflow you can work off of, is uses dplyr and broom to join your levels with your coefficients table. Right now it requires you know which variables are factors. You could change the NA to "" if you prefer. It also organizes the output alphabetically which will not always put the reference group first. Let me know if you have any issues with scaling of this:
library(broom)
library(dplyr)
iris <- datasets::iris
iris$Petal.Width <- factor(ifelse(iris$Petal.Width > 1, "Big", "Small"), levels = c("Small", "Big"))
reg_obj <- lm(Sepal.Length ~ Petal.Width + Species, iris)
factor_levels <- tibble(term = c(paste0("Species", levels(iris$Species)),
paste0("Petal.Width", levels(iris$Petal.Width))))
full_join(tidy(reg_obj), factor_levels, by = "term") %>%
arrange(term)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.01 0.0709 70.6 1.03e-114
2 Petal.WidthBig 0.607 0.204 2.97 3.51e- 3
3 Petal.WidthSmall NA NA NA NA
4 Speciessetosa NA NA NA NA
5 Speciesversicolor 0.408 0.202 2.02 4.55e- 2
6 Speciesvirginica 0.975 0.228 4.28 3.33e- 5

This produces the desired output:
res <- capture.output(summary(lm(Sepal.Length ~ Petal.Width + Species, data = iris)))
res[14:22] <- res[13:21]
res[13] <- "Speciessetosa"
cat(res, sep = "\n")

Related

how to save the output of regression in a table in R

I want to save my output regression of lmer() from lme4 R package. Is there any good way for this to get the output below in a table e.g .csv or .txt or .html etc?
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 103.989 5.617 139.000 18.52 < 2e‐16 ***
age ‐0.172 0.177 139.000 ‐1.03 0.304
bmi 0.597 0.229 139.000 2.56 0.012 *
gender 1.019 0.325 139.000 3.15 0.002 **
I tried, tab_model() from library sjplot in R, but it does not give the SE, df and t values. I would like to save the output above. I appreciate any advice.
Make sure the class of your model object is lmerMod and it will work with stargazer, which exports beautiful formatted regression tables to plain text, html, latex, etc. and has all sort of options to customize those tables (see the docs).
# class(mod)<- "lmerMod"
mod <- lme4::lmer(Ozone ~ Temp + (1|Month),
data = airquality)
stargazer::stargazer(mod)
stargazer::stargazer(mod, type = "html")
Update:to write to textfile:
library(lme4)
m1 <- lmer(drat ~ wt + (1 + wt|cyl), data=mtcars)
library(broom.mixed)
library(dplyr)
df<- m1 %>%
tidy()
write.table(df,"filename.txt",sep="\t",row.names=FALSE)
OR
m1 %>%
tidy() %>%
write.table(.,"filename.txt",sep="\t",row.names=FALSE)
"effect" "group" "term" "estimate" "std.error" "statistic"
"fixed" NA "(Intercept)" 4.67281034450577 0.344833957358875 13.5508996280279
"fixed" NA "wt" -0.344238767944164 0.0911701519816392 -3.77578363600283
"ran_pars" "cyl" "sd__(Intercept)" 0.374914148920673 NA NA
"ran_pars" "cyl" "cor__(Intercept).wt" -1 NA NA
"ran_pars" "cyl" "sd__wt" 0.0839046849277359 NA NA
"ran_pars" "Residual" "sd__Observation" 0.370192153038516 NA NA
One way could be using broom.mixed package as suggested by #
user63230 in the comments section:
Here is an example:
library(lme4)
m1 <- lmer(drat ~ wt + (1 + wt|cyl), data=mtcars)
library(broom.mixed)
library(dplyr)
m1 %>%
tidy()
effect group term estimate std.error statistic
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 fixed NA (Intercept) 4.67 0.345 13.6
2 fixed NA wt -0.344 0.0912 -3.78
3 ran_pars cyl sd__(Intercept) 0.375 NA NA
4 ran_pars cyl cor__(Intercept).wt -1 NA NA
5 ran_pars cyl sd__wt 0.0839 NA NA
6 ran_pars Residual sd__Observation 0.370 NA NA

How to write a function that will run multiple regression models of the same type with different dependent variables and then store them as lists?

I am trying to write a function that will run multiple regressions and then store the outputs in a vector. What I want is for the function to pick the dependent variables from a list that I will provide, and then run the regressions on the same right hand-side variables. Not sure how to go about doing this. Any hints will be appreciated.
my_data <- data.frame(x1=(1:10) + rnorm(10, 3, 1.5), x2=25/3 + rnorm(10, 0, 1),
dep.var1=seq(5, 28, 2.5), dep.var2=seq(100, -20, -12.5),
dep.var3=seq(1, 25, 2.5))
## The following is a list that tells the function
dep.var <- list(dep.var1=my_data$dep.var1, dep.var2=my_data$dep.var2)
## which dependent variables to use from my_data
all_models <- function(dep.var){lm(dep.var ~ x1 + x2, data=my_data)}
model <- sapply(dep.var, all_models) ## The "sapply" here tells the function to
## take the dependent variables from the list dep.var.
I want the "model" list to have two objects: model1 with dep.var1 and model2 with dep.var2. Then as required, I will use summary(model#) to see the regression output.
I know that this in theory works when a vector is used (i.e., p):
p <- seq(0.25, 0.95, 0.05)
s <- function(p) {1 - pnorm(35, p*1*44, sqrt(44)*sqrt(p*(1 - p)))}
f <- sapply(p, s)
But I can't get the whole thing to work as required for my regression models. It works somewhat because you can run and check "model" and it will show you the two regression outputs - but it is horrible. And the "model" does not show the regression specification, i.e., dep.var1 ~ x1 + x2.
Consider reformulate to dynamically change model formulas using character values for lm calls:
# VECTOR OF COLUMN NAMES (NOT VALUES)
dep.vars <- c("dep.var1", "dep.var2")
# USER-DEFINED METHOD TO PROCESS DIFFERENT DEP VAR
run_model <- function(dep.var) {
fml <- reformulate(c("x1", "x2"), dep.var)
lm(fml, data=data)
}
# NAMED LIST OF MODELS
all_models <- sapply(dep.vars, run_model, simplify = FALSE)
# OUTPUT RESULTS
all_models$dep.var1
all_models$dep.var2
...
From there, you can run further extractions or processes across model objects:
# NAMED LIST OF MODEL SUMMARIES
all_summaries <- lapply(all_models, summary)
all_summaries$dep.var1
all_summaries$dep.var2
...
# NAMED LIST OF MODEL COEFFICIENTS
all_coefficients <- lapply(all_models, `[`, "coefficients")
all_coefficients$dep.var1
all_coefficients$dep.var2
...
You could sapply over the names of the dependent vars which you could nicely identify with grep. In lm use reformulate to build the formula.
sapply(grep('^dep', names(my_data), value=TRUE), \(x)
lm(reformulate(c('x1', 'x2'), x), my_data))
# dep.var1 dep.var2 dep.var3
# coefficients numeric,3 numeric,3 numeric,3
# residuals numeric,10 numeric,10 numeric,10
# effects numeric,10 numeric,10 numeric,10
# rank 3 3 3
# fitted.values numeric,10 numeric,10 numeric,10
# assign integer,3 integer,3 integer,3
# qr qr,5 qr,5 qr,5
# df.residual 7 7 7
# xlevels list,0 list,0 list,0
# call expression expression expression
# terms dep.var1 ~ x1 + x2 dep.var2 ~ x1 + x2 dep.var3 ~ x1 + x2
# model data.frame,3 data.frame,3 data.frame,3
The dep.var* appear nicely in the result.
However, you probably want to use lapply and pipe it into setNames() to get the list elements named. Instead of grep you may of course define the dependent variables manually. To get a clean formular call stored, we use a trick once #g-grothendieck taught me using do.call.
dv <- as.list(grep('^dep', names(my_data), value=TRUE)[1:2])
res <- lapply(dv, \(x) {
f <- reformulate(c('x1', 'x2'), x)
do.call('lm', list(f, quote(my_data)))
}) |>
setNames(dv)
res
# $dep.var1
#
# Call:
# lm(formula = dep.var1 ~ x1 + x2, data = my_data)
#
# Coefficients:
# (Intercept) x1 x2
# -4.7450 2.3398 0.2747
#
#
# $dep.var2
#
# Call:
# lm(formula = dep.var2 ~ x1 + x2, data = my_data)
#
# Coefficients:
# (Intercept) x1 x2
# 148.725 -11.699 -1.373
This allows you to get the summary() of the objects, which probably is what you want.
summary(res$dep.var1)
# Call:
# lm(formula = dep.var1 ~ x1 + x2, data = my_data)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.8830 -1.8345 -0.2326 1.4335 4.2452
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -4.7450 7.2884 -0.651 0.536
# x1 2.3398 0.2836 8.251 7.48e-05 ***
# x2 0.2747 0.7526 0.365 0.726
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.55 on 7 degrees of freedom
# Multiple R-squared: 0.9117, Adjusted R-squared: 0.8865
# F-statistic: 36.14 on 2 and 7 DF, p-value: 0.0002046
Finally you could wrap it in a function
calc_models <- \(dv) {
lapply(dv, \(x) {
f <- reformulate(c('x1', 'x2'), x)
do.call('lm', list(f, quote(my_data)))
}) |>
setNames(dv)
}
calc_models(list('dep.var1', 'dep.var2'))
Here is a way how you could iterate through your dataframe and apply the function to the group you define (here dep.var) and save the different models in a dataframe:
library(tidyverse)
library(broom)
my_data %>%
pivot_longer(
starts_with("dep"),
names_to = "group",
values_to = "dep.var"
) %>%
mutate(group = as.factor(group)) %>%
group_by(group) %>%
group_split() %>%
map_dfr(.f = function(df) {
lm(dep.var ~ x1 + x2, data = df) %>%
tidy() %>% # first output
#glance() %>% # second output
add_column(group = unique(df$group), .before=1)
})
dataframe output:
# A tibble: 9 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var1 (Intercept) -5.29 11.6 -0.456 0.662
2 dep.var1 x1 2.11 0.268 7.87 0.000101
3 dep.var1 x2 0.538 1.23 0.437 0.675
4 dep.var2 (Intercept) 151. 57.9 2.61 0.0347
5 dep.var2 x1 -10.6 1.34 -7.87 0.000101
6 dep.var2 x2 -2.69 6.15 -0.437 0.675
7 dep.var3 (Intercept) -9.29 11.6 -0.802 0.449
8 dep.var3 x1 2.11 0.268 7.87 0.000101
9 dep.var3 x2 0.538 1.23 0.437 0.675
list output:
[[1]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var1 (Intercept) -5.29 11.6 -0.456 0.662
2 dep.var1 x1 2.11 0.268 7.87 0.000101
3 dep.var1 x2 0.538 1.23 0.437 0.675
[[2]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var2 (Intercept) 151. 57.9 2.61 0.0347
2 dep.var2 x1 -10.6 1.34 -7.87 0.000101
3 dep.var2 x2 -2.69 6.15 -0.437 0.675
[[3]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var3 (Intercept) -9.29 11.6 -0.802 0.449
2 dep.var3 x1 2.11 0.268 7.87 0.000101
3 dep.var3 x2 0.538 1.23 0.437 0.675
glance output:
group r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 dep.var1 0.927 0.906 2.32 44.3 0.000106 2 -20.8 49.7 50.9 37.8 7 10
2 dep.var2 0.927 0.906 11.6 44.3 0.000106 2 -36.9 81.9 83.1 944. 7 10
3 dep.var3 0.927 0.906 2.32 44.3 0.000106 2 -20.8 49.7 50.9 37.8 7 10

Include in data table regression coeffiecients, std.errors and Pvalues using R

how do I create a data.table in r with coefficient, std.err and Pvlaues with rqpd regression type? It's easy with the coefficients using summary(myregression)[2] but don't know how to get std.err and Pval. Thanks
Try with broom:
library(broom)
library(dplyr)
#Model
mod <- lm(Sepal.Length~.,data=iris)
#Broom
summaryobj <- tidy(mod)
Output:
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.17 0.280 7.76 1.43e-12
2 Sepal.Width 0.496 0.0861 5.76 4.87e- 8
3 Petal.Length 0.829 0.0685 12.1 1.07e-23
4 Petal.Width -0.315 0.151 -2.08 3.89e- 2
5 Speciesversicolor -0.724 0.240 -3.01 3.06e- 3
6 Speciesvirginica -1.02 0.334 -3.07 2.58e- 3
Found a solution that is working
summ <- summary(myregression, se = "boot")
summ
str(summ)
PValues <- summ$coefficients[,4]

Data frame couldn´t show

I try running the following code, but get error ar shown in the pictures below. Im quite new to R so dont know if its information to the case, but the first column in my data frame called "data" is dates. I get as.Dates.numeric(value) "origin" must be applied, my intuition says it got something to do with the date column, but then again, im a newbie. Just in case, the date column is not supposed to be a part of coef.vec.
v1 <- 2:7
coef.vec <- data.frame(NULL) # create object to keep results
for (i in seq_along(v1)) {
m <- summary(lm(data[,v1[i]] ~ data[,8])) # run model
coef.vec[i, 1] <- names(data)[v1[i]] # print variable name
coef.vec[i, 2] <- m$coefficients[1,1] # intercept
coef.vec[i, 3] <- m$coefficients[2,1] # coefficient
coef.vec[i, 4] <- mean(data[[i]]) # means of variables
}
names(coef.vec) <- c("y.variable", "intercept", "coef.x","variable.mean")
error1
error2
Try this approach using lapply for column 2 to 7 of your data.
coef.vec <- do.call(rbind, lapply(names(data)[2:7], function(x) {
m <- summary(lm(data[[x]] ~ data[[8]]))
data.frame(y.variable = x,
intercept = m$coefficients[1,1],
coef.x = m$coefficients[2,1],
variable.mean = mean(data[[x]]))
}))
We can construct the formula with reformulate, apply the lm, get the summary output with tidy from broom and create a single dataset
library(dplyr)
library(purrr)
library(broom)
map_dfr(names(data)[2:7], ~
tidy(lm(reformulate(names(data)[8], response = .x), data = data)))
Or this can be done in a single step without any loop
tidy(lm(cbind(iris[,1], iris[,2]) ~ Species, iris))
Or
tidy(lm(as.matrix(iris[1:2]) ~ Species, iris))
# A tibble: 6 x 6
# response term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 Sepal.Length (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Sepal.Length Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Sepal.Length Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 Sepal.Width (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Sepal.Width Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Sepal.Width Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10
and check the output from the loop
map_dfr(names(iris)[1:2], ~ tidy(lm(reformulate('Species', response = .x), data = iris)))
# A tibble: 6 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10

How to insert in a list in R the variable names of regressors which have pvalue below 5%

I'm trying to build a list in R that contains all the regressor names which have pvalue below the 5% threshold.
For example:
first regression
#gender (male - female)
regr1 <- lm(salary ~ female, data = test)
summary(regr1)
output first regression:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.855618 0.001888 453.24 <2e-16 ***
female -0.054514 0.003088 -17.65 <2e-16 ***
second regression:
#education (PhD - Master - Bachelor - HighSchool - None)
regr2 <- lm(salary ~ Master + Bachelor + HighSchool + None, data = test)
summary(regr2)
output second regression:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.91008 0.02829 32.164 < 2e-16 ***
Master -0.05446 0.02836 -1.920 0.054811 .
Bachelor -0.10291 0.02848 -3.613 0.000303 ***
HighSchool -0.10173 0.02911 -3.495 0.000475 ***
None -0.12590 0.02864 -4.396 1.11e-05 ***
The variable master is not significant, so i don't want it in the List.
This is teh list that I would like to get:
varnames <- c("female", "Bachelor", "HighSchool", "None")
You can use broom::tidy and then manipulate the table, like this:
library(tidyverse)
tab <- lm(data = mtcars, mpg ~ cyl + disp + hp) %>% summary() %>% broom::tidy()
tab
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 34.2 2.59 13.2 1.54e-13
2 cyl -1.23 0.797 -1.54 1.35e- 1
3 disp -0.0188 0.0104 -1.81 8.09e- 2
4 hp -0.0147 0.0147 -1.00 3.25e- 1
Then you filter the p.value column:
tab %>% filter(p.value < 0.05)
# A tibble: 1 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 34.2 2.59 13.2 1.54e-13
So now you can take the regressors name:
tab %>% filter(p.value < 0.05) %>% select(term) %>% as.character()
[1] "(Intercept)"
In base R you can do something like the following:
lr1 <- lm(Sepal.Length ~ ., data = iris)
coef_table <- coef(summary(tab))
row.names(coef_table)[coef_table[, "Pr(>|t|)"] < 0.001]
# "(Intercept)" "Sepal.Width" "Petal.Length"

Resources