How to write a function that will run multiple regression models of the same type with different dependent variables and then store them as lists? - r

I am trying to write a function that will run multiple regressions and then store the outputs in a vector. What I want is for the function to pick the dependent variables from a list that I will provide, and then run the regressions on the same right hand-side variables. Not sure how to go about doing this. Any hints will be appreciated.
my_data <- data.frame(x1=(1:10) + rnorm(10, 3, 1.5), x2=25/3 + rnorm(10, 0, 1),
dep.var1=seq(5, 28, 2.5), dep.var2=seq(100, -20, -12.5),
dep.var3=seq(1, 25, 2.5))
## The following is a list that tells the function
dep.var <- list(dep.var1=my_data$dep.var1, dep.var2=my_data$dep.var2)
## which dependent variables to use from my_data
all_models <- function(dep.var){lm(dep.var ~ x1 + x2, data=my_data)}
model <- sapply(dep.var, all_models) ## The "sapply" here tells the function to
## take the dependent variables from the list dep.var.
I want the "model" list to have two objects: model1 with dep.var1 and model2 with dep.var2. Then as required, I will use summary(model#) to see the regression output.
I know that this in theory works when a vector is used (i.e., p):
p <- seq(0.25, 0.95, 0.05)
s <- function(p) {1 - pnorm(35, p*1*44, sqrt(44)*sqrt(p*(1 - p)))}
f <- sapply(p, s)
But I can't get the whole thing to work as required for my regression models. It works somewhat because you can run and check "model" and it will show you the two regression outputs - but it is horrible. And the "model" does not show the regression specification, i.e., dep.var1 ~ x1 + x2.

Consider reformulate to dynamically change model formulas using character values for lm calls:
# VECTOR OF COLUMN NAMES (NOT VALUES)
dep.vars <- c("dep.var1", "dep.var2")
# USER-DEFINED METHOD TO PROCESS DIFFERENT DEP VAR
run_model <- function(dep.var) {
fml <- reformulate(c("x1", "x2"), dep.var)
lm(fml, data=data)
}
# NAMED LIST OF MODELS
all_models <- sapply(dep.vars, run_model, simplify = FALSE)
# OUTPUT RESULTS
all_models$dep.var1
all_models$dep.var2
...
From there, you can run further extractions or processes across model objects:
# NAMED LIST OF MODEL SUMMARIES
all_summaries <- lapply(all_models, summary)
all_summaries$dep.var1
all_summaries$dep.var2
...
# NAMED LIST OF MODEL COEFFICIENTS
all_coefficients <- lapply(all_models, `[`, "coefficients")
all_coefficients$dep.var1
all_coefficients$dep.var2
...

You could sapply over the names of the dependent vars which you could nicely identify with grep. In lm use reformulate to build the formula.
sapply(grep('^dep', names(my_data), value=TRUE), \(x)
lm(reformulate(c('x1', 'x2'), x), my_data))
# dep.var1 dep.var2 dep.var3
# coefficients numeric,3 numeric,3 numeric,3
# residuals numeric,10 numeric,10 numeric,10
# effects numeric,10 numeric,10 numeric,10
# rank 3 3 3
# fitted.values numeric,10 numeric,10 numeric,10
# assign integer,3 integer,3 integer,3
# qr qr,5 qr,5 qr,5
# df.residual 7 7 7
# xlevels list,0 list,0 list,0
# call expression expression expression
# terms dep.var1 ~ x1 + x2 dep.var2 ~ x1 + x2 dep.var3 ~ x1 + x2
# model data.frame,3 data.frame,3 data.frame,3
The dep.var* appear nicely in the result.
However, you probably want to use lapply and pipe it into setNames() to get the list elements named. Instead of grep you may of course define the dependent variables manually. To get a clean formular call stored, we use a trick once #g-grothendieck taught me using do.call.
dv <- as.list(grep('^dep', names(my_data), value=TRUE)[1:2])
res <- lapply(dv, \(x) {
f <- reformulate(c('x1', 'x2'), x)
do.call('lm', list(f, quote(my_data)))
}) |>
setNames(dv)
res
# $dep.var1
#
# Call:
# lm(formula = dep.var1 ~ x1 + x2, data = my_data)
#
# Coefficients:
# (Intercept) x1 x2
# -4.7450 2.3398 0.2747
#
#
# $dep.var2
#
# Call:
# lm(formula = dep.var2 ~ x1 + x2, data = my_data)
#
# Coefficients:
# (Intercept) x1 x2
# 148.725 -11.699 -1.373
This allows you to get the summary() of the objects, which probably is what you want.
summary(res$dep.var1)
# Call:
# lm(formula = dep.var1 ~ x1 + x2, data = my_data)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.8830 -1.8345 -0.2326 1.4335 4.2452
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -4.7450 7.2884 -0.651 0.536
# x1 2.3398 0.2836 8.251 7.48e-05 ***
# x2 0.2747 0.7526 0.365 0.726
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.55 on 7 degrees of freedom
# Multiple R-squared: 0.9117, Adjusted R-squared: 0.8865
# F-statistic: 36.14 on 2 and 7 DF, p-value: 0.0002046
Finally you could wrap it in a function
calc_models <- \(dv) {
lapply(dv, \(x) {
f <- reformulate(c('x1', 'x2'), x)
do.call('lm', list(f, quote(my_data)))
}) |>
setNames(dv)
}
calc_models(list('dep.var1', 'dep.var2'))

Here is a way how you could iterate through your dataframe and apply the function to the group you define (here dep.var) and save the different models in a dataframe:
library(tidyverse)
library(broom)
my_data %>%
pivot_longer(
starts_with("dep"),
names_to = "group",
values_to = "dep.var"
) %>%
mutate(group = as.factor(group)) %>%
group_by(group) %>%
group_split() %>%
map_dfr(.f = function(df) {
lm(dep.var ~ x1 + x2, data = df) %>%
tidy() %>% # first output
#glance() %>% # second output
add_column(group = unique(df$group), .before=1)
})
dataframe output:
# A tibble: 9 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var1 (Intercept) -5.29 11.6 -0.456 0.662
2 dep.var1 x1 2.11 0.268 7.87 0.000101
3 dep.var1 x2 0.538 1.23 0.437 0.675
4 dep.var2 (Intercept) 151. 57.9 2.61 0.0347
5 dep.var2 x1 -10.6 1.34 -7.87 0.000101
6 dep.var2 x2 -2.69 6.15 -0.437 0.675
7 dep.var3 (Intercept) -9.29 11.6 -0.802 0.449
8 dep.var3 x1 2.11 0.268 7.87 0.000101
9 dep.var3 x2 0.538 1.23 0.437 0.675
list output:
[[1]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var1 (Intercept) -5.29 11.6 -0.456 0.662
2 dep.var1 x1 2.11 0.268 7.87 0.000101
3 dep.var1 x2 0.538 1.23 0.437 0.675
[[2]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var2 (Intercept) 151. 57.9 2.61 0.0347
2 dep.var2 x1 -10.6 1.34 -7.87 0.000101
3 dep.var2 x2 -2.69 6.15 -0.437 0.675
[[3]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var3 (Intercept) -9.29 11.6 -0.802 0.449
2 dep.var3 x1 2.11 0.268 7.87 0.000101
3 dep.var3 x2 0.538 1.23 0.437 0.675
glance output:
group r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 dep.var1 0.927 0.906 2.32 44.3 0.000106 2 -20.8 49.7 50.9 37.8 7 10
2 dep.var2 0.927 0.906 11.6 44.3 0.000106 2 -36.9 81.9 83.1 944. 7 10
3 dep.var3 0.927 0.906 2.32 44.3 0.000106 2 -20.8 49.7 50.9 37.8 7 10

Related

Loop in Cox regression

I am trying to run a cox regression for 1000 variables (exposure) as below
varlist <- names(dataset)[275:1275]
sumtables <- lapply(varlist, function(i) {
iformula <- as.formula(sprintf("Surv(time_cox, events) ~ %s + age +age2 ", i))
x <- coxph(iformula, data=dataset, na.action=na.omit)
summary(x)[7][[1]] ##### summary(x)[8][[1]]
})
it works well, but I don't know how to extract the data (for each variable (beta and se)) and run the benjamini-hochberg on p-values.
any help is appreciated! Thanks
I am assuming here that all the variables in varlist are either binary or numeric.
sumtables <- lapply(varlist, function(i) {
iformula <- as.formula(sprintf("Surv(time_cox, events) ~ %s + age +age2 ", i))
x <- coxph(iformula, data=dataset, na.action=na.omit)
data.frame(pvalue = drop1(x, scope = i, test = "Chisq")[2,4],
coef = coef(x)[i])
})
You could use purrr::map to get a tidy dataframe of all your coefficients, se's and p values etc. from the vector of tested exposures. Modifying a little from your code above to work with veteran dataset:
library(survival)
library(tidyverse)
exp_vars <- names(veteran[, c(1, 2, 5, 6, 8)])
tibble(exp_vars) %>%
group_by(exp_vars) %>%
mutate(cox_mod = map(exp_vars, function(exposure) {
iformula <-
as.formula(sprintf("Surv(time, status) ~ %s + age", exposure))
x <- coxph(iformula, data = veteran, na.action = na.omit)
x
}),
coefs = list(rownames_to_column(data.frame(
summary(cox_mod[[1]])$coefficients
)))) %>%
unnest(coefs)
#> # A tibble: 12 x 8
#> # Groups: exp_vars [5]
#> exp_vars cox_mod rowname coef exp.coef. se.coef. z Pr...z..
#> <chr> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 trt <coxph> trt -0.00365 0.996 0.183 -0.0200 9.84e- 1
#> 2 trt <coxph> age 0.00753 1.01 0.00966 0.779 4.36e- 1
#> 3 celltype <coxph> celltypesmallc~ 0.992 2.70 0.254 3.91 9.40e- 5
#> 4 celltype <coxph> celltypeadeno 1.16 3.17 0.293 3.94 8.07e- 5
#> 5 celltype <coxph> celltypelarge 0.235 1.27 0.278 0.848 3.97e- 1
#> 6 celltype <coxph> age 0.00590 1.01 0.00935 0.631 5.28e- 1
#> 7 karno <coxph> karno -0.0337 0.967 0.00520 -6.48 8.94e-11
#> 8 karno <coxph> age -0.00239 0.998 0.00908 -0.263 7.92e- 1
#> 9 diagtime <coxph> diagtime 0.00943 1.01 0.00892 1.06 2.90e- 1
#> 10 diagtime <coxph> age 0.00797 1.01 0.00961 0.830 4.07e- 1
#> 11 prior <coxph> prior -0.0135 0.987 0.0201 -0.674 5.00e- 1
#> 12 prior <coxph> age 0.00715 1.01 0.00955 0.749 4.54e- 1
Created on 2022-03-16 by the reprex package (v2.0.1)

How to add interaction terms in multinomial regression

I am using the mlogit function from the mlogit package to run a multinomial logit regression. I am not sure how to add interaction terms into my model. Here is a toy dataset and my attempt to add interactions:
library(mlogit)
data <- data.frame(y=sample(1:3, 24, replace = TRUE),
x1 = c(rep(1,12), rep(2,12)),
x2 = rep(c(rep(1,4), rep(2,4), rep(3,4)),2),
x3=rnorm(24),
z1 = sample(1:10, 24, replace = TRUE))
m0 <- mlogit(y ~ 0|x1 + x2 + x3 + z1, shape = "wide", data = data) #model with only main effects
m1 <- mlogit(y ~ 0|(x1 + x2 + x3 + z1)^2, shape = "wide", data = data) #model assuming with all possible 2-way interactions?
The output from summary(m1) shows:
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):2 86.41088 164.93831 0.5239 0.6003
(Intercept):3 62.43859 163.57346 0.3817 0.7027
x1:2 -32.27065 82.62474 -0.3906 0.6961
x1:3 0.24661 84.07429 0.0029 0.9977
x2:2 -75.09247 81.36496 -0.9229 0.3561
x2:3 -85.16452 81.40983 -1.0461 0.2955
x3:2 113.11778 119.15990 0.9493 0.3425
x3:3 112.77622 117.74567 0.9578 0.3382
z1:2 11.18665 22.32508 0.5011 0.6163
z1:3 13.15552 22.26441 0.5909 0.5546
x1:2 34.01298 39.66983 0.8574 0.3912
x1:3 32.19141 39.48373 0.8153 0.4149
x1:2 -53.86747 59.75696 -0.9014 0.3674
x1:3 -47.97693 59.09055 -0.8119 0.4168
x1:2 -6.98799 11.29920 -0.6185 0.5363
x1:3 -10.41574 11.52313 -0.9039 0.3660
x2:2 0.59185 6.68807 0.0885 0.9295
x2:3 2.63458 4.94419 0.5329 0.5941
x2:2 0.80945 2.03769 0.3972 0.6912
x2:3 2.60383 2.21878 1.1735 0.2406
x3:2 -0.64112 1.64678 -0.3893 0.6970
x3:3 -2.14289 1.98436 -1.0799 0.2802
the first column is not quite clear to me what specific interactions were outputted. Any pointers will be greatly appreciated!
This might be a clearer way to do it:
library(dplyr)
library(broom)
library(nnet)
multinom(formula = y ~ (x1 + x2 + x3 + z1)^2, data = data) %>%
tidy()
# A tibble: 22 x 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 (Intercept) -158. 247. -0.640 0.522
2 2 x1 -388. 247. -1.57 0.116
3 2 x2 -13.4 248. -0.0543 0.957
4 2 x3 120. 334. 0.360 0.719
5 2 z1 173. 968. 0.179 0.858
6 2 x1:x2 337. 248. 1.36 0.174
7 2 x1:x3 40.2 334. 0.120 0.904
8 2 x1:z1 -53.8 968. -0.0555 0.956
9 2 x2:x3 -137. 1018. -0.135 0.893
10 2 x2:z1 -76.6 910. -0.0841 0.933
# … with 12 more rows

Include in data table regression coeffiecients, std.errors and Pvalues using R

how do I create a data.table in r with coefficient, std.err and Pvlaues with rqpd regression type? It's easy with the coefficients using summary(myregression)[2] but don't know how to get std.err and Pval. Thanks
Try with broom:
library(broom)
library(dplyr)
#Model
mod <- lm(Sepal.Length~.,data=iris)
#Broom
summaryobj <- tidy(mod)
Output:
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.17 0.280 7.76 1.43e-12
2 Sepal.Width 0.496 0.0861 5.76 4.87e- 8
3 Petal.Length 0.829 0.0685 12.1 1.07e-23
4 Petal.Width -0.315 0.151 -2.08 3.89e- 2
5 Speciesversicolor -0.724 0.240 -3.01 3.06e- 3
6 Speciesvirginica -1.02 0.334 -3.07 2.58e- 3
Found a solution that is working
summ <- summary(myregression, se = "boot")
summ
str(summ)
PValues <- summ$coefficients[,4]

Data frame couldn´t show

I try running the following code, but get error ar shown in the pictures below. Im quite new to R so dont know if its information to the case, but the first column in my data frame called "data" is dates. I get as.Dates.numeric(value) "origin" must be applied, my intuition says it got something to do with the date column, but then again, im a newbie. Just in case, the date column is not supposed to be a part of coef.vec.
v1 <- 2:7
coef.vec <- data.frame(NULL) # create object to keep results
for (i in seq_along(v1)) {
m <- summary(lm(data[,v1[i]] ~ data[,8])) # run model
coef.vec[i, 1] <- names(data)[v1[i]] # print variable name
coef.vec[i, 2] <- m$coefficients[1,1] # intercept
coef.vec[i, 3] <- m$coefficients[2,1] # coefficient
coef.vec[i, 4] <- mean(data[[i]]) # means of variables
}
names(coef.vec) <- c("y.variable", "intercept", "coef.x","variable.mean")
error1
error2
Try this approach using lapply for column 2 to 7 of your data.
coef.vec <- do.call(rbind, lapply(names(data)[2:7], function(x) {
m <- summary(lm(data[[x]] ~ data[[8]]))
data.frame(y.variable = x,
intercept = m$coefficients[1,1],
coef.x = m$coefficients[2,1],
variable.mean = mean(data[[x]]))
}))
We can construct the formula with reformulate, apply the lm, get the summary output with tidy from broom and create a single dataset
library(dplyr)
library(purrr)
library(broom)
map_dfr(names(data)[2:7], ~
tidy(lm(reformulate(names(data)[8], response = .x), data = data)))
Or this can be done in a single step without any loop
tidy(lm(cbind(iris[,1], iris[,2]) ~ Species, iris))
Or
tidy(lm(as.matrix(iris[1:2]) ~ Species, iris))
# A tibble: 6 x 6
# response term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 Sepal.Length (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Sepal.Length Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Sepal.Length Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 Sepal.Width (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Sepal.Width Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Sepal.Width Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10
and check the output from the loop
map_dfr(names(iris)[1:2], ~ tidy(lm(reformulate('Species', response = .x), data = iris)))
# A tibble: 6 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10

Many regressions using tidyverse and broom: Same dependent variable, different independent variables

This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)
Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1

Resources