Many regressions using tidyverse and broom: Same dependent variable, different independent variables - r

This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)

Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))

You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1

Related

Run linear regression model on nested dataset after grouping (multiply imputed dataset)

*I want to group nested (multiply imputed) dataset and then apply linear regression on each dataset. I have tried a number of approaches, including the map options (2) and the for loop (3). I have had no luck at all. I want the model results to look like results from summary(mod1). Does anyone know what I could be doing wrong?
# get dependencies
library(mice)
library(tidyverse)
# impute the boys dataset from mice package
boys_imp <- mice(boys)
# 1) I want to run a model like this on my multiply imputed dataset
mod <- boys %>%
group_by(reg) %>%
do(tidy(
lm(
data=.,
formula = wgt ~ bmi),
conf.int = T))
summary(mod1)
# A tibble: 12 × 8
# Groups: reg [6]
reg term estimate std.error statistic p.value conf.low conf.high
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 north (Intercept) -81.9 9.84 -8.32 2.48e-12 -101. -62.3
2 north bmi 6.84 0.500 13.7 2.53e-22 5.85 7.84
3 east (Intercept) -75.3 7.62 -9.89 3.21e-18 -90.4 -60.3
4 east bmi 6.29 0.420 15.0 4.53e-32 5.46 7.12
5 west (Intercept) -91.9 6.31 -14.6 2.48e-34 -104. -79.4
6 west bmi 7.17 0.347 20.7 3.49e-54 6.49 7.86
7 south (Intercept) -79.8 6.73 -11.9 1.83e-24 -93.1 -66.5
8 south bmi 6.47 0.373 17.3 1.63e-40 5.73 7.20
9 city (Intercept) -92.0 13.9 -6.61 6.75e- 9 -120. -64.2
10 city bmi 6.95 0.757 9.18 1.39e-13 5.44 8.46
11 NA (Intercept) -88.6 43.8 -2.02 2.92e- 1 -645. 468.
12 NA bmi 6.46 2.89 2.24 2.68e- 1 -30.2 43.1
# 2) the map way --------------------------------------------------------
mod_imp <- boys_imp %>%
mice::complete("all") %>%
map(group_by, reg) %>%
map(lm, formula = wgt ~ bmi) %>%
pool()
summary(mod_imp)
term estimate std.error statistic df p.value
1 (Intercept) -85.473428 3.5511961 -24.06891 715.1703 0
2 bmi 6.793622 0.1945322 34.92287 693.7835 0
# 3) for loop way-------------------------------------------------------
# nest the mids dataset
boys_imp2 <- boys_imp %>%
mice::complete("all")
dat1 <- replicate(length(boys_imp2), NULL) # preallocate same size
# run the for loop
for (i in seq_along(boys_imp2)) {
dat1[[i]] <- boys_imp2[[i]] %>%
group_by(reg) %>%
do(lm(wgt ~ bmi, data = boys_imp2[[i]]))
}
|==================================================================|100% ~0 s remaining Error in `do()`:
! Results 1, 2, 3, 4, 5, ... must be data frames, not lm.
Run `rlang::last_error()` to see where the error occurred.*
I have found a solution to the problem. This involve grouping the data by ID and variable of interest, subsequently I map lm on to the datasets. I then finish off with unnesting the data
boys_imp %>%
mice::complete("long", include = FALSE) %>%
group_by(.imp, reg) %>%
nest() %>%
mutate(lm_model = map(data, ~lm(bmi ~ phb, data = .))) %>%
group_by(reg) %>%
summarise(model = list(tidy(pool(lm_model),conf.int = T))) %>%
unnest_wider(model) %>%
unnest(cols = c(term, estimate, std.error,
statistic, p.value, conf.low, conf.high))
# A tibble: 30 × 16
reg term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 north (Intercept) 19.3 0.332 57.9 0 18.6 19.9
2 north phb.L 5.10 0.678 7.53 1.81e-10 3.75 6.46
3 north phb.Q 1.25 0.800 1.56 1.24e- 1 -0.357 2.86
4 north phb.C -0.430 0.882 -0.487 6.30e- 1 -2.25 1.39
5 north phb^4 -1.10 0.948 -1.16 2.57e- 1 -3.07 0.862
6 north phb^5 -0.156 1.08 -0.144 8.87e- 1 -2.41 2.10
7 east (Intercept) 18.7 0.244 76.8 0 18.3 19.2
8 east phb.L 4.83 0.509 9.48 4.44e-15 3.82 5.84
9 east phb.Q 1.10 0.692 1.60 1.27e- 1 -0.343 2.55
10 east phb.C -0.518 0.671 -0.772 4.49e- 1 -1.91 0.878
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows

Loop in Cox regression

I am trying to run a cox regression for 1000 variables (exposure) as below
varlist <- names(dataset)[275:1275]
sumtables <- lapply(varlist, function(i) {
iformula <- as.formula(sprintf("Surv(time_cox, events) ~ %s + age +age2 ", i))
x <- coxph(iformula, data=dataset, na.action=na.omit)
summary(x)[7][[1]] ##### summary(x)[8][[1]]
})
it works well, but I don't know how to extract the data (for each variable (beta and se)) and run the benjamini-hochberg on p-values.
any help is appreciated! Thanks
I am assuming here that all the variables in varlist are either binary or numeric.
sumtables <- lapply(varlist, function(i) {
iformula <- as.formula(sprintf("Surv(time_cox, events) ~ %s + age +age2 ", i))
x <- coxph(iformula, data=dataset, na.action=na.omit)
data.frame(pvalue = drop1(x, scope = i, test = "Chisq")[2,4],
coef = coef(x)[i])
})
You could use purrr::map to get a tidy dataframe of all your coefficients, se's and p values etc. from the vector of tested exposures. Modifying a little from your code above to work with veteran dataset:
library(survival)
library(tidyverse)
exp_vars <- names(veteran[, c(1, 2, 5, 6, 8)])
tibble(exp_vars) %>%
group_by(exp_vars) %>%
mutate(cox_mod = map(exp_vars, function(exposure) {
iformula <-
as.formula(sprintf("Surv(time, status) ~ %s + age", exposure))
x <- coxph(iformula, data = veteran, na.action = na.omit)
x
}),
coefs = list(rownames_to_column(data.frame(
summary(cox_mod[[1]])$coefficients
)))) %>%
unnest(coefs)
#> # A tibble: 12 x 8
#> # Groups: exp_vars [5]
#> exp_vars cox_mod rowname coef exp.coef. se.coef. z Pr...z..
#> <chr> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 trt <coxph> trt -0.00365 0.996 0.183 -0.0200 9.84e- 1
#> 2 trt <coxph> age 0.00753 1.01 0.00966 0.779 4.36e- 1
#> 3 celltype <coxph> celltypesmallc~ 0.992 2.70 0.254 3.91 9.40e- 5
#> 4 celltype <coxph> celltypeadeno 1.16 3.17 0.293 3.94 8.07e- 5
#> 5 celltype <coxph> celltypelarge 0.235 1.27 0.278 0.848 3.97e- 1
#> 6 celltype <coxph> age 0.00590 1.01 0.00935 0.631 5.28e- 1
#> 7 karno <coxph> karno -0.0337 0.967 0.00520 -6.48 8.94e-11
#> 8 karno <coxph> age -0.00239 0.998 0.00908 -0.263 7.92e- 1
#> 9 diagtime <coxph> diagtime 0.00943 1.01 0.00892 1.06 2.90e- 1
#> 10 diagtime <coxph> age 0.00797 1.01 0.00961 0.830 4.07e- 1
#> 11 prior <coxph> prior -0.0135 0.987 0.0201 -0.674 5.00e- 1
#> 12 prior <coxph> age 0.00715 1.01 0.00955 0.749 4.54e- 1
Created on 2022-03-16 by the reprex package (v2.0.1)

lapply instead of for loop for randomised hypothesis testing r

I have a df that looks something like this like this:
set.seed(42)
ID <- sample(1:30, 100, rep=T)
Trait <- sample(0:1, 100, rep=T)
Year <- sample(1992:1999, 100, rep=T)
df <- cbind(ID, Trait, Year)
df <- as.data.frame(df)
Where ID is an individual organism, trait is a presence/absence of a phenotype and Year is the year an observation was made.
I would like to model if trait is random between individuals, something like this
library(MCMCglmm)
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical")
Now, would like to shuffle the Trait column and run x permutations, to check if my observed mean and CI fall outside of what's expected from random.
I could do this with a for loop, but I'd rather use a tidyverse solution.
I've read that lapply is a bette(?) alternative, but I am struggling to find a specific enough walk-through that I can follow.
I'd appreciate any advice offered here.
Cheers!
Jamie
EDIT October 10th. Cleaned up the code and per comment below added the code to give you back a nice organized tibble\dataframe
### decide how many shuffles you want and name them
### in an orderly fashion for the output
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ ID,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
#> # A tibble: 20 x 7
#> shuffle model_term post.mean `l-95% CI` `u-95% CI` eff.samp pMCMC
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 shuffle_1 (Intercept) 112. 6.39 233. 103. 0.016
#> 2 shuffle_1 ID -6.31 -13.5 -0.297 112. 0.014
#> 3 shuffle_2 (Intercept) 24.9 -72.5 133. 778. 0.526
#> 4 shuffle_2 ID -0.327 -6.33 5.33 849. 0.858
#> 5 shuffle_3 (Intercept) 4.39 -77.3 87.4 161. 0.876
#> 6 shuffle_3 ID 1.04 -3.84 5.99 121. 0.662
#> 7 shuffle_4 (Intercept) 7.71 -79.0 107. 418. 0.902
#> 8 shuffle_4 ID 0.899 -4.40 6.57 408. 0.694
#> 9 shuffle_5 (Intercept) 30.4 -62.4 144. 732. 0.51
#> 10 shuffle_5 ID -0.644 -6.61 4.94 970. 0.866
#> 11 shuffle_6 (Intercept) -45.5 -148. 42.7 208. 0.302
#> 12 shuffle_6 ID 4.73 -0.211 11.6 89.1 0.058
#> 13 shuffle_7 (Intercept) -16.2 -133. 85.9 108. 0.696
#> 14 shuffle_7 ID 2.47 -2.42 10.3 47.8 0.304
#> 15 shuffle_8 (Intercept) 0.568 0.549 0.581 6.60 0.001
#> 16 shuffle_8 ID -0.0185 -0.0197 -0.0168 2.96 0.001
#> 17 shuffle_9 (Intercept) -6.95 -112. 92.2 452. 0.886
#> 18 shuffle_9 ID 2.07 -3.30 8.95 370. 0.476
#> 19 shuffle_10 (Intercept) 43.8 -57.0 159. 775. 0.396
#> 20 shuffle_10 ID -1.36 -7.44 5.08 901. 0.62
Your original data
set.seed(42)
ID <- sample(1:30, 100, rep=T)
Trait <- sample(0:1, 100, rep=T)
Year <- sample(1992:1999, 100, rep=T)
df <- cbind(ID, Trait, Year)
df <- as.data.frame(df)

Multiple LM model returning the same coefficients

Hello Stack Community,
I am trying to model wage growth across US territories using linear models to forecast into the future. I want to try and create a model for each state/ territory (DC, VI, and PR), however, when I look at the coefficients for my models, they are the same for each state.
I have used a combination of plyr ,dplyr, and broom thus far to create and sort my data frame (named stuben_dat) for this project
#Wage Growth
state_data = stuben_dat %>% group_by(st) %>%
do (state_wg= lm(wage_growth ~ us_wage_growth + lag_wage_growth + dum1
+dum2 +dum3,
data= stuben_dat, subset=yr>= (current_year - 5)))
#The dummy variables adjust for seasonality (q1 vs q2 vs q3 vs q4)
#The current_year = whatever year I last updated the program
#The current_year-5 value lets me change the look back period
#This look back period can be used to exclude recessions or outliers
Here is just a snapshot of my output, and as you can see, the beta coefficients and regression statistics are exactly the same for each state (Just AK and AL) are shown here. However, I want to build a different model for each state.
# A tibble: 318 x 6
# Groups: st [53]
st term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AK (Intercept) -1.75 0.294 -5.97 3.28e- 9
2 AK us_wage_growth 996. 23.6 42.2 1.82e-228
3 AK lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
4 AK dum1 -0.245 0.304 -0.806 4.21e- 1
5 AK dum2 -0.321 0.304 -1.06 2.90e- 1
6 AK dum3 0.0947 0.303 0.312 7.55e- 1
7 AL (Intercept) -1.75 0.294 -5.97 3.28e- 9
8 AL us_wage_growth 996. 23.6 42.2 1.82e-228
9 AL lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
10 AL dum1 -0.245 0.304 -0.806 4.21e- 1
# ... with 308 more rows
It is because you are using the same data in your do() call. Try out:
state_data = stuben_dat %>%
group_by(st) %>%
do(state_wg = lm(wage_growth ~ us_wage_growth + lag_wage_growth +
dum1 + dum2 + dum3,
data = ., subset = (yr >= (current_year - 5))))

Calculate predicted model results by iterating through variables

I have several models fit to predict an outcome y = x1 + x2 + .....+x22. That's a fair number of predictors and a fair number of models. My customers want to know what's the marginal impact of each X on the estimated y. The models may include splines and interaction terms. I can do this, but it's cumbersome and requires loops or a lot of copy paste, which is slow or error prone. Can I do this better by writing my function differently and/or using purrr or an *apply function? Reproducible example is below. Ideally, I could write one function and apply it to longdata.
## create my fake data.
library(tidyverse)
library (rms)
ltrans<- function(l1){
newvar <- exp(l1)/(exp(l1)+1)
return(newvar)
}
set.seed(123)
mystates <- c("AL","AR","TN")
mydf <- data.frame(idno = seq(1:1500),state = rep(mystates,500))
mydf$x1[mydf$state=='AL'] <- rnorm(500,50,7)
mydf$x1[mydf$state=='AR'] <- rnorm(500,55,8)
mydf$x1[mydf$state=='TN'] <- rnorm(500,48,10)
mydf$x2 <- sample(1:5,500, replace = T)
mydf$x3 <- (abs(rnorm(1500,10,20)))^2
mydf$outcome <- as.numeric(cut2(sample(1:100,1500,replace = T),95))-1
dd<- datadist(mydf)
options(datadist = 'dd')
m1 <- lrm(outcome ~ x1 + x2+ rcs(x3,3), data = mydf)
dothemath <- function(x1 = x1ref,x2 = x2ref,x3 = x3ref) {
ltrans(-2.1802256-0.01114239*x1+0.050319692*x2-0.00079289232* x3+
7.6508189e-10*pmax(x3-7.4686271,0)^3-9.0897627e-10*pmax(x3- 217.97865,0)^3+
1.4389439e-10*pmax(x3-1337.2538,0)^3)}
x1ref <- 51.4
x2ref <- 3
x3ref <- 217.9
dothemath() ## 0.0591
mydf$referent <- dothemath()
mydf$thisobs <- dothemath(x1 = mydf$x1, x2 = mydf$x2, x3 = mydf$x3)
mydf$predicted <- predict(m1,mydf,type = "fitted.ind") ## yes, matches.
mydf$x1_marginaleffect <- dothemath(x1= mydf$x1)/mydf$referent
mydf$x2_marginaleffect <- dothemath(x2 = mydf$x2)/mydf$referent
mydf$x3_marginaleffect <- dothemath(x3 = mydf$x3)/mydf$referent
## can I do this with long data?
longdata <- mydf %>%
select(idno,state,referent,thisobs,x1,x2,x3) %>%
gather(varname,value,x1:x3)
##longdata$marginaleffect <- dothemath(longdata$varname = longdata$value) ## no, this does not work.
## I need to communicate to the function which variable it is evaluating.
longdata$marginaleffect[longdata$varname=="x1"] <- dothemath(x1 = longdata$value[longdata$varname=="x1"])/
longdata$referent[longdata$varname=="x1"]
longdata$marginaleffect[longdata$varname=="x2"] <- dothemath(x2 = longdata$value[longdata$varname=="x2"])/
longdata$referent[longdata$varname=="x2"]
longdata$marginaleffect[longdata$varname=="x3"] <- dothemath(x3 = longdata$value[longdata$varname=="x3"])/
longdata$referent[longdata$varname=="x3"]
testing<- inner_join(longdata[longdata$varname=="x1",c(1,7)],mydf[,c(1,10)])
head(testing) ## yes, both methods work.
Mostly you're just talking about a grouped mutate, with the caveat that dothemath is built such that you need to specify the variable name, which can be done by using do.call or purrr::invoke to call it on a named list of parameters:
longdata <- longdata %>%
group_by(varname) %>%
mutate(marginaleffect = invoke(dothemath, setNames(list(value), varname[1])) / referent)
longdata
#> # A tibble: 4,500 x 7
#> # Groups: varname [3]
#> idno state referent thisobs varname value marginaleffect
#> <int> <fct> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 AL 0.0591 0.0688 x1 46.1 1.06
#> 2 2 AR 0.0591 0.0516 x1 50.2 1.01
#> 3 3 TN 0.0591 0.0727 x1 38.0 1.15
#> 4 4 AL 0.0591 0.0667 x1 48.4 1.03
#> 5 5 AR 0.0591 0.0515 x1 47.1 1.05
#> 6 6 TN 0.0591 0.0484 x1 37.6 1.15
#> 7 7 AL 0.0591 0.0519 x1 60.9 0.905
#> 8 8 AR 0.0591 0.0531 x1 63.2 0.883
#> 9 9 TN 0.0591 0.0780 x1 47.8 1.04
#> 10 10 AL 0.0591 0.0575 x1 50.5 1.01
#> # ... with 4,490 more rows
# the first values look similar
inner_join(longdata[longdata$varname == "x1", c(1,7)], mydf[,c(1,10)])
#> Joining, by = "idno"
#> # A tibble: 1,500 x 3
#> idno marginaleffect x1_marginaleffect
#> <int> <dbl> <dbl>
#> 1 1 1.06 1.06
#> 2 2 1.01 1.01
#> 3 3 1.15 1.15
#> 4 4 1.03 1.03
#> 5 5 1.05 1.05
#> 6 6 1.15 1.15
#> 7 7 0.905 0.905
#> 8 8 0.883 0.883
#> 9 9 1.04 1.04
#> 10 10 1.01 1.01
#> # ... with 1,490 more rows
# check everything is the same
mydf %>%
gather(varname, marginaleffect, x1_marginaleffect:x3_marginaleffect) %>%
select(idno, varname, marginaleffect) %>%
mutate(varname = substr(varname, 1, 2)) %>%
all_equal(select(longdata, idno, varname, marginaleffect))
#> [1] TRUE
It may be easier to reconfigure dothemath to take an additional parameter of the variable name so as to avoid the gymnastics.

Resources