How to run a linear regression on estimated coefficients - r

Consider the following:
a <- c( 2, 3, 4, 5, 6, 7, 8, 9, 100, 11)
b <- c(5, 6, 7, 80, 9, 10, 11, 12, 13, 14)
c <- c(15, 16, 175, 18, 19, 20, 21, 22, 23, 24)
x <- c(17,18,50,15,64,15,3,5,6,9)
y <- c(55,66,99,83,64,51,23,64,89,101)
z <- c(98,78,56,21,45,34,61,98,45,64)
abc <- data.frame(cbind(a,b,c))
Firstly, I plan to run a regression and values abc with xyz as follows (This went according to plan):
dep_vars <- as.matrix(abc)
lm <- lm(dep_vars ~ x + y + z, data = abc)
From here, I understand that we can print the summary in the console using either of these two methods:
summary(lm)
map_dfr(summary(lm), tidy, .id = 'dep_var')
My issue is that I would like to run a regression of the coefficients of x, y and z on a, b and c.

Base R Options
The others have already answered your question, but I felt like providing multiple options in case you wished to know them for the future. Here, all of these base R functions do effectively the same thing albeit in slightly different ways:
#### Call Directly From Saved LM Object ####
coef(lm)
lm$coefficients # functionally equivalent to above
lm[1] # returns vector format from list
For example lm$coefficients and coef(lm) both look like this:
a b c
(Intercept) 15.3706338 44.2597407 -51.0505560
x -0.5862903 -0.3488530 1.0575465
y 0.4074877 0.1587221 0.7602451
z -0.2724661 -0.5257350 0.2025181
Whereas pulling it from a saved lm object gives you a vectorized version pulled from a list:
$coefficients
a b c
(Intercept) 15.3706338 44.2597407 -51.0505560
x -0.5862903 -0.3488530 1.0575465
y 0.4074877 0.1587221 0.7602451
z -0.2724661 -0.5257350 0.2025181
Broom/Dplyr Options
If you want a tidy version of this, tidy transforms a base R lm object into a tidier format that you can directly manipulate:
#### Broom and Dplyr ####
library(tidyverse)
library(broom)
tidy(lm) # pull all terms
Like so:
# A tibble: 12 × 6
response term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a (Intercept) 15.4 45.8 0.335 0.749
2 a x -0.586 0.538 -1.09 0.318
3 a y 0.407 0.454 0.898 0.404
4 a z -0.272 0.426 -0.639 0.546
5 b (Intercept) 44.3 30.4 1.46 0.195
6 b x -0.349 0.357 -0.978 0.366
7 b y 0.159 0.300 0.528 0.616
8 b z -0.526 0.282 -1.86 0.112
9 c (Intercept) -51.1 69.1 -0.739 0.488
10 c x 1.06 0.812 1.30 0.240
11 c y 0.760 0.684 1.11 0.309
12 c z 0.203 0.643 0.315 0.763
To select some specific coefficients, you can simply add slice to whatever you please:
tidy(lm) %>%
slice(2) # pull specific coefficient
Giving you a row of your data frame for a and x:
# A tibble: 1 × 6
response term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a x -0.586 0.538 -1.09 0.318

Related

R: How can I convert a list of linear regression results to a dataframe?

I am running a multivariate regression on ~150 different outcomes. Because gathering the results by individual tables or by hand is tidious, obviously, I have tried to produce a datafram out of the results. So far my steps:
I made a function for the regression:
f1 <- function(X){summary(lm(X~HFres + age + sex + season + nonalcTE, data=dslin))}
I applied apply() to make a list (I only used a few of the 150 outcomes while trying to make it work)
m1 <- apply(dslin[,c(21:49)], MARGIN=2, FUN=f1)
Then I change the object into a dataframe:
m2 <- m1 %>% {tibble(variables = names(.),coefficient = map(., "coefficients"))} %>% unnest_wider(coefficient)
This is the result:
> m2
>A tibble: 29 x 9
> variables `(Intercept)`[,1] [,2] [,3] [,4] HFres[,1] [,2] [,3] [,4]
> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 C_101_IL8 3.59 0.106 34.0 1.28e-224 0.0000129 0.00367 0.00352 0.997
> 2 C_102_VEGFA 9.28 0.0844 110. 0 0.00425 0.00293 1.45 0.147
> 3 C_103_AM 4.92 0.0820 60.0 0 0.00261 0.00285 0.916 0.360
> 4 C_105_CD40L 7.53 0.164 45.9 0 0.00549 0.00570 0.964 0.335
> 5 C_106_GDF15 6.97 0.0864 80.7 0 0.00196 0.00300 0.653 0.514
> 6 C_107_PlGF 6.25 0.0665 94.0 0 0.00219 0.00231 0.947 0.344
> 7 C_108_SELE 4.89 0.117 41.8 1.14e-321 0.000978 0.00406 0.241 0.810
> 8 C_109_EGF 6.59 0.157 41.9 1.8 e-322 0.00714 0.00546 1.31 0.191
> 9 C_110_OPG 8.21 0.0673 122. 0 0.000320 0.00234 0.137 0.891
>10 C_111_SRC 7.62 0.0511 149. 0 0.000660 0.00177 0.372 0.710
>... with 19 more rows, and 6 more variables: age <dbl[,4]>, sexFemale <dbl[,4]>,
> seasonfall <dbl[,4]>, seasonspring <dbl[,4]>, seasonsummer <dbl[,4]>,
> nonalcTE <dbl[,4]>
It's a bit bad to see here but initially in m1 I had two columns, one with the variables and one with a list. Then after unnesting I have several columns which still each have 4 columns.
When I export this to excell (with the rio package) only the [,1] columns show up because the columns '(Intercept)', HF res, ecc. are still nested.
I have tried applying the unnest_wider() command again
m2 %>% unnest_wider(list=c('(Intercept)', 'HFres', 'age', 'sexFemale', 'seasonfall', 'seasonspring', 'seasonsummer')
This didn't work, because it didn't accept that I want to unnest a list of columns instead of a dataframe.
I then tried it for only one of the variables to start with
m2 %>% unnest_wider(HFres)
This also gave me errors.
So, my remaining problem is I still need to unnest the columns of m2 in order to make them all visible when I export them.
Alternatively, It would be enough for me to have only the [,1] and [,4] subcolumn of each column if that is easier to extract them. I know I can e.g. access one subcolumn like this: m2[["age"]][,1] and maybe I could make a new dataframe from m2 extracting all the columns I want?
Thank you for your help!
Update: reprex ( I hope this is a correct understanding of what a reprex is)
create dataframe
age <- c(34, 56, 24, 78, 56, 67, 45, 93, 62, 16)
bmi <- c(24, 25, 27, 23, 2, 27, 28, 24, 27, 21)
educ <- c(4,2,5,1,3,2,4,5,2,3)
smoking <- c(1,3,2,2,3,2,1,3,2,1)
HF <- c(3,4,2,4,5,3,2,3,5,2)
P1 <- c(5,4,7,9,5,6,7,3,4,2)
P2 <- c(7,2,4,6,5,3,2,5,6,3)
P3 <- c(6,4,2,3,5,7,3,2,5,6)
df <- data.frame(age, bmi, educ, smoking, HF, P1, P2, P3)
function
f1 <- function(X){summary(lm(X~HF + age + bmi + educ + smoking, data=df))}
apply function to columns
m1 <- apply(df[,c(6:8)], MARGIN=2, FUN=f1)
m2 <- m1 %>% {tibble(variables = names(.),coefficient = map(., "coefficients"))} %>% unnest_wider(coefficient)
I basically need the coefficient (beta) which is the [,1] of each column and the p-value which is the [,4]
The broom package is intended for exactly this — turning model results into tidy dataframes. Here’s an example using broom::tidy() to get a table of coefficients for each dv, and purrr::map_dfr() to iterate over dvs, row-bind the coefficient tables, and add a column with the dv for each model:
library(broom)
library(purrr)
f1 <- function(X) {
tidy(lm(
as.formula(paste(X, "~ mpg * cyl")),
data = mtcars
))
}
model_results <- map_dfr(
set_names(names(mtcars)[3:11]),
f1,
.id = "dv"
)
model_results
Output:
# A tibble: 36 x 6
dv term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 disp (Intercept) -242. 154. -1.57 0.128
2 disp mpg 10.3 6.47 1.59 0.123
3 disp cyl 103. 22.9 4.52 0.000104
4 disp mpg:cyl -3.24 1.18 -2.75 0.0104
5 hp (Intercept) -86.5 123. -0.704 0.487
6 hp mpg 4.59 5.16 0.889 0.381
7 hp cyl 50.3 18.2 2.75 0.0102
8 hp mpg:cyl -1.47 0.940 -1.57 0.128
9 drat (Intercept) 3.34 1.28 2.61 0.0145
10 drat mpg 0.0541 0.0538 1.01 0.323
# ... with 26 more rows
If you want dvs in rows and coefficients in columns, you can tidyr::pivot_wider():
library(tidyr)
model_coefs <- pivot_wider(
model_results,
id_cols = dv,
names_from = term,
values_from = estimate
)
model_coefs
Output:
# A tibble: 9 x 5
dv `(Intercept)` mpg cyl `mpg:cyl`
<chr> <dbl> <dbl> <dbl> <dbl>
1 disp -242. 10.3 103. -3.24
2 hp -86.5 4.59 50.3 -1.47
3 drat 3.34 0.0541 -0.0354 -0.00533
4 wt 2.98 -0.00947 0.478 -0.0219
5 qsec 25.0 -0.0938 -0.862 0.000318
6 vs 2.38 -0.0194 -0.292 0.00223
7 am -0.908 0.0702 0.0721 -0.00470
8 gear 4.22 0.0115 -0.181 0.00311
9 carb 3.32 -0.0830 0.249 -0.00333

Improve parallel performance with batching in a static-dynamic branching pipeline

BLUF: I am struggling to understand out how to use batching in the R targets package to improve performance in a static and dynamic branching pipeline processed in parallel using tar_make_future(). I presume that I need to batch within each dynamic branch but I am unsure how to go about doing that.
Here's a reprex that uses dynamic branching nested inside static branching, similar to what my actual pipeline is doing. It first branches statically for each value in all_types, and then dynamically branches within each category. This code produces 1,000 branches and 1,010 targets total. In the actual workflow I obviously don't use replicate, and the dynamic branches vary in number depending on the type value.
# _targets.r
library(targets)
library(tarchetypes)
library(future)
library(future.callr)
plan(callr)
all_types = data.frame(type = LETTERS[1:10])
tar_map(values = all_types, names = "type",
tar_target(
make_data,
replicate(100,
data.frame(x = seq(1000) + rnorm(1000, 0, 5),
y = seq(1000) + rnorm(1000, 20, 20)),
simplify = FALSE
),
iteration = "list"
),
tar_target(
fit_model,
lm(make_data),
pattern = map(make_data),
iteration = "list"
)
)
And here's a timing comparison of tar_make() vs tar_make_future() with eight workers:
# tar_destroy()
t1 <- system.time(tar_make())
# tar_destroy()
t2 <- system.time(tar_make_future(workers = 8))
rbind(serial = t1, parallel = t2)
## user.self sys.self elapsed user.child sys.child
## serial 2.12 0.11 25.59 NA NA
## parallel 2.07 0.24 184.68 NA NA
I don't think the user or system fields are useful here since the job gets dispatched to separate R processes, but the elapsed time for the parallel job takes about 7 times longer than the serial job.
I presume this slowdown is caused by the large number of targets. Will batching improve performance in this case, and if so how can I implement batching within the dynamic branch?
You are on the right track with batching. In your case, that is a matter of breaking up your list of 100 datasets into groups of, say, 10 or so. You could do this with a nested list of datasets, but that's a lot of work. Luckily, there is an easier way.
Your question is actually really well-timed. I just wrote some new target factories in tarchetypes that could help. To access them, you will need the development version of tarchetypes from GitHub:
remotes::install_github("ropensci/tarchetypes")
Then, with tar_map2_count(), it will be much easier to batch your list of 100 datasets for each scenario.
library(targets)
tar_script({
library(broom)
library(targets)
library(tarchetypes)
library(tibble)
make_data <- function(n) {
datasets_per_batch <- replicate(
100,
tibble(
x = seq(n) + rnorm(n, 0, 5),
y = seq(n) + rnorm(n, 20, 20)
),
simplify = FALSE
)
tibble(dataset = datasets_per_batch, rep = seq_along(datasets_per_batch))
}
tar_map2_count(
name = model,
command1 = make_data(n = rows),
command2 = tidy(lm(y ~ x, data = dataset)), # Need dataset[[1]] in tarchetypes 0.4.0
values = data_frame(
scenario = LETTERS[seq_len(10)],
rows = seq(10, 100, length.out = 10)
),
columns2 = NULL,
batches = 10
)
})
tar_make(reporter = "silent")
#> Warning message:
#> `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
tar_read(model)
#> # A tibble: 2,000 × 8
#> term estimate std.error statistic p.value scenario rows tar_group
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int>
#> 1 (Intercept) 17.1 12.8 1.34 0.218 A 10 10
#> 2 x 1.39 1.35 1.03 0.333 A 10 10
#> 3 (Intercept) 6.42 14.0 0.459 0.658 A 10 10
#> 4 x 1.75 1.28 1.37 0.209 A 10 10
#> 5 (Intercept) 32.8 7.14 4.60 0.00176 A 10 10
#> 6 x -0.300 1.14 -0.263 0.799 A 10 10
#> 7 (Intercept) 29.7 3.24 9.18 0.0000160 A 10 10
#> 8 x 0.314 0.414 0.758 0.470 A 10 10
#> 9 (Intercept) 20.0 13.6 1.47 0.179 A 10 10
#> 10 x 1.23 1.77 0.698 0.505 A 10 10
#> # … with 1,990 more rows
Created on 2021-12-10 by the reprex package (v2.0.1)
There is also tar_map_rep(), which may be easier if all your datasets are randomly generated, but I am not sure if I am overfitting your use case.
library(targets)
tar_script({
library(broom)
library(targets)
library(tarchetypes)
library(tibble)
make_one_dataset <- function(n) {
tibble(
x = seq(n) + rnorm(n, 0, 5),
y = seq(n) + rnorm(n, 20, 20)
)
}
tar_map_rep(
name = model,
command = tidy(lm(y ~ x, data = make_one_dataset(n = rows))),
values = data_frame(
scenario = LETTERS[seq_len(10)],
rows = seq(10, 100, length.out = 10)
),
batches = 10,
reps = 10
)
})
tar_make(reporter = "silent")
#> Warning message:
#> `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
tar_read(model)
#> # A tibble: 2,000 × 10
#> term estimate std.error statistic p.value scenario rows tar_batch tar_rep
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int> <int>
#> 1 (Inter… 37.5 7.50 5.00 0.00105 A 10 1 1
#> 2 x -0.701 1.17 -0.601 0.564 A 10 1 1
#> 3 (Inter… 21.5 9.64 2.23 0.0567 A 10 1 2
#> 4 x -0.213 1.55 -0.138 0.894 A 10 1 2
#> 5 (Inter… 20.6 9.51 2.17 0.0620 A 10 1 3
#> 6 x 1.40 1.79 0.783 0.456 A 10 1 3
#> 7 (Inter… 11.6 11.2 1.04 0.329 A 10 1 4
#> 8 x 2.34 1.39 1.68 0.131 A 10 1 4
#> 9 (Inter… 26.8 9.16 2.93 0.0191 A 10 1 5
#> 10 x 0.288 1.10 0.262 0.800 A 10 1 5
#> # … with 1,990 more rows, and 1 more variable: tar_group <int>
Created on 2021-12-10 by the reprex package (v2.0.1)
Unfortunately, futures do come with overhead. Maybe it will be faster in your case if you try tar_make_clustermq()?

Many regressions using tidyverse and broom: Same dependent variable, different independent variables

This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)
Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1

errors in apply lm regression to each row

I want to run a lm regression to each row of my data dt.
My code is
coe <- apply(dt, 1, FUN = function(x) lm(dbl ~ bld, data = as.data.frame(x))$coefficients)
But it returns:
Error in eval(predvars, data, env) : object 'dbl' not found
I confirm that there are dbl and bld in my data dt.
So I do not know how to deal with it.
I am guessing you have mistakenly written about running regression by row (which is impossible since there will just be one observation for x and y in y ~ x). Instead, you want to run the regression repeatedly for some grouping variable?
This is pretty easy to do with groupedstats:
groupedstats::grouped_lm(
data = ggplot2::diamonds,
grouping.vars = c(cut, color), # grouping variables
formula = price ~ carat * clarity # formula
)
#> # A tibble: 547 x 10
#> cut color term estimate std.error t.value conf.low conf.high
#> <ord> <ord> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ideal E (Int~ -3085. 64.9 -47.5 -3212. -2958.
#> 2 Ideal E carat 10529. 74.1 142. 10384. 10674.
#> 3 Ideal E clar~ -2088. 267. -7.81 -2612. -1564.
#> 4 Ideal E clar~ 168. 265. 0.633 -352. 688.
#> 5 Ideal E clar~ -926. 217. -4.26 -1352. -500.
#> 6 Ideal E clar~ 625. 157. 3.99 318. 932.
#> 7 Ideal E clar~ -392. 107. -3.65 -602. -181.
#> 8 Ideal E clar~ 83.9 79.1 1.06 -71.1 239.
#> 9 Ideal E clar~ -40.8 67.4 -0.605 -173. 91.4
#> 10 Ideal E cara~ 9746. 287. 34.0 9185. 10308.
#> # ... with 537 more rows, and 2 more variables: p.value <dbl>,
#> # significance <chr>
Created on 2018-08-19 by the reprex package (v0.2.0.9000).
There are two problems with what you're trying to do. When you're passing dt to apply, it drops x to a (named) numeric vector. When you coerce it using as.data.frame, it becomes a data.frame with one column. This is why dbl is not found.
> x <- c(a = 1, b = 0.58)
> as.data.frame(x)
x
a 1.00
b 0.58
Second point is that you want to do a regression on two points. In essence, you are doing this:
> lm(b ~ a, data = data.frame(a = 1, b = 0.58))
Call:
lm(formula = b ~ a, data = data.frame(a = 1, b = 0.58))
Coefficients:
(Intercept) a
0.58 NA
You will not be able to estimate the parameter of interest because you'll need more points to do that.

Calculate predicted model results by iterating through variables

I have several models fit to predict an outcome y = x1 + x2 + .....+x22. That's a fair number of predictors and a fair number of models. My customers want to know what's the marginal impact of each X on the estimated y. The models may include splines and interaction terms. I can do this, but it's cumbersome and requires loops or a lot of copy paste, which is slow or error prone. Can I do this better by writing my function differently and/or using purrr or an *apply function? Reproducible example is below. Ideally, I could write one function and apply it to longdata.
## create my fake data.
library(tidyverse)
library (rms)
ltrans<- function(l1){
newvar <- exp(l1)/(exp(l1)+1)
return(newvar)
}
set.seed(123)
mystates <- c("AL","AR","TN")
mydf <- data.frame(idno = seq(1:1500),state = rep(mystates,500))
mydf$x1[mydf$state=='AL'] <- rnorm(500,50,7)
mydf$x1[mydf$state=='AR'] <- rnorm(500,55,8)
mydf$x1[mydf$state=='TN'] <- rnorm(500,48,10)
mydf$x2 <- sample(1:5,500, replace = T)
mydf$x3 <- (abs(rnorm(1500,10,20)))^2
mydf$outcome <- as.numeric(cut2(sample(1:100,1500,replace = T),95))-1
dd<- datadist(mydf)
options(datadist = 'dd')
m1 <- lrm(outcome ~ x1 + x2+ rcs(x3,3), data = mydf)
dothemath <- function(x1 = x1ref,x2 = x2ref,x3 = x3ref) {
ltrans(-2.1802256-0.01114239*x1+0.050319692*x2-0.00079289232* x3+
7.6508189e-10*pmax(x3-7.4686271,0)^3-9.0897627e-10*pmax(x3- 217.97865,0)^3+
1.4389439e-10*pmax(x3-1337.2538,0)^3)}
x1ref <- 51.4
x2ref <- 3
x3ref <- 217.9
dothemath() ## 0.0591
mydf$referent <- dothemath()
mydf$thisobs <- dothemath(x1 = mydf$x1, x2 = mydf$x2, x3 = mydf$x3)
mydf$predicted <- predict(m1,mydf,type = "fitted.ind") ## yes, matches.
mydf$x1_marginaleffect <- dothemath(x1= mydf$x1)/mydf$referent
mydf$x2_marginaleffect <- dothemath(x2 = mydf$x2)/mydf$referent
mydf$x3_marginaleffect <- dothemath(x3 = mydf$x3)/mydf$referent
## can I do this with long data?
longdata <- mydf %>%
select(idno,state,referent,thisobs,x1,x2,x3) %>%
gather(varname,value,x1:x3)
##longdata$marginaleffect <- dothemath(longdata$varname = longdata$value) ## no, this does not work.
## I need to communicate to the function which variable it is evaluating.
longdata$marginaleffect[longdata$varname=="x1"] <- dothemath(x1 = longdata$value[longdata$varname=="x1"])/
longdata$referent[longdata$varname=="x1"]
longdata$marginaleffect[longdata$varname=="x2"] <- dothemath(x2 = longdata$value[longdata$varname=="x2"])/
longdata$referent[longdata$varname=="x2"]
longdata$marginaleffect[longdata$varname=="x3"] <- dothemath(x3 = longdata$value[longdata$varname=="x3"])/
longdata$referent[longdata$varname=="x3"]
testing<- inner_join(longdata[longdata$varname=="x1",c(1,7)],mydf[,c(1,10)])
head(testing) ## yes, both methods work.
Mostly you're just talking about a grouped mutate, with the caveat that dothemath is built such that you need to specify the variable name, which can be done by using do.call or purrr::invoke to call it on a named list of parameters:
longdata <- longdata %>%
group_by(varname) %>%
mutate(marginaleffect = invoke(dothemath, setNames(list(value), varname[1])) / referent)
longdata
#> # A tibble: 4,500 x 7
#> # Groups: varname [3]
#> idno state referent thisobs varname value marginaleffect
#> <int> <fct> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 AL 0.0591 0.0688 x1 46.1 1.06
#> 2 2 AR 0.0591 0.0516 x1 50.2 1.01
#> 3 3 TN 0.0591 0.0727 x1 38.0 1.15
#> 4 4 AL 0.0591 0.0667 x1 48.4 1.03
#> 5 5 AR 0.0591 0.0515 x1 47.1 1.05
#> 6 6 TN 0.0591 0.0484 x1 37.6 1.15
#> 7 7 AL 0.0591 0.0519 x1 60.9 0.905
#> 8 8 AR 0.0591 0.0531 x1 63.2 0.883
#> 9 9 TN 0.0591 0.0780 x1 47.8 1.04
#> 10 10 AL 0.0591 0.0575 x1 50.5 1.01
#> # ... with 4,490 more rows
# the first values look similar
inner_join(longdata[longdata$varname == "x1", c(1,7)], mydf[,c(1,10)])
#> Joining, by = "idno"
#> # A tibble: 1,500 x 3
#> idno marginaleffect x1_marginaleffect
#> <int> <dbl> <dbl>
#> 1 1 1.06 1.06
#> 2 2 1.01 1.01
#> 3 3 1.15 1.15
#> 4 4 1.03 1.03
#> 5 5 1.05 1.05
#> 6 6 1.15 1.15
#> 7 7 0.905 0.905
#> 8 8 0.883 0.883
#> 9 9 1.04 1.04
#> 10 10 1.01 1.01
#> # ... with 1,490 more rows
# check everything is the same
mydf %>%
gather(varname, marginaleffect, x1_marginaleffect:x3_marginaleffect) %>%
select(idno, varname, marginaleffect) %>%
mutate(varname = substr(varname, 1, 2)) %>%
all_equal(select(longdata, idno, varname, marginaleffect))
#> [1] TRUE
It may be easier to reconfigure dothemath to take an additional parameter of the variable name so as to avoid the gymnastics.

Resources