R- iterating over variables names using a loop or function - r

I want to loop over variables within a data frame either using a for loop or function in R. I have coded the following (which doesn't work):
y <- c(0,0,1,1,0,1,0,1,1,1)
var1 <- c("a","a","a","b","b","b","c","c","c","c")
var2 <- c("m","m","n","n","n","n","o","o","o","m")
mydata <- data.frame(y,var1,var2)
myfunction <- function(v){
regressionresult <- lm(y ~ v, data = mydata)
summary(regressionresult)
}
myfunction("var1")
When I try running this, I get the error message:
Error in model.frame.default(formula = y ~ v, data = mydata, drop.unused.levels = TRUE) :
variable lengths differ (found for 'v')
I don't think this is a problem with the data, but with how I refer to the variable name because the following code produces the desired regression results (for one variable that I wanted to loop over):
regressionresult <- lm(y ~ var1, data = mydata)
summary(regressionresult)
How can I fix the function, or put the variables names in the loop?
[I also tried to loop over the variables names, but had a similar problem as with the function:
for(v in c("var1","var2")){
regressionresult <- lm(y ~ v, data = mydata)
summary(regressionresult)
}
When running this loop, it produces the error:
Error in model.frame.default(formula = y ~ v, data = mydata, drop.unused.levels = TRUE) :
variable lengths differ (found for 'v')
Thanks for your help!

We can use paste to create the formula to pass it on the lm
myfunction <- function(v){
regressionresult <- lm(paste0('y ~', v), data = mydata)
summary(regressionresult)
}
out1 <- myfunction("var1")
Or use glue::glue
myfunction <- function(v){
regressionresult <- lm(glue::glue('y ~ {v}'), data = mydata)
summary(regressionresult)
}
myfunction("var1")

You can use functions in the tidyverse to work with tidy data and applying model to different formulas.
y <- c(0,0,1,1,0,1,0,1,1,1)
var1 <- c("a","a","a","b","b","b","c","c","c","c")
var2 <- c("m","m","n","n","n","n","o","o","o","m")
library(tidyverse)
mydata <- data_frame(y,var1,var2)
res <- mydata %>%
# get data in long format - tidy format
gather("var_type", "value", -y) %>%
# we want one model per var_type
nest(-var_type) %>%
# apply lm on each data
mutate(
regressionresult = map(data, ~lm(y ~ value, data = .x))
)
res
#> # A tibble: 2 x 3
#> var_type data regressionresult
#> <chr> <list> <list>
#> 1 var1 <tibble [10 x 2]> <S3: lm>
#> 2 var2 <tibble [10 x 2]> <S3: lm>
summary(res$regressionresult[[1]])
#>
#> Call:
#> lm(formula = y ~ value, data = .x)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.7500 -0.3333 0.2500 0.3125 0.6667
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.3333 0.3150 1.058 0.325
#> valueb 0.3333 0.4454 0.748 0.479
#> valuec 0.4167 0.4167 1.000 0.351
#>
#> Residual standard error: 0.5455 on 7 degrees of freedom
#> Multiple R-squared: 0.1319, Adjusted R-squared: -0.1161
#> F-statistic: 0.532 on 2 and 7 DF, p-value: 0.6094
Broom package can help you work with the result then
library(broom)
#> Warning: le package 'broom' a été compilé avec la version R 3.4.4
res <- res %>%
mutate(tidy_summary = map(regressionresult, broom::tidy))
res
#> # A tibble: 2 x 4
#> var_type data regressionresult tidy_summary
#> <chr> <list> <list> <list>
#> 1 var1 <tibble [10 x 2]> <S3: lm> <data.frame [3 x 5]>
#> 2 var2 <tibble [10 x 2]> <S3: lm> <data.frame [3 x 5]>
You can get one of the summary
res$tidy_summary[[1]]
#> term estimate std.error statistic p.value
#> 1 (Intercept) 0.3333333 0.3149704 1.0583005 0.3250657
#> 2 valueb 0.3333333 0.4454354 0.7483315 0.4786436
#> 3 valuec 0.4166667 0.4166667 1.0000000 0.3506167
or unnest to get a data.frame to work with.
res %>%
unnest(tidy_summary)
#> # A tibble: 6 x 6
#> var_type term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 var1 (Intercept) 0.333 0.315 1.06 0.325
#> 2 var1 valueb 0.333 0.445 0.748 0.479
#> 3 var1 valuec 0.417 0.417 1.000 0.351
#> 4 var2 (Intercept) 0.333 0.315 1.06 0.325
#> 5 var2 valuen 0.417 0.417 1 0.351
#> 6 var2 valueo 0.333 0.445 0.748 0.479
Functions of interest are nest and unnest from [tidyr][http://tidyr.tidyverse.org/) that allow to create list columns easily, map from purrr that allows to iterate over a list and apply a function (here lm) and tidy from broom package that offers functions to tidy results from models (summary results, predict results, ...)
Not used here but know that modelr package helps for doing pipelines when modeling.

Related

How to use purrr to iterate over every combo of covariates and outcomes in lm reg

A common scenario for me is that I need to run basically the same regression model but over a series of different outcomes, and for sensitivity analyses I simultaneously need to iterate over different sets of covariates.
I'm still new to R, but using the below with purrr I'm able to iterate over outcomes and covariates, but it marches through pairs of lists in parallel of course, when I need it to march through every combination from each list.
What are some options for how iterative over all combinations of outcomes and covariates?
Also, does anyone know why the below code won't work with "map2?" I get the error message that "as_mapper(.f, ...) : argument ".f" is missing, with no default"
library(dplyr)
library(purrr)
dataset <- tibble(
y1=rnorm(n=100),
y2=rnorm(n=100),
x1=rnorm(n=100),
x2=rnorm(n=100))
outcomes <- dataset %>%
select(y1,y2)
covars <- dataset %>%
select(x1,x2)
paramlist <- list(covarL,outcomeL)
paramlist %>%
pmap(~lm(.y ~ .x,data=dataset))
There are many ways to do this in the larger tidyverse. I am a fan of dplyr::rowwise for this kind of calculations. We can use the colnames instead of the actual data and then create a matrix like tibble with tidyr::expand_grid which contains all combinations of outcomes and covars. Then we can use dplyr::rowwise and use lm inside list() together with reformulate which takes strings as inputs. To get the result we can use broom::tidy.
library(dplyr)
library(purrr)
library(tidyr)
dataset <- tibble(
y1=rnorm(n=100),
y2=rnorm(n=100),
x1=rnorm(n=100),
x2=rnorm(n=100))
outcomes <- dataset %>%
select(y1,y2) %>% colnames
covars <- dataset %>%
select(x1,x2) %>% colnames
paramlist <- expand_grid(outcomes, covars)
paramlist %>%
rowwise %>%
mutate(mod = list(lm(reformulate(outcomes, covars), data = dataset)),
res = list(broom::tidy(mod)))
#> # A tibble: 4 x 4
#> # Rowwise:
#> outcomes covars mod res
#> <chr> <chr> <list> <list>
#> 1 y1 x1 <lm> <tibble [2 x 5]>
#> 2 y1 x2 <lm> <tibble [2 x 5]>
#> 3 y2 x1 <lm> <tibble [2 x 5]>
#> 4 y2 x2 <lm> <tibble [2 x 5]>
Created on 2021-09-06 by the reprex package (v2.0.1)
We can do the same thing with {purrr} instead of dplyr::rowwise:
paramlist %>%
mutate(mod = map2(outcomes, covars, ~ lm(reformulate(.y, .x), data = dataset)),
res = map(mod, broom::tidy))
#> # A tibble: 4 x 4
#> outcomes covars mod res
#> <chr> <chr> <list> <list>
#> 1 y1 x1 <lm> <tibble [2 x 5]>
#> 2 y1 x2 <lm> <tibble [2 x 5]>
#> 3 y2 x1 <lm> <tibble [2 x 5]>
#> 4 y2 x2 <lm> <tibble [2 x 5]>
Created on 2021-09-06 by the reprex package (v2.0.1)
Another pure {purrr} solution is to use a nested map call. Since it is nested we need to flatten the results before we can use map(summary) on them.
# outcomes and covars are the same strings as above
outcomes %>%
map(~ map(covars, function(.y) lm(reformulate(.y, .x), data = dataset))) %>%
flatten %>%
map(summary)
#> [[1]]
#>
#> Call:
#> lm(formula = reformulate(.y, .x), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.20892 -0.56744 -0.08498 0.55445 2.10146
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.0009328 0.0923062 -0.010 0.992
#> x1 -0.0809739 0.0932059 -0.869 0.387
#>
#> Residual standard error: 0.9173 on 98 degrees of freedom
#> Multiple R-squared: 0.007643, Adjusted R-squared: -0.002483
#> F-statistic: 0.7548 on 1 and 98 DF, p-value: 0.3871
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = reformulate(.y, .x), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.11442 -0.59186 -0.08153 0.61642 2.10575
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.02048 0.09461 -0.216 0.829
#> x2 -0.05159 0.10805 -0.477 0.634
#>
#> Residual standard error: 0.9197 on 98 degrees of freedom
#> Multiple R-squared: 0.002321, Adjusted R-squared: -0.007859
#> F-statistic: 0.228 on 1 and 98 DF, p-value: 0.6341
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = reformulate(.y, .x), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3535 -0.7389 -0.2023 0.6236 3.8627
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.08178 0.10659 -0.767 0.445
#> x1 -0.08476 0.10763 -0.788 0.433
#>
#> Residual standard error: 1.059 on 98 degrees of freedom
#> Multiple R-squared: 0.006289, Adjusted R-squared: -0.003851
#> F-statistic: 0.6202 on 1 and 98 DF, p-value: 0.4329
#>
#>
#> [[4]]
#>
#> Call:
#> lm(formula = reformulate(.y, .x), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4867 -0.7020 -0.1935 0.5869 3.7574
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.06575 0.10875 -0.605 0.547
#> x2 0.12388 0.12420 0.997 0.321
#>
#> Residual standard error: 1.057 on 98 degrees of freedom
#> Multiple R-squared: 0.01005, Adjusted R-squared: -5.162e-05
#> F-statistic: 0.9949 on 1 and 98 DF, p-value: 0.321
Created on 2021-09-06 by the reprex package (v2.0.1)

Accessing results from a list column when some elements are NA

Problem: List column contains a few missing values
Data
Consider the following tibble that contains the results of 2 model fits:
> Model_fits
# A tibble: 4 x 4
cyl data model1 model2
<dbl> <list<tibble[,2]>> <list> <list>
1 2 [5 x 2] <dbl [1]> <dbl [1]>
2 4 [11 x 2] <lm> <lm>
3 6 [7 x 2] <lm> <dbl [1]>
4 8 [14 x 2] <lm> <lm>
The data for cyl==2 was missing in this example. Therefore, model1 contains NA_real_ in the first row. Similarly, model2 contains NA_real_ in rows 1 and 3.
Extracting model results
I want to extract the results of model fit using broom::glance. But it does not work due to the missing values:
> Model_fits %>%
+ mutate(summary_res = map(model1, broom::glance))
Error: Problem with `mutate()` input `summary_res`.
x No glance method for objects of class numeric
i Input `summary_res` is `map(model1, broom::glance)`.
Attempt at solution
So, I try to use purrr::possibly, but that does not work either:
> Model_fits %>%
+ mutate(summary_res1 = map(model1, ~ possibly(broom::glance(.x),
+ otherwise = NA_real_)))
Error: Problem with `mutate()` input `summary_res1`.
x No glance method for objects of class numeric
i Input `summary_res1` is `map(model1, ~possibly(broom::glance(.x), otherwise = NA_real_))`.
Expected outcome
I want to get the broom::glance results for all non-missing values and NA_real_ for all missing values. Please guide me how can I get these results?
Code for creating Model_fits
Please note that I created the following as a reproducible example. But this is not my original data/model results.
library(tidyverse)
new_data <- tibble(mpg = rep(NA_real_, 5),
cyl = rep(2, 5),
disp = rep(NA_real_, 5))
mtcars2 <- mtcars %>%
dplyr::select(mpg, cyl, disp)
mt <- bind_rows(mtcars2,
new_data)
model_res_list <- map(mtcars2 %>% group_split(cyl), ~lm(mpg ~ disp, data = .x))
lizt <- list(NA_real_, model_res_list[[1]], model_res_list[[2]], model_res_list[[3]])
lizt2 <- list(NA_real_, model_res_list[[1]], NA_real_, model_res_list[[3]])
Model_fits <- mt %>%
group_nest(cyl) %>%
mutate(model1 = lizt,
model2 = lizt2)
One more thing you could do about this is using tryCatch function, So that you define in case of an error occurring what would be the output of your function. In this case it will not bring the execution of the function to a halt.
Model_fits %>%
mutate(mod01 = map(model1, ~ tryCatch(glance(.x),
error = function(cond) {
NA_real_
}))) %>%
unnest(mod01)
# A tibble: 4 x 17
cyl data model1 model2 mod01 r.squared adj.r.squared sigma statistic p.value df
<dbl> <list<tibbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 [5 x 2] <dbl [~ <dbl ~ NA NA NA NA NA NA NA
2 4 [11 x 2] <lm> <lm> NA 0.648 0.609 2.82 16.6 0.00278 1
3 6 [7 x 2] <lm> <dbl ~ NA 0.0106 -0.187 1.58 0.0537 0.826 1
4 8 [14 x 2] <lm> <lm> NA 0.270 0.209 2.28 4.44 0.0568 1
# ... with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
If we want to use possibly or safely instead of tryCatch we should first write a custom function that wraps glance in general and before applying on our data set:
poss_glance <- possibly(glance, otherwise = NA_real_)
Model_fits %>%
mutate(mod01 = map(model1, ~ poss_glance(.x))) %>%
unnest(mod01)
# A tibble: 4 x 17
cyl data model1 model2 mod01 r.squared adj.r.squared sigma statistic p.value df
<dbl> <list<tibbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 [5 x 2] <dbl [~ <dbl ~ NA NA NA NA NA NA NA
2 4 [11 x 2] <lm> <lm> NA 0.648 0.609 2.82 16.6 0.00278 1
3 6 [7 x 2] <lm> <dbl ~ NA 0.0106 -0.187 1.58 0.0537 0.826 1
4 8 [14 x 2] <lm> <lm> NA 0.270 0.209 2.28 4.44 0.0568 1
# ... with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
Or even we could use safely in place of possibly so that our function returns an enhanced output in this case NA_real_:
safe_glance <- safely(glance, otherwise = NA_real_)
Model_fits %>%
mutate(mod01 = map(model1, ~ safe_glance(.x)))
You can just check if the value is NA in a wrapper function that you pass to map.
Model_fits %>%
mutate(summary_res = map(model1, function(x) if (length(x) == 1 && is.na(x)) NA_real_ else broom::glance(x)))

Using a loop to run a regression using different datasets in R?

I have the following dataset:
n <- 2
strata <- rep(1:4, each=n)
y <- rnorm(n = 8)
x <- 1:8
df <- cbind.data.frame(y, x, strata)
I want to perform the following processes using a loop
data_1 <- subset(df, strata == 1)
data_2 <- subset(df, strata == 2)
data_3 <- subset(df, strata == 3)
data_4 <- subset(df, strata == 4)
model1 <- lm(y ~ x, data = data_1)
model2 <- lm(y ~ x, data = data_2)
model3 <- lm(y ~ x, data = data_3)
model4 <- lm(y ~ x, data = data_4)
Any help would be appreciated, thanks!
We can split the data by 'strata' into a list and create the model by looping over the list with lapply
out <- lapply(split(df, df$strata), function(dat) lm(y ~ x, data = dat))
-oputut
$`1`
Call:
lm(formula = y ~ x, data = dat)
Coefficients:
(Intercept) x
-2.907 1.924
$`2`
Call:
lm(formula = y ~ x, data = dat)
Coefficients:
(Intercept) x
2.5733 -0.7632
$`3`
Call:
lm(formula = y ~ x, data = dat)
Coefficients:
(Intercept) x
0.9309 -0.1986
$`4`
Call:
lm(formula = y ~ x, data = dat)
Coefficients:
(Intercept) x
8.479 -1.207
try to do it this way
library(tidyverse)
library(broom)
mtcars %>%
group_nest(gear) %>%
mutate(model = map(data, ~lm(disp ~ mpg, data = .x)) %>% map(broom::glance)) %>%
unnest(model)
#> # A tibble: 3 x 14
#> gear data r.squared adj.r.squared sigma statistic p.value df logLik
#> <dbl> <list<tibb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 [15 x 10] 0.526 0.489 67.8 14.4 2.23e-3 1 -83.5
#> 2 4 [12 x 10] 0.812 0.793 17.7 43.2 6.28e-5 1 -50.4
#> 3 5 [5 x 10] 0.775 0.701 63.2 10.4 4.86e-2 1 -26.5
#> # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
#> # df.residual <int>, nobs <int>
mtcars %>%
group_nest(gear) %>%
mutate(model = map(data, ~lm(disp ~ mpg, data = .x)) %>% map(broom::tidy)) %>%
unnest(model)
#> # A tibble: 6 x 7
#> gear data term estimate std.error statistic p.value
#> <dbl> <list<tibble[,10]>> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 3 [15 x 10] (Intercept) 655. 88.3 7.41 0.00000508
#> 2 3 [15 x 10] mpg -20.4 5.37 -3.80 0.00223
#> 3 4 [12 x 10] (Intercept) 286. 25.3 11.3 0.000000514
#> 4 4 [12 x 10] mpg -6.64 1.01 -6.57 0.0000628
#> 5 5 [5 x 10] (Intercept) 529. 105. 5.02 0.0152
#> 6 5 [5 x 10] mpg -15.3 4.74 -3.22 0.0486
Created on 2021-06-07 by the reprex package (v2.0.0)

Problem with running paired t-test within nested dplyr dataset

I have gone through the vignette for row-wise operations for the new dplyr v1.0.0 and am intrigued by the possibilities of the nest_by function for modelling within different silos of a dataset.
However I am having difficulty getting a repeated-measures analysis to work.
Here's an example to illustrate when it does work
df1 <- data.frame(group = factor(rep(LETTERS[1:3],10)),
pred = factor(rep(letters[1:2],each=5,length.out=30)),
out = rnorm(30))
Now create the nesting based on the group variable.
library(dplyr)
nest1 <- df1 %>% nest_by(group)
nest
We can view this new special nested data frame
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# a [10 x 2]
# b [10 x 2]
# c [10 x 2]
Now we can perform operations on it, like a linear regression, regressing out on pred within each level of the original group variable.
mods <- nest1 %>% mutate(mod = list(lm(out ~ pred, data = data)))
In this new object we have added a new column to the original nested dataset containing the lm() object
mods
# # A tibble: 3 x 3
# # Rowwise: group
# group data mod
# <fct> <list<tbl_df[,2]>> <list>
# 1 A [10 x 2] <lm>
# 2 B [10 x 2] <lm>
# 3 C [10 x 2] <lm>
And we can view the results of these models
library(broom)
mods %>% summarise(broom::tidy(mod))
# A tibble: 6 x 6
# Groups: group [3]
# group term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 A (Intercept) 0.0684 0.295 0.232 0.823
# 2 A predb -0.231 0.418 -0.553 0.595
# 3 B (Intercept) -0.159 0.447 -0.356 0.731
# 4 B predb 0.332 0.633 0.524 0.615
# 5 C (Intercept) -0.385 0.245 -1.57 0.154
# 6 C predb 0.891 0.346 2.58 0.0329
Now I would like to be able to do the same thing but with a repeated measures t-test.
# dataset with grouping factor and two columns, each representing a measure at one of two timepoints
df2 <- data.frame(group = factor(rep(letters[1:3],10)),
t1 = rnorm(30),
t2 = rnorm(30))
# nest by grouping factor
nest2 <- df2 %>% nest_by(group)
nest2
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# 1 a [10 x 2]
# 2 b [10 x 2]
# 3 c [10 x 2]
Now when I try to perform a paired t-test at each level of the new nested dataset, using a similar procedure to the linear model...
mods2 <- nest2 %>% mutate(t = list(t.test(t1, t2, data = data)))
...I get the following error message
Error: Problem with `mutate()` input `t`.
x object 't1' not found
i Input `t` is `list(t.test(t1, t2, data = data))`.
i The error occured in row 1.
Run `rlang::last_error()` to see where the error occurred.
Can anyone help me?
The data option is used with the formula method, while 's3' method with x, y as argument, we can wrap using with
library(dplyr)
library(purrr)
nest2 %>%
mutate(t = list(with(data, t.test(t1, t2))))
# A tibble: 3 x 3
# Rowwise: group
# group data t
# <fct> <list<tbl_df[,2]>> <list>
#1 a [10 × 2] <htest>
#2 b [10 × 2] <htest>
#3 c [10 × 2] <htest>
Or use extractors ($, [[)
nest2 %>%
mutate(t = list(t.test(data$t1, data$t2)))

Running n linear regressions with few lines of code and storing results in a matrix

I have a tibble db of 25 dependent variables (db[,2:26]) and a vector of a single explanatory variable rmrf. All I want to do is to run a regression for each of the 25 dependent variables on the same common explanatory variable.
I want to obtain a table of alphas, betas, t.stat for alphas and R2, hence a matrix of 25 rows (one for each dependent variable) and 4 columns.
Nevertheless, despite it seems to be a pretty simple issue (I am a newbie in R), I do not understand:
how to smartly run all the 25 regressions in few lines of code [ loop, apply?]
how to extract the 4 required quantities.
While for the first issue I may have a solution (not sure though!):
varlist <- names(db)[2:26] #the 25 dependent variables
models <- lapply(varlist, function(x) {
lm(substitute(i ~ rmrf, list(i = as.name(x))), data = db)
})
for the second one I still have no idea (except using the function coefficient() of the lm class, but still cannot integrate the other 2 quantities).
Could you please help me figuring this out?
lm is vectoried across the dependent variables:
Just do
lm(as.matrix(db[,-1]) ~ rmrf, data = db)
Eg. Lets take an example of iris dataset, if we take that Petal.Width is the independent variable while the first 3 variables are the dependent vriable, then we could do:
dat <- iris[-5]
library(tidyverse)
library(broom)
lm(as.matrix(dat[-4]) ~ Petal.Width, dat) %>%
{cbind.data.frame(tidy(.)%>%
pivot_wider(response, names_from = term,
values_from = c(estimate, statistic)),
R.sq = map_dbl(summary(.),~.x$r.squared))}%>%
`rownames<-`(NULL)
response estimate_(Intercept) estimate_Petal.Width statistic_(Intercept) statistic_Petal.Width R.sq
1 Sepal.Length 4.777629 0.8885803 65.50552 17.296454 0.6690277
2 Sepal.Width 3.308426 -0.2093598 53.27795 -4.786461 0.1340482
3 Petal.Length 1.083558 2.2299405 14.84998 43.387237 0.9271098
If I got right, you want to apply LM for each pair independent ~ dependent in the dataset. You can use pivot/nest/broom strategy like this:
library(tidyverse)
library(broom)
# creating some dataset
db <- tibble(
y = rnorm(5),
x1 = rnorm(5),
x2 = rnorm(5),
x3 = rnorm(5)
)
# lets see
head(db)
# A tibble: 5 x 4
y x1 x2 x3
<dbl> <dbl> <dbl> <dbl>
1 -0.994 0.139 -0.935 0.0134
2 1.09 0.960 1.23 1.45
3 1.03 0.374 1.06 -0.900
4 1.63 -0.162 -0.498 -0.740
5 -0.0941 1.47 0.312 0.933
# pivot to long format by "independend var"
db_pivot <- db %>%
gather(key = "var_name", value = "value", -y)
head(db_pivot)
# A tibble: 6 x 3
y var_name value
<dbl> <chr> <dbl>
1 -0.368 x1 -1.29
2 -1.48 x1 -0.0813
3 -2.61 x1 0.477
4 0.602 x1 -0.525
5 -0.264 x1 0.0598
6 -0.368 x2 -0.573
# pipeline
resp <- db_pivot %>%
group_by(var_name) %>% # for each var group
nest() %>% # nest the dataset
mutate(lm_model=map(data,function(.x){ # apply lm for each dataset
lm(y~., data=.x)
})) %>%
mutate( # for each lm model fitted
coef_stats = map(lm_model, tidy), # use broom to extract coef statistics from lm model
model_stats = map(lm_model, glance) # use broom to extract regression stats from lm model
)
head(resp)
# A tibble: 3 x 5
# Groups: var_name [3]
var_name data lm_model coef_stats model_stats
<chr> <list> <list> <list> <list>
1 x1 <tibble [5 x 2]> <lm> <tibble [2 x 5]> <tibble [1 x 11]>
2 x2 <tibble [5 x 2]> <lm> <tibble [2 x 5]> <tibble [1 x 11]>
3 x3 <tibble [5 x 2]> <lm> <tibble [2 x 5]> <tibble [1 x 11]>
# coefs
resp %>%
unnest(coef_stats) %>%
select(-data,-lm_model, -model_stats)
# A tibble: 6 x 6
# Groups: var_name [3]
var_name term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 x1 (Intercept) -1.14 0.548 -2.08 0.129
2 x1 value -1.16 0.829 -1.40 0.257
3 x2 (Intercept) -0.404 0.372 -1.09 0.356
4 x2 value -0.985 0.355 -2.77 0.0694
5 x3 (Intercept) -0.707 0.755 -0.936 0.418
6 x3 value -0.206 0.725 -0.284 0.795
# R2
resp %>%
unnest(model_stats) %>%
select(-data,-lm_model, -coef_stats)
# A tibble: 3 x 12
# Groups: var_name [3]
var_name r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 x1 0.394 0.192 1.12 1.95 0.257 2 -6.37 18.7 17.6 3.74 3
2 x2 0.719 0.626 0.760 7.69 0.0694 2 -4.44 14.9 13.7 1.73 3
3 x3 0.0261 -0.298 1.42 0.0805 0.795 2 -7.55 21.1 19.9 6.01 3

Resources