Problem with running paired t-test within nested dplyr dataset - r

I have gone through the vignette for row-wise operations for the new dplyr v1.0.0 and am intrigued by the possibilities of the nest_by function for modelling within different silos of a dataset.
However I am having difficulty getting a repeated-measures analysis to work.
Here's an example to illustrate when it does work
df1 <- data.frame(group = factor(rep(LETTERS[1:3],10)),
pred = factor(rep(letters[1:2],each=5,length.out=30)),
out = rnorm(30))
Now create the nesting based on the group variable.
library(dplyr)
nest1 <- df1 %>% nest_by(group)
nest
We can view this new special nested data frame
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# a [10 x 2]
# b [10 x 2]
# c [10 x 2]
Now we can perform operations on it, like a linear regression, regressing out on pred within each level of the original group variable.
mods <- nest1 %>% mutate(mod = list(lm(out ~ pred, data = data)))
In this new object we have added a new column to the original nested dataset containing the lm() object
mods
# # A tibble: 3 x 3
# # Rowwise: group
# group data mod
# <fct> <list<tbl_df[,2]>> <list>
# 1 A [10 x 2] <lm>
# 2 B [10 x 2] <lm>
# 3 C [10 x 2] <lm>
And we can view the results of these models
library(broom)
mods %>% summarise(broom::tidy(mod))
# A tibble: 6 x 6
# Groups: group [3]
# group term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 A (Intercept) 0.0684 0.295 0.232 0.823
# 2 A predb -0.231 0.418 -0.553 0.595
# 3 B (Intercept) -0.159 0.447 -0.356 0.731
# 4 B predb 0.332 0.633 0.524 0.615
# 5 C (Intercept) -0.385 0.245 -1.57 0.154
# 6 C predb 0.891 0.346 2.58 0.0329
Now I would like to be able to do the same thing but with a repeated measures t-test.
# dataset with grouping factor and two columns, each representing a measure at one of two timepoints
df2 <- data.frame(group = factor(rep(letters[1:3],10)),
t1 = rnorm(30),
t2 = rnorm(30))
# nest by grouping factor
nest2 <- df2 %>% nest_by(group)
nest2
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# 1 a [10 x 2]
# 2 b [10 x 2]
# 3 c [10 x 2]
Now when I try to perform a paired t-test at each level of the new nested dataset, using a similar procedure to the linear model...
mods2 <- nest2 %>% mutate(t = list(t.test(t1, t2, data = data)))
...I get the following error message
Error: Problem with `mutate()` input `t`.
x object 't1' not found
i Input `t` is `list(t.test(t1, t2, data = data))`.
i The error occured in row 1.
Run `rlang::last_error()` to see where the error occurred.
Can anyone help me?

The data option is used with the formula method, while 's3' method with x, y as argument, we can wrap using with
library(dplyr)
library(purrr)
nest2 %>%
mutate(t = list(with(data, t.test(t1, t2))))
# A tibble: 3 x 3
# Rowwise: group
# group data t
# <fct> <list<tbl_df[,2]>> <list>
#1 a [10 × 2] <htest>
#2 b [10 × 2] <htest>
#3 c [10 × 2] <htest>
Or use extractors ($, [[)
nest2 %>%
mutate(t = list(t.test(data$t1, data$t2)))

Related

Running n linear regressions with few lines of code and storing results in a matrix

I have a tibble db of 25 dependent variables (db[,2:26]) and a vector of a single explanatory variable rmrf. All I want to do is to run a regression for each of the 25 dependent variables on the same common explanatory variable.
I want to obtain a table of alphas, betas, t.stat for alphas and R2, hence a matrix of 25 rows (one for each dependent variable) and 4 columns.
Nevertheless, despite it seems to be a pretty simple issue (I am a newbie in R), I do not understand:
how to smartly run all the 25 regressions in few lines of code [ loop, apply?]
how to extract the 4 required quantities.
While for the first issue I may have a solution (not sure though!):
varlist <- names(db)[2:26] #the 25 dependent variables
models <- lapply(varlist, function(x) {
lm(substitute(i ~ rmrf, list(i = as.name(x))), data = db)
})
for the second one I still have no idea (except using the function coefficient() of the lm class, but still cannot integrate the other 2 quantities).
Could you please help me figuring this out?
lm is vectoried across the dependent variables:
Just do
lm(as.matrix(db[,-1]) ~ rmrf, data = db)
Eg. Lets take an example of iris dataset, if we take that Petal.Width is the independent variable while the first 3 variables are the dependent vriable, then we could do:
dat <- iris[-5]
library(tidyverse)
library(broom)
lm(as.matrix(dat[-4]) ~ Petal.Width, dat) %>%
{cbind.data.frame(tidy(.)%>%
pivot_wider(response, names_from = term,
values_from = c(estimate, statistic)),
R.sq = map_dbl(summary(.),~.x$r.squared))}%>%
`rownames<-`(NULL)
response estimate_(Intercept) estimate_Petal.Width statistic_(Intercept) statistic_Petal.Width R.sq
1 Sepal.Length 4.777629 0.8885803 65.50552 17.296454 0.6690277
2 Sepal.Width 3.308426 -0.2093598 53.27795 -4.786461 0.1340482
3 Petal.Length 1.083558 2.2299405 14.84998 43.387237 0.9271098
If I got right, you want to apply LM for each pair independent ~ dependent in the dataset. You can use pivot/nest/broom strategy like this:
library(tidyverse)
library(broom)
# creating some dataset
db <- tibble(
y = rnorm(5),
x1 = rnorm(5),
x2 = rnorm(5),
x3 = rnorm(5)
)
# lets see
head(db)
# A tibble: 5 x 4
y x1 x2 x3
<dbl> <dbl> <dbl> <dbl>
1 -0.994 0.139 -0.935 0.0134
2 1.09 0.960 1.23 1.45
3 1.03 0.374 1.06 -0.900
4 1.63 -0.162 -0.498 -0.740
5 -0.0941 1.47 0.312 0.933
# pivot to long format by "independend var"
db_pivot <- db %>%
gather(key = "var_name", value = "value", -y)
head(db_pivot)
# A tibble: 6 x 3
y var_name value
<dbl> <chr> <dbl>
1 -0.368 x1 -1.29
2 -1.48 x1 -0.0813
3 -2.61 x1 0.477
4 0.602 x1 -0.525
5 -0.264 x1 0.0598
6 -0.368 x2 -0.573
# pipeline
resp <- db_pivot %>%
group_by(var_name) %>% # for each var group
nest() %>% # nest the dataset
mutate(lm_model=map(data,function(.x){ # apply lm for each dataset
lm(y~., data=.x)
})) %>%
mutate( # for each lm model fitted
coef_stats = map(lm_model, tidy), # use broom to extract coef statistics from lm model
model_stats = map(lm_model, glance) # use broom to extract regression stats from lm model
)
head(resp)
# A tibble: 3 x 5
# Groups: var_name [3]
var_name data lm_model coef_stats model_stats
<chr> <list> <list> <list> <list>
1 x1 <tibble [5 x 2]> <lm> <tibble [2 x 5]> <tibble [1 x 11]>
2 x2 <tibble [5 x 2]> <lm> <tibble [2 x 5]> <tibble [1 x 11]>
3 x3 <tibble [5 x 2]> <lm> <tibble [2 x 5]> <tibble [1 x 11]>
# coefs
resp %>%
unnest(coef_stats) %>%
select(-data,-lm_model, -model_stats)
# A tibble: 6 x 6
# Groups: var_name [3]
var_name term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 x1 (Intercept) -1.14 0.548 -2.08 0.129
2 x1 value -1.16 0.829 -1.40 0.257
3 x2 (Intercept) -0.404 0.372 -1.09 0.356
4 x2 value -0.985 0.355 -2.77 0.0694
5 x3 (Intercept) -0.707 0.755 -0.936 0.418
6 x3 value -0.206 0.725 -0.284 0.795
# R2
resp %>%
unnest(model_stats) %>%
select(-data,-lm_model, -coef_stats)
# A tibble: 3 x 12
# Groups: var_name [3]
var_name r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 x1 0.394 0.192 1.12 1.95 0.257 2 -6.37 18.7 17.6 3.74 3
2 x2 0.719 0.626 0.760 7.69 0.0694 2 -4.44 14.9 13.7 1.73 3
3 x3 0.0261 -0.298 1.42 0.0805 0.795 2 -7.55 21.1 19.9 6.01 3

Create a correlation matrix based on multiple column values with p-values in R

I'm new to R and I'm trying to create a correlation matrix that will also include p-values.
The main issue I'm having is with computing correlations for specific numeric variables depending on the identity of three factors.
My data looks something like this
data.frame(
cond = c("low", "medium", "high"),
group = c("gr1", "gr2", "gr3"),
rand = c("yes", "no"),
trial1 = rnorm(30),
trial2 = rnorm(30))
I want to correlate trial1 and trial2 for each unique value in cond, group, and rand. Essentially, for each level of those factors, I would like to get an r- and p-value, and save them in a matrix.
I tried it the long way - extracting the observations that I want to correlate by using three logical tests like this(df$cond == "low") & (df$group == 'gr1') & (df&rand == 'yes'). This gave me what I needed but the code is very long and doesn't save the values in a matrix.
I've never tried for-loops before so I'd appreciate it if anyone knew either how to do that or another efficient way of doing it.
Thank you!
library(dplyr)
library(tidyr)
library(purrr)
d <- data.frame(
cond = c("low", "medium", "high"),
group = c("gr1", "gr2", "gr3"),
rand = c("yes", "no"),
trial1 = rnorm(30),
trial2 = rnorm(30)
)
x <- d %>%
group_by(cond, rand, group) %>%
nest() %>%
mutate(
cor_test = map(data, function(i) cor.test(i$trial1, i$trial2)),
correlation = map_dbl(cor_test, ~ .x$estimate),
p.value = map_dbl(cor_test, ~ .x$p.value)
)
x
#> # A tibble: 6 x 7
#> cond rand group data cor_test correlation p.value
#> <fct> <fct> <fct> <list> <list> <dbl> <dbl>
#> 1 low yes gr1 <tibble [5 x 2]> <htest> -0.0329 0.958
#> 2 medium no gr2 <tibble [5 x 2]> <htest> 0.489 0.403
#> 3 high yes gr3 <tibble [5 x 2]> <htest> -0.413 0.490
#> 4 low no gr1 <tibble [5 x 2]> <htest> -0.240 0.697
#> 5 medium yes gr2 <tibble [5 x 2]> <htest> -0.144 0.817
#> 6 high no gr3 <tibble [5 x 2]> <htest> 0.0361 0.954
Created on 2019-08-23 by the reprex package (v0.3.0)
You first group the data by all combinations of your factor levels
Then you "nest" the data, i.e. for each group from step 1, create a "subset" of your data frame and save it in a list-variable called data (default name)
create a new list-variable, cor_test, which saves the result from cor.test() calls using variables trial1 and trial2 from each subset
create new variables, correlation and p.value, that simply extract the r (estimate) and p (p.value) elements from each object saved in the list-variable cor_test.
This is a very flexible approach, you just need to define the names of the variables for which you calculate the correlation (trial1 and trial2).
I don't really understand what you are trying to do, but here is how you would estimate a correlation matrix with p-values for each possible combination of the three first variables
by(df[,c("trial1","trial2")],list(df$cond,df$group,df$rand),function(x){
return(list(cor(x),cor.test(x[,1],x[,2])$p.value))
})

Multiple tests with pairwise combinations using dplyr/tidyverse

My question is related to this one but a more complex example, in which I would like to statistically compare multiple columns in all combinations, and each of the columns has a different number of samples.
Consider the original data:
# A tibble: 51 x 3
trial person score
<chr> <chr> <dbl>
1 foo a 0.266
2 bar b 0.372
3 foo c 0.573
4 bar a 0.908
5 foo b 0.202
6 bar c 0.898
7 foo a 0.945
8 bar b 0.661
9 foo c 0.629
10 foo b 0.206
For each trial type, I'd like to run a statistical test comparing the scores of each person. So, I need the following test results:
Trial foo, compare all score samples of persons A–B, B–C, C–A
Trial bar, compare all score samples of persons A–B, B–C, C–A
Of course, there are more than two trials, and more than three persons.
Hence, the solution using group_split given in the other question does not work, as it implies always testing agains the first person (in my case), not all pairwise combinations.
So, in the following code, I'm stuck at two points:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
library(broom)
set.seed(1)
df = tibble::tibble(
trial = rep(c("foo", "bar"), 30),
person = rep(c("a", "b", "c"), 20),
score = runif(60)
) %>%
filter(score > 0.2)
df %>%
group_by(person, trial) %>%
summarize(scores = list(score)) %>%
spread(person, scores) %>%
group_split(trial) %>%
map_df(function(data) {
data %>%
summarize_at(vars(b:c), function(x) {
wilcox.test(.$a, x, paired = FALSE) %>% broom::tidy
})
})
#> Error in wilcox.test.default(.$a, x, paired = FALSE): 'x' must be numeric
Created on 2019-05-29 by the reprex package (v0.3.0)
The value of x is apparently not just the actual list of scores, but the column vector of scores for a single trial. But I don't know how else to deal with the fact that the number of samples in each person is different.
Also, I still have to manually specify the column names, which would already be a combinatorial nightmare if there were more than, say, four persons.
I can somehow get the combinations as such:
df %>%
group_split(trial) %>%
map_df(function(data) {
combinations = expand(tibble(x = unique(data$person), y = unique(data$person)), x, y) %>% filter(x != y)
})
… but that doesn't really help in creating columns for comparison.
What could I do to make this work?
This will allow you to programmatically specify combinations and get around the error you were hitting in wilcox.test().
combos <- unique(df$person) %>%
combn(2, simplify = F) %>%
set_names(map_chr(., ~ paste(., collapse = "_")))
df %>%
group_split(trial) %>%
set_names(map_chr(., ~ unique(.$trial))) %>%
map_df(function(x) {
map_df(combos, function(y) {
filter(x, person %in% y) %>%
wilcox.test(score ~ person, data = .) %>%
broom::tidy()
}, .id = "contrast")
}, .id = "trial")
# A tibble: 6 x 6
trial contrast statistic p.value method alternative
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 bar a_b 34 0.878 Wilcoxon rank sum test two.sided
2 bar a_c 32 1 Wilcoxon rank sum test two.sided
3 bar b_c 31 0.959 Wilcoxon rank sum test two.sided
4 foo a_b 41 1 Wilcoxon rank sum test two.sided
5 foo a_c 41 1 Wilcoxon rank sum test two.sided
6 foo b_c 43 0.863 Wilcoxon rank sum test two.sided
Since this differs a lot from the pattern you started with, I'm not sure it will work for your real world case, but it works here so I wanted to share.
Here is an alternative solution that uses nesting to handle groups (persons) with different number of measurements.
library("broom")
library("tidyverse")
set.seed(1)
df <-
tibble(
trial = rep(c("foo", "bar"), 30),
person = rep(c("a", "b", "c"), 20),
score = runif(60)
) %>%
filter(score > 0.2)
comparisons <- df %>%
expand(
trial,
group1 = person,
group2 = person
) %>%
filter(
group1 < group2
)
comparisons
#> # A tibble: 6 × 3
#> trial group1 group2
#> <chr> <chr> <chr>
#> 1 bar a b
#> 2 bar a c
#> 3 bar b c
#> 4 foo a b
#> 5 foo a c
#> 6 foo b c
df <- df %>% nest_by(trial, person)
df
#> # A tibble: 6 × 3
#> # Rowwise: trial, person
#> trial person data
#> <chr> <chr> <list<tibble[,1]>>
#> 1 bar a [8 × 1]
#> 2 bar b [8 × 1]
#> 3 bar c [8 × 1]
#> 4 foo a [9 × 1]
#> 5 foo b [9 × 1]
#> 6 foo c [9 × 1]
comparisons %>%
inner_join(
df, by = c("trial", "group1" = "person")
) %>%
inner_join(
df, by = c("trial", "group2" = "person")
) %>%
mutate(
p.value = map2_dbl(
data.x, data.y, ~ wilcox.test(.x$score, .y$score)$p.value
)
)
#> # A tibble: 6 × 6
#> trial group1 group2 data.x data.y p.value
#> <chr> <chr> <chr> <list<tibble[,1]>> <list<tibble[,1]>> <dbl>
#> 1 bar a b [8 × 1] [8 × 1] 0.878
#> 2 bar a c [8 × 1] [8 × 1] 1
#> 3 bar b c [8 × 1] [8 × 1] 0.959
#> 4 foo a b [9 × 1] [9 × 1] 1
#> 5 foo a c [9 × 1] [9 × 1] 1
#> 6 foo b c [9 × 1] [9 × 1] 0.863
Created on 2022-03-17 by the reprex package (v2.0.1)

Obtaining slopes, intercepts, and coefficients of determination from several linear models, all from the same dataframe [duplicate]

This question already has an answer here:
Fast pairwise simple linear regression between variables in a data frame
(1 answer)
Closed 4 years ago.
I have the following dataframe:
Index <- seq.int(1:10)
A <- c(5, 5, 3, 4, 3, 3, 2, 2, 4, 3)
B <- c(10, 11, 12, 12, 12, 11, 13, 13, 14, 13)
C <- c(7, 6, 7, 7, 6, 5, 6, 5, 5, 4)
df <- data.frame(Index, A, B, C)
> df
Index A B C
[1,] 1 5 10 7
[2,] 2 5 11 6
[3,] 3 3 12 7
[4,] 4 4 12 7
[5,] 5 3 12 6
[6,] 6 3 11 5
[7,] 7 2 13 6
[8,] 8 2 13 5
[9,] 9 4 14 5
[10,] 10 3 13 4
I would like to generate linear models (and ultimately obtain slopes, intercepts, and coefficients of determination in an easy-to-work-with dataframe form) with the Index column as the dependent variable and with all of the other columns as the response variable, separately. I know I can do this by running the following line of code:
summary(lm(cbind(A, B, C) ~ Index, data = df))
One issue I have with the above line of code is that it uses the cbind function, and thus, I have to input each column separately. I am working with a large dataframe with many columns, and instead of using the cbind function, I'd love to be able to tell the function to use a bunch of columns (i.e., response variables) at once by writing something like df[, 2:ncol(df)] in place of cbind(A, B, C).
Another issue I have with the above line of code is that the output is not really in a user-friendly form. Ultimately, I would like the output (slopes, intercepts, and coefficients of determination) to be in an easy-to-work-with dataframe form:
response <- c("A", "B", "C")
slope <- c(-0.21818, 0.33333, -0.29091)
intercept <- c(4.60000, 10.26667, 7.40000)
r.squared <- c(0.3776, 0.7106, 0.7273)
summary_df <- data.frame(response, slope, intercept, r.squared)
> summary_df
response slope intercept r.squared
1 A -0.21818 4.60000 0.3776
2 B 0.33333 10.26667 0.7106
3 C -0.29091 7.40000 0.7273
What is the most efficient way to do this? There must be a solution using the lapply function that I'm just not getting. Thanks so much!
I would convert the data frame to a tibble. This allows you to use list columns as described in this presentation, to store and manipulate your models.
Let's call the data frame df1, not df. Convert to a tibble, then use tidyr::gather() and tidyr::nest to reshape it:
library(tidyverse)
library(broom)
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var)
The result is a tibble with a row for each of A, B, C and a column data which stores the Index column and the corresponding value, Val, for each of A, B, C.
# A tibble: 3 x 2
Var data
<chr> <list>
1 A <tibble [10 x 2]>
2 B <tibble [10 x 2]>
3 C <tibble [10 x 2]>
Now we can use dplyr::mutate() and purrr::map to create a column containing the models for each of A, B and C.
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var) %>%
mutate(model = map(data, ~lm(Index ~ Val, .)))
# A tibble: 3 x 3
Var data model
<chr> <list> <list>
1 A <tibble [10 x 2]> <S3: lm>
2 B <tibble [10 x 2]> <S3: lm>
3 C <tibble [10 x 2]> <S3: lm>
Finally we can use broom::glance() or broom::tidy() to extract the required values from the models, then tidyr::unnest() to get back to a regular tibble.
Using glance:
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var) %>%
mutate(model = map(data, ~lm(Index ~ Val, .)),
summary = map(model, glance)) %>%
unnest(summary) %>%
select(-data, -model)
# A tibble: 3 x 12
Var r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.378 0.300 2.53 4.85 0.0587 2 -22.4 50.7 51.6 51.3 8
2 B 0.711 0.674 1.73 19.6 0.00219 2 -18.5 43.1 44.0 23.9 8
3 C 0.727 0.693 1.68 21.3 0.00171 2 -18.2 42.5 43.4 22.5 8
Using tidy:
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var) %>%
mutate(model = map(data, ~lm(Index ~ Val, .)),
summary = map(model, tidy)) %>%
unnest(summary)
# A tibble: 6 x 6
Var term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A (Intercept) 11.4 2.79 4.08 0.00352
2 A Val -1.73 0.786 -2.20 0.0587
3 B (Intercept) -20.3 5.85 -3.47 0.00842
4 B Val 2.13 0.481 4.43 0.00219
5 C (Intercept) 20 3.18 6.28 0.000237
6 C Val -2.5 0.541 -4.62 0.00171
To address the first part of your query, you can pass matrix objects to lm formula sides:
summary(lm(as.matrix(df[-1]) ~ as.matrix(df[1])))
Checks out in terms of the reported coefficients:
all.equal(
coef(lm(as.matrix(df[-1]) ~ as.matrix(df[1]))),
coef(lm(cbind(A,B,C) ~ Index, data=df)),
check.attributes=FALSE
)
#[1] TRUE
Note the warning from 李哲源 that combining this like matrix(...) ~ . will not work as intended. It may generally be safer to specify both sides as expressions, or both sides as a matrix only.

R- iterating over variables names using a loop or function

I want to loop over variables within a data frame either using a for loop or function in R. I have coded the following (which doesn't work):
y <- c(0,0,1,1,0,1,0,1,1,1)
var1 <- c("a","a","a","b","b","b","c","c","c","c")
var2 <- c("m","m","n","n","n","n","o","o","o","m")
mydata <- data.frame(y,var1,var2)
myfunction <- function(v){
regressionresult <- lm(y ~ v, data = mydata)
summary(regressionresult)
}
myfunction("var1")
When I try running this, I get the error message:
Error in model.frame.default(formula = y ~ v, data = mydata, drop.unused.levels = TRUE) :
variable lengths differ (found for 'v')
I don't think this is a problem with the data, but with how I refer to the variable name because the following code produces the desired regression results (for one variable that I wanted to loop over):
regressionresult <- lm(y ~ var1, data = mydata)
summary(regressionresult)
How can I fix the function, or put the variables names in the loop?
[I also tried to loop over the variables names, but had a similar problem as with the function:
for(v in c("var1","var2")){
regressionresult <- lm(y ~ v, data = mydata)
summary(regressionresult)
}
When running this loop, it produces the error:
Error in model.frame.default(formula = y ~ v, data = mydata, drop.unused.levels = TRUE) :
variable lengths differ (found for 'v')
Thanks for your help!
We can use paste to create the formula to pass it on the lm
myfunction <- function(v){
regressionresult <- lm(paste0('y ~', v), data = mydata)
summary(regressionresult)
}
out1 <- myfunction("var1")
Or use glue::glue
myfunction <- function(v){
regressionresult <- lm(glue::glue('y ~ {v}'), data = mydata)
summary(regressionresult)
}
myfunction("var1")
You can use functions in the tidyverse to work with tidy data and applying model to different formulas.
y <- c(0,0,1,1,0,1,0,1,1,1)
var1 <- c("a","a","a","b","b","b","c","c","c","c")
var2 <- c("m","m","n","n","n","n","o","o","o","m")
library(tidyverse)
mydata <- data_frame(y,var1,var2)
res <- mydata %>%
# get data in long format - tidy format
gather("var_type", "value", -y) %>%
# we want one model per var_type
nest(-var_type) %>%
# apply lm on each data
mutate(
regressionresult = map(data, ~lm(y ~ value, data = .x))
)
res
#> # A tibble: 2 x 3
#> var_type data regressionresult
#> <chr> <list> <list>
#> 1 var1 <tibble [10 x 2]> <S3: lm>
#> 2 var2 <tibble [10 x 2]> <S3: lm>
summary(res$regressionresult[[1]])
#>
#> Call:
#> lm(formula = y ~ value, data = .x)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.7500 -0.3333 0.2500 0.3125 0.6667
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.3333 0.3150 1.058 0.325
#> valueb 0.3333 0.4454 0.748 0.479
#> valuec 0.4167 0.4167 1.000 0.351
#>
#> Residual standard error: 0.5455 on 7 degrees of freedom
#> Multiple R-squared: 0.1319, Adjusted R-squared: -0.1161
#> F-statistic: 0.532 on 2 and 7 DF, p-value: 0.6094
Broom package can help you work with the result then
library(broom)
#> Warning: le package 'broom' a été compilé avec la version R 3.4.4
res <- res %>%
mutate(tidy_summary = map(regressionresult, broom::tidy))
res
#> # A tibble: 2 x 4
#> var_type data regressionresult tidy_summary
#> <chr> <list> <list> <list>
#> 1 var1 <tibble [10 x 2]> <S3: lm> <data.frame [3 x 5]>
#> 2 var2 <tibble [10 x 2]> <S3: lm> <data.frame [3 x 5]>
You can get one of the summary
res$tidy_summary[[1]]
#> term estimate std.error statistic p.value
#> 1 (Intercept) 0.3333333 0.3149704 1.0583005 0.3250657
#> 2 valueb 0.3333333 0.4454354 0.7483315 0.4786436
#> 3 valuec 0.4166667 0.4166667 1.0000000 0.3506167
or unnest to get a data.frame to work with.
res %>%
unnest(tidy_summary)
#> # A tibble: 6 x 6
#> var_type term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 var1 (Intercept) 0.333 0.315 1.06 0.325
#> 2 var1 valueb 0.333 0.445 0.748 0.479
#> 3 var1 valuec 0.417 0.417 1.000 0.351
#> 4 var2 (Intercept) 0.333 0.315 1.06 0.325
#> 5 var2 valuen 0.417 0.417 1 0.351
#> 6 var2 valueo 0.333 0.445 0.748 0.479
Functions of interest are nest and unnest from [tidyr][http://tidyr.tidyverse.org/) that allow to create list columns easily, map from purrr that allows to iterate over a list and apply a function (here lm) and tidy from broom package that offers functions to tidy results from models (summary results, predict results, ...)
Not used here but know that modelr package helps for doing pipelines when modeling.

Resources