working with lists of models using the pipe syntax - r

I often like to fit and examine multiple models that relate two variables in an R dataframe.
I can do that using syntax like this:
require(tidyverse)
require(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
map_df(models, ~tidy(lm(data=mtcars, formula=.x)))
But I'm used to the pipe syntax and was hoping to be able to something like this:
mtcars %>% map_df(models, ~tidy(lm(data=., formula=.x)))
That makes it clear that I'm "starting" with mtcars and then doing stuff to it to generate my output. But that syntax doesn't work, giving an error Error: Index 1 must have length 1.
Is there a way to write my purrr:map() function in a way that I can pipe mtcars into it to get the same output as the working code above? I.e.
mtcars %>% <<<something>>>

tl/dr: mtcars %>% {map_df(models, function(.x) tidy(lm(data=., formula=.x)))}
Or mtcars %>% map_df(models, ~tidy(lm(..1,..2)), ..2 = .)
There are 2 problems with the solution you've tried.
The first is that you need to use curly braces if you want to place the dot in an unusual place.
library(magrittr)
1 %>% divide_by(2) # 0.5 -> this works
1 %>% divide_by(2,.) # 2 -> this works as well
1 %>% divide_by(2,mean(.,3)) # this doesn't
1 %>% divide_by(.,2,mean(.,3)) # as it's equivalent to this one
1 %>% {divide_by(2,mean(.,3))} # but this one works as it forces all dots to be explicit.
The second is that you can't use the dot with the ~ formulation in the way you intended, try map(c(1,2), ~ 3+.) and map(c(1,2), ~ 3+.x) (or even map(c(1,2), ~ 3+..1)) and you'll see you get the same result. By the time you use the dot in a ~ formula it's not linked to the pipe function anymore.
To make sure the dot is interpreted as mtcars you need to use the good old function(x) ... definition.
This works:
mtcars %>% {map_df(models, function(.x) tidy(lm(data=., formula=.x)))}
Finally, as a bonus, here's what I came up with, trying to find a solution without curly braces :
mtcars %>% map(models,lm,.) %>% map_df(tidy)
mtcars %>% map_df(models, ~tidy(lm(..1,..2)), ..2 = .)

This should work and does not involve the complexity of functions or {}. A purely purrr solution.
library(tidyverse)
library(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
mtcars %>%
list %>% # make it a list
cross2(models) %>% # get combinations
transpose %>% # put into a nice format
set_names("data", "formula") %>% # set names to lm arg names
pmap(lm) %>% # fit models
map_df(tidy) # tidy it up

This is a little at odds with how purrr::map works. You are mapping over the list of models (one item of the list at a time), not the dataframe (which would be one column of the dataframe at a time). Because the dataframe stays constant even with other model expressions, I don't think mapping will work for this situation.
However, you could get the syntax you want from defining a custom function based on the one you have above.
library(tidyverse)
library(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
models_to_rows <- function(data, models_list) {
models_list %>%
map_df(~tidy(lm(data=data, formula=.x)))
}
mtcars %>%
models_to_rows(models)
#> term estimate std.error statistic p.value
#> 1 (Intercept) 89.60052274 9.702303069 9.234975 2.823542e-10
#> 2 exp(cyl) 0.04045315 0.004897717 8.259594 3.212750e-09
#> 3 (Intercept) -51.05436157 24.981944312 -2.043650 4.985522e-02
#> 4 cyl 31.95828066 3.883803355 8.228604 3.477861e-09

Related

Using dplyr rowwise to create multiple linear models

considering this post:
https://www.tidyverse.org/blog/2020/06/dplyr-1-0-0/
I was trying to create multiple models for a data set, using multiple formulas. this example says:
library(dplyr, warn.conflicts = FALSE)
models <- tibble::tribble(
~model_name, ~ formula,
"length-width", Sepal.Length ~ Petal.Width + Petal.Length,
"interaction", Sepal.Length ~ Petal.Width * Petal.Length
)
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
rowwise(Species, model_name) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::glance(model))
You can see rowwise function is used to get the answer but when i dont use this function, i still get the correct answer
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::tidy(model))
i only lost the "model_name" column, but considering that rowwise documentation says, this function is to compute, i dont get why is still computed this way, why this happens?
thanks in advance.
considering
https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html
You can optionally supply “identifier” variables in your call to rowwise(). These variables are preserved when you call summarise(), so they behave somewhat similarly to the grouping variables passed to group_by():
i didn't understand how identifiers works, so as far i get this "identifiers" (Species,model_name) doesn't affect how to compute a value, only the way your tibble is presented.
So if you have a rowwise tibble created by nest_by you dont need the rowwise() function to compute by row. So in my example, rowwise function only give you a extra column of information but linear model is still the same. this is just for a "elegant way", it doesn't change the way its computed.
Thanks to tmfmnk

How do I get mean functions to work when I use piping?

This is probably a simple question, but I'm having trouble getting the mean function to work using dplyr.
Using the mtcars dataset as an example, if I type:
data(mtcars)
mtcars %>%
select (mpg) %>%
mean()
I get the "Warning message:
In mean.default(.) : argument is not numeric or logical: returning NA" error message.
For some reason though if I repeat the same code but just ask for a "summary", or "range" or several other statistical calculations, they work fine:
data(mtcars)
mtcars %>%
select (mpg) %>%
summary()
Similarly, if I run the mean function in base R notation, that works fine too:
mean(mtcars$mpg)
Can anyone point out what I've done wrong?
Use pull to pull out the vector.
mtcars %>%
pull(mpg) %>%
mean()
# [1] 20.09062
Or use pluck from the purrr package.
mtcars %>%
purrr::pluck("mpg") %>%
mean()
# [1] 20.09062
Or summarize first and then pull out the mean.
mtcars %>%
summarize(mean = mean(mpg)) %>%
pull(mean)
# [1] 20.09062
In dplyr, you can use summarise() whenever you're not changing your original dataframe (reordering it, filtering it, adding to it, etc), but instead are creating a new dataframe that has summary statistics for the first dataframe.
mtcars %>%
summarise(mean_mpg = mean(mpg))
gives the output:
mean_mpg
1 20.09062
PS. If you're learning dplyr, learning these five verbs will take you a long way: select(), filter(), group_by(), summarise(), arrange().

Using dplyr to compare models

I was working through the examples in the dplyr documentation of the do() function and all was well until I came across this snippet to summarize model comparisons: # compare %>% summarise(p.value = aov$`Pr(>F)`) The error was "Error: expecting a single value". So I found a way forward accessing the list of aov elements directly. This question is about sub-setting operators and to ask if there is a better way to do this. Here is my full attempt and solution.
models <- group_by(mtcars,cyl) %>% do(mod_lin = lm(mpg ~ disp, data = .), mod_quad = lm(mpg ~ poly(disp,2), data = .))
compare <- models %>% do(aov = anova(.$mod_lin, .$mod_quad))
compare %>% summarise(p.value = aov$'Pr(>F)')
Error: expecting a single value
Looking into the structure of compare
select comparison 1
compare$aov[[1]]
select comparison 1 and all of element 6 (the pvalues)
compare$aov[[1]][6]
just the pvalues
compare$aov[[1]][2,6]
compare %>% summarise(pvalue = aov[2,6]) # this gets the pvalues by group
So I suppose I'm wondering how with an object of classes (‘rowwise_df’, ‘tbl_df’ and 'data.frame') that summarise can intuit the [[]] operator. And also if there might be a better way to do this.
You could try
compare %>% do(.$aov['Pr(>F)']) %>% na.omit()

dplyr summarize with a function of a dataframe

I'm having some trouble carrying out a routine using the dplyr package. In short, I have a function which takes a dataframe as an input, and returns a single (numeric) value; I'd like to be able to apply this function to several subsets of a dataframe. It feels like I should be able to use group_by() to specify the subsets of the dataframe, then pipe along to the summarize() function, but I'm not sure how to pass the (subsetted) dataframe along to the function I'd like to apply.
As a simplified example, let's say I'm using the iris dataset, and I've got a fairly simple function which I'd like to apply to several subsets of the data:
data(iris)
lm.func = function(.data){
lm.fit = lm(Petal.Width ~ Petal.Length, data = .data)
out = summary(lm.fit)$coefficients[2,1]
return(out)
}
Now, I'd like to be able to apply this function to subsets of iris based on some other variable, like Species. I'm able to manually filter the data, then pipe along to my function, for example:
iris %>% filter(Species == "setosa") %>% lm.func(.)
But I'd like to be able to apply lm.func to each subset of the data, based on Species. My first thought was to try something like the following:
iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))
Even though I know this doesn't work, my idea is to try to pass each subset of iris to the lm.func function.
To clarify, I'd like to end up with a dataframe with two columns -- a first with each level of the grouping variable, and a second with the output of lm.func when the data are restricted to a subset specified by the grouping variable.
Is it possible to use summarize() in this way?
You can try with do
iris %>%
group_by(Species) %>%
do(data.frame(coef.val=lm.func(.)))
# Species coef.val
#1 setosa 0.2012451
#2 versicolor 0.3310536
#3 virginica 0.1602970
There is an easy way to do without creating a function.
library(broom)
models <-iris %>%
group_by(Species) %>%
do(
mod = lm(Petal.Width ~ Petal.Length, data =.)
)
models %>% do(tidy(.$mod))
term estimate std.error statistic p.value
1 (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01
2 Petal.Length 0.20124509 0.08263253 2.4354220 1.863892e-02
3 (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01
4 Petal.Length 0.33105360 0.03750041 8.8279995 1.271916e-11
5 (Intercept) 1.13603130 0.37936622 2.9945505 4.336312e-03
6 Petal.Length 0.16029696 0.06800119 2.3572668 2.253577e-02

Combining dplyr::do() with dplyr::mutate?

I would like to achieve the following: for each subgroup of a dataset, I would like to carry out a regression, and the residuals of that regression should be saved as a new variable in the original dataframe. For instance,
group_by(mtcars, gear) %>% mutate(res = residuals(lm(mpg~carb, .)))
indicates what I think should work, but does not (anyone care to explain why it does not work?). One way to get the residuals is to do the following:
group_by(mtcars, gear) %>% do(res = residuals(lm(mpg~carb, .)))
which gives me a dataframe in which dbl-objects are saved, i.e. those contain the residuals for each group. However, it seems they do not contain the original rownames that would help me to merge them back to the original data.
So, my question is: how can I achieve what I want to do in a dplyr-kind of way?
Obviously, it can be achieved in other ways. To give you an example, the following works just fine:
dat <- mtcars
dat$res <- NA
for(i in unique(mtcars$gear)){
dat[dat$gear==i, "res"] <- residuals(lm(mpg ~ disp, data=dat[dat$gear==i,]))
}
However, my understanding is that dplyr is made for this purpose, so there should be a dplyr-style way?
Any hints / tips / comments are appreciated.
Remark: this question is very similar to lm() called within mutate() except that in that question, only one parameter per group is retained, which makes a merge-approach easy. I have an entire vector with no rownames, so that I would have to rely on the ordering of the vector to do that, and that seems troublesome to me.
library(lazyeval)
eq <- "y ~ x"
dat <- mtcars
dat %>%
group_by(gear) %>%
mutate(res=residuals(lm(interp(eq, y = mpg, x = disp))))
or without lazyeval
dat %>%
group_by(gear) %>%
mutate(res=residuals(lm(deparse(substitute(mpg~disp)))))
#This gives you the residuals. You can then combine this with original data.
mtcars %>%
group_by(cyl) %>%
do(model = lm(mpg ~ wt, data=.)) %>%
do((function(reg_mod) {
data.frame(reg_res = residuals(reg_mod$model))
})(.))

Resources