Problem with pipe within purrr:map2 and mutate - r

nested_numeric <- model_table %>%
group_by(ano_fiscal) %>%
select(-c("ano_estudo", "payout", "div_ratio","ebitda", "name.company",
"alavancagem","div_pl", "div_liq", "div_total")) %>%
nest()
nested_numeric
# A tibble: 7 x 2
# Groups: ano_fiscal [7]
ano_fiscal data
<dbl> <list>
1 2012 <tibble [34 x 10]>
2 2013 <tibble [35 x 10]>
3 2014 <tibble [35 x 10]>
4 2015 <tibble [35 x 10]>
5 2016 <tibble [35 x 10]>
6 2017 <tibble [35 x 10]>
7 2018 <tibble [35 x 10]>
df_ipca$idx
[1] 0.9652515 0.9741318 0.9817300 0.9911546 0.9941281 0.9985022 1.0000000
The list-column named "data" consists of numeric variables. I want to multiply them for a deflator index. (a.k.a. adjusting for inflation)
this works fine
map2_df(nested_numeric$data, df_ipca$idx, ~ .x * .y)
or even
map2(nested_numeric$data, df_ipca$idx, ~ .x * .y)
but I'm trying to create a new list-column named "adjusted_data" with the result of this operation:
nested_numeric <- model_table %>%
group_by(ano_fiscal) %>%
select(-c("ano_estudo", "payout", "div_ratio","ebitda", "name.company",
"alavancagem","div_pl", "div_liq", "div_total")) %>%
nest() %>%
mutate( adjusted_data = data %>% {
map2(., df_ipca$idx, ~ .x * .y)})
Gives me this error:
Error: Column `adjusted_data` must be length 1 (the group size), not 7
I hope my problem is clear enough because I'm trying to adjust for inflation a data frame with values nested by years.
I thought that going for map2 within a mutate would be enough... I've tried everything and couldn't figure it what I'm doing wrong.
I've read similar questions with pipes within map2 here, but still...
Please help :)
Thank you!

A simple solution (which however does break up your pipes) is to just do
nested_numeric$adjusted_data <- map2(nested_numeric$data, df_ipca$idx, ~ .x * .y)
For example, using the iris data:
library(tidyverse)
df_ipca <- data.frame(idx = runif(3))
iris <- iris %>%
group_by(Species) %>%
nest()
iris$adjusted_data <- map2(iris$data, df_ipca$idx, ~.x * .y)
iris
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species data adjusted_data
#> <fct> <list> <list>
#> 1 setosa <tibble [50 × 4]> <df[,4] [50 × 4]>
#> 2 versicolor <tibble [50 × 4]> <df[,4] [50 × 4]>
#> 3 virginica <tibble [50 × 4]> <df[,4] [50 × 4]>
Using solution with mutate
If you want to do the map2 inside mutate, after you have grouped and nested your data, you first have to ungroup() before calling mutate (I think otherwise mutate will try to do the operation within each group instead of looping over the entire data column, which is what you want):
nested_numeric %>%
ungroup() %>%
mutate(
adjusted_data = map2(data, df_ipca$idx, ~ .x * .y)
)
For example, using the iris data:
library(tidyverse)
df_ipca <- data.frame(idx = runif(3))
iris_nested <- iris %>%
group_by(Species) %>%
nest() %>%
ungroup() %>%
mutate(
adjusted_data = map2(data, df_ipca$idx, ~ .x * .y)
)
# Original data
map(iris_nested$data, head)
#> [[1]]
#> # A tibble: 6 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#>
#> [[2]]
#> # A tibble: 6 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 7 3.2 4.7 1.4
#> 2 6.4 3.2 4.5 1.5
#> 3 6.9 3.1 4.9 1.5
#> 4 5.5 2.3 4 1.3
#> 5 6.5 2.8 4.6 1.5
#> 6 5.7 2.8 4.5 1.3
#>
#> [[3]]
#> # A tibble: 6 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6.3 3.3 6 2.5
#> 2 5.8 2.7 5.1 1.9
#> 3 7.1 3 5.9 2.1
#> 4 6.3 2.9 5.6 1.8
#> 5 6.5 3 5.8 2.2
#> 6 7.6 3 6.6 2.1
# Adjusted data
map(iris_nested$adjusted_data, head)
#> [[1]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 1.0206142 0.7004215 0.2801686 0.04002409
#> 2 0.9805901 0.6003613 0.2801686 0.04002409
#> 3 0.9405660 0.6403854 0.2601566 0.04002409
#> 4 0.9205540 0.6203733 0.3001807 0.04002409
#> 5 1.0006022 0.7204336 0.2801686 0.04002409
#> 6 1.0806503 0.7804697 0.3402047 0.08004817
#>
#> [[2]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.3256959 0.1488896 0.2186816 0.06513919
#> 2 0.2977791 0.1488896 0.2093760 0.06979199
#> 3 0.3210431 0.1442368 0.2279872 0.06979199
#> 4 0.2559039 0.1070144 0.1861120 0.06048639
#> 5 0.3024319 0.1302784 0.2140288 0.06979199
#> 6 0.2652095 0.1302784 0.2093760 0.06048639
#>
#> [[3]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 2.399749 1.257011 2.285475 0.9522814
#> 2 2.209293 1.028464 1.942654 0.7237339
#> 3 2.704479 1.142738 2.247384 0.7999164
#> 4 2.399749 1.104646 2.133110 0.6856426
#> 5 2.475932 1.142738 2.209293 0.8380076
#> 6 2.894935 1.142738 2.514023 0.7999164
In fact, you can also omit the group_by() and ungroup() calls by providing the non-nested column (in your case, ano_fiscal) to nest():
iris %>%
nest(data = -Species) %>%
mutate(
adjusted_data = map2(data, df_ipca$idx, ~ .x * .y)
)
which should give the same result as before. Note to avoid having a warning, you should name the -Species argument inside nest().

Related

Return variable name as string with dplyr in wrapper function

I think this is hopefully a reasonably straightforward problem. I am creating a custom function that perform a battery of summary statistics and formats the data consistently for some reports. To accomplish this I write a function() that take 3 variables as input, a dataset, a grouping variable, and a response variable of interest. I perform the summary statistics using dplyr::summarize(). I know that to use dplyr::summarize in a custom function, I have to embrace the grouping variable and the response variable with curly curly notation within the dplyr:: functions. I want to record the name of the response variable in the output tibble. In a non-tidyverse:: world I would use deparse(substitute()) to accomplish this. However, this method apparently does not work within the tidyverse::. Here is my reproducible example. I will walk through it piecemeal, and then post the uninterrupted code at the end of my question.
For my first attempt, I tried the deparse(substitute({{}})) approach
library(tidyverse)
data("iris")
fxn1<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute({{var}})))
}
Demo1<-fxn1(iris, Species, Petal.Width)
Demo1
Unfortunately this created some sort of expression in the Var column, and messes up the summarization.
# A tibble: 12 x 3
# Groups: Species [3]
Species Mean_Val Var
<fct> <dbl> <chr>
1 setosa 0.246 "(function (...) "
2 setosa 0.246 "{"
3 setosa 0.246 " .External2(ffi_tilde_eval, sys.call(), environment(), ~
4 setosa 0.246 "})(Petal.Width)"
5 versicolor 1.33 "(function (...) "
6 versicolor 1.33 "{"
7 versicolor 1.33 " .External2(ffi_tilde_eval, sys.call(), environment(), ~
8 versicolor 1.33 "})(Petal.Width)"
9 virginica 2.03 "(function (...) "
10 virginica 2.03 "{"
11 virginica 2.03 " .External2(ffi_tilde_eval, sys.call(), environment(), ~
12 virginica 2.03 "})(Petal.Width)"
For my second attempt, I got rid of the curly curly notation in deparse(substitute())
fxn2<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute(var)))
}
Demo2<-fxn2(iris, Species, Petal.Width)
Demo2
This almost correct, but instead of inputing "Petal.Width" into the Var column, it adds "var".
# A tibble: 3 x 3
Species Mean_Val Var
<fct> <dbl> <chr>
1 setosa 0.246 var
2 versicolor 1.33 var
3 virginica 2.03 var
What I want my data to look like is this
Desired<-iris %>% group_by(Species) %>% summarize(Mean_Val=mean(Petal.Width), Var="Petal.Width")
Desired
Which looks like this:
# A tibble: 3 x 3
Species Mean_Val Var
<fct> <dbl> <chr>
1 setosa 0.246 Petal.Width
2 versicolor 1.33 Petal.Width
3 virginica 2.03 Petal.Width
Does anyone know how to get dplyr:: to do the equivalent of deparse(substitute), but actually return the name of the variable, and not the argument name? Any guidance would be much appreciated.
Here is the uninterrupted reproducible example:
library(tidyverse)
data("iris")
fxn1<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute({{var}})))
}
Demo1<-fxn1(iris, Species, Petal.Width)
Demo1
fxn2<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute(var)))
}
Demo2<-fxn2(iris, Species, Petal.Width)
Demo2
Desired<-iris %>% group_by(Species) %>% summarize(Mean_Val=mean(Petal.Width), Var="Petal.Width")
Desired
This kind of provides your expected output ("Var" is a list, so not ideal); does it solve your problem?
library(tidyverse)
data("iris")
fxn1<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute({{var}})))
}
Demo1<-fxn1(iris, Species, Petal.Width)
#> `summarise()` has grouped output by 'Species'. You can override using the
#> `.groups` argument.
Demo1
#> # A tibble: 12 × 3
#> # Groups: Species [3]
#> Species Mean_Val Var
#> <fct> <dbl> <chr>
#> 1 setosa 0.246 "(function (...) "
#> 2 setosa 0.246 "{"
#> 3 setosa 0.246 " .External2(ffi_tilde_eval, sys.call(), environment(…
#> 4 setosa 0.246 "})(Petal.Width)"
#> 5 versicolor 1.33 "(function (...) "
#> 6 versicolor 1.33 "{"
#> 7 versicolor 1.33 " .External2(ffi_tilde_eval, sys.call(), environment(…
#> 8 versicolor 1.33 "})(Petal.Width)"
#> 9 virginica 2.03 "(function (...) "
#> 10 virginica 2.03 "{"
#> 11 virginica 2.03 " .External2(ffi_tilde_eval, sys.call(), environment(…
#> 12 virginica 2.03 "})(Petal.Width)"
fxn2<-function(DF, grp, var){
out<-DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=deparse(substitute(var)))
}
Demo2<-fxn2(iris, Species, Petal.Width)
Demo2
#> # A tibble: 3 × 3
#> Species Mean_Val Var
#> <fct> <dbl> <chr>
#> 1 setosa 0.246 var
#> 2 versicolor 1.33 var
#> 3 virginica 2.03 var
Desired<-iris %>% group_by(Species) %>% summarize(Mean_Val=mean(Petal.Width), Var="Petal.Width")
Desired
#> # A tibble: 3 × 3
#> Species Mean_Val Var
#> <fct> <dbl> <chr>
#> 1 setosa 0.246 Petal.Width
#> 2 versicolor 1.33 Petal.Width
#> 3 virginica 2.03 Petal.Width
fxn3 <- function(DF, grp, var){
DF %>%
group_by({{grp}}) %>%
summarize(Mean_Val=mean({{var}}, na.rm=TRUE),
Var=c(ensym(var)))
}
Demo3 <- fxn3(iris, Species, Petal.Width)
Demo3
#> # A tibble: 3 × 3
#> Species Mean_Val Var
#> <fct> <dbl> <list>
#> 1 setosa 0.246 <sym>
#> 2 versicolor 1.33 <sym>
#> 3 virginica 2.03 <sym>
print.data.frame(Demo3)
#> Species Mean_Val Var
#> 1 setosa 0.246 Petal.Width
#> 2 versicolor 1.326 Petal.Width
#> 3 virginica 2.026 Petal.Width
Created on 2022-04-21 by the reprex package (v2.0.1)

R Dplyr: Summarizing a column, if it is present

In R, working in the tidyverse:
My data sources change. There's a column which is only present some weeks. When it is, I want to summarize it. Using iris as an example, suppose that Sepal.Width is sometimes missing. Conceptually, I want a function like this
library(tidyverse)
summIris <- function(irisDf){
irisDf %>%
group_by(Species) %>%
summarise_ifPresent(
Sepal.Length = mean(Sepal.Length),
Sepal.Width = mean(Sepal.Width))
}
Which'd return
R > summIris(iris )
# A tibble: 3 x 3
Species Sepal.Length Sepal.Width
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
> summIris(iris %>% select(- Sepal.Width ))
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
I could work around by wrapping the logic in if else. But is there something more concise and elegant?
summarize_at allows you to define on which columns you execute the summary, and you can use starts_with, ends_with, matches, or contains to dynamically select columns.
library(dplyr)
iris %>%
group_by(Species) %>%
summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 3
# Species Sepal.Length Sepal.Width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 3.43
# 2 versicolor 5.94 2.77
# 3 virginica 6.59 2.97
iris %>%
select(-Sepal.Length) %>%
group_by(Species) %>%
summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
# 1 setosa 3.43
# 2 versicolor 2.77
# 3 virginica 2.97
Another one also works but gives a warning with unfound columns:
iris %>%
select(-Sepal.Length) %>%
group_by(Species) %>%
summarize_at(vars(one_of(c("Sepal.Width", "Sepal.Length"))), funs(mean(.)))
# Warning: Unknown columns: `Sepal.Length`
# # A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
# 1 setosa 3.43
# 2 versicolor 2.77
# 3 virginica 2.97

when to use map() function and when to use summarise_at()/mutate_at()

Can anyone give a suggestion regarding when to use the map() (all map_..() functions) and when to use summarise_at()/mutate_at()?
E.g. if we are doing some modification to the column of vectors then we do not need to think map() ?
If we have a df / have a column has a list in it then we need to use map()?
Does map() function always need to be used with nest() function?
Anyone could suggest some learning videos regarding this. And also how to put lists in df and modeling multiple lists at the same time then store the model results in another column ?
Thank you so much!
The biggest difference between {dplyr} and {purrr} is that {dplyr} is designed to work on data.frames only, and {purrr} is designed to work on every kind of lists. Data.frames being lists, you can also use {purrr} for iterating on a data.frame.
map_chr(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"numeric" "numeric" "numeric" "numeric" "factor"
summarise_at and map_at do not exactly behave the same: summarise_at just return the summary you're looking for, map_at return all the data.frame as a list, with the modification done where you asked it :
> library(purrr)
> library(dplyr)
> small_iris <- sample_n(iris, 5)
> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
$Sepal.Length
[1] 6.58
$Sepal.Width
[1] 3.2
$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7
$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5
$Species
[1] virginica setosa virginica versicolor versicolor
Levels: setosa versicolor virginica
> summarise_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
Sepal.Length Sepal.Width
1 6.58 3.2
map_at always return a list, mutate_at always a data.frame :
> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
$Sepal.Length
[1] 0.77 0.54 0.67 0.64 0.67
$Sepal.Width
[1] 0.28 0.39 0.33 0.29 0.31
$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7
$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5
$Species
[1] virginica setosa virginica versicolor versicolor
Levels: setosa versicolor virginica
> mutate_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.77 0.28 6.7 2.0 virginica
2 0.54 0.39 1.3 0.4 setosa
3 0.67 0.33 5.7 2.1 virginica
4 0.64 0.29 4.3 1.3 versicolor
5 0.67 0.31 4.7 1.5 versicolor
So to sum up on your first question, if you are thinking about doing operation "column-wise" on a non-nested df and want to have a data.frame as a result, you should go for {dplyr}.
Regarding nested column, you have to combine group_by(), nest() from {tidyr}, mutate() and map(). What you're doing here is creating a smaller version of your dataframe that will contain a column which is a list of data.frames. Then, you're going to use map() to iterate over the elements inside this new column.
Here is an example with our beloved iris:
library(tidyr)
iris_n <- iris %>%
group_by(Species) %>%
nest()
iris_n
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica <tibble [50 × 4]>
Here, the new object is a data.frame with the colum data being a list of smaller data.frames, one by Species (the factor we specified in group_by()). Then, we can iterate on this column by simply doing :
map(iris_n$data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x))
[[1]]
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
Coefficients:
(Intercept) Sepal.Width
2.6390 0.6905
[[2]]
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
Coefficients:
(Intercept) Sepal.Width
3.5397 0.8651
[[3]]
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
Coefficients:
(Intercept) Sepal.Width
3.9068 0.9015
But the idea is to keep everything inside a data.frame, so we can use mutate to create a column that will keep this new list of lm results:
iris_n %>%
mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
Species data lm
<fct> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
So you can run several mutate() to get the r.squared for e.g:
iris_n %>%
mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)),
lm = map(lm, summary),
r_squared = map_dbl(lm, "r.squared"))
# A tibble: 3 x 4
Species data lm r_squared
<fct> <list> <list> <dbl>
1 setosa <tibble [50 × 4]> <S3: summary.lm> 0.551
2 versicolor <tibble [50 × 4]> <S3: summary.lm> 0.277
3 virginica <tibble [50 × 4]> <S3: summary.lm> 0.209
But a more efficient way is to use compose() from {purrr} to build a function that will do it once, instead of repeating the mutate().
get_rsquared <- compose(as_mapper("r.squared"), summary, lm)
iris_n %>%
mutate(lm = map_dbl(data, ~ get_rsquared(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
Species data lm
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica <tibble [50 × 4]> 0.209
If you know you'll always be using Sepal.Length ~ Sepal.Width, you can even prefill lm() with partial():
pr_lm <- partial(lm, formula = Sepal.Length ~ Sepal.Width)
get_rsquared <- compose(as_mapper("r.squared"), summary, pr_lm)
iris_n %>%
mutate(lm = map_dbl(data, get_rsquared))
# A tibble: 3 x 3
Species data lm
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica <tibble [50 × 4]> 0.209
Regarding the resources, I've written a series of blogpost on {purrr} you can check: https://colinfay.me/tags/#purrr
Colin gives a great self-contained answer. Since you asked for more resources on using multiple models with tibbles, I'd also like to add the Many Models chapter of R 4 Data Science which gives a broad overview of creating, simplifying, and modeling with list-columns. http://r4ds.had.co.nz/many-models.html

purrr mapping not producing tidy data

Thanks to this site, I'm using the R purrr package to aggregation data based on multiple columns. The aggregation is working how I want but the output is not. Here is a sample using the mtcars dataset.
library(dplyr)
library(purrr)
#pull in data
data <- mtcars
#get colnames
variable1 <- colnames(data)
#map the variables
t1 <- map(variable1, ~ data %>%
group_by_at(.x) %>%
summarize(number = mean(mpg))) %>%
set_names(variable1) %>%
bind_rows(., .id = 'variable')
Were I expect three columns (Predictor Variable, Levels within Each of those Variables, aggregation), I have 8. See the image below:
How can I take my code up at the top and turn out a tidy dataset?
A simple way to do this is to reshape your data to long form, which lets you aggregate with ordinary dplyr:
library(tidyverse)
mpg_means <- mtcars %>%
gather(variable, value, -mpg) %>%
group_by(variable, value) %>%
summarise(mean_mpg = mean(mpg))
mpg_means
#> # A tibble: 146 x 3
#> # Groups: variable [?]
#> variable value mean_mpg
#> <chr> <dbl> <dbl>
#> 1 am 0. 17.1
#> 2 am 1. 24.4
#> 3 carb 1. 25.3
#> 4 carb 2. 22.4
#> 5 carb 3. 16.3
#> 6 carb 4. 15.8
#> 7 carb 6. 19.7
#> 8 carb 8. 15.0
#> 9 cyl 4. 26.7
#> 10 cyl 6. 19.7
#> # ... with 136 more rows
Note that while mtcars is entirely numeric, if you have different types, converting to long form will coerce variable types. The calculations will be the same, but it may cause issues later. To resolve it, use an output format that can handle diverse types, e.g.
mpg_means_in_list_cols <- mtcars %>%
as_tibble() %>% # compact printing for list columns
summarise_all(list) %>% # collapse each column into a list of itself
gather(group, group_values, -mpg) %>%
mutate(mpg_means = map2(mpg, group_values, # for each mpg/value pair, ...
~tibble(mpg = .x, group_value = .y) %>% # ...reconstruct a data frame...
group_by(group_value) %>%
summarise(mean_mpg = mean(mpg)))) # ...and aggregate
mpg_means_in_list_cols
#> # A tibble: 10 x 4
#> mpg group group_values mpg_means
#> <list> <chr> <list> <list>
#> 1 <dbl [32]> cyl <dbl [32]> <tibble [3 × 2]>
#> 2 <dbl [32]> disp <dbl [32]> <tibble [27 × 2]>
#> 3 <dbl [32]> hp <dbl [32]> <tibble [22 × 2]>
#> 4 <dbl [32]> drat <dbl [32]> <tibble [22 × 2]>
#> 5 <dbl [32]> wt <dbl [32]> <tibble [29 × 2]>
#> 6 <dbl [32]> qsec <dbl [32]> <tibble [30 × 2]>
#> 7 <dbl [32]> vs <dbl [32]> <tibble [2 × 2]>
#> 8 <dbl [32]> am <dbl [32]> <tibble [2 × 2]>
#> 9 <dbl [32]> gear <dbl [32]> <tibble [3 × 2]>
#> 10 <dbl [32]> carb <dbl [32]> <tibble [6 × 2]>
While this is decidedly not as pretty, it's capable of holding many types tidily. To extract the result above, just add %>% unnest(mpg_means). As-is, grouping variables are each held in a list element of group_values and in aggregated form in the first column of each mpg_means tibble.
When grouping your data within the map, you can rename the grouping variable to "level", since those values will form the column containing the levels of the grouping variable in the final data set.
When you have mixed types of grouping variables (e.g. both numeric and character), you'll also need to coerce the grouping variable to character in order to be able to bind the results together.
With those additions, you should get what you expect. (You can also skip the bind_rows by using map_df instead of map, to save a little bit of code, like I've done below.)
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09
library(purrr)
library(dplyr)
data <- iris
vars <- names(data)
set_names(vars) %>%
map_df(function(var) {
var <- set_names(var, "level")
data %>%
group_by_at(var) %>%
summarize_at("Sepal.Length", "mean") %>%
mutate_at("level", as.character)
}, .id = "variable")
#> # A tibble: 126 x 3
#> variable level Sepal.Length
#> <chr> <chr> <dbl>
#> 1 Sepal.Length 4.3 4.3
#> 2 Sepal.Length 4.4 4.4
#> 3 Sepal.Length 4.5 4.5
#> 4 Sepal.Length 4.6 4.6
#> 5 Sepal.Length 4.7 4.7
#> 6 Sepal.Length 4.8 4.8
#> 7 Sepal.Length 4.9 4.9
#> 8 Sepal.Length 5 5.0
#> 9 Sepal.Length 5.1 5.1
#> 10 Sepal.Length 5.2 5.2
#> # ... with 116 more rows
You could also wrap the process in a function, and allow multiple variables to summarise with multiple functions. You'd have to spend a moment to come up with an evocative name though (I cheated and just used foo here).
foo <- function(data, vars, funs) {
grps <- names(data)
set_names(grps) %>%
map_df(function(grp) {
grp <- set_names(grp, "level")
data %>%
group_by_at(grp) %>%
summarize_at(vars, funs) %>%
mutate_at("level", as.character)
}, .id = "variable")
}
foo(iris, vars(Sepal.Length, Sepal.Width), funs(mean, sd))
#> # A tibble: 126 x 6
#> variable level Sepal.Length_mean Sepal.Width_mean Sepal.Length_sd
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sepal.Length 4.3 4.3 3.000000 NaN
#> 2 Sepal.Length 4.4 4.4 3.033333 0
#> 3 Sepal.Length 4.5 4.5 2.300000 NaN
#> 4 Sepal.Length 4.6 4.6 3.325000 0
#> 5 Sepal.Length 4.7 4.7 3.200000 0
#> 6 Sepal.Length 4.8 4.8 3.180000 0
#> 7 Sepal.Length 4.9 4.9 2.950000 0
#> 8 Sepal.Length 5 5.0 3.120000 0
#> 9 Sepal.Length 5.1 5.1 3.477778 0
#> 10 Sepal.Length 5.2 5.2 3.425000 0
#> # ... with 116 more rows, and 1 more variables: Sepal.Width_sd <dbl>

R: predict new values for groups

I've calculated a different regression for each group in a data frame:
DF.L <- DF %>%
group_by(Channel) %>%
do(Fit = rlm(L ~ -1 + Y + I(Y^2), data = .))
I want to apply this set of regressions to another data frame. To do so, I'm testing how to apply it to the same data frame:
DF %>%
group_by(Channel) %>%
do({
Lfit <- predict(subset(DF.L, Channel == unique(.$Channel))$Fit, .)
data.frame(., Lfit)
})
glimpse(DF)
But I keep getting this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
Calls: %>% ... do_.grouped_df -> eval -> eval -> predict -> predict
What I am doing wrong?
Using the built-in ChickWeight data:
library(dplyr)
library(MASS)
library(broom)
library(tidyr)
library(ggplot2)
head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Fit some models
ChickWeight_models <- ChickWeight %>%
group_by(Diet) %>%
do(fit = MASS::rlm(weight ~ Time + I(Time^2), data = .))
ChickWeight_models
Source: local data frame [4 x 2]
Groups: <by row>
# A tibble: 4 x 2
Diet fit
* <fctr> <list>
1 1 <S3: rlm>
2 2 <S3: rlm>
3 3 <S3: rlm>
4 4 <S3: rlm>
So I've created a very similar object to your DF.L. It's a frame with the four groups, each with an rlm object in a list-column called fit.
Make up some test data
Now I'll make up some data to test this model on. In this case, I'll just take the original data and add some noise to each of the variables.
ChickWeight_simulated <- ChickWeight %>%
mutate(Time = Time + runif(length(Time)),
weight = weight + rnorm(length(weight)))
ChickWeight_simulated
weight Time Chick Diet
1 42.72075 0.9786272 1 1
2 51.12669 2.8399631 1 1
3 58.64632 4.4576380 1 1
4 63.77617 6.1083591 1 1
5 75.40434 8.1051792 1 1
6 91.75830 10.7899030 1 1
Now we want to combine the dataframe of the models with the new data to test on. First we group_by and tidyr::nest the simulated data. This creates an object that is a dataframe with the four groups and a list-column called data, each element of which contains a rolled-up dataframe.
ChickWeight_simulated %>% group_by(Diet) %>% nest()
# A tibble: 4 x 2
Diet data
<fctr> <list>
1 1 <tibble [220 x 3]>
2 2 <tibble [120 x 3]>
3 3 <tibble [120 x 3]>
4 4 <tibble [118 x 3]>
Add the original models to the new data
Then we can join it to the models dataframe:
ChickWeight_simulated %>% group_by(Diet) %>% nest() %>%
full_join(ChickWeight_models)
# A tibble: 4 x 3
Diet data fit
<fctr> <list> <list>
1 1 <tibble [220 x 3]> <S3: rlm>
2 2 <tibble [120 x 3]> <S3: rlm>
3 3 <tibble [120 x 3]> <S3: rlm>
4 4 <tibble [118 x 3]> <S3: rlm>
Now we group by Diet again, and use broom::augment to make a prediction of each model on the new simulated data. Since each group is one row, there is one element each of fit and data; we have to extract that single element out of each list-column into a usable form by using [[1]].
ChickWeight_simulated_predicted <-
ChickWeight_simulated %>% group_by(Diet) %>% nest() %>%
full_join(ChickWeight_models) %>%
group_by(Diet) %>%
do(augment(.$fit[[1]], newdata = .$data[[1]]))
head(ChickWeight_simulated_predicted)
# A tibble: 6 x 6
# Groups: Diet [1]
Diet weight Time Chick .fitted .se.fit
<fctr> <dbl> <dbl> <ord> <dbl> <dbl>
1 1 42.72075 0.9786272 1 43.62963 2.368838
2 1 51.12669 2.8399631 1 51.80855 1.758385
3 1 58.64632 4.4576380 1 59.67606 1.534051
4 1 63.77617 6.1083591 1 68.43218 1.534152
5 1 75.40434 8.1051792 1 80.00678 1.647612
6 1 91.75830 10.7899030 1 97.26450 1.726331
Sanity check
To prove that this really only used the model from a particular level of Diet on the simulated data from that level of Diet, we can visualize the model fit.
ChickWeight_simulated_predicted %>%
ggplot(aes(Time, weight)) +
geom_point(shape = 1) +
geom_ribbon(aes(Time,
ymin = .fitted-1.96*.se.fit,
ymax = .fitted+1.96*.se.fit),
alpha = 0.5, fill = "black") +
geom_line(aes(Time, .fitted), size = 1, color = "red") +
facet_wrap(~Diet)
I think your error comes from how you are calling predict. I can't fix your exact code, but here is a simple way you can get predictions from your model. A more sophisticated way using purrr and nest is outlined here: http://ijlyttle.github.io/isugg_purrr/presentation.html#(1)
UPDATE - the purrr and nest way
Just adding this to show that it can be done pretty easily within the tidyverse, using predict. See link above for more details.
library(tidyverse)
# shuffle the rows to mix up the species
set.seed(1234)
myiris <- iris[sample(nrow(iris), replace = F),]
# create first dataset - use the first 50 rows for running the model
iris_nested <-
myiris[1:50,] %>%
nest(-Species) %>%
rename(myorigdata = data)
# create second dataset - use the other 100 rows for making predictions
new_iris_nested <-
myiris[51:150,] %>%
nest(-Species) %>%
rename(mynewdata = data)
# make a model function
my_rlm <- function(df) {
MASS::rlm(Sepal.Length ~ Petal.Length + Petal.Width, data = df)
}
# get the predictions (see the GitHub link above which breaks this into steps)
predictions_tall <-
iris_nested %>%
mutate(my_model = map(myorigdata, my_rlm)) %>%
full_join(new_iris_nested, by = "Species") %>%
mutate(my_new_pred = map2(my_model, mynewdata, predict)) %>%
select(Species, mynewdata, my_new_pred) %>%
unnest(mynewdata, my_new_pred) %>%
rename(modeled = my_new_pred, measured = Sepal.Length) %>%
gather("Type", "Sepal.Length", modeled, measured)
The nested predictions_tall object looks like this:
predictions_tall %>% nest(-Species, -type) %>% as.tibble()
# A tibble: 6 x 3
Species type data
<fctr> <chr> <list>
1 setosa modeled <data.frame [32 x 4]>
2 versicolor modeled <data.frame [33 x 4]>
3 virginica modeled <data.frame [35 x 4]>
4 setosa measured <data.frame [32 x 4]>
5 versicolor measured <data.frame [33 x 4]>
6 virginica measured <data.frame [35 x 4]>
And finally, the plot to show the prediction results:
predictions_tall %>%
ggplot(aes(x = Petal.Length, y = Sepal.Length)) +
geom_line(aes(color = Species, linetype = Type))
ORIGINAL - the broom way
I've updated this now to only calculate predictions for each group using the model for that group.
This way uses the broom package - specifically the augment function - to add fitted values. See more here: https://cran.r-project.org/web/packages/broom/vignettes/broom.html
Since you don't supply data, I use iris here.
library(tidyverse)
library(broom)
# first shuffle around the rows of iris
set.seed(1234)
myiris <- iris[sample(nrow(iris), replace = F),]
# first data - first 25 rows for running the models on
origiris <-
myiris[1:25,] %>%
nest(-Species) %>%
rename(origdata = data)
# second data - last 50 rows for predicting on
prediris <-
myiris[101:150,] %>%
nest(-Species) %>%
rename(preddata = data)
# estimate models on the first 25 rows
# a separate model is estimated for each species
iris_mod <-
origiris %>%
mutate(mod = map(origdata, ~ MASS::rlm(Sepal.Length ~ Petal.Length + Petal.Width, data = .)))
First get fitted values for the original dataset (not essential, just for illustration):
# get fitted values for the first dataset (origdata)
origiris_aug <-
iris_mod %>%
mutate(origpred = map(mod, augment)) %>%
unnest(origpred) %>%
as.tibble()
The origiris_aug predictions dataframe looks like this:
origiris_aug
# A tibble: 25 x 10
Species .rownames Sepal.Length Petal.Length Petal.Width .fitted .se.fit .resid
<fctr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 18 5.1 1.4 0.3 5.002797 0.1514850 0.09720290
2 setosa 2 4.9 1.4 0.2 4.931824 0.1166911 -0.03182417
3 setosa 34 5.5 1.4 0.2 4.931824 0.1166911 0.56817583
4 setosa 40 5.1 1.5 0.2 4.981975 0.1095883 0.11802526
5 setosa 39 4.4 1.3 0.2 4.881674 0.1422123 -0.48167359
6 setosa 36 5.0 1.2 0.2 4.831523 0.1784156 0.16847698
7 setosa 25 4.8 1.9 0.2 5.182577 0.2357614 -0.38257703
8 setosa 31 4.8 1.6 0.2 5.032125 0.1241074 -0.23212531
9 setosa 42 4.5 1.3 0.3 4.952647 0.1760223 -0.45264653
10 setosa 21 5.4 1.7 0.2 5.082276 0.1542594 0.31772411
# ... with 15 more rows, and 2 more variables: .hat <dbl>, .sigma <dbl>
And now what you actually want - making predictions on the new dataset:
# get fitted values for the second dataset (preddata)
# each model is fitted to the appropriate species' nested dataframe
prediris_aug <-
iris_mod %>%
inner_join(prediris, by = "Species") %>%
map2_df(.x = iris_mod$mod, .y = prediris$preddata, .f = ~augment(.x, newdata = .y)) %>%
as.tibble()
The prediris_aug dataframe looks like this:
prediris_aug
# A tibble: 50 x 7
.rownames Sepal.Length Sepal.Width Petal.Length Petal.Width .fitted .se.fit
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 105 6.5 3.0 5.8 2.2 8.557908 3.570269
2 115 5.8 2.8 5.1 2.4 8.348800 3.666631
3 117 6.5 3.0 5.5 1.8 8.123565 3.005888
4 139 6.0 3.0 4.8 1.8 7.772511 2.812748
5 103 7.1 3.0 5.9 2.1 8.537086 3.475224
6 107 4.9 2.5 4.5 1.7 7.551086 2.611123
7 119 7.7 2.6 6.9 2.3 9.180537 4.000412
8 135 6.1 2.6 5.6 1.4 7.889823 2.611457
9 124 6.3 2.7 4.9 1.8 7.822661 2.838502
10 118 7.7 3.8 6.7 2.2 9.009263 3.825613
# ... with 40 more rows

Resources